Audio Post-Production Technical Primer

Having done a fair bit of audio post-production work, but coming from a music production background, I know that coming up to speed on the esoterica of audio post can be confusing. I have attempted to capture some of the basic technical information here. This is not intended to be a tutorial on the creative aspects of audio post, but rather a technical reference to some of the details about formats that one encounters in audio post work.

Sample Rates

Audio production and post-production work for film or video is typically done at 48kHz. Some folks like to work at 96kHz or 192kHz, for improved fidelity (real or imagined, I will not enter that debate here), but those rates are double and quadruple the standard 48kHz, so they are compatible rates for audio post.

An older rate sometimes encountered is 32kHz; this rate is sometimes used in older Nagra production systems, some DAT recorders, and even DV-25 cameras. Most of these systems are also capable of operating at 48kHz. DV-25 cameras are an interesting case here, because while DV-25 can encode 2 channels of 16-bit audio at 48kHz, if you want more than 2 channels on the camera, you can record 4 channels of 12-bit (biased) audio at 32kHz. However, when you encounter 32kHz, it will typically be as source material, which you would then convert (automatically or manually) to your working rate; it would not make sense to work at 32kHz for an audio post project.

Music production typically uses rates related to CD quality, i.e. 44100Hz, or more rarely at the multiples at 88100Hz and 176400Hz. Nowadays, this is arguably for historical reasons as much as anything, given the dramatically-changing nature of music distribution, and the fact that CD sales are declining while online music distribution (purchased or not) is increasing. Regardless, this is often the rate chosen for music production work for which intended distribution is primarily a music distribution channel. I make the distinction of distribution because while film/video scoring is a music production task, the primary distribution media is film or video (or online digital reflections thereof), and thus this work is almost always done at 48kHz, or the higher multiples thereof.

Frame Rates

Video and film have an inherent "frame rate", which is the speed at which the frames are intended to be presented to the viewer.

Background

Frame rate is expressed in frames per second, which is abbreviated as "fps". Higher frame rates mean more images are shown in the same amount of time. While it might be tempting to think that the higher the frame rate, the better the image, studies have shown (no, I don't have a reference...) that there is no practical benefit to presentation frame rates significantly higher than the film rate of 24 frames per second. Video rates are higher than film, but I imagine this is mostly for historical, electrical reasons, not for reasons of visual perception or temporal acuity (I can wax polysyllabic, too).

Frame Rate and Frame Length (Period)

One important point to keep in mind as you read this is the relationship between frame rate (number of frames displayed per second) and frame length (how much time each frame spans), the latter of which is also referred to as the frame "period". The relationship between frame rate and frame length is an inverse relationship, which is just a fancy way of saying that the higher the frame rate, the shorter the frame, and vice versa. From an audio perspective, the frame period can be measured as a number of samples per frame, and this can be a useful way to think about it: lower frame rates mean more samples per frame, higher frame rates mean fewer samples per frame.

Odd Frame Lengths and the Cadence Run

For some frame rate and sample rate combinations, the frame length is not a whole number of samples. For example, a frame of 24 fps material at 32kHz sample rate results in 1333.33(...) samples per frame, which is not at all a handy number; while this particular case is not very common, it is an easy case to use for illustrative purposes, so bear with me, because when we eventually get to NTSC (which always exhibits this issue, regardless of audio sample rate), you will appreciate this simple case more.

Most often, we really do not care much about these fractional frame lengths, as they resolve soon enough. In some cases, however, we really do need a whole number of samples per frame (i.e. when laying back to a format that multiplexes the audio into each frame of video, like DV formats), and in other cases it is just inconvenient to have a fractional number of samples per frame.

Fortunately, all the common frame rates and sample rates actually eventually resolve to a whole number, if you span enough frames. In the example given above (24fps at 32kHz), we can say that every 3 frames of film at 32kHz would contain 4000 samples. This is referred to as the "cadence run", which is just a fancy term for the number of frames (the "run") it takes before the sample rate and the frame rate meet up at another common boundary. The use of the term "cadence" here implies that there is an established pattern to how those samples are distributed across the run. In the increasingly-tiring example we are using, with a run of 3 frames covering 4000 samples, one might imagine that we would use 1333 samples for each of the first and last frame of the run, and 1334 samples for the middle frame, which would distribute the 4000 samples of the run nicely across the 3 frames.

Of course, this simplistic example runs screaming into the night when presented with the monstrous cadence we will need for dealing with 44100 Hz at 29.97 fps, but we won't worry about that particular hydra until it rears its ugly heads...

Interlacing

As previously mentioned, there is little value in using higher frame rates above the currently used frame rates. For a variety of reasons, however, traditional video formats actually sub-divide a frame into two fields, where each field consists of half the number of lines of the the frame, spread evenly throughout the frame.

This technique, called "interlacing" was introduced to solve a technical problem with early cathode-ray tube televisions. These televisions had a glass surface, coated on the inside with a phosphorous material. An electron gun at the back of the set would fire a beam of electrons which would then strike this phosphorus material and make it glow for a short while, even after the electron beam had moved on to another part of the frame. The beam would proceed from the top-left corner and trace a horizontal line across the screen, and then move down a bit and back to the left-hand side and trace the next line, etc., until it reached the bottom, and then come back to the top and start over again.

If this were done continuously for every line of the frame, however, then by the time the cathode ray got around to tracing the bottom line of the frame, the top line would already have faded out, which would result in a flickering look to the video, as the frame would always be unevenly lit.

To resolve this, clever people in white coats with horn-rimmed glasses decided to distribute the line tracing over two passes, such that all the even-numbered lines were traced first, and then all the odd-numbered lines were traced, each in top-to-bottom order. In this way, by the time a line began to fade, the next line above and below it had already been drawn, and so the frame remained fairly consistently-lit.

From this was born the interlaced video that we typically encounter today. With these formats, each frame consist of two fields - alternately referred to as either "upper" and "lower" fields, or "even" and "odd". The "odd" or "upper" field is the one that contains the top-most line of the video signal, whereas the "even" or "lower" field begins on the second line. Video editors (and especially image processing and video filters folks) get all excited talking about whether or not the video begins "even first" or is "lower-field dominant", but we will just smile indulgently (if a bit vacantly) and let them have their fun.

Some HD video formats actually use "progressive" frames, in which the entire frame is grabbed, and in turn presented, as a single contiguous shot, as it is with film. In order to disambiguate these interlaced and progressive formats, the industry introduced the nomenclature of adding a "p" or "i" to the frame rate number, to indicate whether the rate is dual-field (interlaced), or single-frame (progressive).

High-speed Cameras

Note that while presentation frame rates much higher than 30Hz do not provide a significantly smoother visual experience, there are still some cases where higher frame rates are used. In those cases, the higher frame rate is not for presentation purposes, but rather so that the recorded high-rate image stream can be slowed down, and the resulting images will still appear smooth. This is referred to as "slow-motion", or simply "slo-mo". High-speed cameras are often used in sports for replays and highlights, and also for technical or scientific material, either for analysis, or to make the results more sensational (yay, Mythbusters!). Some very high-speed cameras are capable of frame rates above 1 million frames per second!

The important thing to note, however, is that these rates are for source material. From an audio post perspective, you will be working with material at normal presentation frame rates. The only place higher frame rates really come into play, is if there is any audio accompanying the high-speed footage, in which case you will likely need to slow down the audio to match the edited visual material.

The Various Frame Rates

Traditionally, there are 3 frame rates typically used in audio post work: 29.97 (approximately) fps used in NTSC video, 25 fps rate for PAL video, and 24 fps used for film; for details on why 29.97 fps is approximate, see the later section on "True NTSC Rate". While 30 fps exists as a frame rate, in practice is it not used much, other than as a transitory rate for Telecine transfers until they are pulled down to 29.97 fps. Derivative rates include the NTSC high-definition rates of 59.94 (for 60i) and 23.976 (for 24p), both of which are also approximate rates, as explained later.

Film: 24 Frames-per-Second

Film runs at 24 fps. This is the slowest rate you would encounter (well, save 23.976, but we will get to that later), and true 24 fps is only used in film. Most online editors and DAWs works nicely with the standard video-related audio rate of 48kHz, for which a true film speed of 24fps results in exactly 2000 samples per frame.

Often, editing is actually done at video rates, and thus the material is converted first from 24 fps to either 25 fps or 29.97 fps (or 23.976 fps), via a process known as Telecine (discussed later).

PAL: 25 Frames-per-Second

If you're fortunate enough to live in a county that does not use NTSC, or work with/for such folks, then you will be using PAL or SECAM for video projects - since SECAM is similar-enough to PAL for our purposes as to make the distinction largely academic, I will risk annoying the esoterica-minded portion of the video world and simply refer to the whole PAL/SECAM sector as simply PAL. Footnote: If you find yourself drunk at a party, and enjoy hearing yourself talk loudly, you can probably spout on for long periods of time absolute nonsense about the difference between PAL and SECAM, and (if it even matters) nobody in the room will be able to cogently argue with you (which could be a good or bad thing, depending on how drunk you are).

PAL - which stands for Phase Alternating Line (there will not be a quiz later) - is based around a traditional 50Hz interlaced system (50i). In a 50i system like this there are 50 fields per second, thus 25 full frames per second. This is a very straight-forward, rational system, that results in nice numbers for the sample count per frame when you use just about any standard sample rate, even music-oriented rates. When working with PAL at 48kHz, for example, each frame consists of 1920 audio samples. Nice and even.

Speaking purely geographically, most of the world uses PAL (yes, or SECAM, happy?).

NTSC: "29.97" Frames-per-Second

Naturally, North America does not use PAL. Neither does Central America, parts of South America, Japan, South Korea, Taiwan, Burma and a few other countries. These countries all use a video standard known as NTSC - which stands for National Television System Committee, or as some prefer Never Twice the Same Color - which was the original standard for television signals.

Originally, televisions had a 60Hz field refresh signal, which would have given us a nice even frame rate of 30fps. Naturally, some 50s nerd with a pocket protector, a pipe and the ubiquitous horn-rimmed spectacles (I imagine what Buddy Holly's father looked like?) decided that this would never do. For reasons that make little sense to me, no matter how many times I read about it, when color was introduced to television, the frame rate was reduced to approximately 29.97fps (something about chroma signal sub-carrier frequency and the interference of... *snore*). And so we have the mess that is NTSC.

To make matters even more interesting, of course the rate is not really 29.97, but we will get into that fully later in the "True NTSC Rate" section. Suffice it to say, for now, that NTSC is about 0.1% off of 30fps, and that is plenty enough trouble to get on with.

The NTSC 29.97 rate is the fastest of the traditional frame rates, weighing in at a mere 1601.6 audio samples per full frame of video, when working at 48kHz sample rate. Yes, 1601.6 samples per frame... Which brings us back to the whole "cadence run" thing mentioned earlier: the cadence run of 29.97 at 48kHz is 8008 samples in 5 frames. The SMPTE standard defines the cadence run as the following sequence: 1602, 1601, 1602, 1601, 1602. There are not a lot of times when knowing the cadence run is useful - after all it catches back up to even again after 5 frames - but if you ever need to know it, there it is. If you think that is bad, you should see the cadence for 29.97 at 44100Hz... Footnote: The specification for DV video actually gives a different cadence run for 29.97 at 48kHz. But I'm not going to quote that, because we would really be getting off into the esoteric if I did that, now wouldn't we?

Other NTSC Rates: "59.94" and "23.976" or "23.98" Frames-per-Second

When it came time to do HD television, the white coats and horn-rimmed spectacles were gone, replaced mostly by Japanese hardware engineers and cola-swilling software engineers. They had a really excellent opportunity here to ditch the mistakes of NTSC, and go with a nice even frame rate.

They didn't.

Naturally...

So for HD in NTSC countries, we are stuck with promising-sounding names like 1080i60, but in reality the "60" is approximately 29.97 fps. "But wait", you say, "how can 60 be approximately 29.97?" Well, the "i60" (or "60i") actually refers to the field count, since it is interlaced (thus the "i"), rather than progressive, so this refers to "60 fields per second", and since there are two fields per frame, we arrive at 30 fps, pulled down to NTSC rate of our old friend 29.97 fps. Technically, it is sometimes useful (to someone, at least) to retain the notion of the field rate of 60, and still express it at NTSC rate, and thus was born the approximate field rate of 59.94, but in practice this is just the standard NTSC rate of 29.97.

Another common HD format is 720p24, which is 24 fps progressive (the "p"; you're catching on, aren't you? clever reader!), i.e. 24 full non-interlaced fields per second. As if it were that easy... Naturally, it is not really 24, but rather approximately 23.976. And to further confuse things, this is sometimes further approximated to 23.98, but it is still the same rate. This ends up being the slowest rate of them all, clocking in at a colossal 2002 samples per frame (at 48kHz); well OK, it's only slightly slower than 24 fps, but it is handy that it works out to an even 2002 samples per frame, anyway.

Note that 23.976 fps is also used in film workflows, in additional to the HD format. It is convenient during film transfer to simply copy every film frame to a video frame, slow the whole thing down by about 0.1%, and call it a day. If needed, then, the resulting 23.976 fps video can be converted to 29.97 fps (via process called pull-down), and this process can be undone via "reverse pull-down" (why they didn't just call it "pull-up" is beyond me...). But that starts to stray into the territory of the video side of the world, and we're audio people, so needless to say sometimes you will get projects in 23.976...

Film to Video: Telecine and Back Again

Film is converted to video via a process called Telecine, which involves optically scanning the actual film frames (either in negative or positive form), and storing the result as a video signal - traditionally an analogue signal, but nowadays straight to digital.

For PAL, the frames are usually just played one-for-one. If the film was originally a theatrical release, then this means that the audio is generally sped up by about 4% in the process of converting the frames from 24fps film rate to 25fps PAL rate. This results in an increase in both timing and pitch. Sometimes the pitch is corrected by running it through a pitch shifter (shifting down by about a semitone), but often it is just left sped up, and apparently PAL audiences just live with it.

When a PAL video is shot originally on film (for that coveted "film look"), it is sometimes actually filmed using an overclocked camera that runs at 25fps, which makes the whole process much easier, since there is no change to the audio. Another condition that results in no audio change is when the film was shot at 24fps, but the Telecine process involves a Euro pulldown process (known as 2:2:2:2:2:2:2:2:2:3 pulldown), in which case an extra field is inserted into the mix every 12 frames, but since this results in no timing change from the original, it is not an issue audio engineers have to deal with much.

Of course for NTSC, things involve lots more drama... With NTSC, both the video and the audio are altered during the Telecine process. The video is subjected to a 2:3 pulldown process, in which every 4 film frames produce 5 video frames by repeating fields in a staggered fashion, thus smearing the correction visually and temporaly. It is noticeable if you watch for it, and it does result in the perception that films often appear "not quite right" when viewed in NTSC. The result, though, is that the program material is not substantially changed temporally, so that the overall length of the program remains almost the same.

The devil is in the details, naturally, and the "almost" from above certainly impacts audio. The reason for this is that while 24 frames of film become 30 frames of video, these 30 frames of video are not displayed in 1 second (as the original 24 frames of film were), but rather in slightly more than a second, because the video runs at NTSC rate of (about) 29.97 fps. The result is that audio that was recorded at a real production sample rate of 48000 Hz must be slowed down accordingly, placing it at a new rate of "about" 47952 Hz (actually 47952.047952047952(...) Hz, but who's counting?). Now for the most part this is taken care of by your DAW or NLE (non-linear video-editing) workstation or software, but when it is not, you will need to manually pull down the audio. This can be done either by applying a speed change (slow-down) to your audio, or by hacking the audio header to 47952 Hz.

Note that in some cases (rarer nowadays), your final destination is going to be back to film, in which case the speed changes will then be undone during the reverse Telecine process. If this is the case, be careful to keep that into consideration, as when you introduce new sound, you must remember that it will be sped up slightly (for NTSC) or slowed down (for PAL) during that process. Material that is pitch and/or time sensitive will be affected by this.

NTSC: More Than You Ever Wanted to Know

There are some subtleties to the NTSC television format, naturally, that rear their ugly heads on occasion, typically when we are lulled into a false sense of security, and think we are clever and have it all sorted out. Prominent among these bugaboos is the distinction of the true NTSC rate.

True NTSC Rate versus "29.97"

One frequent confusion regarding NTSC-rate video stems from people thinking that the "29.97" frame rate actually is 29.97 fps (frames per second).

It isn't...

It is actually exactly 30000 / 1001, or 29.97002997002997(repeats) frames per second. The SMPTE spec is clear on this, and this is the actual rate of "29.97" NTSC video.

Most references (including the Wikipedia entry for NTSC) will state the frame rate as "about" or "approximately" 29.97, glossing over the finer details, which are rather esoteric. Unfortunately, some references leave out the "approximately", which leads many to believe that the true rate really is exactly 29.97 fps.

To further confuse matters, some references will refer to a 0.1% difference in the frame period. This is technically correct, but misleading because while the period of a frame differs by exactly 0.1% (i.e. 1001/1000) from true 30 fps, the difference in the rate is not such a nice number. Remember, period is the inverse of rate.

The difference in rate is exactly 1000/1001, thus the NTSC rate is 30000/1001 (i.e. 30 fps multiplied by the differential of 1000/1001), which is 29.97002997002997(repeats)...). Again, the NTSC rate differential does increase the frame period by exactly 0.1%, because period is the inverse of rate, so the ratio is 1001/1000 for period, but the frame rate ratio is 1000/1001, which 99.9000999000999(repeats)%.

Audio and Samples per Frame

From an audio perspective, this weird rate differential actually makes things much easier. At the true NTSC rate, audio has exactly 1601.6 samples per frame (note that samples/frame is the frame period in samples), or, more to the point, 8008 samples per 5 frames, repeating as the pattern {1602, 1601, 1602, 1601, 1602}, which you may hear referred to as the "SMPTE cadence". If it truly were exactly 29.97 fps, then audio would run at 1601.601601601601(repeats) samples per frame, which actually does resolve to an even 1,600,000 samples every 999 frames, but that's a mighty awkward pair of numbers.

Of course the error creep due to this subtle difference is gradual, and for video is largely inconsequential. For video, you would not be off by a full frame until after 1,000,000 frames, which is about 9 hours. But for sub-frame accuracy (which is more a concern for audio), you would be off by 1 sample after 1,000,000 samples, which (at 48kHz) is just 20 seconds. Furthermore, using the incorrect differential, one would be off by about a tenth of a frame after "just" an hour! That is a pretty significant difference, and an hour's worth of material is not that unusual in audio-for-video work.

Timecode

In order to edit film or video, or to edit audio for post, the editor needs a way to identify specific locations in time, relative to the film or video. This is done through timecode, which is a way of uniquely tagging or numbering frames in the video sequence, so that an editor can correlate events to their position in the video sequence.

Since video is all about the presentation of frames over time, the tagging system is based on time. But since the rate of video is greater than one frame per second, we need a finer granularity than just hours, minutes and seconds. So timecode includes a frames unit after the seconds, which indicates the count of the number of frames that have elapsed within the current second.

In some situations in audio post, it may be necessary to have sub-frame accuracy; this is because audio events often happen at a finer granularity than video. Sub-frames are not typically available in the video- or film-editing systems, they are typically found in audio post systems. In some such systems, timecode is presented with a fractional sub-frames element. Some systems present this as a decimal format, in 100ths or 1000ths of a frame, whereas another, more traditional, representation offers 80 sub-frames per frame. If a sub-frame representation is available, you will need to check with the documentation for that system, or with the editor or assistant editor, to find out what the units are.

Formats

While there are some more esoteric timecode formats - such as feet and frames, keycode or metres and frames - by far the most common format is time-based timecode. This time representation consists of two digits each of hours, minutes, seconds and frames, in the format HH:MM:SS:FF. The frames component differs based on frame rate, and on dropped-ness, as we will see below, but the rest is just a straight time representation, of hours, minutes and seconds. Timecode always has a timecode frame rate, indicating the number of frames per second represented in the timecode; this is not always the same as the video rate, especially when NTSC rates are involved.

It is important to note that timecode is a mechanism for counting frames. While the representation of timecode has an obvious correlation to a clock, it does not necessarily represent true ("wall clock") time. In some cases, the rate of the timecode differs from the true passage of time, with respect to how the frames are actually viewed or presented. In other cases, timecode represents the time at which the source video material was captured, rather than the time at which it is presented within a video sequence.

Regardless, even when timecode runs at the same rate as the clock on the wall, timecode is relative to some starting point. Audio editors most often use timecode values that are relative to the start of the video sequence, or full program material, upon which they are working. Timecode then provides the audio editor with a relative base to locate events to sync with video. In this respect, timecode is very important, as it provides a common language that audio editors can use to communicate with video editors, directors, and other primarily video-oriented folks.

Note that the range of timecode runs from the "timecode midnight" time of 00:00:00:00 through to the maximum timecode value of 23:59:59:ff (where the frames portion varies, based on the timecode frame rate). Timecode cannot go negative, i.e. there are no timecode values before 00:00:00:00. For this reason, editors will often start a program off with a timecode offset of 1 hour, i.e. they start at 01:00:00:00. This provides space for pre-roll events, such as count-downs, or other events that must occur before the start of the program material, such as calibration material like bars-and-tone. Do not be surprised or alarmed when material is delivered to you with a starting timecode other than 00:00:00:00; this is more normal than not, and just remember that timecode is always relative to some starting time. When in doubt, if it is not obvious, ask the video editor or director with whom you are working for clarification.

Non-Drop Frame

If you are unlucky enough to work with NTSC, we will get to your particular set of complications shortly, but for the sake of easing into things, we will discuss "rest of world" first... For PAL, and film, and other scenarios of (relative) sanity, the timecode frame rate is typically the same as the film/video rate, and the timecode is counted in a way that is basically equivalent to the passage of real ("wall clock") time.

What does this mean? It means that when playing video back at normal ("1x") speed, 1 minute of timecode time goes by in 1 minute of playback, and 1 hour of timecode time goes by in 1 hour of playback. There is no difference between the timecode rate, and the wall clock rate. Of course, the specific timecode that you see will not reflect the current time-of-day, because timecode is relative to the start of the program material, but elapsed time for both will be the same.

More to the point, non-NTSC editing systems will almost always use a timecode format called non-drop display. The irony here is that "non-drop" is actually the way of counting frames that makes the most sense. In a "non-drop" system, frames are counted normally, i.e. there are not gaps or breaks or dropped frames: the frames portion of the timecode proceeds from frame 0 to the maximal frames-per-second count, i.e. one less than the number of frames per second of the timecode frame rate, just like seconds and minutes. For example, in a 25 frames-per-second system, the frame count goes from 0 to 24; at the next frame, the frame count goes back to 0, and the second count is increased by one.

Since this is the most obvious way that timecode should be counted, why is it given the special name "non-drop"? The answer is that it only has a special name because it contrasts with the infamous drop-frame display. It could easily also be called "normal", or even, if you like, "rational" timecode... If you only ever work with non-drop timecode, you will not know why anyone ever came up with the term "non-drop", since it is a perfectly normal way to count frames.

This is a happy place to be in. Cherish your good fortune, if you find yourself able to work like this...

Drop Frame

So if there is "non-drop" timecode, there must also be "drop"... I think Dante described drop-frame timecode in a novel...

A problem arises when dealing with timecode for NTSC video rates. Because timecode is a count of whole frames, timecode for NTSC video is measured not in the true video rate of "about" 29.97 frames-per-second, but rather in the exact timecode frame rate of 30 frames per (timecode) second. But if we count frames in "non-drop" 30 fps timecode, where each video frame has a timecode one greater than the previous, then we end up with the timecode drifting away from the passage of real (wall clock) time.

This may seem counter-intuitive at first (and it is), so let's illustrate this with a concrete example. Let's suppose we are working with NTSC video with timecode counted as non-drop. For simplicity sake, let's further suppose that we started the video at timecode "00:00:00:00" (we forgot to use an offset of 1 hour, sigh...). We start a stopwatch, while at the same time starting playback of the video from the beginning. After one minute has elapsed by the stopwatch, we stop both the video playback and the stopwatch. We are very careful, and very accurate, and we see that, indeed, one minute of time has passed by the stopwatch (i.e. in "wall clock" time). Now we look to our timecode burn-in on the video, and we see... Can you guess? It's not "00:01:00:00"... We see "00:00:59:28"...

What happened? Well, recall that our video rate is not 30 fps, it is "about" 29.97 fps. This means that after 1 second of wall clock time has elapsed, we have gone through (about) 29.97 frames of video. So after 1 minute of time, we have played (approximately) 1798 frames (60 seconds times 29.97 frames per second is 1798.2 frames). Since we are displaying this with non-drop timecode, where the frames are counted sequentially, with no missing frames, and 30 frames per timecode second, every 30 frames we advance the timecode by 1 second and reset the frames to 0, so we have advanced by 59 full seconds, and 28 full frames. Our timecode measurement was accurate, and our wall-clock time measurement was accurate, but since our timecode frame rate and our true video playback frame rate are not the same, the result is not the same.

In short, when counting frames at 30 fps, but playing them back at (about) 29.97 fps, we have an inherent drift between timecode time and real (wall clock) time. After only 1 minute, they have drifted apart by (a bit less than) 2 frames; after 1 hour, this drift will have accumulated to (about) 108 frames. This is highly-significant difference. We cannot simply ignore it and hope it goes away...

So what do you do with a mess, when you cannot ignore it? You sweep it under the rug, of course! Thus was born drop-frame timecode. With drop-frame timecode, frame numbers are skipped - or "dropped" - in order to "catch up" the timecode time to real time. In between points may be slightly out-of-sync between timecode time and real time, but overall the time gets back in sync because of these skipped frames.

So that's the general approach, but the actual way this is done is quite specific: 2 frames are dropped for every full minute (of timecode time), except for every full 10 minutes, for which the two frames are not dropped. This sounds a bit confusing, but we'll step through it in a moment. The result, though, is that every 10 minutes (of timecode time), 18 frames are skipped; thus every timecode hour 108 timecode frames are skipped.

Now for an example, we start at the drop-frame timecode 03:18:59;27. Step a couple frames forward, and we arrive at 03:18:59;29. Now our drop-frame rule tells us that every full minute (other than every full ten minutes) we skip two frames, so the next frame is not 03:19:00;00, but is instead 03:19:00;02 (because we skipped two frames). Now we advance another 27 frames, and arrive at 03:19:59;29. If we then advance yet another frame, we do not skip two frames this time, because we are at a full-ten-minute mark, so the next frame actually is 03:20:00;00.

Note that because of this drop-frame counting scheme, there can never be a drop-frame timecode that ends in ";00" or ";01", unless the minutes digits also end in a "0". So if you see a timecode like "03:23:00;01", that is not a valid drop-frame timecode value, because in drop-frame timecode, it should have skipped two frames from "03:22:59;29" to "03:23:00;02".

This was mentioned before, but it bears repeating, because it is so important: drop-frame is a timecode counting mechanism, it is not a frame rate (timecode or otherwise). Furthermore, it has nothing to do with actual video frame rates. Lots of folks confuse the two, and refer to NTSC video rates as "drop-frame", but that is absolutely not accurate. The only relationship between the two is that the NTSC video rate of (about) 29.97 often (but not always) is counted using drop-frame timecode. However, it can easily also be counted in non-drop (you just get out of sync with wall clock if you do), and doing so has absolutely no effect on the speed at which the video is running. Drop-frame is about how frames are counted, not the speed (rate) at which the video runs.

Carrier Signals

Timecode is not very useful if it simply stays in the camera, so timecode is usually recorded along with the recorded video or film images. Furthermore, timecode is often also carried simultaneously to multiple devices (cameras and sound recorders) for purposes of synchronizing location information, or even clocking (although there are higher-accuracy clock forms that are typically used for this, some of which are also capable of carrying timecode). To carry timecode from one device to another, it must be encoded into a signal that can carry it, and likewise it needs to be able to be placed into recorded media, so it can be carried over to the post-production environment for use in editing.

Linear/Longitudinal Timecode (LTC)

For us older audio engineers, the most common form of timecode signal we would encounter in the past was LTC (pronounced as "lit-sey"), which stands for either linear timecode or longituindal timecode, depending on whom you ask... This is actually an audio signal, typically an analogue signal, carried over audio cables or even recorded onto a spare track of an audio recording device.

The signal carries the timecode data, encoded into the analogue signal. Technically, this is done using a bi-phase signal composed of square waves of various pulse-widths, where a full-period in one direction is used to indicate a value of zero, and two half-periods alternated are used to indicate a value of one. These individual zero-or-one values (called "bits") are then collected into 80-bit words that span the time of a video frame. There are lots of interesting (nay, fascinating...) technical details involved in the encoding, but the net result is an audio signal that is very robust in the face of distortion, conversion, signal attenuation, drop-outs, and even sample-rate-conversion, and carries the timecode signal over any system that is capable of carrying an audio signal (analogue or digital).

The LTC signal is not very interesting to listen to, and is often used as a cliche sound effect for computers, since it sounds similar (albeit only vaguely) to old-style computer modems. Not that anyone knows what a modem sounds like nowadays, but for some reason everyone always thinks computers or data communications when they hear one of these modulated signals.

MIDI Timecode (MTC)

Of course, since timecode is, at its essense, a periodic digital signal, it is much easier to carry timecode over an existing digital transmission system. One of the oldest such systems still in use today is MIDI. Indeed, there is a system of MIDI events defined by the MIDI standard for carrying timecode signals over MIDI as a sequence of MIDI realtime events. This is known as MIDI timecode, abbreviated MTC. It has the distinct advantage of being easy (and cheap) to encode, carries over any existing MIDI connection (hardwired, wireless or through software), and can be recorded via many MIDI sequencers.

Vertical Interval Timecode (VITC)

Timecode can also be carried within an analogue video signal, by encoding it into parts of the video signal originally left unused because they corresponded to the vertical retrace of the electron gun in an old CRT television. Because it uses this retrace interval, it is called vertical interval timecode, or VITC for short (pronounced "vit-sey"). In a video studio, often an empty video signal is sent throughout the building, used to synchronize all the devices to a single video clock; this signal is called blackburst, and usually carries a VITC signal within it. Since this page is largely devoted to audio post, rather than video editing, I will gloss over blackburst and VITC, but mentioned them so that you know the terms, should they come up.

Embedded Timecode

As more and more video systems go entirely digital, the more common case nowadays is for the timecode to be carried with, and stored with, the digital video. This is most often done in one of two ways: either the timecode is actually buried in ("muxed" with) the digital data that corresponds to each frame of video (and often audio), or the timecode is embedded as metadata in the container file that contains the video (and/or audio) media. An example of the former are systems like DV and SDI, where the audio and video and timecode are all multi-plexed into packets for transportation or for storage on media. Two examples of timecode as metadata are: a QuickTime movie stores the timecode as a separate track, and broadcast-WAVE files, which contain timecode as a single piece of file-wide metadata, indicating the starting timecode of the audio.

Conclusion

Hopefully this has been a useful read. There's a fair bit of info here, and some of it may be rather dry, but I think all of it is useful, when working on audio post stuff. Let us know if this is helpful, or not.


Copyright © 2010-2012 Kelly Jacklin. All rights reserved.