Some time ago, I tried to cut video files at specific timestamps. However, the tools I knew didn't work accurately or had other drawbacks with those specific types of videos. In the following, I describe what I learned while analyzing the problem and writing an own tool to cut my videos. Please note, the description below might not be fully accurate in every case but it should give you a good idea of the problems and possible solutions.
Basics of modern video codecs
With many audio und video codecs, streams can be cut by simply truncating them at the desired position, e.g., after the last or before the first video frame one wants to keep. For recent codecs like h264, however, this is not as simple as such video streams use inter frame prediction and contain different types of frames that are not chronologically ordered. A h264 stream may contain the following frame types:
- I frame (intra coded)
- P frame (predictive coded)
- B frame (bidirectionally predictive coded)
An I-frame is a so-called key frame that contains a complete picture. If your stream only contains I-frames, you could simply cut the stream after each frame as explained in the beginning. However, P- and B-frames do not contain a complete image and depend on the data in other frames. Hence, those other frames are required to create a picture ready for presentation from a B- or P-frame. This is especially beneficial if parts of the picture do not change for some time, e.g., in a static scene. Then, the P- and B-frames only contain the changes from the initial I-frame. While P-frames only depend on prior frames, B-frames also depend on subsequent frames. Therefore, you cannot cut after a B-frame as it might need the data of frames that come afterwards. One might think that it should be possible to cut after a P-frame as it only needs prior frames but this is also not possible in every case as we will learn in the following sections.
As mentioned, h264 streams may not only contain different types of frames but also frames that are not chronologically ordered. In a h264 stream, frames appear in decode order. This means that a full image can be calculated for the current frame at any time because each frame that contains necessary data is placed in the stream before the current frame. For example, if we look at the following sequence of pictures that are chronologically ordered:
I B B P
the first frame is independent, the following two B-frames depend on the I- as well as the P-frame and the P-frame depends only on the I-frame. If we want to be able to get a full image after reading each frame, the frames have to be in this order:
I P B B
Again, the I-frame is independent, the P-frame only depends on the I-frame and the two B-frames depend on the first two frames. After reading each frame, we have enough data to generate a full picture out of this frame but we cannot cut the stream after the P-frame, for example, as the two B-frames should be presented to a viewer before showing the P-frame.
To simplify the work with such streams, each frame usually has two numbers: a presentation time stamp (PTS) and a decode time stamp (DTS). Both contain a simple integer number that states at which time the frames have to be shown to a viewer and at which time they have to be decoded. For smooth video playback, for example, a video player will decode the frames as they are in the stream, i.e. with increasing DTS, and order the frames with increasing PTS afterwards before drawing them on the screen. If we look at the above example, the frames would have the following values:
I P B B
PTS 1 4 2 3
DTS 1 2 3 4
If we look at this again from the perspective of a cutting tool, doing the right thing is now considerably more complicated. First, we have to check the type of the last frame we want to keep. If it is a I-frame we can simply cut the stream after the frame. If it is a P-frame, we have to cut the stream after the frames that have a lower PTS than our P-frame. If it is a B-frame we cannot simply cut the stream, as the frame depends on following frames that we do not want to keep. To make things worse, we have a similar problem if only a part of the frames shall be dropped and the stream shall continue afterwards. For example, if the playback shall continue with a B-frame we might need the data of previous frames that have been dropped. If a player would start the playback with a P- or B-frame, one would see a mostly corrupted image for some time as parts of the picture are missing. However, one can still perceive the movements in the video as this information is stored in those frames.
Therefore, many tools that cut video streams either cut only at I-frame boundaries or require a complete reencoding for frame-accurate cutting. Consequently, one has the choice between inaccurate cutting or quality loss and additional time overhead for the reencoding. Instead, avcut combines both approaches: the frames in a stream are copied as long as no cut is required in a so-called group of pictures (GOP) that consists of dependent frames, i.e. from an I-frame to the frame before the next I-frame. If a cut in such a group is required, all frames since the last I-frame are reencoded in order to ensure that the last frame before the cut does not depend on frames after the cut. Likewise, the frames are reencoded if the stream shall start or continue with a frame that is not an I-frame.
Besides avcut, there are other tools that are able to provide frame-accurate cutting on Linux, e.g., avidemux. However, avidemux is a large project, e.g., with a GUI and many other features, and for my purposes I only need something light-weight that is easy to handle. To do most of the hard work like demuxing, decoding, copying, encoding and muxing the video, avcut uses the ffmpeg library.
Muxing refers to the merging of frames from different (audio, video, subtitles, ...) streams into one combined stream that can be written to a file. In turn, demuxing separates the the different streams when reading the video from a file. This is more complicated than one might imagine as a combined stream might contain multiple audio, video or other types of streams, e.g., subtitles, and each of them has own properties that have to be synchronized for accurate presentation to a viewer. How this is handled in a combined stream is specified by the so-called container format. Common containers are, for example, AVI or, my favorite, Matroska (MKV).
Implementation of avcut
In this article, I will not talk much about the common parts of the ffmpeg API and concentrate on the less obvious code that was necessary to implement avcut. For a beginner tutorial, see dranger.com/ffmpeg/ and the FFMPEG wiki. If you want to check out the code, the project is available on GitHub: https://github.com/anyc/avcut. Please note, the following description of the API might not be accurate anymore due to API changes of newer FFmpeg versions.
First, you have to initialize ffmpeg with a call to
av_register_all() and then you can open and analyze the input file with
avformat_find_stream_info(). Before the common initialization of the decoder for each individual stream with
avcodec_open2(), a first adjustment is necessary. Later, when we want to read a new frame from a file with
av_read_frame(), this function actually returns a so-called packet in a
AVPacket struct. This is a generic structure that contains data from a specific stream, e.g., video or audio frames, and does not provide much information about the contained data. To get the type of the video frame in a packet of a video stream, for example, the packet has to be decoded into an
AVFrame struct. As avcut will decode and buffer all frames and corresponding packets since the last I-frame, we need to set
refcounted_frames to 1 for each
AVCodecContext structure that represents the codec and its properties used by a stream. If we do not do this, the function
avcodec_decode_video2(), that decodes a video packet into a video frame, will reuse the memory of a given
AVFrame structure. Hence, the data in our buffer would become invalid. If
refcounted_frames is set to 1,
avcodec_decode_video2() will not reuse the data of an AVFrame and, consequently, we become responsible for freeing the memory of the frame later.
If we have to reencode parts of the stream, we want to encode them with similar settings. Therefore, we have to copy the
AVCodecContext structure from the input to the output streams using
avcodec_copy_context(). Unfortunately, not all encoder settings are stored in the original file and we have to set properties like the encoding quality ourselves. As the parts we will reencode are usually small, avcut uses high-quality settings for reencoding. I also had to set the thread_count property of the encoder to 1 as higher values caused inexplicable faults while closing the encoder later.
Another problem that took some time was the concept of global headers (background 1, 2). The h264 decoder needs some extra information about the stream in addition to the video frames. This so-called extradata can be stored in the header of the file (global headers) or before each keyframe (local headers). Now, if you use the Matroska container, you have to enable global headers. If this flag is set and you reencode frames, none of the packets will contain extradata. However, if the original file uses local headers and we copy some packets, there will be packets with the original local and the new global header which seem to confuse video players if you seek forward and backwards between new and old packets. To solve this problem, avcut sets the global headers flag but also copies the extradata explicitly to the beginning of each keyframe.
After all settings have been initialized, avcut copies the original metadata, e.g., the video title, using
av_dict_copy, before avcut writes the header to the file with
Another difficult concept for beginners are the different time bases. As written earlier, there are presentation (PTS) and decode timestamps (DTS). These timestamps do not contain the actual time but an integer value that can be multiplied with the respective time base to calculate the time offset from the start of the video in seconds. If the PTS of a packet and of a frame are equal, this does not mean that they have to be shown at the same time as both can use a different time base. This is necessary as, for example, certain container formats and codecs only work with specific time bases. A time base is defined as a quotient of two integers, e.g., a time base of 1/1000 is common for a stream in a Matroska container. Frames usually use the time base of the codec (
AVCodecContext) and packets the time base of the stream (
AVStream). Hence, if we need to calculate the PTS of a frame from its packet, we have to use the following formula:
PTS_frame = (PTS_packet * timebase_stream) / timebase_codec
To make things worse, depending on the container of the input file, the packets may not contain a PTS value or the value may not be entirely correct. Luckily, ffmpeg provides a function called
av_frame_get_best_effort_timestamp() that tries to find the right PTS value for the given frame. It is important to understand the time concept in ffmpeg as we want to compare the timestamps with the cut points given by a user later. Again, ffmpeg provides helper functions like
av_q2d() that rescale an integer timestamp from one timebase to another and convert a timebase stored as
AVRational structure into a floating point value that can be used in above formula, respectively.
As we want to drop frames in the video later, we may also have to modify the original PTS for the frames after the deleted scenes. If we do not modify the PTS, there would be a jump in the timestamps from one frame to the next and a player might suspend video playback until the wall clock time matches the PTS of the next frame. To calculate the right PTS, we calculate the time between the two cut points and divide this value by the timebase to get the number of frames we dropped as shown in the following formula:
PTS_new = PTS_old - (resume_timestamp - interrupt_timestamp) / av_q2d(timebase_codec)
The final step, before we start to actually process packets and frames in the code, is the calculation of the starting DTS. If we pass packets to the muxer later, we have to ensure that the DTS is always smaller or equal to the PTS - as we cannot decode a packet after we have presented it. If we start the DTS from zero, this becomes a problem with the h264 codec as the frames are not in order. As P-frames come before B-frames with a lower PTS, the condition can be violated. For example, in the above example, the last B-frame has a PTS of 3 but a DTS of 4. To satisfy this requirement nonetheless, the DTS of the first packet does not have to start with zero and avcut starts with a DTS of (0 - GOP_size) instead. Hence, not the absolute but the relative value to the reference point (the first packet) determines the time when a packet should be decoded.
In the main loop of avcut, the packets are read from the input file with
av_read_frame, stored in a buffer and decoded into a frame that is also stored in the buffer. This repeats until a second I-frame has been read which causes the buffer to be flushed. If there is no cutpoint between the two I-frames, the PTS and DTS of the packets are adjusted and they are written to the output file. If there is a cut point, the same procedure is applied to the packets before or after the cut point for audio or other types of streams. For video streams, however, the frames are passed to the encode function
avcodec_encode_video2 and the resulting packets are written to the output.
Again, the last step is a little bit more complicated with codecs like h264. When we pass a frame to the encoder, it can return a packet but it will not always do this and if it does, the resulting packet might not contain the frame that we passed to the function. The same uncertainty applies to the decode function by the way. The encoder buffers several frames in order to analyze them and to find similarities. So you might pass a few frames to the encoder before you actually get a packet in return. Consequently, if we have no more frames to pass to the encoder, there are still packets we have not received. The encoder expects that we pass a NULL pointer instead of a frame in such a case until it returns no more packets. After flushing the encoder this way, some of them crash if we start to pass frames to them again. However, this happens with avcut if we cut the video multiple times. To avoid a crash, the codec has to be closed and reopened each time we finish encoding a sequence of frames.
After all packets have been written to the output file, we finalize the writing with
av_write_trailer and close all handles.
If you want to check out the code, the project is available on GitHub: https://github.com/anyc/avcut