AI for Encoding Coming in Different Phases
AI encoding is coming in two phases is the message from Thierry Fautier, VP of Strategy at Harmonic. The first phase will focus on artificial intelligence (AI) and machine learning (ML) techniques using existing codecs such as AVC, HEVC, AV1, and AVS3. The second phase will focus on newer codecs like VVC and AV2.
Harmonic has already deployed the first version of such AI-assisted encoding schemes that they call Content Aware Encoding (CAE) and is embedded in their EyeQ offering. The idea is to use AI and the mechanics of the human visual system to “continuously assess video quality in real-time and focus bits where and when they matter most for the viewer.” Exactly how the algorithm works remains confidential, but Fautier says their operator customers see up to a 40% bit rate reduction for comparable quality when implementing CAE.
“There are now over 100 CAE deployments worldwide using AVC and HEVC mostly for OTT services,” noted Fautier, “and we have shown it can reduce bit rates for 8K using HEVC during the French Open trial we did in 2019 with France Televisions.”
CAE runs on existing encoders without additional computation power, at least in Harmonic’s implementation. Other AI techniques using existing codecs can be put in two categories: implementations that require a big increase in CPU usage, and techniques like Convoluted Neural Networks (CNN) that are being studied in groups like MPEG. The focus with CNN solutions is to re-localise compute power more to the client-side to save bandwidth. Researchers are therefore trying to figure out how to balance the load between AI-based algorithms that run on a neural network vs. the GPU/CPU processing needed for the raw encoding. The story here is not written yet.
It is also important to understand that AI techniques are based on a learning process (supervised or not) where a considerable CPU budget is used. One must also consider the CPU power used at run time to try to limit its impact when using an AI-based technique. Netflix and some others are using AI to make exhaustive encodes of all the parameter combinations and deduce the best set of resolution-bit rate combinations. This is very accurate but is also very CPU intensive and therefore not applicable to live applications. It is also not very green in terms of carbon footprint or in terms of dollars spent.
As for directions in AI-assisted encoding being deployed on existing codecs, Fautier says there are three main areas of development: 1) dynamic resolution encoding; 2) dynamic frame rate encoding; and 3) layering.
Dynamic Resolution Encoding (DRE) is an extension to the encoding ladders that OTT content providers use today. The choice of the resolution-bit rate combinations to encode to has been an area of active research over the last few years with per-title encoding being state-of-the-art today. This means the resolution-bit-rate ladder choices are done on a scene-by-scene basis to optimize the storage requirements and streaming bandwidth. The main difference is that with DRE it is done in a single pass without any additional processing power required as opposed to the classical approach used in VOD where all resolutions are encoded and the result is determined after comparing all the encodings.
Fautier says that Harmonic’s DRE approach is appropriate for live content, not just file-based OTT content. In fact, they have already proposed this to the Brazilian TV 3.0 Organization developing a solution for that country’s next-generation broadcast system.
The second area of development is dynamic frame rate encoding. Here, the idea is to encode only at a frame rate that is necessary. That is, talking heads can likely be encoded at 30 fps or lower without loss of detail, whereas live sports will probably need to be encoded at the frame rate at which it is captured. The objective is to reduce the compute load for the encoding process – by up to 30%, depending on the content. “This technique has been researched for many years without any success, but now, thanks to AI, we see very solid results,” says Fautier.
The third area is layering. Scalable HEVC, LCEVC, pre/post-processing pairing are all examples of this approach. With layering, you encode a base layer at 4K resolution along with an enhancement layer that conveys the extra 8K details. These two layers may or may not be transmitted over the same transport system. For example, a 4K signal could be broadcast with an enhancement layer sent over an IP connection. If the receiving TV is 4K, it ignores the enhancement data. But if an 8K TV receives these signals, it can use this enhancement data to reconstruct and decode an 8K signal.
This layering approach can be done today using Scalable HEVC, deployed in the U.S. in ATSC 3.0 with a base layer in HD for Mobile and an extension layer for 4K TVs. Scalable VVC and VVC-based LCEVC have been proposed to the TV 3.0 consortium. Also under investigation is the use of LCEVC to create a base layer of legacy HD AVC-encoded content with a UHD extension layer. Those techniques use standard-based approaches.
Pre/post-processing pairing, as proposed by Samsung with its Scalenet approach, uses a neural network to downscale the 8K content for distribution as 4K content with signaling and/or metadata to aid in the reconstruction of the 8K signal. One additional challenge with the use of neural networks for this approach is in establishing the standards for the interchange of encoding/processing data. MPEG is currently looking into this for its new version of video standards.
The use of AI, ML, and neural networks to aid in the encoding, distribution, and reconstruction of video is in its early days. But the above approaches seem to be solid areas of development that will really help reduce bit rates, storage needs and even compute complexity in the future. These are also likely to be the key elements that eventually enable a commercially viable 8K streaming/broadcast service.