Pre-encoding 8K with iSIZE BitSave
To understand the benefits of pre-encoding 8K content, we reached out to iSIZE technologies. This London-based deep-tech company uses deep learning for video delivery and claims the first AI-powered live psychovisual precoding before encoding.
The iSIZE preprocessing approach
iSIZE first uses machine learning to create a psychovisual preprocessor with its BitSave product. This tool can then be deployed in front of any standard codec to provide any combination of lowering bandwidth or increasing video quality. Typically, the model is pre-trained by iSIZE.
iSIZE implements recent theories in the field of psychovisual image analysis and rate-perception-distortion optimization (Reference: Blau, Yochai, and Tomer Michaeli. “Rethinking lossy compression: The rate-distortion-perception tradeoff.” International Conference on Machine Learning. PMLR, 2019. Online here).
If terms like “rate-perception-distortion optimization” are a little scary to you, rest assured, they were to us too at first. There’s a lot of jargon there for a simple idea. Compressing video with brute force algorithms eventually reaches diminishing returns with any given technology, like, say, HEVC. Academic researchers from multiple fields have long been saying that the way our eyes work means compression artifacts have different effects. Some can be detrimental to perceived quality, while others go unnoticed by viewers. The point of new approaches like iSIZE’s and others like LCEVC is to leverage these neuroscience learnings to lower the complexity of encoding, where the eye can’t notice while reaching the very best compression levels.
iSIZE’s CEO and founder, Sergio Grce, told us that models are trained “with a generalized set of fidelity, perceptual, and rate loss functions”. The resulting model is fully compatible with all existing video encoding, delivery, and decoding standards. Note that more detailed information on iSIZE’s approach is available in a technical paper by iSIZE CTO Yiannis Andreopoulos. CVPR 2021 published the report here (Chadha, Aaron, and Yiannis Andreopoulos. “Deep Perceptual Preprocessing for Video Coding.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.).
So What’s a Deep Perceptual Optimizer?
iSIZE describes their approach as a codec-agnostic server-side enhancement that optimizes for low-level metrics like SSIM (structural similarity index metric) and higher-level (more perceptually-oriented) metrics like VMAF.
iSIZE calls their preprocessing a deep perceptual optimizer (DPO) because it uses single frames and applies a deep neural network that is optimizing the perceptual quality of the subsequent encoding. That means large volumes of content are needed to train the DPO offline. For this, a virtualized model of an encoder incorporates the effects of inter- or intra-frame prediction, transform and quantization, and entropy encoding in learnable functions. This emulation of a practical encoder enables iSIZE to ‘teach’ the pre-processing network how encoders distort the incoming pixel stream at typical encoding bitrates. In parallel, they can get a rate estimate for a range of quality levels. “This trains DPO to minimize the expected bitrate of an encoder when encoding the DPO-processed content, while at the same time maximize the encoder’s perceptual quality,” explained Andreopoulos.
Perhaps a more simplistic way of thinking about this is optics and projected images. In a projector, a clean image is created on the imager but the projection optics may distort the image that gets to the screen. If you can model how the lens distorts the image, you can create the conjugate of that distrotion by adjusting the pixel values before they get to the imager. By writing this conjugate distorted video to the imager, what gets to the screen is a much cleaner image.
If this analogy holds true, then what iSIZE is doing is treating the encoder as a noisy process with known distortions at certain bit rate levels. By modeling these encoder distortions they can create a closed loop neural network that can train the preprocessor to compensate for these encoder distortions. The output of the preprocessor changes the pixel-level values so that when run thru the encoder, the distortions are reduced. There is no need for metadata and no changes are required at the decoder either.
The DPO doesn’t need to have access to (or change) the encoder’s settings or even know the encoding standard, claims Andreopoulos. This seems a bit strange as each codec and each manufacturer’s implementation of a codec operates a bit differently and so should be modeled a bit differently, it seems to us (by analogy, each projection lens is different requiring a measurment of that lens).
How, where, and why to use preprocessing?
iSIZE told us that the BitSave preprocessor could be placed into any live or file-based workflow just before the encoding pipeline without disrupting anything in the existing setup. The added latency is a single frame and, depending on specific workflows, iSIZE BitSave preprocessing time can be as low as 1.5ms on GPU or similar high-performance hardware (for 1080p content, we believe). For the testing described below, iSIZE’s CTO told us that this could reach an overhead of as low as 10% versus GPU or CPU encoding runtime.
Bitrate reduction while maintaining equal or better quality is iSIZE’s most common use case. However, Grce told us that “8K encoding can reawaken unique workflow challenges. For example, encode speed, and live encoding density may bring long-gone issues back.” He explained that BitSave could help here by allowing the encoder to run at a lower level of complexity while maintaining an equal level of quality for a specific bitrate. In such a case, if an encoder is currently achieving the performance of “slower” or “veryslow” presets of x265, similar video quality and bitrate results can be achieved using the “fast” preset when the file is preprocessed with BitSave. As mentioned previously, iSIZE claims that the BitSave preprocessor adds a “relatively nominal” amount of additional processing time, resulting in a significant overall speed/density saving.
A single-pass preprocessing of the input video works for all resolutions and encoding bitrates since the deep perceptual optimization is done at the frame level. This feature simplifies adding a preprocessing step to adaptive bitrate (ABR) ladders and other complex workflows requiring multiple encodes.
Grce further told us that BitSave could be used whenever one or more of the following five scenarios are needed:
- Providing higher quality video experiences
- Lowering bitrates and storage
- Increasing processing speed/density
- Addressing sustainability concerns notably by using less CPU/GPU resources and thus less power
- Providing scalable solutions, future-proofed for new codecs and higher resolutions.
Description of encoding tests of the 8K content
At the request of the 8K Association, iSIZE performed tests on four 8K files
- TYL_trailer_20200923_QUHD_HEVC_gamma_Rec709 (SDR/HEVC)
- TYL_trailer_20200923_QUHD_HEVC_PQ_Rec2020 (HDR/HEVC)
- TYL_trailer_20200923_QUHD_PR4x4_PQ_Rec2020 (HDR/ProRes)
- TYL_trailer_20200923_QUHD_PR4x4_Rec709 (SDR/ProRes)
iSIZE segmented each file into six parts for more accessible file handling and to help with VMAF and MS-SSIM accuracy by applying harmonic mean scores to smaller time segments. The standard open-source x265 implementation of HEVC in FFmpeg performed two encodings.
- The first encode used x265 with a CRF value at 22 and the “veryslow” preset without BitSave preprocessing the file, which is generally known to provide for high-quality encoding at VMAF values of 95 or higher (visually lossless).
- The second encode was preprocessed with BitSave and then encoded with x265. The “veryslow” preset was used once again. However, to target bitrate savings, the CRF was increased to 26; this would traditionally have a severe impact on quality (and VMAF and MS-SSIM would drop significantly below 95); however, the results show that this is not the case for the preprocessed file.
iSIZE provided the following analysis of the results for quality assessments, bit rates, and files sizes:
|Filename||ΔBitrate vs. Codec Only (%)||ΔVMAF vs. Codec Only||ΔMS-SSIM (x100)|
Note: All delta values refer to preprocessed+x265 vs. x265 with no preprocessing; negative values mean reduction in bitrate/quality vs. the x265 encoder at CRF 22; positive values imply increase. All results were produced with the standard libvmaf library of Netflix.
These results confirm iSIZE’s claim of significant bitrate savings when their BitSave preprocessor is used, and this comes even with a slight VMAF improvement and negligible drop in MS-SSIM scores.
iSIZE explained that the Visual Quality metrics are well below the range of noticeable difference and equivalent video quality: VMAF is around 98 and above, and MS-SSIM×100 in the region of 99. These are only preliminary tests because 4K (3840×2160) content limits the current maximum resolution VMAF models available for this testing. This corresponds to scores based on a viewing distance of approximately twice the nominal viewing distance (1.5 X screen height) and would result in less sensitive ΔVMAF scores. While the current tests have this limitation, iSIZE anticipates seeing more positive ΔVMAF scores for 8K content if the official VMAF repository of Netflix makes an 8K VMAF model available. In other words, iSIZE believes these test results may be understating BitSave’s VQ and bitrate improvement potential.
While it is only preliminary testing, iSIZE’s initial evidence points to bitrate gains by applying a psychovisual preprocessing step to 8K content. According to the company, these gains allow for cost reduction and enable bitrate savings that are not feasible with conventional encoding. These results illustrate some of the benefits of pre-encoding 8K content, and we eagerly look forward to seeing this with our own eyes, perhaps at NAB?