Interpreting Convolutional Neural Networks for Video Coding

Published: 2 June 2020

Luka Murn
Graduate R&D Engineer
Marta Mrak
Lead R&D Engineer

Machine learning (ML) is a subset of artificial intelligence (AI) techniques that have shown great potential in solving challenging tasks in multimedia content processing. In the 大象传媒 Research & Development Visual Data Analytics team, we use ML for improvement of video compression through simple ML forms as decision trees to more advanced deep Convolutional Neural Networks (CNNs).

CNNs are particularly effective at solving tasks related to visual content, such as object classification or image enhancement. However, various networks based on CNNs are typically significantly complex. Although the complexity is well justified for managing visual analytics tasks, it can limit their suitability for real-time applications like video coding, pose estimation and object recognition in autonomous driving. Furthermore, the intricacy of CNN architectures makes it challenging to understand exactly what these algorithms have learned, putting the trustworthiness of their deployment into question. Their methods of manipulating data need to be properly explained and understood, to mitigate potential unexpected outcomes.

To address these challenges and enable the consistent and reliable use of ML, 大象传媒 R&D is opening up the inner workings of deep networks for specific ML applications.

What is ML interpretability, and why do we need it?

Interpretability is an area of ML research that aims to explain how the results of learned ML algorithms are derived clearly and plainly. Neural network models often come with thousands of parameters that change during training, while the network learns to identify the patterns in the data. Trained models are then deployed for specific tasks, but are commonly used as black boxes, in the sense that it is typically not known how these algorithms make decisions to arrive at their outputs based on input data. By interpreting neural networks, it is possible to uncover the black box, providing explanations to support a transparent and reliable use of ML. Additionally, this process can also lead to uncovering redundancies in the structure of an analysed model. Understanding the relationships learned by a neural network enables the derivation of streamlined, simple algorithms that can be applied in systems which require low-complexity solutions.

is highly useful in revealing biases in ML workflows, in fields such as image classification. A classic example of a biased application of ML is . The classifier bases its predictions on learned models trained on images of wolves with snow in the background and pictures of huskies without snow. Regardless of the relevant features of these animals (like colour, or pose), the trained classifier predicts a wolf if there is snow in the image and husky otherwise. By using interpretability, these undesirable effects can be detected and avoided before reaching deployment in applications.

Our approach

We have developed an approach that focuses on complexity reduction of CNNs by interpreting their learned parameters. This approach enables us to build new simple models that preserve the advantages of the initial model learned by a CNN, while also enabling their transparency.

In the example presented in the video below, we use a network with three convolutional layers where we removed non-linear elements. Following the CNN training, we can devise how to directly compute samples of the resulting image from the input, instead of performing numerous convolutions defined by the CNN layers.

The obtained simplification fully describes how the network behaves and presents a substantial decrease in the number of parameters compared to the original, non-interpreted model. A limitation is that in more complex networks, full simplification cannot be achieved because of various required non-linearities. Nevertheless, in further research, we have seen that some branches of very complex CNNs can still benefit from our approach.

ML in video coding

Motion compensation is one of the crucial video compression concepts and can be successfully improved by using . However, CNN approaches applied to motion compensation lead to substantial increases in both runtime and memory consumption, making them largely unsuitable for real-time deployment.

To overcome these constraints, we trained our aforementioned streamlined network to predict the pixels for motion compensation more accurately than the traditional methods. Then, in addition to the conventional prediction of pixels (shown on the video, below), we implemented the interpreted models within a video codec.

Our technique performs significantly quicker than previous CNN-based efforts when tested within a video coding format. Our experiments have revealed an 82% decrease in the decoder runtime while retaining the coding benefits of previous approaches.

This open-source software is now available via the . You can read more about this work in work in , published in , and , presented at .

Why is this important?

In addition to demonstrating how interpretability can optimise ML solutions in the video coding area, this work has a broader reach. As in the case of the wolves and huskies example mentioned before, prior research demonstrates that, by using interpretability, we can detect biases within ML models.

However, we have also displayed how much more efficient ML models can be interpreted, addressing the most important aspect of implementing ML models in real-time production and distribution workflows. We are demonstrating that interpretability is one of the pillars towards sustainable AI, and are continuing to examine its benefits in unravelling the ML black box.

This work was carried out within the in collaboration with the and the .

大象传媒

Accessibility links

Luka Murn

Marta Mrak

Rebuild Page

Useful links

Theme toggler

大象传媒