もっと詳しく

According to foreign media IEEE Spectrum, researchers at Meta have published a series of new papers on MAE (masked auto-encoder). MAE systems can use SSL technology (self-supervised learning) to predict the missing parts of the data, and then restore the incomplete text, images, video and audio.

The general principle of MAE system to restore different types of files,It is to predict the missing content based on the existing information, and then use other data to make up.

Through this technology, AI may be able to automatically perform data annotation (ground truth) instead of manual annotation. This means that the learning efficiency of AI models has been greatly improved, which may bring new ideas for the future development of AI models.

1. The essence of intelligence is predictive ability, and SSL technology can improve the level of AI intelligence

The MAE system uses SSL technology (Self-supervised Learning).SSL means that the annotations used for machine learning originate from the data itselfrather than a technique derived from human annotation.

MAE systems can recover images, video and audio by predicting the missing parts from very fragmented and mutilated data. And this is how the MAE system builds “world models”.

Yann LeCun, chief AI scientist at Meta, said: “SSL technology is a prerequisite for AI systems to build ‘world models’. Only with SSL capabilities can AI be as rational and common sense as humans. , to gain the ability to transfer knowledge and adapt to different environments.” Jan LeCun said that if the MAE system can predict the missing part of the data, it means that AI can understand that the world is three-dimensional and has a certain degree of discrimination ability, only It is possible to predict complex human behavior.

Yann LeCun told foreign media IEEE Spectrum: “We want to create AI models that can learn autonomously like animals and humans.” Yann LeCun believes that the essence of intelligence is a predictive ability. This view is endorsed by 2018 Turing Award winner Yoshua Bengio, who also believes that the ability to reason and predict the world is the key to intelligence.

Meta makes AI video computing costs plummet by 95%, and AI can guess the original image even if the picture covers half of it

▲ The left is the training image provided to the MAE model, the middle is the prediction result, and the right is the original image

2. A new way of playing crosswords? AI helps you fill in the picture

Ross Girshick, a researcher in Meta’s AI division, co-authored a paper on the principles of the MAE system. The paper mentions that Meta’s MAE system is built on a neural network algorithm called Transformer.Transformer is a kind of neural network algorithm based on attention mechanism. This algorithm allows AI models to reduce their reliance on external information, capture the internal relationships of data or features, and optimize model training results.

▲ Paper on the principle of MAE

When processing textual data, the MAE system will detect a textual database that lacks certain data. After the MAE system detects these missing texts, it supplements the missing content with new text blocks.

This technique can also be transferred to the processing of still images in MAE systems. The researchers decomposed the image into patches and let the MAE system fill in the missing images. Ross Girshick said this was inspired by Google’s ViT model (Vision Transformer).

The basic principle of the ViT model (Vision Transformer) is to apply the Transformer architecture to the field of computer vision. Specifically, the ViT model can divide a picture into patches of the same size, encode each patch and then form an image sequence, which the machine can recognize. Based on this inspiration, when the MAE system predicts the missing image, it will decompose the image into many small patches, and then fill in the missing content with new patches.

3. The information density of text and image is different, and the experimental result of covering 75% of the image is the best

The team found that because text and images have different information densities, the proportion of data that needs to be masked for optimal restoration of text and images is also different. When the MAE system restores still images,Masking 75% of the data gives relatively best results.But for text, the number is 15%.

▲ The researchers found that the experimental results of covering 75% of the image were the best

Languages ​​are human-generated symbols that are highly semantic and information-dense. Each character contains a lot of meaning, and if there are too many missing words in the sentence, the MAE model will predict many kinds of results, and the accuracy rate is not high. Correspondingly, images are natural symbols with a lot of spatial redundancy. For example, on the same image, the pixel features of images with similar regions are not much different, so the lost image information can be recovered from adjacent image blocks through the model.

Ross Gilchick explained that the MAE system consists of two working steps. First, the MAE system uses an encoder to learn the relationship between pixels from the dataset. The MAE system then uses a decoder to reconstruct the original image starting from the mask. When these two parts are complete, the MAE system discards the encoder in favor of the decoder for vision tasks such as classification and object detection.

“The decoder of the MAE system can do tasks such as object recognition, which is a huge gain for us,” said Ross Girschick. This means that with the MAE system, the machine can automatically annotate the data (ground truth), instead of manually labeling the data.

4. MAE system can save 95% of video computing cost

When the MAE system is used to process video,Researchers cover up 95% of the data in each frame. Video has high similarity between frames, which means that video has more information redundancy than static images. Meta researcher Christoph Feichtenhofer said that with this approach, the MAE system can reduce the computational cost by 95 percent, which is a major advantage of the MAE system in video computing. He also said the technology could potentially be used for content moderation and task classification on Facebook and Instagram.

And for AI learning of audio, the Meta AI team found an ingenious approach. They turned audio files into spectrograms, in other words, they turned sounds into images. They would then use the same processing method as the image, masking the spectrogram patch and training. Although the model can currently only process audio clips of a few seconds, it has achieved good results.

Potential audio applications for the technology include audio classification, improving voice calls, and better finding ways to compress audio files, said Bernie Huang, a staff member at Audio Systems.

▲ MAE frame

Conclusion: MAE system may have more application space, but accuracy should be carefully considered

MAE systems can predict missing parts of crippled data to restore text, images, video, and audio.

This technology has great imagination space and application potential, such as restoring photos of archaeological remains, making up for lost historical documents, etc. MAE systems may not only lead to breakthroughs in AI, but may also bring surprises to other fields.

However, the MAE model also has shortcomings.100% accuracy cannot be achieved based on current experiments, the model may generate content that does not exist. People need to carefully consider and study these issues when restoring data using MAE models.

.
[related_posts_by_tax taxonomies=”post_tag”]

The post Meta reduces AI video computing costs by 95%, and AI can guess the original image even if the image covers half of it appeared first on Gamingsym.