OpenAI’s New Video AI Faces Harsh Criticism from Meta’s Chief Scientist

HomeTechOpenAI's New Video AI Faces Harsh Criticism from Meta's Chief Scientist

February 23, 2024

181

OpenAI's New Video AI Faces Harsh Criticism from Meta's Chief Scientist — Image: TheDecoder

Last week, OpenAI unveiled its latest artificial intelligence creation, Sora – a text-to-video generator capable of creating high-definition video clips up to a minute long from simple text prompts. The demo videos showcased impressive scenes like puppies playing in the snow, people walking down a Tokyo street, and camera pans through a museum.

However, not everyone in the AI community is impressed with Sora. Yann LeCun, Meta’s chief AI scientist and one of the pioneers of deep learning, harshly criticized OpenAI’s claims that Sora will enable the creation of “general purpose simulators of the physical world.”

In a post on X, LeCun argued that OpenAI’s approach of generating video by modeling pixels is “as wasteful and doomed to failure as the largely-abandoned idea of ‘analysis by synthesis.'” He contends that trying to model the world and generate realistic video by focusing on pixel-level details is inefficient and cannot properly handle the uncertainty inherent in making predictions about complex 3D spaces.

LeCun refers to a longstanding debate in machine learning between generative and discriminative models. Generative models try to construct outputs, like images or video, by modeling the complex joint probability distribution of the high-dimensional input data. Discriminative models instead focus on modeling the conditional probability of the output given the input data.

According to LeCun, generative models like Sora that generate video pixel-by-pixel are attempting to “infer” too many irrelevant details about the 3D world. He argues this is analogous to trying to predict the trajectory of a soccer ball by modeling the properties of every material it is made of, instead of just focusing on key parameters like mass and velocity.

While LeCun admits generative models have shown promise for text generation tasks like ChatGPT, where the input space is discrete and finite, he believes modeling the rich visual world at the pixel level is intractable. There are simply too many variables to account for.

Instead, LeCun has been working on an alternative approach at Meta called V-JEPA. This model tries to avoid filling in every single pixel by discarding unpredictable visual information. According to Meta, this leads to better training efficiency and sample efficiency compared to generative video models like Sora.

By not fixating on reconstructing pixels, V-JEPA can focus on modeling the most salient aspects of a video sequence. LeCun believes this discriminative approach has a better chance of eventually creating useful general simulations of the world.

OpenAI’s splashy demos have captured the public imagination about AI’s potential. But LeCun’s criticism serves as an important counterpoint – despite rapid progress, these generative models still face fundamental limitations. We are far from building true artificial general intelligence.

Sora’s unrealistic visual artifacts and physics show current technology remains brittle. And OpenAI themselves admit Sora has problems with strange object distortions and misunderstood physics that need work.

Still, while LeCun and Meta are taking a different approach, most big labs like Google, Baidu, and Adobe are racing alongside OpenAI to enhance generative video models. These models have wide applications in content creation, low-cost animation, and predictive simulations.

Despite LeCun’s harsh evaluation, OpenAI’s work has pushed boundaries on what AI can create. But there are still miles to go before these systems achieve robust, human-level intelligence. We need diverse perspectives and healthy skepticism as researchers pursue progress in this fascinating domain.

The path towards artificial general intelligence remains arduous. There are unlikely to be any quick shortcuts, despite what hype-cycles might suggest. Platforms like Sora are impressive stepping stones, but cannot yet reliably mimic the complexity of the real world. As Meta’s alternative work shows, there are merits to both generative and discriminative methods in modeling intelligence.

What is clear is that enhancing AI’s reasoning abilities will require continued open and vigorous debate within the research community. We need visionaries thinking outside the box about how to build more efficient, flexible, and grounded learning algorithms. This discourse will push the field forward, unlocking AGI’s immense promise while avoiding potential pitfalls.