China developing its own version of OpenAI’s text-to-video model

China developing its own version of OpenAI’s text-to-video modelChina developing its own version of OpenAI’s text-to-video model
via Pexels (representation only)
Bryan Ke
March 5, 2024
A group of researchers and artificial intelligence (AI) experts are collaborating to develop China’s response to Sora, OpenAI’s highly anticipated text-to-video model. 
What it is: Peking University professors and Rabbitpre, an AI company based in Shenzhen, announced their collaboration in a GitHub post on Friday, which they named Open-Sora. The project was facilitated through the Rabbitpre AIGC Joint Lab, a joint effort between the company and the university’s graduate school.
According to the team, Open-Sora aims to “reproduce OpenAI’s video generation model” with a “simple and scalable” repository. The group is seeking assistance from the open-source community for its development.
Progress so far: Using a three-part framework with the components Video VQ-VAE, Denoising Diffusion Transformer and Condition Encoder, the group has successfully generated samples with varying aspect ratios, resolutions and durations for reconstructed videos, including 10- and 18-second clips.
About Sora: Unveiled on Feb. 15, Sora is OpenAI’s first text-to-video model that can instantly create high-quality, realistic videos using only text prompts. So far, durations can last for up to a minute.

While the technology has been announced, OpenAI said it has no plans to make Sora available for general use anytime soon. The company still needs to address several issues such as reducing misinformation, hateful content and bias, as well as properly labeling the finished product.
What’s next: Rabbitpre AIGC Joint Lab has outlined some of its future plans for Open-Sora, which include establishing a codebase and training an unconditional model on landscape datasets. Subsequently, the group plans to train models to enhance resolution and duration as part of its primary project stages.
The team also plans to conduct experiments on a text-to-video landscape dataset, train its 1080p resolution (1920 x 1080) model on a video-to-text dataset and develop a control model with additional conditions.
 
Share this Article
NextShark.com
© 2024 NextShark, Inc. All rights reserved.