Breakthrough in multimodal video generation technology, how can Web3 reap the benefits?

Breakthrough in Multimodal Video Generation: Recent advancements enable AI to generate videos from text, images, and audio, integrating these elements seamlessly. Examples include ByteDance's EX-4D (monocular to 4D conversion), Baidu's "Huixiang" (image-to-video), Google DeepMind Veo (4K video + synchronized audio), and Douyin ContentV (low-cost 1080p generation).
Technical Complexity: Multimodal video generation involves exponential challenges like temporal continuity, audio synchronization, and 3D spatial consistency. Modular approaches (e.g., ByteDance's task decomposition) now replace monolithic models, reducing reliance on massive GPU clusters.
Cost Reduction: Innovations like layered generation, cache reuse, and dynamic resource allocation have slashed costs (e.g., Douyin ContentV at 3.67 yuan/5 seconds), democratizing high-quality video production.
Industry Impact: AI disrupts traditional video production by replacing asset-heavy workflows (equipment, post-production) with prompt-based generation, lowering barriers to entry and shifting competition to creativity and aesthetics.
Web3 Synergies:
- Distributed Computing: Demand grows for diverse, decentralized computing power to handle modular AI tasks.
- Data Annotation: Professional video generation requires precise inputs (scene descriptions, audio styles), creating opportunities for Web3-incentivized data markets (e.g., photographers, sound engineers contributing annotated data).
- Decentralized Collaboration: As AI shifts from centralized to modular, Web3 platforms can integrate computing power, data, and models into a self-sustaining ecosystem.

This convergence of Web2 AI advancements and Web3 decentralization could reshape both industries.

Author:Haotian

In addition to the "sinking" of AI localization, the biggest change in the AI track in recent times is: a breakthrough in the technology of multimodal video generation, from originally supporting pure text to generate video to a full-link integrated generation technology of text + image + audio.

Here are a few examples of technological breakthroughs for you to experience:

1) ByteDance open-sources the EX-4D framework: Monocular videos can be transformed into free-viewing 4D content in seconds, with a user approval rate of 70.7%. In other words, given an ordinary video, AI can automatically generate viewing effects from any angle, which previously required a professional 3D modeling team to accomplish;

2) Baidu's "Huixiang" platform: One picture generates a 10-second video, claiming to be of "movie-level" quality. However, this is not exaggerated by marketing packaging, and we have to wait until the Pro version is updated in August to see the actual effect;

3) Google DeepMind Veo: It can achieve the synchronous generation of 4K video + ambient sound. The key technical highlight is the achievement of "synchronization" capability. Previously, the video and audio systems were spliced together. To achieve true semantic matching, it is necessary to overcome great challenges. For example, in complex scenes, it is necessary to achieve the corresponding audio and video synchronization of walking movements and footsteps in the picture;

4) Douyin ContentV: 8 billion parameters, 2.3 seconds to generate 1080p video, cost 3.67 yuan/5 seconds. To be honest, the cost control is OK, but the current generation quality is still unsatisfactory when encountering complex scenes;

Why are these cases so valuable and significant in terms of breakthroughs in video quality, production costs, application scenarios, etc.?

1. In terms of technological breakthroughs, the complexity of generating a multimodal video is often exponential. A single frame image generates about 10^6 pixels. The video must ensure temporal continuity (at least 100 frames), plus audio synchronization (10^4 samples per second), and 3D spatial consistency must also be considered.

In summary, the technical complexity is not low. Originally, a super large model was used to handle all tasks. It is said that Sora burned tens of thousands of H100s to have the video generation capability. Now it can be achieved through modular decomposition + large model division of labor and cooperation. For example, ByteDance's EX-4D actually breaks down complex tasks into: depth estimation module, perspective conversion module, timing interpolation module, rendering optimization module, etc. Each module specializes in one thing, and then cooperates through a coordination mechanism.

2. Cost reduction: The optimization behind this is actually the inference architecture itself, including a layered generation strategy, which first generates the skeleton at a low resolution and then enhances the imaging content at a high resolution; a cache reuse mechanism, which is the reuse of similar scenes; and dynamic resource allocation, which is actually adjusting the model depth according to the complexity of the specific content.

After this set of optimizations, Douyin ContentV achieved the result of 3.67 yuan/5 seconds.

3. In terms of application impact, traditional video production is an asset-intensive game: equipment, venues, actors, post-production, and it is normal for a 30-second commercial to cost hundreds of thousands of yuan. Now AI compresses this process to a prompt + a few minutes of waiting, and can achieve perspectives and special effects that are difficult to achieve with traditional shooting.

This will transform the original technical and financial barriers to video production into creativity and aesthetics, which may promote a reshuffle of the entire creator economy.

The question is, with all these changes in the demand side of web2AI technology, what does it have to do with web3AI?

1. First, the structure of computing power requirements has changed. In the past, AI competed on computing power scale, and whoever had more homogeneous GPU clusters would win. However, multimodal video generation requires a diversified combination of computing power, which may generate demand for distributed idle computing power, as well as various distributed fine-tuning models, algorithms, and inference platforms.

2. Secondly, the demand for data annotation will also increase. Generating a professional-level video requires: accurate scene description, reference images, audio style, camera motion trajectory, lighting conditions, etc., which will become new requirements for professional data annotation. Using the incentive method of web3, photographers, sound engineers, 3D artists, etc. can be stimulated to provide professional data elements, and the ability of AI video generation can be enhanced with professional vertical data annotation;

3. Finally, it is worth mentioning that when AI gradually moves from centralized large-scale resource allocation to modular collaboration, it is itself a new demand for decentralized platforms. At that time, computing power, data, models, incentives, etc. will be combined to form a self-reinforcing flywheel, which will then drive the integration of web3AI and web2AI scenarios.