From chips to data: The next battle for AI

OORT |2025-01-22 12:44
The chip war focuses on producing the most powerful hardware, while the data war is about obtaining the right data sets to train AI. The increasing scarcity of ethical, high-quality data has become a bottleneck for many companies to develop AI.

While the world is still focused on the war surrounding AI chips - tariffs, intellectual property restrictions, supply chain sanctions and geopolitical disputes, the data shortage problem that directly affects the future development of AI has obviously been overlooked.

Earlier this year, Elon Musk astutely pointed out that AI companies have run out of data to train models and have even “exhausted” the sum of human knowledge.

This article will explore the shrinking data pool and how decentralized AI (DeAI) can play a key role in addressing this challenge.

The data war is coming

First of all, it is important to be clear that data is not inexhaustible.

The data war has been foreshadowed: In 2023, a group of visual artists filed a landmark lawsuit against Stability AI, MidJourney, and DeviantArt, accusing the companies of using their works to train generative AI models (such as Stable Diffusion) without permission. At the same time, Musk accused companies such as OpenAI of "scraping" Twitter (now X Platform) data without authorization, prompting X Platform to tighten API pricing and access restrictions.

Similarly, Reddit significantly increased its API pricing, disrupting companies like OpenAI and Anthropic that rely on Reddit user-generated content to train AI models. Reddit saw the decision as a way to monetize its data, but it also sparked a debate about the tension between user data platforms and AI companies seeking to use that data.

These events highlight an increasingly obvious reality: we are running out of legally and ethically usable data.

Multiple fronts for data

The chip war focuses on producing the most powerful hardware, while the data war is about obtaining the right data sets to train AI. The increasing scarcity of ethical, high-quality data has become a bottleneck for many companies to develop AI.

For large companies, the most feasible way is to obtain data from centralized giants, although it is costly. However, small businesses face limited and often unaffordable options. Without proper methods or channels for collecting data, these companies will lag behind in future AI development and innovation.

So how do we ethically and efficiently collect the data needed to advance AI development?

The data war will be fought on multiple fronts, each bringing unique challenges and opportunities.

Data Collection

Who controls the data collection channels? How to ensure ethical and legal compliance?

As lawsuits against tech giants pile up for illegally scraping or using data, new initiatives are beginning to emerge. For example, Harvard University has taken the lead in promoting data contributions with user consent to provide open access datasets to the public. While such projects have their value, they are far from sufficient to meet the needs of commercial AI applications.

Synthetic data is also emerging as a potential solution. Companies such as Meta and Microsoft have begun using AI to generate data to fine-tune models such as Llama and Phi-4. Google and OpenAI have also adopted synthetic data in their work. However, synthetic data also faces its own challenges, such as the problem of model "hallucination", which may affect its accuracy and reliability.

Decentralized data collection offers another promising alternative. By leveraging blockchain technology and using cryptocurrency to incentivize individuals to securely share data, decentralized models can address privacy, ownership, and quality issues. These solutions also democratize data access, enabling small businesses to compete in the AI ecosystem.

Data quality

Poor-quality data can lead to model bias, inaccurate predictions, and ultimately mistrust in AI systems. How can we ensure that the data used for AI training is accurate and representative?

Common industry practices include:

  • Rigorous data validation: The company uses advanced validation techniques to filter errors, inconsistencies, and noise in the data set. This typically involves human oversight, automated processes, or a combination of both to verify data integrity.

  • Bias mitigation strategies: To ensure data is representative, companies implement bias detection tools and diverse sampling techniques. For example, in the healthcare field, data sets must include different demographic groups to avoid bias that could affect diagnostic models.

  • Follow standards: Industry frameworks for data security such as ISO/IEC 27001 and other emerging ethical AI guidelines are becoming a necessity to ensure data quality and compliance with global standards.

  • Crowdsourced quality checking: Platforms such as Amazon Mechanical Turk are used for tasks such as labeling and verifying data. Although low-cost, these methods require supervision to ensure consistency and accuracy.

  • Decentralized Verification: Blockchain and decentralized systems are emerging as tools for authenticating data sources, ensuring data authenticity and preventing tampering.

In addition, regulators face the urgent challenge of establishing comprehensive data privacy and security rules that balance individual rights with technological innovation while addressing key national security issues such as protecting sensitive data from cyber threats, foreign exploitation, and abuse by hostile entities.

A rough road ahead

The impact of the data wars is far-reaching. In the healthcare industry, for example, access to high-quality patient data can revolutionize diagnosis and treatment planning, but strict privacy regulations pose a barrier. Similarly, in the music industry, AI models trained using ethical datasets can change everything from songwriting to copyright enforcement, provided they respect intellectual property rights.

These challenges highlight the importance of decentralized solutions that prioritize data transparency, quality, and accessibility. By leveraging decentralized systems, we can create a more equitable data ecosystem where individuals retain control of their data, businesses have access to ethical and high-quality datasets, and drive innovation without compromising privacy or security.

The shift from the chip war to the data war will reshape the AI ecosystem and its evolution, providing a leading opportunity for decentralized data solutions. By prioritizing ethical data collection and accessibility, decentralized AI has the potential to bridge the gap and lead a fairer and more innovative AI future.

The battle for the best data has begun. Are we ready for it?

Author: Dr. Chong Li, Founder of OORT and Professor of Columbia University

Originally published in Forbes: https://www.forbes.com/sites/digital-assets/2025/01/20/from-chip-war-to-data-war-ais-next-battleground-explained/

Author :OORT
This article reflects the opinions of PANews's columnist and does not represent the stance of PANews. PANews does not assume legal responsibility. The article and opinions do not constitute investment advice.
Image Source : OORT If there is any infringement, please contact the author to remove it.
Comment
Recommend Reading