Author: JAE
Not long ago, nof1, a laboratory focusing on artificial intelligence research in financial markets, announced on Twitter the launch of a groundbreaking experiment - the Alpha Arena large-scale model trading test. This tweet also received over 14 million views both inside and outside the circle.
Conducted on Hyperliquid, the leading Perp DEX, this experiment placed six mainstream large language models (LLMs) in a real, competitive trading environment for the first time. Each model was allocated $10,000 in real capital to independently trade Perp. To date, DeepSeek has maintained its top position with a return of approximately 11%.
LLM conducts its first live-fire exercise in the crypto market, with DeepSeek currently in first place.
Alpha Arena's milestone significance lies in its transcendence of the limitations of traditional financial AI models. Previous financial AI research has largely been confined to historical backtesting environments, where trading behavior has been unable to materially impact market prices, and models have been trained solely on static data. In contrast, Alpha Arena creates a dynamic, zero-sum competitive environment, forcing LLMs to continuously adapt to changing market prices and liquidity, making real-time decisions. This paradigm shift has led Alpha Arena to be considered the "first live-fire exercise" for AI in the crypto market.
To ensure fairness in testing, nof1 fed all models the same prompts and data. This means that a model's performance will be primarily determined by its inherent reasoning architecture, the efficiency of its tools for converting analysis into trading instructions, and its ability to independently manage risk.
As of now, DeepSeek has topped the list with a return rate of over 11%, followed by Claude with a return rate of about 10%. Grok has dropped to third place with a return rate of about 2%. Other models are all in a loss-making state.
On October 20, DeepSeek and Grok once led the list with a return rate of approximately 40%. However, all models experienced a collective pullback due to the market decline, and the return rate also shrank significantly, indicating that LLM may not yet have the ability to judge the market status.
Among them, Claude recorded the largest gains and losses and had the most aggressive trading strategy; Gemini executed the most trades (64) and paid the highest transaction fee to date of $600.42. While engaging in high-frequency trading, it failed to take cost control into account; GPT-5's total loss was as high as $4,051, and its account equity curve continued to decline, ranking last.
Figure: Alpha Arena initial results comparison (October 21)
The data in the figure shows a clear disconnect between traditional LLM benchmark performance and net gains in real-world trading. In benchmarks like Finance Reasoning and AIME (Mathematics), GPT-5 and Grok-4 typically lead, demonstrating their ability to handle complex financial formulas and advanced mathematics.
However, financial markets aren't just about static mathematical reasoning; they're a dynamic system involving real-time data, market sentiment analysis, and liquidity fluctuations. In the Alpha Arena live trading competition, DeepSeek V3.1 outperformed. This demonstrates that the key to generating profits for LLMs lies not in static knowledge or complex reasoning scores, but in the ability to translate analytical results into executional trading instructions. DeepSeek V3.1 achieved high returns with lower trading volume and win rate, suggesting that it may be possible to more accurately capture key price discovery opportunities with just a few trades while effectively managing transaction fees.
A counterexample is the impact of high-frequency trading and fee insensitivity on the LLM profit model. For example, based on Gemini 2.5 Pro's trading records, Gemini's gains from trading activities actually exceeded its losses. However, perhaps due to a lack of accurate fee estimation and optimization capabilities, its net profit was completely eroded, resulting in a net loss.
AI trading will become widespread, but strategy homogeneity may trigger systemic risks
Regarding this matter, CZ posted on the X platform, saying that it is expected that "AI+trading" may become more common and bring more trading volume.
The large-scale deployment of AI may also reshape the liquidity and price discovery mechanisms of the crypto market. Algorithmic trading is a core driver of modern financial markets. AI-driven algorithms can execute trades in speeds of up to 0.01 seconds, far exceeding the human reaction speed of 0.1 to 0.3 seconds, significantly improving market efficiency. Statistics show that global algorithmic trading volume in cryptocurrencies reached $94 trillion in 2023, with over 70% of this volume being conducted by robots.
As AI matures, it will enable more powerful automated trading capabilities. AI will not only accelerate market efficiency but also reduce slippage by providing liquidity across a wider range of assets and trading platforms, thereby improving overall market stability and resilience.
However, the autonomous, high-speed operation of AI in the crypto market may also amplify systemic financial risks. There is historical precedent: the 2010 Dow Jones Industrial Average "Flash Crash" demonstrated that when a large number of algorithmic trading systems with similar settings trigger each other, it can trigger a chain reaction, leading to a market crash.
In the AI + Crypto scenario, this risk may be amplified due to strategy homogeneity. Market observers have noted that the account equity curves of Grok-4 and DeepSeek are strikingly similar. The zero-sum environment of Alpha Arena will put all participating LLMs through a high-pressure adaptability test. In a zero-sum game, any LLM strategy that briefly leads may be detected and learned by other competitors.
In the future, if a large number of AI agents are developed on a few leading LLMs (such as DeepSeek V3.1 and Grok-4) and share similar training data and strategy logic, this will create what regulators call a "horizontal issue." Given the 24/7, highly leveraged nature of the crypto market, this convergence of strategies could lead to mutual detection and competition among agents. In the event of market volatility or unexpected inputs, all agents could trigger sell orders simultaneously, causing a "selling spiral" even more severe than the one in 2010.
On the other hand, CZ also expressed doubts in his tweet, voicing questions on the minds of many observers. It used to be widely believed that trading only achieved optimal results when possessing superior, proprietary strategies. Now that the strategies of the six major LLMs are publicly available, will DeepSeek's strategy still be effective? How long will its profitability last? Will trading in the opposite direction of Gemini and GPT-5 yield higher returns than DeepSeek? Is Grok-4 learning from DeepSeek? Which model will perform best in extreme or one-sided market conditions? These questions remain to be answered over time.
While many questions remain to be answered, nof1's Alpha Arena is a highly innovative experiment that brings LLM to the real crypto market. This "live-fire exercise" vividly demonstrates the enormous potential of AI to reshape the crypto market, and Alpha Arena is just the beginning.