PANews reported on February 26 that DeepSeek launched DeepGEMM on the third day of its OpenSourceWeek, a CUDA library that supports FP8 GEMM and can be used for dense matrix calculations and mixture of experts (MoE) architecture to optimize the training and inference of V3/R1 models.

DeepGEMM key features:

• Ultra-high performance: 1350+ FP8 TFLOPS on Hopper GPU

• Minimal dependencies: no heavy dependencies, simple code like tutorials

• JIT compilation: no need for pre-compilation, automatic optimization at runtime

• The core code is only about 300 lines, but outperforms expert-optimized kernels for most matrix sizes

• Support dense layout and two MoE layouts