PANews reported on February 26 that DeepSeek launched DeepGEMM on the third day of its OpenSourceWeek, a CUDA library that supports FP8 GEMM and can be used for dense matrix calculations and mixture of experts (MoE) architecture to optimize the training and inference of V3/R1 models.
DeepGEMM key features:
• Ultra-high performance: 1350+ FP8 TFLOPS on Hopper GPU
• Minimal dependencies: no heavy dependencies, simple code like tutorials
• JIT compilation: no need for pre-compilation, automatic optimization at runtime
• The core code is only about 300 lines, but outperforms expert-optimized kernels for most matrix sizes
• Support dense layout and two MoE layouts