Researchers at DeepSeek have released a new experimental model, V3.2-exp, designed to significantly lower inference costs in long-context operations, as announced in a post on Hugging Face and an accompanying academic paper on GitHub.
The model’s key feature is DeepSeek Sparse Attention, a system that utilizes a “lightning indexer” to prioritize specific excerpts from the context window. Following this, a “fine-granular token selection system” selects specific tokens from within those excerpts, which are then loaded into the module’s limited attention window. This enables the Sparse Attention model to operate over extensive context portions with relatively small server loads.
DeepSeek’s preliminary testing indicates that the price of a simple API call can be reduced by as much as half in long-context operations. Although further testing is necessary to validate these claims, the model is open-weight and freely available on Hugging Face, allowing third-party evaluations to assess the results presented in the paper.
DeepSeek’s new model is part of a series of recent breakthroughs addressing the issue of inference costs, which represent the server expenses associated with operating a pre-trained AI model, distinct from training costs. The researchers aimed to enhance the fundamental transformer architecture’s efficiency and found significant room for improvement.
Based in China, DeepSeek has been an unconventional player in the AI sector, particularly for those who view AI research as a nationalist competition between the U.S. and China. The company garnered attention earlier this year with its R1 model, trained primarily using reinforcement learning at a lower cost than its American counterparts. Although R1 did not spark a revolution in AI training as predicted, DeepSeek’s new “sparse attention” approach could still offer valuable insights to U.S. providers on maintaining low inference costs.
The new model’s release is unlikely to generate the same level of excitement as R1, but it has the potential to teach U.S. providers important strategies for reducing inference costs.




