Huawei and SiliconFlow describe speedy LLM for CloudMatrix 384

"Our extensive evaluation with the DeepSeek-R1 model shows that CloudMatrix-Infer achieves state-of-the-art efficiency without sacrificing accuracy. CloudMatrix-Infer delivers a prefill throughput of 6,688 tokens/s per NPU, and a decode throughput of 1,943 tokens/s per NPU (at ยก50 ms TPOT). These results correspond to compute efficiencies of 4.45 tokens/s/TFLOPS for prefill and 1.29 tokens/s/TFLOPS for decode, both exceeding published results for SGLang on NVIDIA H100 and DeepSeek on NVIDIA H800. CloudMatrix-Infer also effectively manages the throughput-latency trade-off, sustaining a 538 tokens/s decode throughput even under the stricter sub-15 ms TPOT constraint. Furthermore, the INT8 quantization on Ascend 910C maintains model accuracy comparable to the official DeepSeek-R1 API across 16 distinct benchmarks."

Other contents

Huawei and SiliconFlow describe speedy LLM for CloudMatrix 384

Huawei and SiliconFlow describe speedy LLM for CloudMatrix 384

Slash Server Costs Without Hardware or Software Changes

Slash Server Costs Without Hardware or Software Changes

AMD Bares Instinct MI350 GPU, Teases MI400 and MI500

AMD Bares Instinct MI350 GPU, Teases MI400 and MI500

Untethered

Untethered

Qualcomm Nabs Alphawave to Bolster Data-Center Strategy

Qualcomm Nabs Alphawave to Bolster Data-Center Strategy

The DM&P Vortex86EX3 Keeps 90s-Era x86 Computing Alive

The DM&P Vortex86EX3 Keeps 90s-Era x86 Computing Alive

Chinese CPU maker Hygon to acquire HPC company Sugon

Chinese CPU maker Hygon to acquire HPC company Sugon

AMD Emerges to Challenge Nvidia on MLPerf Training

AMD Emerges to Challenge Nvidia on MLPerf Training

Israeli startup Speedata employs CGRA to accelerate data analytics

Israeli startup Speedata employs CGRA to accelerate data analytics

Broadcom Tomahawk 6 takes a swing at data-center and AI-cluster networks

Broadcom Tomahawk 6 takes a swing at data-center and AI-cluster networks