MoE Inference Profiling & Roofline Analysis

Advisors: Prof. Zhiru Zhang, Yizhao Gao

Spring 2026 - MoE Inference Profiling & Roofline Analysis on 8×H100 GPUs

Phase 1: Real-Hardware Profiling

Profiled five models to track decode latency. Found that the slowest GPU consistently took ~33% longer on compute per layer than the fastest GPU due to token routing skew.

EP Wait Time Analysis Plot — Fig 1: Per-GPU MoE compute duration showing synchronization wait times.

Normalized Expert Load Heatmap — Fig 2: Normalized expert load highlighting token routing skew.

Phase 2: Roofline Modeling

Extended a simulator with a layer-by-layer Roofline model to classify layers as compute- or memory-bound. Finding: Expert Parallelism overhead scales with expert intermediate size, not expert count.

Cross-Model EP vs TP Comparison — Fig 3: Cross-model comparison of Tensor vs Expert parallelism efficiency.

Mixtral-8x7B Roofline Plot — Fig 4: Mixtral-8x7B shows the widest gap between TP and EP efficiency.

Qwen3-235B Roofline Plot — Fig 5: Qwen3-235B compounds per-layer All-to-All overhead across 94 layers.

gpt-oss-120b Roofline Plot — Fig 6: gpt-oss-120b shows a moderate EP penalty.

DeepSeek-V2-Lite Roofline Plot — Fig 7: DeepSeek-V2-Lite TP and EP points overlap closely due to large shared experts.

Methodology & Stack

Generated a decode trace over 500 MATH problems and a prefill profile of 10k samples. Analyzed 836M+ CUDA events via SQLite. Built with SGLang, PyTorch, DCGM, and Matplotlib.

Summer 2026

Continuing research with the Zhang Group. Project scope TBD.