Advisors: Prof. Zhiru Zhang, Yizhao Gao

Spring 2026 - MoE Inference Profiling & Roofline Analysis on 8×H100 GPUs

Phase 1: Real-Hardware Profiling

Profiled five models to track decode latency. Found that the slowest GPU consistently took ~33% longer on compute per layer than the fastest GPU due to token routing skew.

EP Wait Time Analysis Plot
Fig 1: Per-GPU MoE compute duration showing synchronization wait times.
Normalized Expert Load Heatmap
Fig 2: Normalized expert load highlighting token routing skew.
Phase 2: Roofline Modeling

Extended a simulator with a layer-by-layer Roofline model to classify layers as compute- or memory-bound. Finding: Expert Parallelism overhead scales with expert intermediate size, not expert count.

Cross-Model EP vs TP Comparison
Fig 3: Cross-model comparison of Tensor vs Expert parallelism efficiency.
Mixtral-8x7B Roofline Plot
Fig 4: Mixtral-8x7B shows the widest gap between TP and EP efficiency.
Qwen3-235B Roofline Plot
Fig 5: Qwen3-235B compounds per-layer All-to-All overhead across 94 layers.
gpt-oss-120b Roofline Plot
Fig 6: gpt-oss-120b shows a moderate EP penalty.
DeepSeek-V2-Lite Roofline Plot
Fig 7: DeepSeek-V2-Lite TP and EP points overlap closely due to large shared experts.
Methodology & Stack
Generated a decode trace over 500 MATH problems and a prefill profile of 10k samples. Analyzed 836M+ CUDA events via SQLite. Built with SGLang, PyTorch, DCGM, and Matplotlib.

Summer 2026

Continuing research with the Zhang Group. Project scope TBD.