moe-training

8.7

164

297

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

moe

8.7

Rating

Installs

Machine Learning

Quick Review

Excellent MoE training skill with comprehensive coverage of architectures, routing mechanisms, and practical implementation details. The description clearly articulates when to use MoE (5× cost reduction, scaling capacity without proportional compute increase) and covers major frameworks (DeepSpeed, HuggingFace). Provides production-ready code for core MoE components (routing, load balancing, expert parallelism), complete DeepSpeed configurations, and Mixtral 8x7B implementation. Structure is well-organized with clear sections and references to additional files for advanced topics. The skill addresses a genuinely complex domain where CLI agents would struggle with the nuanced trade-offs in expert count, capacity factors, learning rates, and load balancing. Minor improvement opportunities: could add more explicit troubleshooting decision trees and quantitative benchmarks comparing MoE vs dense models across different scales.