Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.
8.7
Rating
0
Installs
Machine Learning
Category
Excellent MoE training skill with comprehensive coverage of architectures, routing mechanisms, and practical implementation details. The description clearly articulates when to use MoE (5× cost reduction, scaling capacity without proportional compute increase) and covers major frameworks (DeepSpeed, HuggingFace). Provides production-ready code for core MoE components (routing, load balancing, expert parallelism), complete DeepSpeed configurations, and Mixtral 8x7B implementation. Structure is well-organized with clear sections and references to additional files for advanced topics. The skill addresses a genuinely complex domain where CLI agents would struggle with the nuanced trade-offs in expert count, capacity factors, learning rates, and load balancing. Minor improvement opportunities: could add more explicit troubleshooting decision trees and quantitative benchmarks comparing MoE vs dense models across different scales.
Loading SKILL.md…