Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.
7.6
Rating
0
Installs
Machine Learning
Category
Excellent MoE training skill with comprehensive coverage of architectures, routing mechanisms, and practical implementations. The description clearly articulates when to use MoE (5× cost reduction, sparse activation), and the skill provides production-ready code for both basic and advanced patterns (Mixtral 8x7B, PR-MoE). Strong task knowledge with detailed DeepSpeed configurations, load balancing strategies, and hyperparameter tuning guidelines. Well-structured with clear progression from basics to advanced topics, and references are appropriately delegated to separate files. High novelty as MoE training is complex, requiring specialized knowledge of routing, expert parallelism, and load balancing that would be difficult for a CLI agent to synthesize from scratch. Minor improvements could include more explicit routing algorithm comparisons and multi-framework examples beyond DeepSpeed.
Loading SKILL.md…