training-llms-megatron

8.7

445

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

llm-training

8.7

Rating

Installs

AI & LLM

Quick Review

Exceptional skill for large-scale LLM training. The description clearly articulates when to use this skill (>1B parameters, GPU efficiency needs, advanced parallelism). SKILL.md provides comprehensive, production-ready workflows with concrete examples for LLaMA training, MoE models, and performance optimization. Task knowledge is outstanding with detailed code snippets, parallelism configuration tables, troubleshooting guides, and clear step-by-step checklists. Structure is clean with a well-organized overview and appropriate delegation to reference files for deep dives. Novelty is extremely high—training 70B-405B parameter models with advanced 5D parallelism (TP/PP/DP/CP/EP) and achieving 47% MFU would require hundreds of thousands of tokens for a CLI agent to figure out independently, making this skill highly cost-effective. Minor improvement possible: could slightly expand the 'When to use vs alternatives' section, but overall this is a production-grade, highly valuable skill.