Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.
8.7
Rating
0
Installs
AI & LLM
Category
Excellent skill with comprehensive, actionable guidance for Flash Attention optimization. The description clearly articulates when and why to use this skill (long sequences, memory constraints, speedup needs). Task knowledge is outstanding with three detailed workflows covering PyTorch native, flash-attn library, and H100 FP8 optimization—each with copy-paste checklists, code examples, benchmarking, and troubleshooting. Structure is clean with a logical progression from quick start to advanced topics, appropriately deferring detailed benchmarks and integrations to reference files. Novelty is strong: implementing Flash Attention correctly requires specialized knowledge of GPU memory patterns, proper tensor layouts, and hardware-specific optimizations that would require substantial research and trial-and-error for a CLI agent. Minor improvement areas: could specify exact memory savings formulas and add more decision criteria for choosing between PyTorch SDPA vs flash-attn library.
Loading SKILL.md…

Skill Author