Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
7.6
Rating
0
Installs
Machine Learning
Category
Excellent skill documentation for AWQ quantization. The description clearly articulates when to use AWQ versus alternatives (GPTQ, bitsandbytes), making it easy for a CLI agent to decide invocation. Task knowledge is comprehensive with complete code examples for loading pre-quantized models, quantizing custom models, multi-GPU deployment, vLLM integration, and various kernel backends. Structure is well-organized with clear sections, comparison tables, and performance benchmarks. The skill addresses a genuinely complex task (4-bit LLM quantization with activation-aware weight protection) that would require significant research and experimentation for a CLI agent to implement from scratch. Minor points: references to advanced-usage.md and troubleshooting.md suggest additional depth exists. The deprecation notice is important context but doesn't diminish the skill's current utility. Overall, this is a high-quality skill that meaningfully reduces the token cost and complexity of deploying quantized LLMs.
Loading SKILL.md…