Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
7.6
Rating
0
Installs
AI & LLM
Category
Excellent skill for vLLM deployment with comprehensive workflow coverage. The description clearly indicates when to use the skill (production APIs, throughput optimization, limited GPU memory). Task knowledge is outstanding with three complete workflows covering production deployment, batch inference, and quantization, each with concrete code and commands. Structure is very clean with a logical progression from quick start to workflows to troubleshooting, appropriately delegating deep technical details to reference files. Novelty is strong—deploying production-grade LLM serving with proper configuration, monitoring, and optimization would require significant research and many tokens for a CLI agent. Minor improvement areas: could slightly expand the description to mention batch inference capabilities, and add a bit more detail on the trade-offs between quantization methods. Overall, this is a well-crafted skill that provides genuine value by consolidating complex deployment knowledge into actionable workflows.
Loading SKILL.md…