Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.
7.6
Rating
0
Installs
AI & LLM
Category
Exceptional skill for SGLang serving with comprehensive coverage of structured generation, RadixAttention caching, and production deployment. The description clearly identifies when to use SGLang vs. alternatives (vLLM, TensorRT-LLM). Provides complete code examples for JSON/regex/grammar outputs, agent workflows, multi-turn chats, and OpenAI-compatible API usage. Includes performance benchmarks showing 5-10× speedups. Structure is excellent with concise SKILL.md and detailed references. High novelty: RadixAttention prefix caching and constrained decoding are complex features that would require many tokens for a CLI agent to implement. Minor improvement possible: could add error handling patterns and edge cases.
Loading SKILL.md…