llava

8.1

145

296

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

multimodal

8.1

Rating

Installs

AI & LLM

Quick Review

Excellent skill documentation for LLaVA vision-language model. The description clearly explains capabilities (visual instruction tuning, image conversations, VQA) making it easy for a CLI agent to understand when to invoke. Task knowledge is comprehensive with complete code examples for loading models, single/multi-turn conversations, and common use cases. Structure is logical with clear sections and a helpful comparison table of alternatives. The skill provides meaningful value by packaging complex vision-language inference (multi-step setup, conversation management, quantization) that would require significant tokens for an agent to implement from scratch. Minor deductions: novelty is moderate as some simpler vision tasks could be handled by lighter tools; structure could benefit from extracting some code patterns to separate files for very large implementations.