llava

6.4

155

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

multimodal

6.4

Rating

Installs

AI & LLM

Quick Review

Well-structured skill with comprehensive coverage of LLaVA vision-language capabilities. Excellent task knowledge including installation, multiple usage patterns (CLI, Python API, web UI), model variants, and practical examples for VQA, captioning, and multi-turn conversations. Clear structure with logical sections and good decision guidance (when to use vs alternatives). Strong practical details on quantization, VRAM requirements, and performance benchmarks. However, novelty is limited: LLaVA is essentially a wrapper around existing open-source tools with standard installation/usage patterns. A CLI agent with internet access could reasonably install and use LLaVA by following its official documentation. The skill adds convenience and consolidation but doesn't provide unique capabilities or complex orchestration that would be difficult for a base agent to replicate. Best value is in the curated guidance and ready-to-use code patterns rather than enabling fundamentally new capabilities.