nemo-evaluator-sdk

7.6

170

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

evaluation

7.6

Rating

Installs

AI & LLM

Quick Review

Excellent skill with comprehensive workflows for enterprise LLM evaluation. The description clearly covers multi-backend execution across 100+ benchmarks, and a CLI agent can confidently invoke this skill for evaluation tasks. Task knowledge is outstanding with 4 detailed workflows covering standard benchmarks, HPC deployment, model comparison, and safety/VLM evaluation—complete with config examples, CLI commands, and Python API usage. Structure is very clean with a concise main document and references for advanced topics. Novelty is high: orchestrating containerized evaluation across multiple backends (Docker/Slurm/cloud) with 18+ harnesses would require significant tokens and expertise for a CLI agent to accomplish independently. Minor improvement possible: could slightly expand the description to mention safety/VLM capabilities explicitly for better discoverability.