evaluating-llms-harness

8.7

112

395

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

benchmarking

8.7

Rating

Installs

AI & LLM

Quick Review

Excellent skill for LLM benchmarking with comprehensive coverage. The description clearly conveys when to use this skill (benchmarking, model comparison, academic results). Structure is well-organized with 4 detailed workflows covering standard evaluation, training progress tracking, model comparison, and vLLM acceleration. Task knowledge is outstanding with complete command examples, code snippets, troubleshooting, and hardware specs. References are appropriately delegated to separate files (benchmark-guide.md, custom-tasks.md, etc.) keeping SKILL.md focused. Novelty is strong: running standardized academic benchmarks like MMLU requires significant setup and domain knowledge that would consume many agent tokens; this skill packages industry-standard evaluation workflows used by EleutherAI and HuggingFace. Minor improvement areas: could slightly expand the 'when to use vs alternatives' section, but overall this is a highly practical, well-documented skill that provides clear value over a CLI agent alone.