evaluating-code-models

8.7

107

290

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

model-evaluation

8.7

Rating

Installs

Machine Learning

Quick Review

Excellent skill documentation for code model evaluation. The description clearly covers capabilities (15+ benchmarks, pass@k metrics, multi-language support) enabling CLI agents to invoke appropriately. Task knowledge is comprehensive with 4 detailed workflows covering standard benchmarking, multi-language evaluation, instruction-tuned models, and model comparison—each with complete commands and parameters. Structure is well-organized with quick start, workflows with checklists, troubleshooting, and reference tables. The skill provides meaningful value by orchestrating complex evaluation pipelines that would require hundreds of tokens and deep expertise for a CLI agent to replicate (Docker isolation, pass@k sampling strategies, multi-language execution environments, proper instruction formatting). Minor deductions: novelty is moderate as it primarily wraps an existing harness rather than introducing novel methodology, though the workflow orchestration and best practices add significant practical value.