GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
8.1
Rating
0
Installs
AI & LLM
Category
Excellent skill documentation for GPU-accelerated LLM data curation. The description clearly communicates when to use NeMo Curator versus alternatives, and the SKILL.md provides comprehensive code examples covering all major operations (filtering, deduplication, PII redaction, multi-modal processing). Task knowledge is strong with concrete pipelines, performance benchmarks, and real-world cost comparisons (89% savings demonstrated). Structure is good with logical progression from basics to advanced patterns, though the single file is somewhat lengthy; some content could be modularized. Novelty is solid—this addresses a computationally expensive problem (data curation at TB scale) where GPU acceleration provides 10-16× speedups, meaningfully reducing both time and cost compared to CPU approaches or manual CLI operations. Minor improvements could include more modular organization and clearer separation of quickstart vs advanced patterns.
Loading SKILL.md…

Skill Author