When a Domain LLM Beats the General Model — and What It Doesn’t Prove

At ACS Future School we fed Curiosity AI roughly 12.5 GB of internal materials across 1000+ PDFs and DOCX files, then ran evaluations on curated QA sets. On one benchmark built from ~16k question–answer pairs mined from our own data, Curiosity edged out ChatGPT by about 10% overall — with even larger gaps on some subject slices.

That sounds like a victory, but the honest takeaway is narrower: when you control retrieval, training data, and evaluation scope, a specialized stack can beat a general chat API on your own test. It does not mean universal superiority.

What made the comparison meaningful was discipline: same prompts, frozen scoring, and datasets large enough to spot variance instead of noise. If you’re building domain assistants, invest in evaluation harnesses early — not just demos.