Autonomous Question Generation from Unstructured Sources

Structured item types, validation, and why slapping an LLM on a PDF is not a product.

  • llm
  • education
  • pydantic
  • rag
  • assessment

For a stretch of time I focused on assessment tooling: authors were writing items by hand, while the source material—docs, slides, notes—was large and always moving. We needed a pipeline that could read unstructured input and produce many kinds of assessment items, not only paragraphs of text.

I was fortunate to work with subject-matter experts who were patient while we got the mechanics wrong a few times. What finally worked for us was treating generation as structured output with validation, not as a single open-ended completion.

Why schemas came first

A model that returns free-form prose is difficult to bank automatically. For our use case we needed concrete shapes: multiple choice with keyed answers, ordering tasks, numeric tolerances, gap-fill with distinct keys, and more—dozens of distinct shapes in our catalog. We used Pydantic models (JSON Schema would have been similar) and rejected output that did not validate, then asked for repair with a tight prompt rather than only turning the temperature knob.

class MultipleChoiceItem(BaseModel):
    stem: str
    choices: list[str]
    correct_index: NonNegativeInt
    explanation: str

def validate_or_repair(raw: dict, repair_fn) -> MultipleChoiceItem:
    try:
        return MultipleChoiceItem.model_validate(raw)
    except ValidationError:
        return MultipleChoiceItem.model_validate(repair_fn(raw))

The repair path was not perfect, but it was inspectable. When something failed, we could see whether the model misunderstood the stem, the choices, or the indexing rule.

Grounding in the corpus

Early experiments that skipped retrieval produced confident items about content that was not in the pack we meant to teach. I learned to chunk, embed, and retrieve relevant passages before asking for a question. It did not eliminate errors, but it aligned the system with the documents we had permission to use.

Impact on the team’s workflow

Once validation, retrieval, and a few quality checks were in place, the human role shifted for us: less time staring at a blank template, more time on spot review, bias checks, and fit with the syllabus. We measured authoring time before and after in a narrow internal study, and the reduction in mechanical drafting was large enough that subject leads could spend their hours on pedagogy instead of formatting. I mention the numbers cautiously—your corpus and governance will differ—but the direction was consistent.

What I would tell my past self

Spend energy on the eval harness and schema before chasing the cleverest prompt. Generation behaves more like compiler output than like chat: if the language definition is fuzzy, tuning adjectives will not save you.

I am sharing this in the spirit of lessons learned on the job, not as a universal recipe. If you are building something similar, I hope the emphasis on types, retrieval, and humane review resonates.