Building LetsGit.IT with an LLM: Quiz Fairness
How we fixed a surprisingly common quiz bug: users could guess the correct answer just by picking the longest option.
Summary
A practical build story: we removed “longest answer wins” bias from the LetsGit.IT quiz without sacrificing full explanations outside the quiz.
Key takeaways
- Measure bias before changing UX.
- Prefer quiz-specific answer variants over trimming.
- Make distractors plausible by removing “absolute” cues.
- Use lightweight generation (trade-off flips) to add subtle wrong options.
- Ship in clean slices with the commit-work skill.
Building LetsGit.IT with an LLM: Quiz Fairness
This is the first post in the series about building LetsGit.IT with LLM assistance. The focus here isn’t “AI magic”, but the boring, reliable parts of product engineering: measuring what’s wrong, changing the data model (not just the UI), and shipping in a way that keeps the codebase maintainable.
This chapter is based on the 2025-12-29 session notes (sessions/articles/2025-12-29.md). If you prefer Polish, read the PL version here: /pl/blog/tworzenie-letsgit-it-z-llm-uczciwy-quiz.
The problem: the quiz was solvable by length
LetsGit.IT has a multiple-choice quiz mode: one correct answer and (usually) two distractors. In theory, you win by understanding the topic. In practice, the dataset created a strong bias:
- the correct answer was almost always the longest option
- wrong answers were short, vague, or “obviously wrong” (often with absolute wording)
That’s not a subtle UX issue. It’s a validity issue: the quiz stops measuring knowledge and starts measuring pattern recognition.
Step 1: measure it (don’t argue about it)
Before touching UI or copy, we ran a simple audit over the seed batches (recruitment-assistant/prisma/data/migrations/*.json), checking per question whether the correct answer is strictly the longest option among the quiz candidates.
The headline result from that audit was blunt: 100% of questions across categories (EN/PL) had the correct answer as the longest in the generated quiz choices.
The key lesson is not the specific number. It’s the workflow:
- pick a simple metric that reflects user behavior (“does length predict correctness?”)
- make it cheap to recompute
- use it as a guardrail for future changes
Step 2: the tempting quick fix — and why we rolled it back
The first remediation attempt was a classic “ship now” move: normalize option height and clamp/trim quiz answers in the quiz UI. The idea was simple: if users can’t see the length difference, the bias disappears.
That approach did reduce the visual cue, but it introduced new problems:
- trimming can remove the clause that makes an answer correct
- you end up “hiding meaning” instead of improving content quality
- it conflicts with the product requirement: full answers must remain available outside the quiz (lists, detail pages)
So we rolled it back and switched to a cleaner model.
Step 3: quiz-specific answer variants (short answers)
The better fix is to admit what the quiz needs: short, quiz-optimized answer variants, while keeping full answers intact everywhere else.
We implemented that with a small, explicit layer:
recruitment-assistant/src/data/quiz-answers.tscontains per-question overrides (EN + PL)recruitment-assistant/src/lib/questions.tsresolves the quiz answer variant at runtime- the quiz UI consumes the “quiz answer” field; list/detail views still display the full answer
Conceptually:
type QuizAnswerVariant = {
questionSlug: string;
en: { quizAnswer: string };
pl: { quizAnswer: string };
};
This is intentionally boring:
- it’s explicit (no heuristic trimming at render time)
- it’s testable (a question either has an override or not)
- it keeps the product promise: “short answers in quiz, full answers in detail”
This is also where an LLM is actually useful: generating first drafts of short answers that you can then refine. The important constraint: the override must preserve correctness, not just be shorter.
Step 4: make distractors plausible (without rewriting everything by hand)
Once “correct answer length” stopped being a tell, the next weakness became obvious: the wrong answers weren’t competitive. Two patterns stood out:
- absolute cues (“always”, “never”, “impossible”) that are rarely true in engineering
- distractors that were either too short or semantically unrelated
We ran an automated cleanup pass over wrong answers and rewrote ~2.4k entries to remove the most obvious tells while preserving the intended “wrongness”.
This isn’t about making the quiz mean. It’s about making it useful. A good distractor should be “close to true”, forcing the learner to recall the detail that matters.
“Exclude list” was generated… then postponed
We also generated a list of “exclude candidates” for questions that are too broad for a multiple-choice quiz, mostly based on answer length and scope. The decision for now was to not exclude questions yet (the list was too aggressive).
The bigger point: automation can propose a list, but product decisions still need a human pass.
Trade-off flips: cheap, believable wrong options
For some answers, you can generate a plausible wrong option by flipping a trade-off. Examples:
- “higher consistency, higher latency” → “higher consistency, lower latency”
- “fewer network hops, less flexibility” → “fewer network hops, more flexibility”
We added a lightweight runtime step in the quiz option generator: when the correct answer contains an explicit trade-off pattern, generate an “inverted” variant as an extra candidate distractor.
Here’s the simplified flow:
This is not “LLM generation in production”. It’s deterministic string-level augmentation that improves plausibility when the data already contains a trade-off.
Step 5: ship it cleanly (this is where the commit-work skill matters)
Session notes often stop at “we implemented it”. But the last 10% matters:
- make sure the change is reviewable later
- make sure you can bisect issues
- make sure you can revert safely
In Codex CLI, that’s where the commit-work skill fits. The skill is essentially a structured checklist + helper:
- review what changed (and what should not have changed)
- propose a logical split into commits (code vs docs, refactor vs content)
- produce a Conventional Commit message that matches the intent
In this case, the repo history reflects that shape:
feat(quiz): improve distractors and quiz answersdocs: update session log
A practical “LLM-assisted but still responsible” shipping loop looks like this:
cd recruitment-assistant
bun lint
bun run test
Then:
- stage only the intended files
- run the commit skill to draft a clean message and sanity-check the diff
The meta-lesson: LLMs help, but only if your workflow still enforces boundaries.
What this taught us about building LetsGit.IT with an LLM
- The dataset is part of the product. If it’s biased, the UX will be biased.
- UI tricks are not a substitute for content quality. Trimming “fixes” the symptom.
- Prefer explicit contracts over heuristics.
quizAnsweris a contract; “trim in UI” is a hack. - Automate passes, then review the edge cases. A bulk rewrite gets you 80%; humans finish the last 20%.
- Treat shipping as a first-class step. Clean commits make future LLM work cheaper (context is clearer).
Next in the series
In the next posts, we’ll go deeper on:
- manual per-category tuning (starting with categories like /en/category/data-structures)
- measuring quiz difficulty and “distractor quality” over time
- keeping EN/PL content aligned without turning content editing into a chore
If you’re building your own interview-prep tool, you can reuse the pattern even if your stack is different: measure bias → design a content contract → augment distractors deterministically → ship in reviewable slices.