The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
1) classifier blind to subject pronouns ("yo", "I", etc.)
2) right-then-left OCR reads on 2-col tables
3) Y-cluster drift on multi-line cells in 4-col layouts
Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
- pair_table (extract all Spanish↔English pairs)
- reference_only (Spanish-only conjugation tables — no pairs, UI shows
the flat OCR lines as a reference list instead)
- hybrid (some header pairs + reference content beneath; only
the genuine pairs become cards)
merge_pdf_into_book.py now picks pair source by priority:
llm-vision → bounding-box OCR → block-alternation heuristic.
Numbers (across the whole book):
- mis-oriented tables: 114 → 5
- quarantined cards: 250 → 2
- extracted pairs: 2832 → 4569
textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pbxproj references textbook_data.json and textbook_vocab.json as Copy
Bundle Resources, so xcodebuild fails if they're missing. Committing the
generated output keeps the repo self-sufficient — regenerate via
Conjuga/Scripts/textbook/run_pipeline.sh when content changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>