Fixes #32 — LLM vision pass for vocab pairs, fixes scrambled English/Spanish · f368c24ad6 - Spanish

Fixes #32 — LLM vision pass for vocab pairs, fixes scrambled English/Spanish

The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
  1) classifier blind to subject pronouns ("yo", "I", etc.)
  2) right-then-left OCR reads on 2-col tables
  3) Y-cluster drift on multi-line cells in 4-col layouts

Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
  - pair_table       (extract all Spanish↔English pairs)
  - reference_only   (Spanish-only conjugation tables — no pairs, UI shows
                      the flat OCR lines as a reference list instead)
  - hybrid           (some header pairs + reference content beneath; only
                      the genuine pairs become cards)

merge_pdf_into_book.py now picks pair source by priority:
  llm-vision → bounding-box OCR → block-alternation heuristic.

Numbers (across the whole book):
  - mis-oriented tables: 114 → 5
  - quarantined cards:   250 → 2
  - extracted pairs:     2832 → 4569

textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit is contained in:

Trey T

2026-05-03 18:48:04 -05:00

parent 90aea92fba

commit f368c24ad6

5 changed files with 21072 additions and 9446 deletions

.gitignore

View File

@@ -50,6 +50,7 @@ epub_extract/
 # Scripts are committed; their generated outputs are not.
 Conjuga/Scripts/textbook/*.json
 Conjuga/Scripts/textbook/review.html
 Conjuga/Scripts/textbook/paired_vocab_llm/
 # Note: the app-bundle copies (Conjuga/Conjuga/textbook_{data,vocab}.json)
 # ARE committed so `xcodebuild` works on a fresh clone without first running
 # the pipeline. They're regenerated from the scripts when content changes.