Render textbook vocab as paired Spanish→English grid

Previously the chapter reader showed vocab tables as a flat list of OCR
lines — because Vision reads columns top-to-bottom, the Spanish column
appeared as one block followed by the English column, making pairings
illegible.

Now every vocab table renders as a 2-column grid with Spanish on the
left and English on the right. Supporting changes:

- New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images,
  cluster lines into rows by Y-coordinate, split rows by largest X-gap,
  detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted
  this pass vs ~1100 from the old block-alternation heuristic.
- merge_pdf_into_book.py now prefers bounding-box pairs when present,
  falls back to the heuristic, embeds the resulting pairs as
  vocab_table.cards in book.json.
- DataLoader passes cards through to TextbookBlock on seed.
- TextbookChapterView renders cards via SwiftUI Grid (2 cols).
- fix_vocab.py quarantine rule relaxed — only mis-pairs where both
  sides are clearly the same language are removed. "unknown" sides
  stay (bbox pipeline already oriented them correctly).

Textbook card count jumps from 1044 → 3118 active pairs.
textbookDataVersion bumped to 9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey T
2026-04-19 15:58:41 -05:00
parent cd491bd695
commit 5f90a01314
9 changed files with 17619 additions and 1148 deletions

View File

@@ -173,14 +173,17 @@ def main() -> None:
kept_cards.append(card)
continue
# Quarantine obvious mis-pairs: both sides same language OR language mismatch
# Quarantine only clear mis-pairs: both sides EXPLICITLY the wrong
# language (both Spanish or both English). "unknown" sides stay —
# the bounding-box pipeline already handled orientation correctly
# and many valid pairs lack the article/accent markers we classify on.
fes, fen = language_score(card["front"])
bes, ben = language_score(card["back"])
front_lang = "es" if fes > fen else ("en" if fen > fes else "unknown")
back_lang = "es" if bes > ben else ("en" if ben > bes else "unknown")
# A good card has front=es, back=en. Anything else when the card is
# flagged is almost always a column-pairing error.
if front_lang != "es" or back_lang != "en":
bothSameLang = (front_lang == "es" and back_lang == "es") or (front_lang == "en" and back_lang == "en")
reversed_pair = front_lang == "en" and back_lang == "es"
if bothSameLang or reversed_pair:
quarantined_cards.append({
"chapter": ch["chapter"],
"front": card["front"],