Render textbook vocab as paired Spanish→English grid
Previously the chapter reader showed vocab tables as a flat list of OCR lines — because Vision reads columns top-to-bottom, the Spanish column appeared as one block followed by the English column, making pairings illegible. Now every vocab table renders as a 2-column grid with Spanish on the left and English on the right. Supporting changes: - New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images, cluster lines into rows by Y-coordinate, split rows by largest X-gap, detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted this pass vs ~1100 from the old block-alternation heuristic. - merge_pdf_into_book.py now prefers bounding-box pairs when present, falls back to the heuristic, embeds the resulting pairs as vocab_table.cards in book.json. - DataLoader passes cards through to TextbookBlock on seed. - TextbookChapterView renders cards via SwiftUI Grid (2 cols). - fix_vocab.py quarantine rule relaxed — only mis-pairs where both sides are clearly the same language are removed. "unknown" sides stay (bbox pipeline already oriented them correctly). Textbook card count jumps from 1044 → 3118 active pairs. textbookDataVersion bumped to 9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -173,14 +173,17 @@ def main() -> None:
|
||||
kept_cards.append(card)
|
||||
continue
|
||||
|
||||
# Quarantine obvious mis-pairs: both sides same language OR language mismatch
|
||||
# Quarantine only clear mis-pairs: both sides EXPLICITLY the wrong
|
||||
# language (both Spanish or both English). "unknown" sides stay —
|
||||
# the bounding-box pipeline already handled orientation correctly
|
||||
# and many valid pairs lack the article/accent markers we classify on.
|
||||
fes, fen = language_score(card["front"])
|
||||
bes, ben = language_score(card["back"])
|
||||
front_lang = "es" if fes > fen else ("en" if fen > fes else "unknown")
|
||||
back_lang = "es" if bes > ben else ("en" if ben > bes else "unknown")
|
||||
# A good card has front=es, back=en. Anything else when the card is
|
||||
# flagged is almost always a column-pairing error.
|
||||
if front_lang != "es" or back_lang != "en":
|
||||
bothSameLang = (front_lang == "es" and back_lang == "es") or (front_lang == "en" and back_lang == "en")
|
||||
reversed_pair = front_lang == "en" and back_lang == "es"
|
||||
if bothSameLang or reversed_pair:
|
||||
quarantined_cards.append({
|
||||
"chapter": ch["chapter"],
|
||||
"front": card["front"],
|
||||
|
||||
Reference in New Issue
Block a user