The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
1) classifier blind to subject pronouns ("yo", "I", etc.)
2) right-then-left OCR reads on 2-col tables
3) Y-cluster drift on multi-line cells in 4-col layouts
Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
- pair_table (extract all Spanish↔English pairs)
- reference_only (Spanish-only conjugation tables — no pairs, UI shows
the flat OCR lines as a reference list instead)
- hybrid (some header pairs + reference content beneath; only
the genuine pairs become cards)
merge_pdf_into_book.py now picks pair source by priority:
llm-vision → bounding-box OCR → block-alternation heuristic.
Numbers (across the whole book):
- mis-oriented tables: 114 → 5
- quarantined cards: 250 → 2
- extracted pairs: 2832 → 4569
textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap 24 tense-guide / grammar-note videos to The Language Tutor's
numbered lesson series where a matching lesson exists, filling the two
remaining gaps (ind_preterito_anterior → Lesson 65, estar-gerund-
progressive → Lesson 113). All 32 TLT picks preserved on this pass.
For the non-TLT slots, prefer BaseLang's beginner lesson series where a
topic-specific video exists: ser-vs-estar, preterite-vs-imperfect,
subjunctive-triggers, object-pronouns, conditional-if-clauses,
tener-expressions, future-vs-ir-a, possessive-adjectives,
irregular-yo-verbs, and stem-changing-verbs.
Retire both Tell Me In Spanish videos (personal-a → castellano4U,
types-of-irregular-verbs → Master IRREGULAR VERBS Complete Lesson).
Generator header note clarifies that "not available on this app" rows
are a transient yt-dlp extraction limit — videos still play when tapped
in the app via the Stream button, which opens youtube.com externally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
youtube_videos.md lists every entry in youtube_videos.json with its
tense-guide / grammar-note id, title, channel, upload date, duration,
views, and likes (where public). Also flags the two topics with no
curated video so the gap is auditable in one place.
generate_videos_markdown.py queries yt-dlp in parallel for each unique
videoId and writes the markdown. Rerun when curation changes. One
current entry (saber-vs-conocer → j87i7MVCvIE) is now marked Private
Video — needs re-curation as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the chapter reader showed vocab tables as a flat list of OCR
lines — because Vision reads columns top-to-bottom, the Spanish column
appeared as one block followed by the English column, making pairings
illegible.
Now every vocab table renders as a 2-column grid with Spanish on the
left and English on the right. Supporting changes:
- New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images,
cluster lines into rows by Y-coordinate, split rows by largest X-gap,
detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted
this pass vs ~1100 from the old block-alternation heuristic.
- merge_pdf_into_book.py now prefers bounding-box pairs when present,
falls back to the heuristic, embeds the resulting pairs as
vocab_table.cards in book.json.
- DataLoader passes cards through to TextbookBlock on seed.
- TextbookChapterView renders cards via SwiftUI Grid (2 cols).
- fix_vocab.py quarantine rule relaxed — only mis-pairs where both
sides are clearly the same language are removed. "unknown" sides
stay (bbox pipeline already oriented them correctly).
Textbook card count jumps from 1044 → 3118 active pairs.
textbookDataVersion bumped to 9.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cards now show "tengo — I have" instead of just "tengo", so learners
see the English meaning alongside the Spanish yo form. Bumps course
data version to 6 to trigger re-seed on next launch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>