The book reader's word lookup used DictionaryService, a verb-conjugation
index plus ~200 hand-typed words: ordinary nouns like "taza" returned
nothing, and homographs always lost (tapping "como" in "como siempre"
gave the verb "comer" because the verb index is checked first).
Add a glossary phase to the books pipeline (build_glossary.py): every
distinct Spanish word is translated once, in its sentence context, by
the same Claude-Code-subagent LLM step the pipeline already uses for
chapter translation. English front matter is excluded by an ES==EN
paragraph-ratio heuristic. The glossary is bundled into book_<slug>.json
and is now part of the pipeline for every book.
In the app, Book carries the decoded glossary and BookReaderView resolves
each tap automatically through cache -> glossary -> DictionaryService ->
on-device LLM, citing which source answered so a curated glossary hit
reads differently from a best-effort AI guess.
book_olly-vol2.json regenerated with a 3,658-word glossary.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings every tense guide and grammar note in the app up to teacher-
handout depth via parallel research subagents drafting against a
shared "thorough" checklist (TL;DR, usages, conjugation table, common
irregulars, mnemonic, top pitfalls, contrast with neighbour topic,
real-world dialogue example).
Tense guides — Conjuga/conjuga_data.json (tenseGuides[].body)
All 19 remaining guides rewritten. The 20th (subj_presente) was
enriched in the prior commit. Each body now ~4-5.5K chars (vs the
500-1500 chars of the pre-pass reference cards), covering:
- All five indicative tenses, both conditionals, both imperatives.
- Full subjunctive set including the archaic futuro / futuro
perfecto, framed honestly with "recognise, don't produce" guidance.
- Per-tense conjugation patterns and the top 5-15 irregular verbs.
- Tense-vs-tense contrasts (preterite↔imperfect, future↔ir-a,
-ra↔-se past subjunctive, etc.).
- Pitfalls that English speakers actually make.
Grammar notes — Conjuga/Conjuga/Models/GrammarNote.swift
All 36 notes audited and rewritten where the existing body was
missing one of: explicit mnemonic, contrast pair, pitfalls section,
or coverage of a key sub-topic. None copied verbatim — every note
got at least one of those slotted in. Notable additions:
- DOCTOR/PLACE, WEIRDO, ESCAPA, RID, PRODDS, BANGS, RRPIA mnemonics
where missing.
- commands-imperative: nosotros + vosotros forms were entirely
absent; both added with the -d/-os and present-subjunctive rules.
- relative-pronouns: el que/el cual distinction, cuyo, lo que/lo
cual, donde/adonde.
- se-constructions: all 6 uses including the le→se substitution.
- irregular-yo-verbs: impact on subjunctive and negative tú command.
- Plus 5-item pitfalls sections on every note that lacked one.
Tooling — Conjuga/Scripts/guide-enrichment/
- PLAN.md (prior commit) — the audit, checklist, and priority order
that drove this pass.
- apply_drafts.py (new) — reads drafts/out/*.md, swaps tense guides
into the JSON and grammar notes into the Swift source via regex on
the GrammarNote(...) declarations. Handles multi-block `#` comment
headers some agents emitted. drafts/in/ and drafts/out/ are
gitignored — regeneratable from current state.
DataLoader.swift — courseDataVersion 8 → 9 so existing installs re-
seed all guides on next launch.
Verification:
- `swift -frontend -parse` on GrammarNote.swift succeeds (exit 0).
- JSON validates (python3 json.load round-trip).
- Triple-quote count is even (72 = 36 pairs, matching 36 notes).
- Full xcodebuild verify deferred — local SDK install was disrupted
by an Xcode update; will retest as part of the next ad-hoc deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The present-subjunctive guide was surface-level: two numbered usages
and a handful of examples, no mnemonic and no structural trigger cue.
That's the recurring problem with the tense guides — they're reference
cards, not teaching materials.
This commit fixes the immediate gap and lays out a plan to fix the
rest:
Conjuga/conjuga_data.json — subj_presente body expanded from 794 to
3670 chars. Adds the WEIRDO mnemonic with per-letter triggers and
examples (Wishes, Emotions, Impersonal, Recommendations, Doubt,
Ojalá), the ESCAPA adverbial-conjunction set, the "que + change of
subject" structural rule, adjectival clauses with unknown
antecedents, and the future-time-clause rule (cuando / hasta que /
en cuanto).
Scripts/guide-enrichment/PLAN.md (new) — audit of all 20 tense
guides and 36 grammar notes, tier-1/2/3 prioritisation, "thorough"
checklist (TL;DR, usages, conjugation, irregulars, mnemonic,
pitfalls, contrast, dialogue example), research sources, per-topic
workflow, effort estimate.
DataLoader.swift — courseDataVersion 7 → 8 so existing installs
re-seed the new body on next launch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extract_epub.py was walking <p> only, but every "Vocabulario" section in
the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant
the heading made it through but the entries didn't — 680 vocab lines
across 24 sections in this book were missing from the bundled JSON.
Audit (text-node owner by closest block ancestor) confirmed <li> is the
only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else.
No <h1>-<h6>, tables, or blockquotes in this EPUB at all.
Fix: walk find_all(["p", "li"]) in document order so bullet entries
slot in right after their "Vocabulario" / list heading. Re-extracted
(2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel
Claude Code subagents. translate_chapters.py prompt template now tells
subagents to keep bilingual `palabra = meaning` lines verbatim — both
sides already coexist on the line.
Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds.
Verified in simulator: all 13 chapter row sizes grew (e.g. ch6
18,295→20,951 chars).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New "Books" row in the Practice tab opens a library of bundled bilingual
books. Each chapter renders Spanish paragraph-by-paragraph; tap any
word for a definition sheet (DictionaryService with on-device AI
fallback), or toggle the toolbar button to swap to the pre-computed
English translation inline.
Local-only Book + BookChapter SwiftData models added to the local
container schema (reset version bumped to 5). DataLoader.seedBooks
walks the bundle for `book_*.json` resources, so future books drop in
without touching app code — just bundle a new JSON and bump
bookDataVersion.
First book: Olly Richards' "Spanish Short Stories For Beginners
Vol 2" — 13 chapters, 2,646 paragraphs, bilingual.
Scripts/books/ is the repeatable pipeline for future EPUBs:
extract_epub.py → translate_chapters.py (per-chapter resumable jobs) →
bundle_book.py. Translation is done by parallel Claude Code subagents
reading per-job input files and writing output files — no API key
required, matching the pattern used for the textbook vocab vision
pass. See Scripts/books/README.md for the full how-to.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small fixes after the LLM-vision pass:
1. merge_pdf_into_book.py — when the LLM classifies an image as 'hybrid'
but extracts zero pairs (e.g., a conjugation table whose only English
text is on the section header that was excluded by the prompt rules),
respect that decision instead of falling through to the bbox/heuristic
pipeline. Previously: 1 chapter-2 estar conjugation table generated
4 bad pairs from the heuristic fallback.
2. fix_vocab.py language_score — recognize Spanish present-perfect
('he tenido', 'He andado por este pueblo') as Spanish. The classifier
was treating the auxiliary 'he'/'has'/'ha' as English subject pronouns,
producing false-positive mis-orientation flags on 4 chapter-15/20/23
present-perfect example tables.
Result: mis-oriented vocab pairs across the book go from 5 → 0.
textbookDataVersion bumped to 14.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
1) classifier blind to subject pronouns ("yo", "I", etc.)
2) right-then-left OCR reads on 2-col tables
3) Y-cluster drift on multi-line cells in 4-col layouts
Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
- pair_table (extract all Spanish↔English pairs)
- reference_only (Spanish-only conjugation tables — no pairs, UI shows
the flat OCR lines as a reference list instead)
- hybrid (some header pairs + reference content beneath; only
the genuine pairs become cards)
merge_pdf_into_book.py now picks pair source by priority:
llm-vision → bounding-box OCR → block-alternation heuristic.
Numbers (across the whole book):
- mis-oriented tables: 114 → 5
- quarantined cards: 250 → 2
- extracted pairs: 2832 → 4569
textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap 24 tense-guide / grammar-note videos to The Language Tutor's
numbered lesson series where a matching lesson exists, filling the two
remaining gaps (ind_preterito_anterior → Lesson 65, estar-gerund-
progressive → Lesson 113). All 32 TLT picks preserved on this pass.
For the non-TLT slots, prefer BaseLang's beginner lesson series where a
topic-specific video exists: ser-vs-estar, preterite-vs-imperfect,
subjunctive-triggers, object-pronouns, conditional-if-clauses,
tener-expressions, future-vs-ir-a, possessive-adjectives,
irregular-yo-verbs, and stem-changing-verbs.
Retire both Tell Me In Spanish videos (personal-a → castellano4U,
types-of-irregular-verbs → Master IRREGULAR VERBS Complete Lesson).
Generator header note clarifies that "not available on this app" rows
are a transient yt-dlp extraction limit — videos still play when tapped
in the app via the Stream button, which opens youtube.com externally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
youtube_videos.md lists every entry in youtube_videos.json with its
tense-guide / grammar-note id, title, channel, upload date, duration,
views, and likes (where public). Also flags the two topics with no
curated video so the gap is auditable in one place.
generate_videos_markdown.py queries yt-dlp in parallel for each unique
videoId and writes the markdown. Rerun when curation changes. One
current entry (saber-vs-conocer → j87i7MVCvIE) is now marked Private
Video — needs re-curation as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the chapter reader showed vocab tables as a flat list of OCR
lines — because Vision reads columns top-to-bottom, the Spanish column
appeared as one block followed by the English column, making pairings
illegible.
Now every vocab table renders as a 2-column grid with Spanish on the
left and English on the right. Supporting changes:
- New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images,
cluster lines into rows by Y-coordinate, split rows by largest X-gap,
detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted
this pass vs ~1100 from the old block-alternation heuristic.
- merge_pdf_into_book.py now prefers bounding-box pairs when present,
falls back to the heuristic, embeds the resulting pairs as
vocab_table.cards in book.json.
- DataLoader passes cards through to TextbookBlock on seed.
- TextbookChapterView renders cards via SwiftUI Grid (2 cols).
- fix_vocab.py quarantine rule relaxed — only mis-pairs where both
sides are clearly the same language are removed. "unknown" sides
stay (bbox pipeline already oriented them correctly).
Textbook card count jumps from 1044 → 3118 active pairs.
textbookDataVersion bumped to 9.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cards now show "tengo — I have" instead of just "tengo", so learners
see the English meaning alongside the Spanish yo form. Bumps course
data version to 6 to trigger re-seed on next launch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>