Commit Graph

13 Commits

Author SHA1 Message Date
Trey T 3ee1563cb0 Books — pre-computed per-book glossary for context-correct word lookup
The book reader's word lookup used DictionaryService, a verb-conjugation
index plus ~200 hand-typed words: ordinary nouns like "taza" returned
nothing, and homographs always lost (tapping "como" in "como siempre"
gave the verb "comer" because the verb index is checked first).

Add a glossary phase to the books pipeline (build_glossary.py): every
distinct Spanish word is translated once, in its sentence context, by
the same Claude-Code-subagent LLM step the pipeline already uses for
chapter translation. English front matter is excluded by an ES==EN
paragraph-ratio heuristic. The glossary is bundled into book_<slug>.json
and is now part of the pipeline for every book.

In the app, Book carries the decoded glossary and BookReaderView resolves
each tap automatically through cache -> glossary -> DictionaryService ->
on-device LLM, citing which source answered so a curated glossary hit
reads differently from a best-effort AI guess.

book_olly-vol2.json regenerated with a 3,658-word glossary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:44:32 -05:00
Trey T 5db4b014a9 Guide — full enrichment pass: 19 tense guides + 36 grammar notes
Brings every tense guide and grammar note in the app up to teacher-
handout depth via parallel research subagents drafting against a
shared "thorough" checklist (TL;DR, usages, conjugation table, common
irregulars, mnemonic, top pitfalls, contrast with neighbour topic,
real-world dialogue example).

Tense guides — Conjuga/conjuga_data.json (tenseGuides[].body)
  All 19 remaining guides rewritten. The 20th (subj_presente) was
  enriched in the prior commit. Each body now ~4-5.5K chars (vs the
  500-1500 chars of the pre-pass reference cards), covering:
  - All five indicative tenses, both conditionals, both imperatives.
  - Full subjunctive set including the archaic futuro / futuro
    perfecto, framed honestly with "recognise, don't produce" guidance.
  - Per-tense conjugation patterns and the top 5-15 irregular verbs.
  - Tense-vs-tense contrasts (preterite↔imperfect, future↔ir-a,
    -ra↔-se past subjunctive, etc.).
  - Pitfalls that English speakers actually make.

Grammar notes — Conjuga/Conjuga/Models/GrammarNote.swift
  All 36 notes audited and rewritten where the existing body was
  missing one of: explicit mnemonic, contrast pair, pitfalls section,
  or coverage of a key sub-topic. None copied verbatim — every note
  got at least one of those slotted in. Notable additions:
  - DOCTOR/PLACE, WEIRDO, ESCAPA, RID, PRODDS, BANGS, RRPIA mnemonics
    where missing.
  - commands-imperative: nosotros + vosotros forms were entirely
    absent; both added with the -d/-os and present-subjunctive rules.
  - relative-pronouns: el que/el cual distinction, cuyo, lo que/lo
    cual, donde/adonde.
  - se-constructions: all 6 uses including the le→se substitution.
  - irregular-yo-verbs: impact on subjunctive and negative tú command.
  - Plus 5-item pitfalls sections on every note that lacked one.

Tooling — Conjuga/Scripts/guide-enrichment/
  - PLAN.md (prior commit) — the audit, checklist, and priority order
    that drove this pass.
  - apply_drafts.py (new) — reads drafts/out/*.md, swaps tense guides
    into the JSON and grammar notes into the Swift source via regex on
    the GrammarNote(...) declarations. Handles multi-block `#` comment
    headers some agents emitted. drafts/in/ and drafts/out/ are
    gitignored — regeneratable from current state.

DataLoader.swift — courseDataVersion 8 → 9 so existing installs re-
seed all guides on next launch.

Verification:
  - `swift -frontend -parse` on GrammarNote.swift succeeds (exit 0).
  - JSON validates (python3 json.load round-trip).
  - Triple-quote count is even (72 = 36 pairs, matching 36 notes).
  - Full xcodebuild verify deferred — local SDK install was disrupted
    by an Xcode update; will retest as part of the next ad-hoc deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:50:35 -05:00
Trey T de446b2301 Guide — enrich present-subjunctive entry with WEIRDO + ESCAPA + plan
The present-subjunctive guide was surface-level: two numbered usages
and a handful of examples, no mnemonic and no structural trigger cue.
That's the recurring problem with the tense guides — they're reference
cards, not teaching materials.

This commit fixes the immediate gap and lays out a plan to fix the
rest:

  Conjuga/conjuga_data.json — subj_presente body expanded from 794 to
  3670 chars. Adds the WEIRDO mnemonic with per-letter triggers and
  examples (Wishes, Emotions, Impersonal, Recommendations, Doubt,
  Ojalá), the ESCAPA adverbial-conjunction set, the "que + change of
  subject" structural rule, adjectival clauses with unknown
  antecedents, and the future-time-clause rule (cuando / hasta que /
  en cuanto).

  Scripts/guide-enrichment/PLAN.md (new) — audit of all 20 tense
  guides and 36 grammar notes, tier-1/2/3 prioritisation, "thorough"
  checklist (TL;DR, usages, conjugation, irregulars, mnemonic,
  pitfalls, contrast, dialogue example), research sources, per-topic
  workflow, effort estimate.

  DataLoader.swift — courseDataVersion 7 → 8 so existing installs
  re-seed the new body on next launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 23:33:09 -05:00
Trey T 05a367fdbe Books — capture <li> vocab bullets the extractor was silently dropping
extract_epub.py was walking <p> only, but every "Vocabulario" section in
the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant
the heading made it through but the entries didn't — 680 vocab lines
across 24 sections in this book were missing from the bundled JSON.

Audit (text-node owner by closest block ancestor) confirmed <li> is the
only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else.
No <h1>-<h6>, tables, or blockquotes in this EPUB at all.

Fix: walk find_all(["p", "li"]) in document order so bullet entries
slot in right after their "Vocabulario" / list heading. Re-extracted
(2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel
Claude Code subagents. translate_chapters.py prompt template now tells
subagents to keep bilingual `palabra = meaning` lines verbatim — both
sides already coexist on the line.

Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds.
Verified in simulator: all 13 chapter row sizes grew (e.g. ch6
18,295→20,951 chars).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:10:34 -05:00
Trey T 09e49bda2c Add Books — read EPUB-imported books in Practice with tap-to-define
New "Books" row in the Practice tab opens a library of bundled bilingual
books. Each chapter renders Spanish paragraph-by-paragraph; tap any
word for a definition sheet (DictionaryService with on-device AI
fallback), or toggle the toolbar button to swap to the pre-computed
English translation inline.

Local-only Book + BookChapter SwiftData models added to the local
container schema (reset version bumped to 5). DataLoader.seedBooks
walks the bundle for `book_*.json` resources, so future books drop in
without touching app code — just bundle a new JSON and bump
bookDataVersion.

First book: Olly Richards' "Spanish Short Stories For Beginners
Vol 2" — 13 chapters, 2,646 paragraphs, bilingual.

Scripts/books/ is the repeatable pipeline for future EPUBs:
extract_epub.py → translate_chapters.py (per-chapter resumable jobs) →
bundle_book.py. Translation is done by parallel Claude Code subagents
reading per-job input files and writing output files — no API key
required, matching the pattern used for the textbook vocab vision
pass. See Scripts/books/README.md for the full how-to.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 09:21:44 -05:00
Trey T 05a0cc0d17 Issue #32 cleanup — drop the last 5 mis-oriented vocab pairs
Two small fixes after the LLM-vision pass:

1. merge_pdf_into_book.py — when the LLM classifies an image as 'hybrid'
   but extracts zero pairs (e.g., a conjugation table whose only English
   text is on the section header that was excluded by the prompt rules),
   respect that decision instead of falling through to the bbox/heuristic
   pipeline. Previously: 1 chapter-2 estar conjugation table generated
   4 bad pairs from the heuristic fallback.

2. fix_vocab.py language_score — recognize Spanish present-perfect
   ('he tenido', 'He andado por este pueblo') as Spanish. The classifier
   was treating the auxiliary 'he'/'has'/'ha' as English subject pronouns,
   producing false-positive mis-orientation flags on 4 chapter-15/20/23
   present-perfect example tables.

Result: mis-oriented vocab pairs across the book go from 5 → 0.
textbookDataVersion bumped to 14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:52:53 -05:00
Trey T f368c24ad6 Fixes #32 — LLM vision pass for vocab pairs, fixes scrambled English/Spanish
The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
  1) classifier blind to subject pronouns ("yo", "I", etc.)
  2) right-then-left OCR reads on 2-col tables
  3) Y-cluster drift on multi-line cells in 4-col layouts

Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
  - pair_table       (extract all Spanish↔English pairs)
  - reference_only   (Spanish-only conjugation tables — no pairs, UI shows
                      the flat OCR lines as a reference list instead)
  - hybrid           (some header pairs + reference content beneath; only
                      the genuine pairs become cards)

merge_pdf_into_book.py now picks pair source by priority:
  llm-vision → bounding-box OCR → block-alternation heuristic.

Numbers (across the whole book):
  - mis-oriented tables: 114 → 5
  - quarantined cards:   250 → 2
  - extracted pairs:     2832 → 4569

textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:48:04 -05:00
Trey t fcb907718a Re-curate videos toward preferred channels
Swap 24 tense-guide / grammar-note videos to The Language Tutor's
numbered lesson series where a matching lesson exists, filling the two
remaining gaps (ind_preterito_anterior → Lesson 65, estar-gerund-
progressive → Lesson 113). All 32 TLT picks preserved on this pass.

For the non-TLT slots, prefer BaseLang's beginner lesson series where a
topic-specific video exists: ser-vs-estar, preterite-vs-imperfect,
subjunctive-triggers, object-pronouns, conditional-if-clauses,
tener-expressions, future-vs-ir-a, possessive-adjectives,
irregular-yo-verbs, and stem-changing-verbs.

Retire both Tell Me In Spanish videos (personal-a → castellano4U,
types-of-irregular-verbs → Master IRREGULAR VERBS Complete Lesson).

Generator header note clarifies that "not available on this app" rows
are a transient yt-dlp extraction limit — videos still play when tapped
in the app via the Stream button, which opens youtube.com externally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 07:24:53 -05:00
Trey t 9c7033d1b4 Add curated-videos markdown report + generator script
youtube_videos.md lists every entry in youtube_videos.json with its
tense-guide / grammar-note id, title, channel, upload date, duration,
views, and likes (where public). Also flags the two topics with no
curated video so the gap is auditable in one place.

generate_videos_markdown.py queries yt-dlp in parallel for each unique
videoId and writes the markdown. Rerun when curation changes. One
current entry (saber-vs-conocer → j87i7MVCvIE) is now marked Private
Video — needs re-curation as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 07:07:41 -05:00
Trey T 5f90a01314 Render textbook vocab as paired Spanish→English grid
Previously the chapter reader showed vocab tables as a flat list of OCR
lines — because Vision reads columns top-to-bottom, the Spanish column
appeared as one block followed by the English column, making pairings
illegible.

Now every vocab table renders as a 2-column grid with Spanish on the
left and English on the right. Supporting changes:

- New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images,
  cluster lines into rows by Y-coordinate, split rows by largest X-gap,
  detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted
  this pass vs ~1100 from the old block-alternation heuristic.
- merge_pdf_into_book.py now prefers bounding-box pairs when present,
  falls back to the heuristic, embeds the resulting pairs as
  vocab_table.cards in book.json.
- DataLoader passes cards through to TextbookBlock on seed.
- TextbookChapterView renders cards via SwiftUI Grid (2 cols).
- fix_vocab.py quarantine rule relaxed — only mis-pairs where both
  sides are clearly the same language are removed. "unknown" sides
  stay (bbox pipeline already oriented them correctly).

Textbook card count jumps from 1044 → 3118 active pairs.
textbookDataVersion bumped to 9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:58:41 -05:00
Trey T 63dfc5e41a Add textbook reader, exercise grading, stem-change toggle, extraction pipeline
Major changes:
- Textbook UI: chapter list, reader, and interactive exercise view (keyboard
  + Apple Pencil) surfaced under the Course tab. 30 chapters, 251 exercises.
- Stem-change conjugation toggle on Week 4 flashcard decks (E-IE, E-I, O-UE).
  Uses existing VerbForm + IrregularSpan data to render highlighted present
  tense conjugations inline.
- Deterministic on-device answer grader with partial credit (correct / close
  for accent-stripped or single-char-typo / wrong). 11 unit tests cover it.
- SharedModels: TextbookChapter (local), TextbookExerciseAttempt (cloud-
  synced), AnswerGrader helpers. Bumped schema.
- DataLoader: textbook seeder (version 8) + refresh helpers that preserve
  LanGo course decks when textbook data is re-seeded.
- Local extraction pipeline in Conjuga/Scripts/textbook/ — XHTML chapter
  parser, answer-key parser, macOS Vision image OCR + PDF page OCR, merger,
  NSSpellChecker validator, language-aware auto-fixer, and repair pass that
  re-pairs quarantined vocab rows using bounding-box coordinates.
- UI test target (ConjugaUITests) with three tests: end-to-end textbook
  flow, all-chapters screenshot audit, and stem-change toggle verification.

Generated textbook content (textbook_data.json, textbook_vocab.json) and
third-party source files are gitignored — re-run Scripts/textbook/run_pipeline.sh
locally to regenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:12:55 -05:00
Trey T 5b69f3b630 Fixes #19 — Add English translations to exceptional yo form flashcards
Cards now show "tengo — I have" instead of just "tengo", so learners
see the English meaning alongside the Spanish yo form. Bumps course
data version to 6 to trigger re-seed on next launch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:02:40 -05:00
Trey t 4b467ec136 Initial commit: Conjuga Spanish conjugation app
Includes SwiftData dual-store architecture (local reference + CloudKit user data),
JSON-based data seeding, 20 tense guides, 20 grammar notes, SRS review system,
course vocabulary, and widget support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 20:58:33 -05:00