Files

T

History

Trey T 7da98d786c Vocab study — noun & adjective flashcards with CEFR level toggles

Add SRS-driven noun and adjective flashcards modeled on the existing verb
flashcard flow:

- SharedModels/Lexeme — catalog of non-verb vocab, frequency-ranked, with
  gender for nouns and optional example sentences. Seeded from a bundled
  vocab_lexemes.json built by Scripts/vocab/build_lexemes.py, which joins
  frequency.csv + es-en.data from a pinned doozan/spanish_data commit
  (CC-BY-SA: hermitdave/FrequencyWords + Wiktionary). 1,449 nouns and 600
  adjectives, each with Wiktionary-sourced gender and (where available)
  an example sentence with English translation.
- LexemeReviewCard + LexemeReviewStore — cloud-synced SM-2 SRS, keyed by
  partOfSpeech + lexemeId + drillMode so future drill modes can coexist.
- LexemeSessionQueue + LexemePool — parallel to VocabSessionQueue; fresh
  cards sort by frequency rank.
- LexemeStudyGroup — cloud-synced resumable session per
  (partOfSpeech, drillMode).
- NounFlashcardPracticeView + AdjectiveFlashcardPracticeView — same flow
  as VocabFlashcardPracticeView: English prompt → tap to reveal Spanish
  → Again/Hard/Good/Easy. Nouns reveal with their article (la taza, el
  problema) so gender is taught alongside meaning, not as a separate
  quiz. Example sentence shown when present.

CEFR-style level toggles:
- LexemeLevel enum (A1/A2/B1/B2/C1+) derived from frequencyRank with
  standard Spanish-frequency-dictionary cutoffs (250/500/1000/2000).
- UserProgress.selectedLexemeLevels — cloud-synced multi-select, defaults
  to A1+A2 on first launch.
- SettingsView gains a "Vocabulary Levels" section with five toggles; the
  existing "Levels" section is renamed "Verb Levels" for clarity.
- Due SRS cards always surface regardless of toggles. Disabling a level
  only stops new cards from that band entering the pool.

PracticeView gets "Nouns" and "Adjectives" rows under "Books".

DataLoader: new lexemeDataVersion gate that re-seeds the Lexeme table
from vocab_lexemes.json independent of book seeding. project.yml lists
the new JSON resource and the existing book_olly-vol2.json (which the
previous build was silently excluding because xcodegen rewrote the
project from project.yml).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 20:16:55 -05:00

.gitignore

Add Books — read EPUB-imported books in Practice with tap-to-define

2026-05-11 09:21:44 -05:00

build_glossary.py

Vocab study — noun & adjective flashcards with CEFR level toggles

2026-05-19 20:16:55 -05:00

bundle_book.py

Vocab study — noun & adjective flashcards with CEFR level toggles

2026-05-19 20:16:55 -05:00

extract_epub.py

Books — capture <li> vocab bullets the extractor was silently dropping

2026-05-11 10:10:34 -05:00

README.md

Books — pre-computed per-book glossary for context-correct word lookup

2026-05-18 10:44:32 -05:00

run.sh

Books — pre-computed per-book glossary for context-correct word lookup

2026-05-18 10:44:32 -05:00

translate_chapters.py

Books — capture <li> vocab bullets the extractor was silently dropping

2026-05-11 10:10:34 -05:00

README.md

Books pipeline

Turns any EPUB into a chapter-structured JSON file the app bundles and reads.

TL;DR

cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book-slug

This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run ./run.sh to bundle the final file.

Phases

Phase	Script	What it does	Output
1	`extract_epub.py`	Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text.	`build/<slug>/chapters.json`
2	`translate_chapters.py`	Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. Resumable: jobs whose output file already exists are skipped.	`build/<slug>/jobs/<jobid>.input.json` + `_pending.txt`
2b	`build_glossary.py`	Tokenize every Spanish paragraph the same way the app does, collect the distinct words with example sentences, split into ~150-word glossary batches. Resumable the same way.	`build/<slug>/glossary/<jobid>.input.json` + `_pending.txt`
2.5	Claude Code subagents	Drain both manifests: translate the chapter jobs and the glossary jobs, writing each job's `<jobid>.output.json`. See "Running translations" below.	`build/<slug>/{jobs,glossary}/<jobid>.output.json`
3	`bundle_book.py`	Merge `chapters.json` + every translation `.output.json` + every glossary `.output.json` into the final bundled JSON the app reads.	`Conjuga/Conjuga/book_<slug>.json`

run.sh chains 1 → 2 → 2b → 3. If Phase 2 or 2b produces pending jobs, Phase 3 still runs but bundles with placeholders so you can preview app structure before the LLM passes complete. Re-running run.sh after subagents fill in the outputs gives you the real bundled file.

The glossary is the book reader's primary word-lookup source: every distinct word translated once, in context, so taps are instant, cover the whole book, and don't mis-resolve homographs (e.g. "como" as the conjunction vs. the verb comer). This phase is a permanent part of the pipeline — every book imported this way gets a glossary.

Adding a new book

Drop the EPUB anywhere on disk.
Run Phase 1+2:
```
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book
```
Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable toc.ncx), extract_epub.py will need a fallback heuristic — see "Open assumptions" below.
Run translations (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:

There are two manifests to drain — translation and glossary:
- build/<slug>/jobs/_pending.txt with prompt build/<slug>/jobs/_prompt_template.md
- build/<slug>/glossary/_pending.txt with prompt build/<slug>/glossary/_prompt_template.md
For each pending job ID, hand a subagent the matching prompt with <JOB_INPUT_PATH> / <JOB_OUTPUT_PATH> filled in. The subagent reads the input, produces the translation/glossary, and writes the output. Resumable — interrupted runs just leave the missing job IDs in _pending.txt.

Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.

Bundle:

./run.sh /path/to/book.epub --slug my-book   # re-running pulls in the new outputs
# or directly:
python3 bundle_book.py my-book --require-all

--require-all will fail loudly if any job is still missing.

Bump bookDataVersion in DataLoader.swift so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).
Verify the file is bundled in Conjuga.xcodeproj. The script writes book_<slug>.json into Conjuga/Conjuga/Resources/; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the xcodeproj ruby gem.

File layout

Conjuga/Scripts/books/
├── extract_epub.py        # Phase 1
├── translate_chapters.py  # Phase 2
├── build_glossary.py      # Phase 2b
├── bundle_book.py         # Phase 3
├── run.sh                 # Orchestrator
└── build/                 # gitignored
    └── <slug>/
        ├── chapters.json
        ├── jobs/                    # translation jobs
        │   ├── _pending.txt
        │   ├── _prompt_template.md
        │   ├── ch01_b00.input.json
        │   ├── ch01_b00.output.json
        │   └── ...
        └── glossary/                # glossary jobs (Phase 2b)
            ├── _pending.txt
            ├── _prompt_template.md
            ├── gloss_b00.input.json
            ├── gloss_b00.output.json
            └── ...

The final output (book_<slug>.json) lives at Conjuga/Conjuga/book_<slug>.json so the iOS app bundle includes it. (Existing textbook_data.json / conjuga_data.json use the same layout — files in the app target root rather than a Resources subgroup.)

Open assumptions

TOC drives chapter boundaries. If an EPUB ships without a usable toc.ncx, or the navMap is too granular (e.g. one navPoint per page), extract_epub.py will need a fallback that groups by <h1> headings in spine order.
Spanish bold tags = inline emphasis. The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
Translation is per-paragraph 1:1. Subagents must preserve paragraph count and order. bundle_book.py will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.

Out of scope (intentional)

OCR of vocab image tables (use Scripts/textbook/ if your book is image-heavy).
Exercise extraction (textbook pipeline).
Per-occurrence word sense disambiguation. The glossary has one entry per distinct word, translated in context; a word genuinely used in two senses in the same book gets its dominant sense. The runtime DictionaryService + the on-device LLM remain as fallbacks for anything the glossary misses.
Cover image extraction (covers are derived from a color hash in the app for now).

README.md Unescape Escape

Books pipeline

TL;DR

Phases

Adding a new book

File layout

Open assumptions

Out of scope (intentional)

README.md