Add SRS-driven noun and adjective flashcards modeled on the existing verb flashcard flow: - SharedModels/Lexeme — catalog of non-verb vocab, frequency-ranked, with gender for nouns and optional example sentences. Seeded from a bundled vocab_lexemes.json built by Scripts/vocab/build_lexemes.py, which joins frequency.csv + es-en.data from a pinned doozan/spanish_data commit (CC-BY-SA: hermitdave/FrequencyWords + Wiktionary). 1,449 nouns and 600 adjectives, each with Wiktionary-sourced gender and (where available) an example sentence with English translation. - LexemeReviewCard + LexemeReviewStore — cloud-synced SM-2 SRS, keyed by partOfSpeech + lexemeId + drillMode so future drill modes can coexist. - LexemeSessionQueue + LexemePool — parallel to VocabSessionQueue; fresh cards sort by frequency rank. - LexemeStudyGroup — cloud-synced resumable session per (partOfSpeech, drillMode). - NounFlashcardPracticeView + AdjectiveFlashcardPracticeView — same flow as VocabFlashcardPracticeView: English prompt → tap to reveal Spanish → Again/Hard/Good/Easy. Nouns reveal with their article (la taza, el problema) so gender is taught alongside meaning, not as a separate quiz. Example sentence shown when present. CEFR-style level toggles: - LexemeLevel enum (A1/A2/B1/B2/C1+) derived from frequencyRank with standard Spanish-frequency-dictionary cutoffs (250/500/1000/2000). - UserProgress.selectedLexemeLevels — cloud-synced multi-select, defaults to A1+A2 on first launch. - SettingsView gains a "Vocabulary Levels" section with five toggles; the existing "Levels" section is renamed "Verb Levels" for clarity. - Due SRS cards always surface regardless of toggles. Disabling a level only stops new cards from that band entering the pool. PracticeView gets "Nouns" and "Adjectives" rows under "Books". DataLoader: new lexemeDataVersion gate that re-seeds the Lexeme table from vocab_lexemes.json independent of book seeding. project.yml lists the new JSON resource and the existing book_olly-vol2.json (which the previous build was silently excluding because xcodegen rewrote the project from project.yml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Books pipeline
Turns any EPUB into a chapter-structured JSON file the app bundles and reads.
TL;DR
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book-slug
This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run ./run.sh to bundle the final file.
Phases
| Phase | Script | What it does | Output |
|---|---|---|---|
| 1 | extract_epub.py |
Unzip the EPUB, walk content.opf spine + toc.ncx navMap, group HTML files into chapters, strip HTML→text. |
build/<slug>/chapters.json |
| 2 | translate_chapters.py |
Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. Resumable: jobs whose output file already exists are skipped. | build/<slug>/jobs/<jobid>.input.json + _pending.txt |
| 2b | build_glossary.py |
Tokenize every Spanish paragraph the same way the app does, collect the distinct words with example sentences, split into ~150-word glossary batches. Resumable the same way. | build/<slug>/glossary/<jobid>.input.json + _pending.txt |
| 2.5 | Claude Code subagents | Drain both manifests: translate the chapter jobs and the glossary jobs, writing each job's <jobid>.output.json. See "Running translations" below. |
build/<slug>/{jobs,glossary}/<jobid>.output.json |
| 3 | bundle_book.py |
Merge chapters.json + every translation *.output.json + every glossary *.output.json into the final bundled JSON the app reads. |
Conjuga/Conjuga/book_<slug>.json |
run.sh chains 1 → 2 → 2b → 3. If Phase 2 or 2b produces pending jobs, Phase 3 still runs but bundles with placeholders so you can preview app structure before the LLM passes complete. Re-running run.sh after subagents fill in the outputs gives you the real bundled file.
The glossary is the book reader's primary word-lookup source: every distinct word translated once, in context, so taps are instant, cover the whole book, and don't mis-resolve homographs (e.g. "como" as the conjunction vs. the verb comer). This phase is a permanent part of the pipeline — every book imported this way gets a glossary.
Adding a new book
-
Drop the EPUB anywhere on disk.
-
Run Phase 1+2:
cd Conjuga/Scripts/books ./run.sh /path/to/book.epub --slug my-bookSanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable
toc.ncx),extract_epub.pywill need a fallback heuristic — see "Open assumptions" below. -
Run translations (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:
There are two manifests to drain — translation and glossary:
build/<slug>/jobs/_pending.txtwith promptbuild/<slug>/jobs/_prompt_template.mdbuild/<slug>/glossary/_pending.txtwith promptbuild/<slug>/glossary/_prompt_template.md
For each pending job ID, hand a subagent the matching prompt with
<JOB_INPUT_PATH>/<JOB_OUTPUT_PATH>filled in. The subagent reads the input, produces the translation/glossary, and writes the output. Resumable — interrupted runs just leave the missing job IDs in_pending.txt.Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.
-
Bundle:
./run.sh /path/to/book.epub --slug my-book # re-running pulls in the new outputs # or directly: python3 bundle_book.py my-book --require-all--require-allwill fail loudly if any job is still missing. -
Bump
bookDataVersioninDataLoader.swiftso the in-app store re-seeds the new book on next launch (or any time you re-run with new translations). -
Verify the file is bundled in
Conjuga.xcodeproj. The script writesbook_<slug>.jsonintoConjuga/Conjuga/Resources/; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via thexcodeprojruby gem.
File layout
Conjuga/Scripts/books/
├── extract_epub.py # Phase 1
├── translate_chapters.py # Phase 2
├── build_glossary.py # Phase 2b
├── bundle_book.py # Phase 3
├── run.sh # Orchestrator
└── build/ # gitignored
└── <slug>/
├── chapters.json
├── jobs/ # translation jobs
│ ├── _pending.txt
│ ├── _prompt_template.md
│ ├── ch01_b00.input.json
│ ├── ch01_b00.output.json
│ └── ...
└── glossary/ # glossary jobs (Phase 2b)
├── _pending.txt
├── _prompt_template.md
├── gloss_b00.input.json
├── gloss_b00.output.json
└── ...
The final output (book_<slug>.json) lives at Conjuga/Conjuga/book_<slug>.json so the iOS app bundle includes it. (Existing textbook_data.json / conjuga_data.json use the same layout — files in the app target root rather than a Resources subgroup.)
Open assumptions
- TOC drives chapter boundaries. If an EPUB ships without a usable
toc.ncx, or the navMap is too granular (e.g. one navPoint per page),extract_epub.pywill need a fallback that groups by<h1>headings in spine order. - Spanish bold tags = inline emphasis. The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
- Translation is per-paragraph 1:1. Subagents must preserve paragraph count and order.
bundle_book.pywill warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.
Out of scope (intentional)
- OCR of vocab image tables (use
Scripts/textbook/if your book is image-heavy). - Exercise extraction (textbook pipeline).
- Per-occurrence word sense disambiguation. The glossary has one entry per
distinct word, translated in context; a word genuinely used in two senses in
the same book gets its dominant sense. The runtime
DictionaryService+ the on-device LLM remain as fallbacks for anything the glossary misses. - Cover image extraction (covers are derived from a color hash in the app for now).