Files
Trey T 3ee1563cb0 Books — pre-computed per-book glossary for context-correct word lookup
The book reader's word lookup used DictionaryService, a verb-conjugation
index plus ~200 hand-typed words: ordinary nouns like "taza" returned
nothing, and homographs always lost (tapping "como" in "como siempre"
gave the verb "comer" because the verb index is checked first).

Add a glossary phase to the books pipeline (build_glossary.py): every
distinct Spanish word is translated once, in its sentence context, by
the same Claude-Code-subagent LLM step the pipeline already uses for
chapter translation. English front matter is excluded by an ES==EN
paragraph-ratio heuristic. The glossary is bundled into book_<slug>.json
and is now part of the pipeline for every book.

In the app, Book carries the decoded glossary and BookReaderView resolves
each tap automatically through cache -> glossary -> DictionaryService ->
on-device LLM, citing which source answered so a curated glossary hit
reads differently from a best-effort AI guess.

book_olly-vol2.json regenerated with a 3,658-word glossary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:44:32 -05:00

103 lines
6.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Books pipeline
Turns any EPUB into a chapter-structured JSON file the app bundles and reads.
## TL;DR
```bash
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book-slug
```
This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run `./run.sh` to bundle the final file.
## Phases
| Phase | Script | What it does | Output |
|---|---|---|---|
| 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
| 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
| 2b | `build_glossary.py` | Tokenize every Spanish paragraph the same way the app does, collect the distinct words with example sentences, split into ~150-word glossary batches. **Resumable** the same way. | `build/<slug>/glossary/<jobid>.input.json` + `_pending.txt` |
| 2.5 | Claude Code subagents | Drain **both** manifests: translate the chapter jobs *and* the glossary jobs, writing each job's `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/{jobs,glossary}/<jobid>.output.json` |
| 3 | `bundle_book.py` | Merge `chapters.json` + every translation `*.output.json` + every glossary `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |
`run.sh` chains 1 → 2 → 2b → 3. If Phase 2 or 2b produces pending jobs, Phase 3 still runs but bundles with placeholders so you can preview app structure before the LLM passes complete. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
The glossary is the book reader's primary word-lookup source: every distinct word translated once, in context, so taps are instant, cover the whole book, and don't mis-resolve homographs (e.g. "como" as the conjunction vs. the verb *comer*). This phase is a permanent part of the pipeline — every book imported this way gets a glossary.
## Adding a new book
1. **Drop the EPUB** anywhere on disk.
2. **Run Phase 1+2**:
```bash
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book
```
Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable `toc.ncx`), `extract_epub.py` will need a fallback heuristic — see "Open assumptions" below.
3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:
There are **two** manifests to drain — translation and glossary:
- `build/<slug>/jobs/_pending.txt` with prompt `build/<slug>/jobs/_prompt_template.md`
- `build/<slug>/glossary/_pending.txt` with prompt `build/<slug>/glossary/_prompt_template.md`
For each pending job ID, hand a subagent the matching prompt with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, produces the translation/glossary, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.
Cluster jobs into agent batches of ~510 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.
4. **Bundle**:
```bash
./run.sh /path/to/book.epub --slug my-book # re-running pulls in the new outputs
# or directly:
python3 bundle_book.py my-book --require-all
```
`--require-all` will fail loudly if any job is still missing.
5. **Bump `bookDataVersion`** in `DataLoader.swift` so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).
6. **Verify the file is bundled** in `Conjuga.xcodeproj`. The script writes `book_<slug>.json` into `Conjuga/Conjuga/Resources/`; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the `xcodeproj` ruby gem.
## File layout
```
Conjuga/Scripts/books/
├── extract_epub.py # Phase 1
├── translate_chapters.py # Phase 2
├── build_glossary.py # Phase 2b
├── bundle_book.py # Phase 3
├── run.sh # Orchestrator
└── build/ # gitignored
└── <slug>/
├── chapters.json
├── jobs/ # translation jobs
│ ├── _pending.txt
│ ├── _prompt_template.md
│ ├── ch01_b00.input.json
│ ├── ch01_b00.output.json
│ └── ...
└── glossary/ # glossary jobs (Phase 2b)
├── _pending.txt
├── _prompt_template.md
├── gloss_b00.input.json
├── gloss_b00.output.json
└── ...
```
The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json` so the iOS app bundle includes it. (Existing `textbook_data.json` / `conjuga_data.json` use the same layout — files in the app target root rather than a Resources subgroup.)
## Open assumptions
- **TOC drives chapter boundaries.** If an EPUB ships without a usable `toc.ncx`, or the navMap is too granular (e.g. one navPoint per page), `extract_epub.py` will need a fallback that groups by `<h1>` headings in spine order.
- **Spanish bold tags = inline emphasis.** The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
- **Translation is per-paragraph 1:1.** Subagents must preserve paragraph count and order. `bundle_book.py` will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.
## Out of scope (intentional)
- OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
- Exercise extraction (textbook pipeline).
- Per-occurrence word sense disambiguation. The glossary has one entry per
distinct word, translated in context; a word genuinely used in two senses in
the same book gets its dominant sense. The runtime `DictionaryService` + the
on-device LLM remain as fallbacks for anything the glossary misses.
- Cover image extraction (covers are derived from a color hash in the app for now).