Books — pre-computed per-book glossary for context-correct word lookup

The book reader's word lookup used DictionaryService, a verb-conjugation index plus ~200 hand-typed words: ordinary nouns like "taza" returned nothing, and homographs always lost (tapping "como" in "como siempre" gave the verb "comer" because the verb index is checked first). Add a glossary phase to the books pipeline (build_glossary.py): every distinct Spanish word is translated once, in its sentence context, by the same Claude-Code-subagent LLM step the pipeline already uses for chapter translation. English front matter is excluded by an ES==EN paragraph-ratio heuristic. The glossary is bundled into book_<slug>.json and is now part of the pipeline for every book. In the app, Book carries the decoded glossary and BookReaderView resolves each tap automatically through cache -> glossary -> DictionaryService -> on-device LLM, citing which source answered so a curated glossary hit reads differently from a best-effort AI guess. book_olly-vol2.json regenerated with a 3,658-word glossary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:44:32 -05:00
parent d0582c4ce7
commit 3ee1563cb0
10 changed files with 18669 additions and 24 deletions
@@ -17,10 +17,13 @@ This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells yo
 |---|---|---|---|
 | 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
 | 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
-| 2.5 | Claude Code subagents | Read each job's `.input.json`, translate Spanish→English, write `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/jobs/<jobid>.output.json` |
-| 3 | `bundle_book.py` | Merge `chapters.json` + every `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |
+| 2b | `build_glossary.py` | Tokenize every Spanish paragraph the same way the app does, collect the distinct words with example sentences, split into ~150-word glossary batches. **Resumable** the same way. | `build/<slug>/glossary/<jobid>.input.json` + `_pending.txt` |
+| 2.5 | Claude Code subagents | Drain **both** manifests: translate the chapter jobs *and* the glossary jobs, writing each job's `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/{jobs,glossary}/<jobid>.output.json` |
+| 3 | `bundle_book.py` | Merge `chapters.json` + every translation `*.output.json` + every glossary `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |

-`run.sh` chains 1 → 2 → 3. If Phase 2 produces pending jobs, Phase 3 still runs but bundles with empty `paragraphsEN` placeholders so you can preview app structure before translation completes. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
+`run.sh` chains 1 → 2 → 2b → 3. If Phase 2 or 2b produces pending jobs, Phase 3 still runs but bundles with placeholders so you can preview app structure before the LLM passes complete. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.
+
+The glossary is the book reader's primary word-lookup source: every distinct word translated once, in context, so taps are instant, cover the whole book, and don't mis-resolve homographs (e.g. "como" as the conjunction vs. the verb *comer*). This phase is a permanent part of the pipeline — every book imported this way gets a glossary.

 ## Adding a new book

@@ -34,7 +37,11 @@ This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells yo

 3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:

-   For each pending job ID listed in `build/<slug>/jobs/_pending.txt`, hand a subagent the prompt at `build/<slug>/jobs/_prompt_template.md` with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, translates, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.
+   There are **two** manifests to drain — translation and glossary:
+   - `build/<slug>/jobs/_pending.txt` with prompt `build/<slug>/jobs/_prompt_template.md`
+   - `build/<slug>/glossary/_pending.txt` with prompt `build/<slug>/glossary/_prompt_template.md`
+
+   For each pending job ID, hand a subagent the matching prompt with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, produces the translation/glossary, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.

   Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.

@@ -56,16 +63,23 @@ This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells yo
 Conjuga/Scripts/books/
 ├── extract_epub.py        # Phase 1
 ├── translate_chapters.py  # Phase 2
+├── build_glossary.py      # Phase 2b
 ├── bundle_book.py         # Phase 3
 ├── run.sh                 # Orchestrator
 └── build/                 # gitignored
    └── <slug>/
        ├── chapters.json
-        └── jobs/
+        ├── jobs/                    # translation jobs
+        │   ├── _pending.txt
+        │   ├── _prompt_template.md
+        │   ├── ch01_b00.input.json
+        │   ├── ch01_b00.output.json
+        │   └── ...
+        └── glossary/                # glossary jobs (Phase 2b)
            ├── _pending.txt
            ├── _prompt_template.md
-            ├── ch01_b00.input.json
-            ├── ch01_b00.output.json
+            ├── gloss_b00.input.json
+            ├── gloss_b00.output.json
            └── ...
 ```

@@ -81,5 +95,8 @@ The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json

 - OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
 - Exercise extraction (textbook pipeline).
- Pre-computed per-word annotations (the app uses `DictionaryService.lookup()` at runtime).
+- Per-occurrence word sense disambiguation. The glossary has one entry per
+  distinct word, translated in context; a word genuinely used in two senses in
+  the same book gets its dominant sense. The runtime `DictionaryService` + the
+  on-device LLM remain as fallbacks for anything the glossary misses.
 - Cover image extraction (covers are derived from a color hash in the app for now).