Spanish/Conjuga/Scripts/books/README.md

# Books pipeline

Turns any EPUB into a chapter-structured JSON file the app bundles and reads.

## TL;DR

```bash
cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book-slug
```

This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run `./run.sh` to bundle the final file.

## Phases

| Phase | Script | What it does | Output |
|---|---|---|---|
| 1 | `extract_epub.py` | Unzip the EPUB, walk `content.opf` spine + `toc.ncx` navMap, group HTML files into chapters, strip HTML→text. | `build/<slug>/chapters.json` |
| 2 | `translate_chapters.py` | Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. **Resumable**: jobs whose output file already exists are skipped. | `build/<slug>/jobs/<jobid>.input.json` + `_pending.txt` |
| 2b | `build_glossary.py` | Tokenize every Spanish paragraph the same way the app does, collect the distinct words with example sentences, split into ~150-word glossary batches. **Resumable** the same way. | `build/<slug>/glossary/<jobid>.input.json` + `_pending.txt` |
| 2.5 | Claude Code subagents | Drain **both** manifests: translate the chapter jobs *and* the glossary jobs, writing each job's `<jobid>.output.json`. See "Running translations" below. | `build/<slug>/{jobs,glossary}/<jobid>.output.json` |
| 3 | `bundle_book.py` | Merge `chapters.json` + every translation `*.output.json` + every glossary `*.output.json` into the final bundled JSON the app reads. | `Conjuga/Conjuga/book_<slug>.json` |

`run.sh` chains 1 → 2 → 2b → 3. If Phase 2 or 2b produces pending jobs, Phase 3 still runs but bundles with placeholders so you can preview app structure before the LLM passes complete. Re-running `run.sh` after subagents fill in the outputs gives you the real bundled file.

The glossary is the book reader's primary word-lookup source: every distinct word translated once, in context, so taps are instant, cover the whole book, and don't mis-resolve homographs (e.g. "como" as the conjunction vs. the verb *comer*). This phase is a permanent part of the pipeline — every book imported this way gets a glossary.

## Adding a new book

1. **Drop the EPUB** anywhere on disk.
2. **Run Phase 1+2**:
   ```bash
   cd Conjuga/Scripts/books
   ./run.sh /path/to/book.epub --slug my-book
   ```
   Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable `toc.ncx`), `extract_epub.py` will need a fallback heuristic — see "Open assumptions" below.

3. **Run translations** (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:

   There are **two** manifests to drain — translation and glossary:
   - `build/<slug>/jobs/_pending.txt` with prompt `build/<slug>/jobs/_prompt_template.md`
   - `build/<slug>/glossary/_pending.txt` with prompt `build/<slug>/glossary/_prompt_template.md`

   For each pending job ID, hand a subagent the matching prompt with `<JOB_INPUT_PATH>` / `<JOB_OUTPUT_PATH>` filled in. The subagent reads the input, produces the translation/glossary, and writes the output. Resumable — interrupted runs just leave the missing job IDs in `_pending.txt`.

   Cluster jobs into agent batches of ~5–10 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.

4. **Bundle**:
   ```bash
   ./run.sh /path/to/book.epub --slug my-book   # re-running pulls in the new outputs
   # or directly:
   python3 bundle_book.py my-book --require-all
   ```
   `--require-all` will fail loudly if any job is still missing.

5. **Bump `bookDataVersion`** in `DataLoader.swift` so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).

6. **Verify the file is bundled** in `Conjuga.xcodeproj`. The script writes `book_<slug>.json` into `Conjuga/Conjuga/Resources/`; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the `xcodeproj` ruby gem.

## File layout

```
Conjuga/Scripts/books/
├── extract_epub.py        # Phase 1
├── translate_chapters.py  # Phase 2
├── build_glossary.py      # Phase 2b
├── bundle_book.py         # Phase 3
├── run.sh                 # Orchestrator
└── build/                 # gitignored
    └── <slug>/
        ├── chapters.json
        ├── jobs/                    # translation jobs
        │   ├── _pending.txt
        │   ├── _prompt_template.md
        │   ├── ch01_b00.input.json
        │   ├── ch01_b00.output.json
        │   └── ...
        └── glossary/                # glossary jobs (Phase 2b)
            ├── _pending.txt
            ├── _prompt_template.md
            ├── gloss_b00.input.json
            ├── gloss_b00.output.json
            └── ...
```

The final output (`book_<slug>.json`) lives at `Conjuga/Conjuga/book_<slug>.json` so the iOS app bundle includes it. (Existing `textbook_data.json` / `conjuga_data.json` use the same layout — files in the app target root rather than a Resources subgroup.)

## Open assumptions

- **TOC drives chapter boundaries.** If an EPUB ships without a usable `toc.ncx`, or the navMap is too granular (e.g. one navPoint per page), `extract_epub.py` will need a fallback that groups by `<h1>` headings in spine order.
- **Spanish bold tags = inline emphasis.** The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
- **Translation is per-paragraph 1:1.** Subagents must preserve paragraph count and order. `bundle_book.py` will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.

## Out of scope (intentional)

- OCR of vocab image tables (use `Scripts/textbook/` if your book is image-heavy).
- Exercise extraction (textbook pipeline).
- Per-occurrence word sense disambiguation. The glossary has one entry per
  distinct word, translated in context; a word genuinely used in two senses in
  the same book gets its dominant sense. The runtime `DictionaryService` + the
  on-device LLM remain as fallbacks for anything the glossary misses.
- Cover image extraction (covers are derived from a color hash in the app for now).