Files
Spanish/Conjuga/Scripts/books
Trey T 05a367fdbe Books — capture <li> vocab bullets the extractor was silently dropping
extract_epub.py was walking <p> only, but every "Vocabulario" section in
the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant
the heading made it through but the entries didn't — 680 vocab lines
across 24 sections in this book were missing from the bundled JSON.

Audit (text-node owner by closest block ancestor) confirmed <li> is the
only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else.
No <h1>-<h6>, tables, or blockquotes in this EPUB at all.

Fix: walk find_all(["p", "li"]) in document order so bullet entries
slot in right after their "Vocabulario" / list heading. Re-extracted
(2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel
Claude Code subagents. translate_chapters.py prompt template now tells
subagents to keep bilingual `palabra = meaning` lines verbatim — both
sides already coexist on the line.

Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds.
Verified in simulator: all 13 chapter row sizes grew (e.g. ch6
18,295→20,951 chars).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:10:34 -05:00
..

Books pipeline

Turns any EPUB into a chapter-structured JSON file the app bundles and reads.

TL;DR

cd Conjuga/Scripts/books
./run.sh /path/to/book.epub --slug my-book-slug

This runs Phase 1 (extract) and Phase 2 (manifest jobs), then stops and tells you how many translation jobs are pending. Run those via Claude Code subagents (Phase 2.5 below), then re-run ./run.sh to bundle the final file.

Phases

Phase Script What it does Output
1 extract_epub.py Unzip the EPUB, walk content.opf spine + toc.ncx navMap, group HTML files into chapters, strip HTML→text. build/<slug>/chapters.json
2 translate_chapters.py Split each chapter into ~30-paragraph translation batches. Each batch becomes a job with its own input/output file. Resumable: jobs whose output file already exists are skipped. build/<slug>/jobs/<jobid>.input.json + _pending.txt
2.5 Claude Code subagents Read each job's .input.json, translate Spanish→English, write <jobid>.output.json. See "Running translations" below. build/<slug>/jobs/<jobid>.output.json
3 bundle_book.py Merge chapters.json + every *.output.json into the final bundled JSON the app reads. Conjuga/Conjuga/book_<slug>.json

run.sh chains 1 → 2 → 3. If Phase 2 produces pending jobs, Phase 3 still runs but bundles with empty paragraphsEN placeholders so you can preview app structure before translation completes. Re-running run.sh after subagents fill in the outputs gives you the real bundled file.

Adding a new book

  1. Drop the EPUB anywhere on disk.

  2. Run Phase 1+2:

    cd Conjuga/Scripts/books
    ./run.sh /path/to/book.epub --slug my-book
    

    Sanity-check the chapter list it prints. If chapter grouping looks wrong (e.g. an EPUB without a usable toc.ncx), extract_epub.py will need a fallback heuristic — see "Open assumptions" below.

  3. Run translations (Phase 2.5). The default approach is to spawn Claude Code subagents from inside a Claude Code session pointed at this repo:

    For each pending job ID listed in build/<slug>/jobs/_pending.txt, hand a subagent the prompt at build/<slug>/jobs/_prompt_template.md with <JOB_INPUT_PATH> / <JOB_OUTPUT_PATH> filled in. The subagent reads the input, translates, and writes the output. Resumable — interrupted runs just leave the missing job IDs in _pending.txt.

    Cluster jobs into agent batches of ~510 jobs each to keep per-agent context manageable. ~5 parallel agents is a good throughput target.

  4. Bundle:

    ./run.sh /path/to/book.epub --slug my-book   # re-running pulls in the new outputs
    # or directly:
    python3 bundle_book.py my-book --require-all
    

    --require-all will fail loudly if any job is still missing.

  5. Bump bookDataVersion in DataLoader.swift so the in-app store re-seeds the new book on next launch (or any time you re-run with new translations).

  6. Verify the file is bundled in Conjuga.xcodeproj. The script writes book_<slug>.json into Conjuga/Conjuga/Resources/; if that folder is part of a recursive group reference, Xcode picks it up automatically. Otherwise, add it manually or via the xcodeproj ruby gem.

File layout

Conjuga/Scripts/books/
├── extract_epub.py        # Phase 1
├── translate_chapters.py  # Phase 2
├── bundle_book.py         # Phase 3
├── run.sh                 # Orchestrator
└── build/                 # gitignored
    └── <slug>/
        ├── chapters.json
        └── jobs/
            ├── _pending.txt
            ├── _prompt_template.md
            ├── ch01_b00.input.json
            ├── ch01_b00.output.json
            └── ...

The final output (book_<slug>.json) lives at Conjuga/Conjuga/book_<slug>.json so the iOS app bundle includes it. (Existing textbook_data.json / conjuga_data.json use the same layout — files in the app target root rather than a Resources subgroup.)

Open assumptions

  • TOC drives chapter boundaries. If an EPUB ships without a usable toc.ncx, or the navMap is too granular (e.g. one navPoint per page), extract_epub.py will need a fallback that groups by <h1> headings in spine order.
  • Spanish bold tags = inline emphasis. The Olly Richards books bold vocab hints inside paragraphs. We strip the bold and let the in-app dictionary lookup handle definitions instead. If a future book uses bold for something else (titles, etc.), revisit.
  • Translation is per-paragraph 1:1. Subagents must preserve paragraph count and order. bundle_book.py will warn + pad/truncate if a job's output array length doesn't match its input — but that's a sign the subagent misbehaved.

Out of scope (intentional)

  • OCR of vocab image tables (use Scripts/textbook/ if your book is image-heavy).
  • Exercise extraction (textbook pipeline).
  • Pre-computed per-word annotations (the app uses DictionaryService.lookup() at runtime).
  • Cover image extraction (covers are derived from a color hash in the app for now).