Files
Trey T 3ee1563cb0 Books — pre-computed per-book glossary for context-correct word lookup
The book reader's word lookup used DictionaryService, a verb-conjugation
index plus ~200 hand-typed words: ordinary nouns like "taza" returned
nothing, and homographs always lost (tapping "como" in "como siempre"
gave the verb "comer" because the verb index is checked first).

Add a glossary phase to the books pipeline (build_glossary.py): every
distinct Spanish word is translated once, in its sentence context, by
the same Claude-Code-subagent LLM step the pipeline already uses for
chapter translation. English front matter is excluded by an ES==EN
paragraph-ratio heuristic. The glossary is bundled into book_<slug>.json
and is now part of the pipeline for every book.

In the app, Book carries the decoded glossary and BookReaderView resolves
each tap automatically through cache -> glossary -> DictionaryService ->
on-device LLM, citing which source answered so a curated glossary hit
reads differently from a best-effort AI guess.

book_olly-vol2.json regenerated with a 3,658-word glossary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:44:32 -05:00

78 lines
2.6 KiB
Bash
Executable File

#!/usr/bin/env bash
# Orchestrate the books pipeline: EPUB -> chapters.json -> per-chapter job
# manifest -> (translation by Claude Code subagents) -> bundled book_<slug>.json.
#
# This script DOES NOT run the LLM translation pass. After Phase 2 it stops
# and prints how many jobs are pending. Use Claude Code subagents (or a fresh
# session per the README) to fill in build/<slug>/jobs/*.output.json, then
# re-run this script — it will pick up where it left off via Phase 3.
#
# Usage:
# ./run.sh <epub_path> [--slug SLUG] [--batch-size N]
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$HERE"
if [[ $# -lt 1 ]]; then
echo "usage: $0 <epub_path> [--slug SLUG] [--batch-size N]"
exit 2
fi
EPUB="$1"; shift
SLUG=""
BATCH_SIZE="30"
GLOSSARY_BATCH_SIZE="150"
while [[ $# -gt 0 ]]; do
case "$1" in
--slug) SLUG="$2"; shift 2 ;;
--batch-size) BATCH_SIZE="$2"; shift 2 ;;
--glossary-batch-size) GLOSSARY_BATCH_SIZE="$2"; shift 2 ;;
*) echo "unknown option: $1" >&2; exit 2 ;;
esac
done
EPUB_ABS="$(cd "$(dirname "$EPUB")" && pwd)/$(basename "$EPUB")"
echo "=== Phase 1: extract_epub.py ==="
if [[ -n "$SLUG" ]]; then
python3 extract_epub.py "$EPUB_ABS" --slug "$SLUG"
else
python3 extract_epub.py "$EPUB_ABS"
fi
# If --slug wasn't passed, recover the slug from the chapters file just written.
if [[ -z "$SLUG" ]]; then
SLUG=$(python3 -c "import json,glob; p=sorted(glob.glob('build/*/chapters.json'), key=lambda x: -__import__('os').path.getmtime(x))[0]; print(json.load(open(p))['slug'])")
fi
echo
echo "=== Phase 2: translate_chapters.py ==="
python3 translate_chapters.py "$SLUG" --batch-size "$BATCH_SIZE"
PENDING_FILE="build/$SLUG/jobs/_pending.txt"
PENDING_COUNT=$(wc -l < "$PENDING_FILE" | tr -d ' ')
echo
echo "=== Phase 2b: build_glossary.py ==="
python3 build_glossary.py "$SLUG" --batch-size "$GLOSSARY_BATCH_SIZE"
GLOSS_PENDING_FILE="build/$SLUG/glossary/_pending.txt"
GLOSS_PENDING_COUNT=$(wc -l < "$GLOSS_PENDING_FILE" | tr -d ' ')
TOTAL_PENDING=$((PENDING_COUNT + GLOSS_PENDING_COUNT))
echo
echo "=== Phase 3: bundle_book.py ==="
if [[ "$TOTAL_PENDING" -gt 0 ]]; then
echo " $PENDING_COUNT translation job(s) and $GLOSS_PENDING_COUNT glossary job(s) still pending."
echo " Run the Claude Code subagent step (see README.md) for BOTH manifests:"
echo " build/$SLUG/jobs/_pending.txt (translation)"
echo " build/$SLUG/glossary/_pending.txt (glossary)"
echo " then re-run this script. Bundling with placeholders so you can preview now."
python3 bundle_book.py "$SLUG"
else
python3 bundle_book.py "$SLUG" --require-all
fi