Vocab study — noun & adjective flashcards with CEFR level toggles

Add SRS-driven noun and adjective flashcards modeled on the existing verb flashcard flow: - SharedModels/Lexeme — catalog of non-verb vocab, frequency-ranked, with gender for nouns and optional example sentences. Seeded from a bundled vocab_lexemes.json built by Scripts/vocab/build_lexemes.py, which joins frequency.csv + es-en.data from a pinned doozan/spanish_data commit (CC-BY-SA: hermitdave/FrequencyWords + Wiktionary). 1,449 nouns and 600 adjectives, each with Wiktionary-sourced gender and (where available) an example sentence with English translation. - LexemeReviewCard + LexemeReviewStore — cloud-synced SM-2 SRS, keyed by partOfSpeech + lexemeId + drillMode so future drill modes can coexist. - LexemeSessionQueue + LexemePool — parallel to VocabSessionQueue; fresh cards sort by frequency rank. - LexemeStudyGroup — cloud-synced resumable session per (partOfSpeech, drillMode). - NounFlashcardPracticeView + AdjectiveFlashcardPracticeView — same flow as VocabFlashcardPracticeView: English prompt → tap to reveal Spanish → Again/Hard/Good/Easy. Nouns reveal with their article (la taza, el problema) so gender is taught alongside meaning, not as a separate quiz. Example sentence shown when present. CEFR-style level toggles: - LexemeLevel enum (A1/A2/B1/B2/C1+) derived from frequencyRank with standard Spanish-frequency-dictionary cutoffs (250/500/1000/2000). - UserProgress.selectedLexemeLevels — cloud-synced multi-select, defaults to A1+A2 on first launch. - SettingsView gains a "Vocabulary Levels" section with five toggles; the existing "Levels" section is renamed "Verb Levels" for clarity. - Due SRS cards always surface regardless of toggles. Disabling a level only stops new cards from that band entering the pool. PracticeView gets "Nouns" and "Adjectives" rows under "Books". DataLoader: new lexemeDataVersion gate that re-seeds the Lexeme table from vocab_lexemes.json independent of book seeding. project.yml lists the new JSON resource and the existing book_olly-vol2.json (which the previous build was silently excluding because xcodegen rewrote the project from project.yml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:16:55 -05:00
parent ac84b22977
commit 7da98d786c
24 changed files with 1811 additions and 72 deletions
@@ -0,0 +1,62 @@
+# Vocab catalog build
+
+`build_lexemes.py` produces `Conjuga/vocab_lexemes.json`, the bundled catalog
+of frequency-ranked Spanish nouns and adjectives that powers the Noun /
+Adjective flashcard study modes.
+
+## Run
+
+```sh
+python3 build_lexemes.py
+```
+
+Downloads `frequency.csv` + `es-en.data` from a pinned commit of
+[`doozan/spanish_data`](https://github.com/doozan/spanish_data), caches them
+under `.cache/<commit>/`, joins them, and writes the JSON. Re-running is
+fast — only the join step happens after the first download.
+
+Override defaults:
+
+```sh
+python3 build_lexemes.py --max-nouns 3000 --max-adjectives 1000
+python3 build_lexemes.py --output /tmp/vocab.json
+```
+
+## Data sources & attribution
+
+All datasets are CC-licensed; the bundled catalog inherits CC-BY-SA. Credit
+in the app's About screen must read:
+
+> Vocabulary data: Wiktionary (CC-BY-SA), OpenSubtitles via FrequencyWords
+> (CC-BY-SA 3.0).
+
+- **`frequency.csv`** — derived from
+  [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords)
+  (OpenSubtitles corpus), packaged by doozan. License: CC-BY-SA 3.0.
+- **`es-en.data`** — Spanish→English Wiktionary export in the
+  [`enwiktionary_wordlist`](https://github.com/doozan/enwiktionary_wordlist)
+  format. License: CC-BY-SA.
+
+The pinned doozan commit is at the top of `build_lexemes.py`
+(`DOOZAN_COMMIT`). Bump it to refresh; the cache key includes the commit so
+old data is auto-replaced.
+
+## Output shape
+
+```json
+[
+  {
+    "baseForm": "casa",
+    "english": "house",
+    "partOfSpeech": "noun",
+    "gender": "f",
+    "frequencyRank": 142,
+    "exampleES": "La casa es grande",
+    "exampleEN": "The house is big"
+  },
+  ...
+]
+```
+
+Sorted by `frequencyRank` ascending so the fresh-card path in `LexemePool`
+surfaces the most useful words first.