Files
Spanish/Conjuga/Scripts/vocab/README.md
Trey T 7da98d786c Vocab study — noun & adjective flashcards with CEFR level toggles
Add SRS-driven noun and adjective flashcards modeled on the existing verb
flashcard flow:

- SharedModels/Lexeme — catalog of non-verb vocab, frequency-ranked, with
  gender for nouns and optional example sentences. Seeded from a bundled
  vocab_lexemes.json built by Scripts/vocab/build_lexemes.py, which joins
  frequency.csv + es-en.data from a pinned doozan/spanish_data commit
  (CC-BY-SA: hermitdave/FrequencyWords + Wiktionary). 1,449 nouns and 600
  adjectives, each with Wiktionary-sourced gender and (where available)
  an example sentence with English translation.
- LexemeReviewCard + LexemeReviewStore — cloud-synced SM-2 SRS, keyed by
  partOfSpeech + lexemeId + drillMode so future drill modes can coexist.
- LexemeSessionQueue + LexemePool — parallel to VocabSessionQueue; fresh
  cards sort by frequency rank.
- LexemeStudyGroup — cloud-synced resumable session per
  (partOfSpeech, drillMode).
- NounFlashcardPracticeView + AdjectiveFlashcardPracticeView — same flow
  as VocabFlashcardPracticeView: English prompt → tap to reveal Spanish
  → Again/Hard/Good/Easy. Nouns reveal with their article (la taza, el
  problema) so gender is taught alongside meaning, not as a separate
  quiz. Example sentence shown when present.

CEFR-style level toggles:
- LexemeLevel enum (A1/A2/B1/B2/C1+) derived from frequencyRank with
  standard Spanish-frequency-dictionary cutoffs (250/500/1000/2000).
- UserProgress.selectedLexemeLevels — cloud-synced multi-select, defaults
  to A1+A2 on first launch.
- SettingsView gains a "Vocabulary Levels" section with five toggles; the
  existing "Levels" section is renamed "Verb Levels" for clarity.
- Due SRS cards always surface regardless of toggles. Disabling a level
  only stops new cards from that band entering the pool.

PracticeView gets "Nouns" and "Adjectives" rows under "Books".

DataLoader: new lexemeDataVersion gate that re-seeds the Lexeme table
from vocab_lexemes.json independent of book seeding. project.yml lists
the new JSON resource and the existing book_olly-vol2.json (which the
previous build was silently excluding because xcodegen rewrote the
project from project.yml).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:16:55 -05:00

1.8 KiB

Vocab catalog build

build_lexemes.py produces Conjuga/vocab_lexemes.json, the bundled catalog of frequency-ranked Spanish nouns and adjectives that powers the Noun / Adjective flashcard study modes.

Run

python3 build_lexemes.py

Downloads frequency.csv + es-en.data from a pinned commit of doozan/spanish_data, caches them under .cache/<commit>/, joins them, and writes the JSON. Re-running is fast — only the join step happens after the first download.

Override defaults:

python3 build_lexemes.py --max-nouns 3000 --max-adjectives 1000
python3 build_lexemes.py --output /tmp/vocab.json

Data sources & attribution

All datasets are CC-licensed; the bundled catalog inherits CC-BY-SA. Credit in the app's About screen must read:

Vocabulary data: Wiktionary (CC-BY-SA), OpenSubtitles via FrequencyWords (CC-BY-SA 3.0).

  • frequency.csv — derived from hermitdave/FrequencyWords (OpenSubtitles corpus), packaged by doozan. License: CC-BY-SA 3.0.
  • es-en.data — Spanish→English Wiktionary export in the enwiktionary_wordlist format. License: CC-BY-SA.

The pinned doozan commit is at the top of build_lexemes.py (DOOZAN_COMMIT). Bump it to refresh; the cache key includes the commit so old data is auto-replaced.

Output shape

[
  {
    "baseForm": "casa",
    "english": "house",
    "partOfSpeech": "noun",
    "gender": "f",
    "frequencyRank": 142,
    "exampleES": "La casa es grande",
    "exampleEN": "The house is big"
  },
  ...
]

Sorted by frequencyRank ascending so the fresh-card path in LexemePool surfaces the most useful words first.