Spanish/Conjuga/Scripts/vocab/README.md

# Vocab catalog build

`build_lexemes.py` produces `Conjuga/vocab_lexemes.json`, the bundled catalog
of frequency-ranked Spanish nouns and adjectives that powers the Noun /
Adjective flashcard study modes.

## Run

```sh
python3 build_lexemes.py
```

Downloads `frequency.csv` + `es-en.data` from a pinned commit of
[`doozan/spanish_data`](https://github.com/doozan/spanish_data), caches them
under `.cache/<commit>/`, joins them, and writes the JSON. Re-running is
fast — only the join step happens after the first download.

Override defaults:

```sh
python3 build_lexemes.py --max-nouns 3000 --max-adjectives 1000
python3 build_lexemes.py --output /tmp/vocab.json
```

## Data sources & attribution

All datasets are CC-licensed; the bundled catalog inherits CC-BY-SA. Credit
in the app's About screen must read:

> Vocabulary data: Wiktionary (CC-BY-SA), OpenSubtitles via FrequencyWords
> (CC-BY-SA 3.0).

- **`frequency.csv`** — derived from
  [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords)
  (OpenSubtitles corpus), packaged by doozan. License: CC-BY-SA 3.0.
- **`es-en.data`** — Spanish→English Wiktionary export in the
  [`enwiktionary_wordlist`](https://github.com/doozan/enwiktionary_wordlist)
  format. License: CC-BY-SA.

The pinned doozan commit is at the top of `build_lexemes.py`
(`DOOZAN_COMMIT`). Bump it to refresh; the cache key includes the commit so
old data is auto-replaced.

## Output shape

```json
[
  {
    "baseForm": "casa",
    "english": "house",
    "partOfSpeech": "noun",
    "gender": "f",
    "frequencyRank": 142,
    "exampleES": "La casa es grande",
    "exampleEN": "The house is big"
  },
  ...
]
```

Sorted by `frequencyRank` ascending so the fresh-card path in `LexemePool`
surfaces the most useful words first.