API Reference
Main API
Transcriber
- class xpfcorpus.Transcriber(language: str, script: str | None = None, *, verify: bool = True, yaml_file: Path | str | None = None, rules_file: Path | str | None = None, verify_file: Path | str | None = None)[source]
High-level grapheme-to-phoneme transcriber.
Supports multiple data sources: - Package repository (default): bundled JSON data - External YAML file: custom language definitions - Legacy format: .rules and .verify files
Examples
# Basic usage - language with default script >>> es = Transcriber(“es”) >>> es.transcribe(“ejemplo”) [‘e’, ‘x’, ‘e’, ‘m’, ‘p’, ‘l’, ‘o’]
# Explicit script >>> tt = Transcriber(“tt”, “cyrillic”)
# External YAML file >>> custom = Transcriber(“custom”, yaml_file=”my_lang.yaml”)
# Legacy format >>> legacy = Transcriber(“test”, rules_file=”es.rules”)
- __init__(language: str, script: str | None = None, *, verify: bool = True, yaml_file: Path | str | None = None, rules_file: Path | str | None = None, verify_file: Path | str | None = None)[source]
Initialize a Transcriber for a language.
- Parameters:
language – Language code (e.g., “es”, “tt”, “aak”). Supports BCP-47 style codes with script and region: - “es-ES” (region preserved, treated as variant) - “yi-Latn” (script extracted) - “tt-cyrillic” (script extracted) - “zh-Hans-CN” (script extracted, region preserved)
script – Script to use (e.g., “latin”, “cyrillic”). Required for multi-script languages without a default. If provided, overrides any script in the language code.
verify – If True, verify the rules on initialization. Raises VerificationError if verification fails.
yaml_file – Path to an external YAML file (requires PyYAML).
rules_file – Path to a legacy .rules file.
verify_file – Path to a legacy .verify file.
- transcribe(word: str) list[str][source]
Transcribe a word from graphemes to phonemes.
- Parameters:
word – The word to transcribe.
- Returns:
List of phoneme strings.
- verify() tuple[bool, list[str]][source]
Verify the loaded rules against the verification data.
- Returns:
Tuple of (all_passed, list_of_error_messages).
- property variant: str | None
The language variant (region code) if a variant file was loaded.
- Returns:
Region code (e.g., “ES”, “MX”) if variant file exists, otherwise None.
Examples
>>> es = Transcriber("es") >>> es.variant # None (base language) >>> >>> # If es-ES.json exists: >>> es_es = Transcriber("es-ES") >>> es_es.variant # "ES" >>> >>> # If es-ES.json doesn't exist (falls back to es.json): >>> es_es = Transcriber("es-ES") >>> es_es.variant # None
available_languages
Exceptions
Custom exceptions for xpfcorpus.
- exception xpfcorpus.exceptions.XPFCorpusError[source]
Bases:
ExceptionBase exception for all xpfcorpus errors.
- exception xpfcorpus.exceptions.LanguageNotFoundError(code: str, available: list[str] | None = None)[source]
Bases:
XPFCorpusErrorRaised when a requested language is not available.
- exception xpfcorpus.exceptions.ScriptNotFoundError(code: str, script: str, available: list[str])[source]
Bases:
XPFCorpusErrorRaised when a requested script is not available for a language.
- exception xpfcorpus.exceptions.ScriptRequiredError(code: str, available: list[str])[source]
Bases:
XPFCorpusErrorRaised when a language requires explicit script selection.
Engine Layer
TranscriptionProcessor
- class xpfcorpus.engine.processor.TranscriptionProcessor(rules: RuleSet, missing: str = '@')[source]
Core transcription engine that converts graphemes to phonemes.
This is a pure transcription class with no I/O operations. The algorithm is adapted from XPF Corpus’s translate04.py.
- __init__(rules: RuleSet, missing: str = '@')[source]
Initialize the processor with a rule set.
- Parameters:
rules – The RuleSet containing all transcription rules.
missing – Character to use for untranscribable graphemes.
- transcribe(word: str) list[str][source]
Transcribe a word from graphemes to phonemes.
- Parameters:
word – The word to transcribe.
- Returns:
List of phoneme strings.
- verify(entries: list[VerifyEntry], *, stop_on_first: bool = False) tuple[bool, list[str]][source]
Verify transcription against expected outputs.
- Parameters:
entries – List of VerifyEntry objects with word/phonemes pairs.
stop_on_first – If True, stop at the first failure.
- Returns:
Tuple of (all_passed, list_of_error_messages).
Data Classes
- class xpfcorpus.engine.rules.RuleSet(classes: dict[str, str]=<factory>, pre: dict[str, str]=<factory>, matches: dict[str, str]=<factory>, subs: list[SubRule] = <factory>, ipasubs: list[SubRule] = <factory>, words: dict[str, list[str]]=<factory>)[source]
A complete set of rules for translating a script.
Contains: - classes: character class definitions for use in other rules - pre: character-level preprocessing (as a translation table) - matches: simple character-to-phoneme mappings (no context) - subs: context-sensitive substitution rules - ipasubs: post-processing substitution rules on IPA output - words: whole-word exception mappings
- class xpfcorpus.engine.rules.SubRule(sfrom: str, sto: str, weight: float = 1.0, precede: str = '', follow: str = '', _sfrom_re: Pattern | None = None, _precede_re: Pattern | None = None, _follow_re: Pattern | None = None)[source]
A substitution rule with pattern matching and context.
Wraps a regex-based substitution with optional precede/follow context and a weight for rule prioritization.
- class xpfcorpus.engine.rules.LanguageData(code: str, name: str = '', family: str = '', macroarea: str = '', compromised: dict | bool | None = None, default_script: str | None = None, scripts: dict[str, ~xpfcorpus.engine.rules.ScriptData]=<factory>)[source]
Complete data for a language, including all scripts.
A language may have multiple scripts (e.g., tt-latin, tt-cyrillic). If there’s a default_script, that script is used when no script is explicitly specified.
- scripts: dict[str, ScriptData]
- get_script_data(script: str | None = None) ScriptData[source]
Get the ScriptData for the specified script, or the default.
Raises ValueError if no script specified and no default exists.
- class xpfcorpus.engine.rules.ScriptData(rules: RuleSet, verify: list[VerifyEntry] = <factory>)[source]
Data for a single script of a language.
- verify: list[VerifyEntry]
- __init__(rules: RuleSet, verify: list[VerifyEntry] = <factory>) None
I/O Layer
PackageRepository
- class xpfcorpus.io.repository.PackageRepository[source]
Access bundled language data from the package.
This uses importlib.resources to access JSON files bundled with the package.
- classmethod available_languages() dict[str, dict][source]
Get a dictionary of available languages.
- Returns:
- {
“es”: {“scripts”: [“latin”], “default”: “latin”}, “tt”: {“scripts”: [“latin”, “cyrillic”], “default”: null}, …
}
- Return type:
Dict mapping language codes to metadata
- classmethod has_language(code: str) bool[source]
Check if a language is available.
Checks both the index and the filesystem for language files. This allows variants (e.g., es-ES.json) to exist without being in index.json.
- classmethod get_scripts(code: str) list[str][source]
Get available scripts for a language.
- Raises:
LanguageNotFoundError – If the language is not available.
- classmethod get_default_script(code: str) str | None[source]
Get the default script for a language.
Returns None if no default is set.
- Raises:
LanguageNotFoundError – If the language is not available.
- classmethod load_language(code: str) LanguageData[source]
Load language data from the bundled JSON files.
- Parameters:
code – Language code (e.g., “es”, “tt”).
- Returns:
LanguageData object.
- Raises:
LanguageNotFoundError – If the language is not available.
Loaders
- class xpfcorpus.io.json_loader.JSONLoader[source]
Load language data from JSON files.
- classmethod load(path: Path | str) LanguageData[source]
Load language data from a JSON file.
- Parameters:
path – Path to the JSON file.
- Returns:
LanguageData object.
- classmethod load_string(content: str) LanguageData[source]
Load language data from a JSON string.
- classmethod from_dict(data: dict[str, Any]) LanguageData[source]
Convert a dictionary to LanguageData.
Expected structure: {
- “metadata”: {
“code”: “es”, “name”: “Spanish”, “family”: “…”, “macroarea”: “…”, “compromised”: false, “default_script”: “latin”
}, “scripts”: {
- “latin”: {
“verify”: […], “rules”: {…}
}
}
}
- class xpfcorpus.io.yaml_loader.YAMLLoader[source]
Load language data from YAML files.
- Requires PyYAML to be installed. Install with:
pip install xpfcorpus[yaml]
- or:
pip install pyyaml
- classmethod load(path: Path | str) LanguageData[source]
Load language data from a YAML file.
- Parameters:
path – Path to the YAML file.
- Returns:
LanguageData object.
- Raises:
ImportError – If PyYAML is not installed.
- classmethod load_string(content: str) LanguageData[source]
Load language data from a YAML string.
- class xpfcorpus.io.legacy_loader.LegacyLoader[source]
Load language data from .rules and .verify files.
- classmethod load_rules(path: Path | str) RuleSet[source]
Load rules from a .rules file.
- Parameters:
path – Path to the .rules file.
- Returns:
RuleSet object.
Language Code Parsing
Language code parsing utilities for BCP-47 style codes.
- xpfcorpus.io.language_code.normalize_script(script: str) str[source]
Normalize script name to a standard form.
Handles both ISO 15924 4-letter codes and common names.
- Parameters:
script – Script name or code (e.g., “Latn”, “latin”, “cyrillic”, “Cyrl”)
- Returns:
Normalized lowercase script name.
Examples
>>> normalize_script("Latn") 'latin' >>> normalize_script("cyrillic") 'cyrillic' >>> normalize_script("Syll") 'syllabics'
- xpfcorpus.io.language_code.parse_language_code(code: str, explicit_script: str | None = None) Tuple[str, str | None, str | None][source]
Parse a language code with optional script and region components.
Supports BCP-47 style codes like: - “es” → (“es”, None, None) - “es-ES” → (“es”, None, “ES”) - region preserved - “yi-Latn” → (“yi”, “latin”, None) - script extracted - “tt-cyrillic” → (“tt”, “cyrillic”, None) - script extracted - “zh-Hans-CN” → (“zh”, “hans”, “CN”) - script extracted, region preserved
If an explicit script is provided, it always takes precedence over any script extracted from the code.
- Parameters:
code – Language code, possibly with script/region subtags.
explicit_script – Optional explicit script that overrides extracted script.
- Returns:
Tuple of (language_code, script_or_none, region_or_none).
Examples
>>> parse_language_code("es") ('es', None, None) >>> parse_language_code("es-ES") ('es', None, 'ES') >>> parse_language_code("yi-Latn") ('yi', 'latin', None) >>> parse_language_code("tt-cyrillic") ('tt', 'cyrillic', None) >>> parse_language_code("zh-Hans-CN") ('zh', 'hans', 'CN') >>> parse_language_code("yi-Latn", "hebrew") ('yi', 'hebrew', None)