Quick Start¶
This guide walks you through the core features of graphemes++ in under 5 minutes.
1. Segment Text into Graphemes¶
The Graphemizer class is the primary entry point. It takes a string, normalizes it, and splits it into proper grapheme clusters.
from graphemes_plusplus import Graphemizer
# Tamil text
g = Graphemizer("ஸ்ரீ மதி")
print(g.graphemes) # ['ஸ்ரீ', ' ', 'ம', 'தி']
print(len(g)) # 4
The Graphemizer is iterable:
Why not just
list(text)? In Tamil and Sinhala, a single visual character (grapheme) can consist of multiple Unicode code points. For example,ஸ்ரீis 4 code points but 1 grapheme. Python'slist()would split it into 4 separate items, which is linguistically incorrect.
2. Compute Grapheme-Aware Distances¶
from graphemes_plusplus import levenshtein, hamming
# Levenshtein distance (edit distance)
print(levenshtein("ஸ்ரீ", "ஸ்ரி")) # 1
# Hamming distance (substitution-only, equal length required)
print(hamming("ஸ்ரீ", "ஸ்ரீ")) # 0
3. Decompose and Compose¶
Break graphemes into phonetic components and reconstruct them:
from graphemes_plusplus import decompose, compose
# Decompose into consonant + vowel components
decomposed = decompose("கா")
print(decomposed) # க் + ஆ components
# Compose back
original = compose(decomposed)
print(original) # கா
4. Normalize a File¶
Batch-normalize an entire text file with proper Unicode NFC and script-specific fixups:
from graphemes_plusplus.utils import normalize_file
# Auto-generates output filename
output = normalize_file("input.txt")
print(output) # 'input_normalized.txt'
# Or specify a custom output path
output = normalize_file("input.txt", "clean_output.txt")
5. Evaluate with Metrics¶
Use grapheme-aware chrF and CER for NLP evaluation:
from graphemes_plusplus.metric import GraphemeCHRF, CER
# chrF score
chrf = GraphemeCHRF()
score = chrf.corpus_score(
["வணக்கம் உலகம்"],
[["வணக்கம் உலகம்"]]
)
print(score) # 100.0
# Character Error Rate
cer = CER("predicted text", "reference text")
print(cer)
Next Steps - Explore the full API Reference for detailed documentation - Read the Tamil or Sinhala specific guides - Learn about Evaluation Metrics for NLP tasks