Graphemizer¶
The Graphemizer class is the core entry point of graphemes++. It takes raw text, applies normalization, and segments it into linguistically correct grapheme clusters.
Class Definition¶
Module: graphemes_plusplus.graphemizer
Parameters¶
| Parameter | Type | Description |
|---|---|---|
string |
str |
The input text to segment into graphemes |
Properties¶
| Property | Type | Description |
|---|---|---|
graphemes |
list[str] |
The list of segmented grapheme clusters |
raw_string |
str |
The original input string (before normalization) |
Methods¶
__len__() → int¶
Returns the number of grapheme clusters.
__iter__()¶
Makes the Graphemizer instance iterable over its grapheme clusters.
Processing Pipeline¶
The Graphemizer follows a two-stage pipeline:
graph LR
A[Raw Text] --> B[Normalizer]
B --> C[GraphemeSplitter]
C --> D[List of Graphemes]
style A fill:#7c4dff,color:#fff
style D fill:#00bfa5,color:#fff
- Normalize — Unicode NFC normalization + Tamil/Sinhala-specific character fixups
- Split — Extended grapheme clustering that handles conjuncts and ZWJ sequences
Examples¶
Tamil Segmentation¶
Sinhala Segmentation¶
Mixed Script¶
>>> g = Graphemizer("Hello வணக்கம்!")
>>> g.graphemes
['H', 'e', 'l', 'l', 'o', ' ', 'வ', 'ண', 'க்', 'க', 'ம்', '!']
Normalization is automatic The
Graphemizerautomatically normalizes input text before segmentation. You don't need to pre-process your text unless you want explicit control over the normalization step.
See Also¶
- Normalizer — The normalization component used internally
- GraphemeSplitter — The splitting component used internally
- Distance Functions — Use graphemes for distance computation