Skip to content

graphemes++

Graphemizer

vmenan/graphemes_plusplus

Graphemizer¶

The Graphemizer class is the core entry point of graphemes++. It takes raw text, applies normalization, and segments it into linguistically correct grapheme clusters.

Class Definition¶

class Graphemizer(string: str)

Module: graphemes_plusplus.graphemizer

Parameters¶

Parameter	Type	Description
`string`	`str`	The input text to segment into graphemes

Properties¶

Property	Type	Description
`graphemes`	`list[str]`	The list of segmented grapheme clusters
`raw_string`	`str`	The original input string (before normalization)

Methods¶

`len() → int`¶

Returns the number of grapheme clusters.

>>> g = Graphemizer("ஸ்ரீ மதி")
>>> len(g)
4

`iter()`¶

Makes the Graphemizer instance iterable over its grapheme clusters.

>>> g = Graphemizer("ஸ்ரீ மதி")
>>> for grapheme in g:
...     print(grapheme)
ஸ்ரீ

ம
தி

Processing Pipeline¶

The Graphemizer follows a two-stage pipeline:

graph LR
    A[Raw Text] --> B[Normalizer]
    B --> C[GraphemeSplitter]
    C --> D[List of Graphemes]

    style A fill:#7c4dff,color:#fff
    style D fill:#00bfa5,color:#fff

Normalize — Unicode NFC normalization + Tamil/Sinhala-specific character fixups
Split — Extended grapheme clustering that handles conjuncts and ZWJ sequences

Examples¶

Tamil Segmentation¶

>>> g = Graphemizer("கொண்டுவந்து")
>>> g.graphemes
['கொ', 'ண்', 'டு', 'வ', 'ந்', 'து']

Sinhala Segmentation¶

>>> g = Graphemizer("ක්‍රිකට්")
>>> g.graphemes
['ක්‍රි', 'ක', 'ට්']

Mixed Script¶

>>> g = Graphemizer("Hello வணக்கம்!")
>>> g.graphemes
['H', 'e', 'l', 'l', 'o', ' ', 'வ', 'ண', 'க்', 'க', 'ம்', '!']

Normalization is automatic The Graphemizer automatically normalizes input text before segmentation. You don't need to pre-process your text unless you want explicit control over the normalization step.

See Also¶

Normalizer — The normalization component used internally
GraphemeSplitter — The splitting component used internally
Distance Functions — Use graphemes for distance computation