LingPy

This documentation is for version 2.0.dev, which is not released yet.

lingpy.sequence.orthography.GraphemeParser

class lingpy.sequence.orthography.GraphemeParser

Class for Unicode graphemic parsing of Unicode strings.

Notes

This class handles (just) Unicode grapheme parsing, i.e. it uses the Unicode standard grapheme parsing rules, as implemented in the Python regex package by Matthew Barnett, to do basic parsing with the “X” grapheme regular expression match. This grapheme match combines one or more Combining Diacritical Marks to their base character. These are called “Grapheme clusters” in Unicode parlance.

This class is meant to do basic rudimentary parsing for things like getting an additional unigram model (segments and their counts) in an input data source.

For more elaborate orthographic parsing, see the OrthographyParser and the OrthographyRulesParser.

An additional method (in its infancy) called combine_modifiers handles the case where there are Unicode Spacing Modifier Letters, which are not explicitly combined to their base character in the Unicode Standard. These graphemes are called “Tailored grapheme clusters” in Unicode. For more information see the Unicode Standard Annex #29: Unicode Text Segmentation:

http://www.unicode.org/reports/tr29/

Methods

combine_modifiers(string) Given a string that is space-delimited on Unicode graphemes, group Unicode modifier letters with their preceeding base characters.
parse_characters(string) Given a string as input, return a space-delimited string of Unicode characters.
parse_graphemes(string) Given a string as input, return a space-delimited string of Unicode graphemes using the “X” regular expression.
parse_string_to_graphemes(string) Deprecated function to parse graphemes and return a tuple of (T/F success, tupled parse).
parse_string_to_graphemes_string(string) Deprecated function to parse graphemes and return a tuple of (T/F success, parse).

This Page