tokenize

fun tokenize(input: String, compact: Boolean = false): List<String>

Splits input into a list of strings separated by opinionated TokenTypes.

Return

the text split into String tokens.

For example:

  • tokenize("ふふフフ") => ["ふふ", "フフ"]

  • tokenize("感じ") => ["感", "じ"]

  • tokenize("truly 私は悲しい") => "truly", " ", "私", "は", "悲", "しい"`

  • tokenize("truly 私は悲しい", compact = true) => "truly ", "私は悲しい"`

  • tokenize("5romaji here...!?漢字ひらがなカタ カナ4「SHIO」。!") => [ "5", "romaji", " ", "here", "...!?", "漢字", "ひらがな", "カタ", " ", "カナ", "4", "「", "SHIO", "」。!"]

  • tokenize("5romaji here...!?漢字ひらがなカタ カナ4「SHIO」。!", compact = true) => [ "5", "romaji here", "...!?", "漢字ひらがなカタ カナ", "4「", "SHIO", "」。!"]

Parameters

input

the text to tokenize.

compact

if true, then many same-language tokens are combined (spaces + text, kanji + kana, numeral + punctuation). Defaults to false.