tokenize
Splits input into a list of strings separated by opinionated TokenTypes.
Return
the text split into String tokens.
For example:
tokenize("ふふフフ")
=>["ふふ", "フフ"]
tokenize("感じ")
=>["感", "じ"]
tokenize("truly 私は悲しい")
=> "truly", " ", "私", "は", "悲", "しい"`tokenize("truly 私は悲しい", compact = true)
=> "truly ", "私は悲しい"`tokenize("5romaji here...!?漢字ひらがなカタ カナ4「SHIO」。!")
=>[ "5", "romaji", " ", "here", "...!?", "漢字", "ひらがな", "カタ", " ", "カナ", "4", "「", "SHIO", "」。!"]
tokenize("5romaji here...!?漢字ひらがなカタ カナ4「SHIO」。!", compact = true)
=>[ "5", "romaji here", "...!?", "漢字ひらがなカタ カナ", "4「", "SHIO", "」。!"]
Parameters
input
the text to tokenize.
compact
if true
, then many same-language tokens are combined (spaces + text, kanji + kana, numeral + punctuation). Defaults to false
.