Thai Web Corpus
corpus
tokenize
G2P
wordcloud
word2vec
extra
search
THAI TOKENIZER
paste text below OR upload text file (<2MB) --- powered by
PyThaiNLP
---
tokenize
clear
keep whitespaces
▼ advanced options (click to open)
whitespace
shrink (" " -> " ")
shrink also ๆ ("จริง ๆ" -> "จริงๆ")
keep original length
punctuation
remove () [] " ' , ; :
keep original
engine
newmm (dictionary-based, faster)
attacut (neural network, slower)
delimiter
vbar |
comma ,
semicolon ;
tab
space
custom dictionary