Twitter
- 2013-01 ~ 2019-12
- randomly sampled 485,000 tweets from 30,000,000 tweets with deleting reply (@xxx)
- tokens 6,140,618
- word types 129,507
Thairath
- 2009-04-08 ~ 2019-06-25
- randomly sampled 25,000 articles from 842,351 articles (limited to paragraphs whose length are from 500 to 2000 characters)
- tokens 3,534,308
- word types 60,433
Matichon
- 2016-09-27 ~ 2018-09-08
- randomly sampled 15,731 articles from 143,365 articles (limited to paragraphs whose length are from 500 to 2000 characters)
- tokens 1,883,713
- word types 43,020
Daily News
- 2009-06-01 ~ 2020-05-18
- randomly sampled 25,000 articles from 669,757 articles (limited to paragraphs whose length are from 500 to 2000 characters)
- tokens 3,403,007
- word types 54,553
NHK Thai
- 2019-01-03 ~ 2020-11-25
- use all 4640 articles without extracting paragraphs; the average length of articles is 903 characters
- tokens 934,674
- word types 16,565
Pantip
- 2013-01-01 ~ 2020-07-17
- randomly sampled 60,000 articles from 110,321 articles (limited to paragraphs whose length are from 100 to 2000 characters)
- tokens 7,314,588
- word types 90,876