José Luis Saorín Ferrer
Tokenization & Coherence Study
Experimental framework on how tokenization granularity affects discourse coherence. T5-base (BPE) vs ByT5-base (UTF-8) vs mT5-base (Unigram), 6 tasks, ~3,285 stimuli.
https://github.com/joseluissaorin/tokenization-coherence-study