José Luis Saorín Ferrer
Tokenization and textuality
Essay + empirical study reframing tokenization not as neutral preprocessing but as the model’s perceptual apparatus: the threshold deciding which units of meaning exist atomically and which must be reconstructed through costly operations. Theoretical scaffolding from Beaugrande & Dressler, Halliday and Langacker; empirical validation across T5, ByT5 and mT5 at reduced scale.
https://github.com/joseluissaorin/tokenization-coherence-study