The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
(Source: "The mind is flat" by Nick Chater)