One popular algorithm for subword tokenisation which follows the above approach is BPE.
- BPE was originally used to help compress data by finding common byte pair combinations. It can also be applied to NLP to find the most efficient way of representing text.
The main goal of the BPE subword algorithm is to find a way to represent your entire text dataset with the least amount of tokens. Similar to a compression algorithm, you want to find the best way to represent your image, text or whatever you are encoding, which uses the least amount of data, or in our case tokens. In the BPE algorithm merging is the way we try and “compress” the text into subword units.
https://blog.floydhub.com/tokenization-nlp/#bpe
https://leimao.github.io/blog/Byte-Pair-Encoding/