Tokenization

THEORY

Tokenization is a fundamental pre-processing step for most natural language processing (NLP) applications. It involves splitting text into smaller units called tokens (e.g., words or word segments) in order to turn an unstructured input string into a sequence of discrete elements that is suitable for a machine learning (ML) model.

https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html

A fundamental tokenization approach is to break text into words. However, using this approach, words that are not included in the vocabulary are treated as “unknown”. Modern NLP models address this issue by tokenizing text into subword units, which often retain linguistic meaning (e.g., morphemes).

A morpheme is the smallest meaningful lexical item in a language. The field of linguistic study dedicated to morphemes is called morphology.

So, even though a word may be unknown to the model, individual subword tokens may retain enough information for the model to infer the meaning to some extent.

https://blog.floydhub.com/tokenization-nlp/

Remember, these models have no knowledge of language**. **

So they cannot learn from text if they don’t understand anything about the structure of language. It will just appear like gibberish to the model and it won’t learn anything.

It won’t understand where one word starts and another ends.
It won’t even know what constitutes a word.

We get around this by first learning to understand spoken language and then learning to relate speech to written text. So we need to find a way to do **two **things to be able to feed our training data of text into our DL model:

Split the input into smaller chunks: The model doesn’t know anything about the structure of language so we need to break it into chunks, or tokens, before feeding it into the model.
Represent the input as a vector: We want the model to learn the relationship between the words in a sentence or sequence of text. We do not want to encode grammatical rules into the model, as they would be restrictive and require specialist linguistic knowledge. Instead, we want the model to learn the relationship itself and discover some way to understand language. To do this we need to encode the tokens as vectors where the model can encode meaning in any of the dimensions of these vector. They can be used as outputs since they represent the contextual reference for the words. Alternatively, they can be fed into other layers as inputs to higher level NLP tasks such as text classification or used for transfer learning.

https://medium.com/the-artificial-impostor/nlp-four-ways-to-tokenize-chinese-documents-f349eb6ba3c3

https://towardsdatascience.com/benchmarking-python-nlp-tokenizers-3ac4735100c5

🌱 Back to Garden

sargx digital garden

Explorer

Tokenization

THEORY

Byte Pair Encoding (BPE)