It’s thought that it learn syntactic knowledge at the lower levels of the neural network and then semantic knowledge at the higher levels as they begin to hone in on more specific language domain signals, e.g. medical vs. technical training texts. https://hal.inria.fr/hal-02131630/document
https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
https://www.kaggle.com/code/abhinand05/bert-for-humans-tutorial-baseline