EdBERTa: Pretraining BERT using EdTech data

Unacademy has a lot of textual data available. Some of the data is in the form of notes and class presentations, student chat data, quizzes, and test series question-answer sets. Also, text data has multi-lingual and cross-lingual content along with transliterated text covering more than 10 major Indian languages.

Based on each type of specific data mentioned above we can realize multiple NLP-related projects. A few of them are - Chat Priority Detection, Question Difficulty Estimation, Similar/Duplicate Question Detection, Ed-Tech Entity Recognition (NER task), Text/Course quality evaluation.

The above problems can be approached to solve using traditional NLP techniques - words and token-based classifiers, Tf-IDF, and word2vec/doc2vec. But it has been proven over the last few years that using pretrained models, trained over the large corpus and then fine-tuned for specific use-case .. yields the best results.

We encourage you to check out how NLP evolved over the years from simple statistical approaches to transformers here. Also, you can check GLUE benchmarks and leaderboards.

We could directly use some of these pre-trained multilingual models like BERT, RoBERTa, ALBERT, and fine-tune them to our tasks and train models. But what if we could pre-train our own model and use it instead of these generic language models?

Train from scratch .. not fine-tune?

We got inspiration from some of the BERT variations ... CamemBERT(state-of-the-art language model for French), Med-BERT(model pre-trained on a structured EHR dataset of 28,490,650 patients), and TaBERT(pre-trained to learn representations for tabular data).

At Unacademy, we have a good amount of text data. Gigabytes of text data, billions of words/tokens, chats, transliterated text, and code-mix quizzes, etc. So instead of fine-tuning, it made sense to pre-train in our case.

Indian language contribution in data on which SOTA models are trained is less than 5%. So did not make sense to fine-tune these models. In future iterations, we could pick up MuRIL, IndicBERT like models to fine-tune instead of training from scratch .. as it might reduce the training period.

Also, we customized the architecture to get faster inference instead of using given base architectures.

Data Collection

We collected text-related data from the Unacademy platform. Quizzes, QnA, Course content and notes, live class chats, etc. There was no need for external text data to be collected .. but we chose to include some data as shown below -

We intentionally skipped Wikipedia & Reddit data .. as we already accumulated many GBs of domain-specific data and we also wanted to experiment quickly as possible.

Once we collected all the data needed, we lowercased all text. Made sense in our case as chat data is mostly lowercased. Books converted to pages and then to paragraphs. We also kept all characters as they cover many math symbols, emojis, etc.

At this point, we had a text file with >200 million lines and >5 billion words to train.

Tokenizer Training

As we have domain-specific data, we opted to train our own sub-word tokenizer instead of using other tokenizers from BERT or GPT. We chose to train the tokenizer with SentencePiece. They support both byte-pair-encoding (BPE) and unigram language model (ULM). To understand more about tokenization & why it is required, check this article.

The installation and training process is very simple and explained here. We used the following command to train the model. Here, input is a training text file. We chose a larger vocab_size of 96k with the hope that a large vocabulary might cover some or all EdTech specific token patterns.


$ spm_train --input=train.txt --model_prefix=edbpe --vocab_size=96000 --character_coverage=1.0 --model_type=bpe
$ spm_encode --model=edbpe.model --output_format=piece < train.txt > train.bpe

We got two files edbpe.model and edbpe.vocab and we created token-encoded-text files using these model files. We also set aside ~1million lines of test and validation files.

At this point, we had text files with >12 billion tokens to train.

Training Using Fairseq

Fairseq is a sequence modeling toolkit mostly used by researchers in NLP & Speech domains. It is being actively supported by Facebook's AI Research team and is very underrated as most of the time HuggingFace is being preferred.

We chose Fairseq over HuggingFace mostly because of previous good experience working with it. Also, it provides extensive options to fine-tune what you want to train, add/remove architecture code easily, and also supports easy fine-tuning of tasks using command line & python API.

The complete training procedure is explained here in this custom roberta training tutorial. We used the roberta-base architecture but customized it to have faster inference but good enough performance. We created 3 variations with different sets of parameters to verify various models and their performances. The final chosen parameters and command look like below. We chose just 4 encoder-layers and 16 attention-heads and used --fp16 for faster, efficient training.


$ fairseq-train --fp16 $DATA_DIR --task masked_lm --criterion masked_lm --encoder-layers 4 --encoder-embed-dim 512 --encoder-ffn-embed-dim 2048 --encoder-attention-heads 16 --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --batch-size $MAX_SENTENCES --update-freq $UPDATE_FREQ --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --skip-invalid-size-inputs-valid-test

We used Python 3.7, RTX 3090, Cuda 11, Nvidia NGC Container for Pytorch for training. It took us a total of 3 weeks to generate all 3 variations of the model. We stopped the training once we went below 4.5 perplexity. More details about perplexity can be found here.

Now our model is ready to use as it is and also for fine-tuning for further tasks.

### Load the model in fairseq
from fairseq.models.roberta import RobertaModel

edberta = RobertaModel.from_pretrained('/path/to/edberta_checkpoint_file/',
                             checkpoint_file='edberta_best.pt')

edberta.eval()

Designing our own tasks instead of GLUE

It is a good practice always to check the performance of a pre-trained model with downstream GLUE tasks. But in our case, we wanted to check how our model performs on EdTech related tasks .. not on traditional GLUE tasks.

So we created 3 specific tasks for downstream fine-tuning.

Course Subject Classification - The task was to classify given book-text sentences into Maths, Psychology, Physics, Chemistry, Geography, History. We decided that F1 will be the metric for evaluation for this task.
Chat Topic Categorisation - The task was to classify chat into Maths, Physics, Chemistry, Geography, History, Greetings, Doubts related chat classes. We again decided that F1 will be the metric for evaluation for this task.
Chat Clustering - The task was to cluster a set of chats and check cluster quality statistically. We defined our custom metric MCDC (Mean Cosine Distance from Centroid) which exactly works as its name suggests.

All collected training data for tasks was verified with tagging team and we tried to keep it as accurate as possible. We fine-tuned BERT, ALBERT, RoBERTa, and EdBERTa over these tasks which took a few hours to train.

Benchmarks

After fine-tuning and final inference results .. we compiled all tasks against their metric for each of these language models.

We got the following encouraging results for EdBERTa -

Course Subject Classification → 5.17% improvement over BERT
Chat Topic Categorisation → 13.26% improvement over ALBERT
Chat Clustering → 28% improvement over RoBERTa (less MCDC is preferred here .. also means tighter/closer cluster points)

All our defined tasks EdBERTa performed better than existing SOTA models.

It is very important to note that this doesn't prove our model is better than SOTA.

It just proves that for our use-cases and on our Ed-Tech data it performs better ... which is useful for us but may not work well with data or tasks in other domains like Health, News, E-Commerce, etc etc.

Final thoughts

In this article, we went through the process of creating a transformer-based language model from scratch and why we chose to do the same. This proof-of-concept again proves the fact that if you have a good amount of data and your domain is very niche ... pre-training should be done instead of just fine-tuning over SOTA NLP models.

This model could be a seed model for all our NLP tasks as explained in earlier sections. It is already showing good results in Question Difficulty Detection and Chat Priority Categorisation tasks where we could get better results quickly. Average inference times for 100 requests in parallel are ~12ms on CPU (for instances like c5.xlarge ) and ~4ms on GPU (for instances like g4dn.xlarge ) which is on par for many NLP APIs in terms of performance.