Fine-Tuning “LaBSE” for a Sentiment Classification Task

Vinura Dhananjaya
Towards Data Science
6 min readJul 19, 2021

--

Photo by Denisse Leon on Unsplash

Some background

Multilingual language models (let’s call “MLM”s) have been the trend in the world of NLP in recent times due to their ability to provide multilingual word embeddings (or sentence, document etc.) within a single model. “Pre-trained” is another term comes up along with MLMs, which tells that the models have been trained on large corpora in different domains, so that we do not have to train them again from the scratch, but we can “fine-tune” them for a desired target task while making use of the “knowledge-transfer (transfer learning)” from the pre-trained knowledge. MLMs have been released to the public use mainly by tech-giants such as Google, Facebook, Baidu given that they have resources to train these large models having millions, billions and even trillions of parameters. LaBSE[1] is such a model released by Google, based on the BERT model.

LaBSE or “Language-Agnostic BERT Sentence Embedding” was built focusing on Bi-text mining, sentence/embedding similarity tasks. It uses “Wordpiece” tokenization and it can produce sentence embeddings ([CLS] token’s embedding from the model’s final layer represents sentence embeddings) for 109 languages. Although, they have not reported the model’s performance on other downstream tasks such as Classification or Named Entity Recognition (NER) nor used for that kind of downstream tasks much. The architecture of the LaBSE is a ‘dual-encoder’ model (Bidirectional Dual Encoder with Additive Margin Softmax), which means it has two encoder blocks which are based on the ‘BERT-base’ model’s encoders. The two encoders encode source and target sentences separately and fed to a scoring function (cosine similarity) to rank their similarity. The training loss function of LaBSE is based on this scoring, which was mentioned earlier as “Additive Margin Softmax”.

Setting things up

(I would try to keep things simple while including the important points) The official or the original model of LaBSE was released to the “Tensorflow Hub” (https://www.tensorflow.org/hub/) by the authors, and I will be using it. The module is depended on Tensorflow (2.4.0+ would be great and I am using 2.5.0) There are other libraries that are required, and they can be installed using pip.

NOTE: As of now (and as I am aware of) Conda environments do not work with tfhub models + GPU. If you try to use such setup, it would always (automatically) fall back to CPU versions of Tensorflow or throw errors. Hence, if a GPU is used, Conda should be out of the equation. (https://github.com/tensorflow/text/issues/644)

First, there are some essential libraries to be installed, (I am using an Ubuntu machine)

!pip install tensorflow-hub!pip install tensorflow-text # Needed for loading universal-sentence-encoder-cmlm/multilingual-preprocess!pip install tf-models-official

And we can import them,

import tensorflow as tfimport tensorflow_hub as hubimport tensorflow_text as text from official.nlp import optimization 

Obviously, you should import other common libraries as well, if needed (numpy, pandas, sklearn ) which I will not mention here.

For the classification task we can use any labelled dataset of the pre-trained 109 languages (or unsupported ones too, it does not matter! but with some performance degradation). What we will be doing is a fine-tuning of the model, which is training the pre-trained model with an additional dataset (a smaller one, compared to the huge pre-trained dataset or corpora) such that the model is fine-tuned to our specific classification (or any other) task. As a starting point, we could use the IMDb movie reviews dataset. (https://ai.stanford.edu/~amaas/data/sentiment/). Hence, our task will become a “binary sentiment classification” task. The dataset consists of 25k training and 25k test data.

!wget -c https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -O — | tar -xzAUTOTUNE = tf.data.AUTOTUNE
batch_size = 32 #8 #16
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
‘aclImdb/train’,
batch_size=batch_size,
validation_split=0.2,
subset=’training’,
seed=seed)
class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
‘aclImdb/train’,
batch_size=batch_size,
validation_split=0.2,
subset=’validation’,
seed=seed)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)test_ds = tf.keras.preprocessing.text_dataset_from_directory(
‘aclImdb/test’,
batch_size=batch_size)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

We can use data pipelines for our input data, built with Tensorflow. (If an earlier version of TF is used, this feature might not be entirely or at least directly available.). Next, we need to “Pre-process” this data before feeding into the model we will be building. Afterwards, the pre-processed data can be encoded or embedded in a vector space. For that we can define below variables and we will use the version 2 of LaBSE (https://tfhub.dev/google/LaBSE/2, version 1 was the initial model released to the TFhub)

tfhub_handle_preprocess=”https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2"
tfhub_handle_encoder=”https://tfhub.dev/google/LaBSE/2"
Photo by Steve Johnson on Unsplash

Building the model

Next, we can build the model. Below, a function has been defined to build the model with some specific layers.

def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name=’text’)
preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name=’preprocessing’)
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name=’LaBSE_encoder’)
outputs = encoder(encoder_inputs)
net = outputs[‘pooled_output’]
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, name=’classifier’)(net)
return tf.keras.Model(text_input, net)

You can notice there is the term ‘pooled_outputs’ in the model, which refers to the [CLS] token representation for sentences as mentioned earlier in the post. (The other form of output is ‘sequence_outputs’). The “trainable=True” parameter or flag on the encoder layer implies that, while fine-tuning with out dataset, we can update the weights of the original model’s weights/parameters as well. (which is also called “Global fine-tuning”).

encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name=’LaBSE_encoder’)

net = outputs[‘pooled_output’]

If we want to keep the original model’s weights as they are, then it would be called a “Feature based fine-tuning” or ”Fixed dimensional method” (there are different terms in the literature). Furthermore, some additional layers are added after the encoder (Dropout and Dense) which acts as the classifier layer of the model. This is the combination found frequently in the literature for a classification like this.

For the optimizer, Adam or AdamW (more) is preferred which we can set up like below. Based on the dataset or the task, the learning rate may be changed (lowered in most cases). Hyperparameter optimization methods which are already provided in Keras (https://keras.io/guides/keras_tuner/) or a service like “Wandb” (https://wandb.ai/site) can also be used to find the optimum parameters such as learning rate and the batch size.

from tensorflow_addons.optimizers import AdamW
step = tf.Variable(0, trainable=False)
schedule = tf.optimizers.schedules.PiecewiseConstantDecay(
[1000], [5e-5,1e-5])
lr = 1 * schedule(step)
wd = lambda: 1e-6 * schedule(step)
optimizer=AdamW(learning_rate=lr,weight_decay=wd)

Next, the model can be compiled and trained with mode.fit() method.

classifier_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=optimizer,
metrics=tf.keras.metrics.BinaryAccuracy(threshold=0.0))
history = classifier_model.fit(train_ds,validation_data=val_ds,
epochs=epochs, batch_size=32)

The number of epochs could be set to 3, 4 or 5 which are usually sufficient. We could also include methods such as “Earlystopping” as well. It is not common to use “K-Fold cross validation” with models which are large like these. Instead, we could run this model multiple times with different random seeds for the input data selection. It is fairly easy to build, run and get results from the model. The trained model can be saved and used to predict new data etc. as usually done with Tensorflow models.

In my opinion, the LaBSE model could be poor in tasks such as text classification like these, compared to models like XLM-R, perhaps due to the fact that LaBSE was originally built and trained for a bi-text mining or sentence similarity task and also more training-data (fine-tuning data) might be required for better results. LaBSE has not been used much for classification tasks in the literature either (according to my knowledge and the paper itself has 38 citations on Google Scholar). For the task here, I got accuracy, precision, recall and f1 scores slightly higher than 50%. This was done with some randomly chosen hyperparameters as well, hence the results might have been improved if the hyperparameters were changed/tuned as well. (Some similar work has been carried out with LaBSE version 1 (https://medium.com/swlh/language-agnostic-text-classification-with-labse-51a4f55dab77), but a much larger training dataset has been used.)

Anyway, I’d like to hear the feedback, comments or other’s experience on this. So. feel free to comment your thoughts and suggestions. Thanks for reading!!!

References

[1] — Language-agnostic BERT Sentence Embedding, Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang, arXiv:2007.01852v1 [cs.CL]

--

--