Active learning for Green-NLP
I have developed a personal interest in applications of AI in climate/sustainability (“AI-for-good”). Green AI is a relatively novel field within that topic, where AI can be applied to tackle problems like climate change and reach sustainable goals. On the other hand, green AI focuses on how environmentally friendly the AI tools could be. Green NLP could be thought of as a subfield where we could use NLP to tackle environmental issues while being eco-friendly. In this post, the first aspect, which is about utilizing NLP to make an impact, is in the spotlight. Especially, NLP could be used for applications like automated sustainability reporting or identification of climate-related claims in the discourse (and more).
The main challenge for green NLP is that the domain is relatively new, hence annotated data is scarce. It consumes a lot of resources to label unstructured textual data. We approached the problem from different perspectives and with different techniques. In this post, I would like to explore a bit about “active learning” as an alternative to data scarcity. It could help in getting good results for modeling tasks with a minimum cost of labeling.
Active learning — an intro
Active learning is a semi-supervised learning method. A model would start with a few annotated samples available, and with the help of an Oracle (a human expert or an expert system), or on its own, the model would try to label the rest of the samples, but efficiently and effectively.
There are plenty of techniques in active learning and we would be using one of them here. In this brief introduction, I would try to discuss the techniques that we will be using. (There is a bunch of articles explaining active learning theories, such as this one).
Pool-based sampling
In pool-based sampling, we have a small, labeled dataset to begin with and a pool of unlabeled samples to choose from. Usually, the size of the labeled dataset (|L|), would be much smaller than the pool (|P|) (|L|<<|P|).
We will define some rules to select the unlabeled samples from this unlabeled pool. These rules will be based on measures of uncertainty, i.e.- we try to measure how uncertain the model is about predicting a specific sample. It is expected that the selection of the most uncertain samples would help the model to achieve better results.
The combination of a pool and uncertainty measures to query samples from the pool (uncertainty sampling) is one active learning technique and we will be using it in this post. As this paper summarizes,
“…. the motivation behind uncertainty sampling is to find some unlabeled examples near decision boundaries, and use them to clarify the position of decision boundaries…”
Querying strategies
The querying strategy would define how we are going to select unlabeled samples to be labeled from the pool. In other words, they are based on the measures of uncertainty that we mentioned before. In this experiment, we will be using uncertainty_sampling, margin_sampling, and entropy_sampling which are three such measures of uncertainty and already implemented in modAL.
In uncertainty_sampling, we would take the sample the model is most uncertain about. In an example where we would have a simple linear decision boundary separating the two classes, it may appear as below. The samples that appear closer to the decision boundary are in a bit uncertain position of selecting a class.
In margin_sampling, we would consider the margin of uncertainty. In our case, if the prediction probabilities (for the two classes) for sample-01 are closer to each other (e.g.- [0.52, 0.48]) the model is not very confident about predicting a specific class to it. For sample-02, the probabilities could be [0.77, 0.23], which means the model is more confident in assigning a class (class 0) to sample-02. Hence, we pick sample-01 from the pool to be labeled next. In the below graph, the two bars for a data sample represent the predicted probabilities for each class, by the model. There, the selection of sample 01 would be the same even if we followed the previous querying strategy.
The margin_sampling strategy allows a larger space of uncertainty which would be more prominent as the number of output classes increases. But in a binary classification scenario like ours, they would be pretty much the same.
The difference between the two strategies could be seen more clearly in a three-class scenario.
Another strategy is entropy_sampling. There, an entropy value is calculated for each sample, based on the probability predictions for the classes. We select the sample with the highest entropy value. When the probabilities are closer to being uniform (in other words probability values are closer to each other — the model is uncertain about a final class), the entropy value is higher.
Datasets and model
We will be using a dataset, related to climate change related claims. The dataset is a binary labeled dataset that tells whether a piece of text is a climate-related claim (label 1) or not (label 0).
As the model, we would be using a simple “distilbert-uncased” model from HuggingFace. (There are a few domain-specific models available as well).
Method
Even though our dataset is labeled, we will assume that only 10% of it has been labeled. It would be a hectic task to label the remaining 90%. This scenario replicates somewhere we would use active learning; we have scraped data from some sources (web, pdf-files, etc.) but now we need to go through them and manually label them, which is not easy. Active learning would be a perfect tool to handle this dilemma. In the original paper, which introduced the dataset, they have already used active learning. Hence, we would use an approach of active learning which is different from what they have used. To perform active learning with our model, we would use “modAL” library. However, to use Huggingface libraries with modAL, we need the help of “skorch”. It provides a nice wrapper around PyTorch models (which Huggingface models are) so that we can have a sci-kit-learn compatible interface with the PyTorch models.
(I will be running the experiment on Google Colab and the complete notebook for the experiment can be found here. Here, I will be adding only the necessary parts of the code).
Baseline
First, let’s install the requirements.
!pip install transformers
!pip install evaluate
!pip install modAL
!pip install skorch
Skorch has its class implementation that we can use to wrap HuggingFace transformers. We will set the random states too.
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
from skorch import NeuralNetClassifier, NeuralNet
from skorch.hf import HuggingfacePretrainedTokenizer
from sklearn.pipeline import Pipelineseed = 13
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
After installing the required libraries and importing them, we would have a separate test-set, to evaluate the active learning performance of our model. As the baseline, first, we would fine-tune the distilbert model on the maximum possible amount of training data (after the test set is separated) and report the accuracy. At this point, we have our data split like below.
- x_train — to train the model. We will later modify this to test our active learning technique
- x_test — a test data set that will not be modifed from here onward.
We can easily fine-tune the model with the “trainer” API from HuggingFace. Below are some generic hyper-parameters we will use, and we will not be doing any hyper-parameter optimization.
training_args = TrainingArguments(
learning rate=5e-5,
num_train_epochs=4,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
logging_steps=200,
output_dir="outputs",
overwrite_output_dir=True,
evaluation_strategy="epoch")
Evaluation of the x_test dataset gives us an accuracy of 65.63%. This would be the baseline.
{'eval_loss': 0.8183095455169678,
'eval_accuracy': 0.65625,
'eval_runtime': 0.8913,
'eval_samples_per_second': 430.845,
'eval_steps_per_second': 53.856,
'epoch': 3.0
Active learning
For our active learning experiment, we need an unlabeled dataset. To create an unlabeled dataset, we can randomly select a set of records from x_train we created earlier. We will create divide x_train into two parts as below,
- X_initial — Labeled dataset (randomly selected 50 data points). This will act as a starting point for the model. The model will learn something from the very few labeled data samples available.
- X_pool — Unlabeled dataset (the rest, ~1150 data points. We will remove the labels from these data points). Model will take samples from this pool and label them later.
Next, we will create a PyTorch module class that will wrap our distilbert model in a torch.nn.module and return logits as outputs.
class distilbertModule(torch.nn.Module): def __init__(self, name, num_labels):
super().__init__()
self.name = name
self.num_labels = num_labels
self.reset_weights() def reset_weights(self):
self.bert = \ AutoModelForSequenceClassification.from_pretrained(self.name, num_labels=self.num_labels#,ignore_mismatched_sizes=True) def forward(self, **kwargs):
pred = self.bert(**kwargs)
return pred.logits
Next, we can create a sklearn pipeline that contains our distilbert model, preceded by the tokenizer.
TOKENIZER = "distilbert-base-uncased"
PRETRAINED_MODEL = "distilbert-base-uncased"
OPTMIZER = torch.optim.AdamW
LR = 5e-5
MAX_EPOCHS = 4
CRITERION = torch.nn.CrossEntropyLoss
BATCH_SIZE = 8
DEVICE='cuda'tokenizer=HuggingfacePretrainedTokenizer(TOKENIZER)pipeline = Pipeline([ #create the pipeline ('tokenizer', HuggingfacePretrainedTokenizer(TOKENIZER)),
('net', NeuralNetClassifier( # create the classifier BertModule,
module__name=PRETRAINED_MODEL,
module__num_labels=2,
optimizer=OPTMIZER,
lr=LR,
max_epochs=MAX_EPOCHS,
criterion=CRITERION,
batch_size=BATCH_SIZE,
iterator_train__shuffle=True,
device=DEVICE,
)),])
Using the created pipeline, we can initialize the a learner object with modAL. We will use our smalled labeled dataset as the starting point to our learner object.
learner = ActiveLearner(estimator=pipeline, X_training=X_initial, y_training=y_initial)
We can measure the performance of our model with this small starting dataset. Which would give a score of 59.64%.
with torch.inference_mode():
y_pred = pipeline.predict(X_test)print(accuracy_score(np.array(y_test),np.array(y_pred)))
Next, we will be querying data from the X_pool set. In this experiment, we will perform 10 queries covering the complete pool (100 samples on each query). Usually, we can do iterative querying until we achieve a predefined performance level or until we completely exhaust the pool. In each query, the learner will take a set of data points from the pool and try to tune (improve) the classifier by labeling them and using them to train our model. With each query, we will record the accuracy score given by the model, on the x_test dataset. (Note that, the labels of the instances that the learner model requests are already there as we had them in the original dataset. Someone could think that these labels can be provided by an oracle.)
performance_hist=[]
n_queries = 10
for idx in range(n_queries):
print('Query no. %d' % (idx + 1))
query_idx, query_instance = learner.query(X_pool,n_instances=80) #
learner.teach(X=X_pool[query_idx], y=y_pool[query_idx],\ only_new=True,) # remove queried instance from the pool X_pool = np.delete(X_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0) y_pred=learner.predict(X_test) performance_hist.append(accuracy_score(np.array(y_test),\
np.array(y_pred)))
Below is a visualization of the changes in the accuracy score with each query. The baseline accuracy score is indicated by the red horizontal line. It can be seen that with the active-learning approach, we could reach near the performance observed with the fully labeled dataset (or the fully supervised fine-tuning performance).
By changing the query_strategy to margin_sampling, we can observe that we could similarly good results, for this case.
learner = ActiveLearner(estimator=pipeline,X_training=X_initial, y_training=y_initial, query_strategy=margin_sampling)
With entropy_sampling, the results are shown below. In this run, the active learner seems to be marginally surpassing the baseline accuracy at some queries.
This experiment was done with a small dataset and without tuning much of the hyperparameters (of the model and the active learning settings). However, it seems that active learning provides a satisfactory alternative to the fully supervised task, which requires costly labeling.
Perhaps active learning could be something overlooked. Yet, active learning could be very much useful when leveraging NLP in a new domain like Green-NLP.
I hope to explore more on the topics of active learning and weak-supervised learning in the future. I’m happy to receive any comments/thought as well.