Today, machine learning (ML) is applied to numerous fields, including, but not limited to Natural Language Processing (NLP), Computer-aided diagnosis, Optimization, and Bioinformatics. A significant proportion of this success is due to a subset of ML called supervised learning. There are three main reasons behind the success of supervised learning (and machine learning, generally): 1) availability of massive data; 2) better algorithms; and 3) powerful computational infrastructure jointly called the AI Trinity. Supervised learning techniques require labeled data. As the data turns into 'Big data,' the effort required to label it becomes more laborious. In this article, we will talk about active learning, a suite of techniques for intelligent and data-driven annotations.
Labeled data is the primary need of supervised learning. Annotating the data is expensive because it may require: i) excessive time and manual effort ii) costly sensors. Let us get an overview of how expensive labeling is with a few examples.
Let us say we need to convert a speech or audio into text for an application such as subtitle generation (speech-to-text task). We have to annotate multiple audio segments that correspond to the words and phrases in the audio. The following example audio of 6 seconds may require around a minute to annotate manually. Thus, annotating hours of publicly available audio data is an impracticable task.
Similarly, researchers have made efforts to detect COVID-19 from the sound of the human cough [10]. Collecting labels for thousands of human cough samples is a difficult task.
Let us say we want to classify different human activities into different categories (shown in Figure 2). We need expensive sensors to monitor the alignment or motion of the human body parts for such tasks. In the end, we need to map the sensor data with various activities with substantial manual effort.
We know that increasing the training (labeled) data increases model performance. Though, all the samples do not contribute equally in improving the model. Let us understand this with a few examples.
We will use synthetic two-class data generated from a bivariate normal distribution for this experiment. Now, we will train a Support Vector Classifier (SVC) [8] model on a subset of this dataset (5 data points) and visualize the decision boundary.
(All Figures in this article are interactive. Hover over plots to know more. Click on legend items to hide/show elements)
Support vector points help the SVC model to distinguish between various classes. The model fit in the above diagram is not accurate because we have used a small number of train datapoints. Consider some potential train points A, B, C, and D from unlabeled datapoints. Datapoints B and D are closer to the confusion area than A and C. Thus, B and D are more informative in improving the model if added to the train points. When the SVC model says, "closer is better," it means closer to the confusion area.
Let us consider the MNIST dataset (a well-known public dataset with labeled images of digits $0$ to $9$) for the classification task. We have shown a few examples here.
We will train the Support Vector Classifier (SVC) model on a few random samples of the MNIST dataset. Let us see what our model learns with a set of $50$ data points ($5$ samples for each class). We show the normalized confusion matrix over the test set having $10000$ samples (Figure $4$).
We can see from the confusion matrix that few digits have more confusion than others. For example, the digit '1' has the least confusion; digit '9' is confused with '7' and '4'. Some digits are difficult to distinguish from the model's perspective. Thus, we may need more training examples for the before-mentioned digits to learn them correctly. Now, we will see a regression-based example.
We will consider a sine curve data with added noise. We take a few samples (8 samples) as the train points, a few as the potential train points, and the rest as the test points.
We will fit a GPR (Gaussian Process Regressor) [11] model to our dataset with Matern kernel. GPR models additionally provide the uncertainty about the predictions. The predictive variance is measure of uncertainty of the model about its predictions (predictive mean).
We can observe that uncertainty (predictive variance) is higher at the distant datapoints from the train points. Let us consider a set of datapoints A, B, C, and D, to see if they are equally informative to the model.
We can say from RMSE and predictive variance that datapoints A and D are more informative to the model than B and C. Note that adding points to the train set is equivalent to annotating unlabeled data and using them for training. We can either have an intelligent way to choose these 'good' points or randomly choose some datapoints and label them. Active learning techniques can help us determine the 'good' datapoints, which are likely to improve our model. Now, we will discuss active learning techniques in detail.
Wikipedia quotes the definition of active learning as the following,
The below diagram illustrates the general flow of active learning.
As shown in the flow diagram, an ML model gives a few samples to the oracle (human annotator or data source) for labeling from an unlabeled pool or distribution. These samples are chosen intelligently by a few criteria. Thus, active learning is also called as optimal experimental design in other words [link].
An ML model can randomly sample datapoints and send them to the oracle for labeling. Random sampling will also eventually result in capturing the global distribution of the dataset in the train datapoints. However, active learning aims to improve the model by intelligently selecting the datapoints for labeling. Thus, Random sampling is an appropriate baseline to compare with active learning.
We have mainly three different scenarios of active learning:
The pool-based sampling scenario is suitable for most of the real-world applications. Thus, we restrict our article to pool-based sampling only.
We can query the datapoints from an unlabeled pool with the following methods:
We will demonstrate each of the above strategies with examples in the subsequent sections.
There are different approaches for the Classification and Regression tasks in uncertainty sampling. We will go through them one by one with examples here.
We will fit a Random Forest Classifier model [10] (an ensemble model consisting of multiple Decision Tree Classifiers) on a few random samples (50 samples) of MNIST dataset and visualize the predictions. We will explain different ways to perform uncertainty sampling using the predictions.
Above are the model predictions in terms of probability for a few random test samples. We can use different uncertainty strategies as the following.
Least confident [16]: In this method, we choose samples for which the most probable class's probability is minimum. In the above example, sample 1 is least confident about its highest probable class digit '1'. So, we will choose sample 1 among all for labeling using this approach.
Margin sampling [17]: In this method, we choose samples for which the difference between the probability of the most probable class and the second most probable class is minimum. In the above example, sample 1 has the least margin; thus, we will choose sample 1 for labeling using this approach.
Entropy [15]: Entropy can be calculated for N number of classes using the following equation, where $P(x_i)$ is predicted probability for $i^{th}$ class. \begin{equation} H(X) = -\sum\limits_{i=0}^{N}P(x_i)log_2P(x_i) \end{equation} Entropy is likely to be higher if the probability is distributed over all classes. Thus, we can say that if entropy is higher, the model is more confused among all classes. For the above example, sample 2 has the highest entropy in predictions. So, we can choose the same for labeling.
We will now see the effect of active learning with these strategies on test data (contains 10000 samples). We will continue using the Random Forest Classifier model for this problem. We start with 50 samples as the initial train set and add 100 actively chosen samples over 100 iterations.
The above animation shows F1-scores for samples of individual digits and overall F1-scores across all digits after each iteration. We can see that each of the strategies, except random sampling, tends to choose more samples of a digit class having a lower F1-score. Margin sampling performs better than the other strategies in terms of F1-score. Margin sampling and Least confident method easily outperform the random baseline. The entropy method, in this case, is comparable to the random baseline. The Figure below shows a comparison of all strategies.
Thus far, we have seen uncertainty for classification tasks. Now, we see an example of regression to understand uncertainty sampling.
We will consider the sine curve dataset we have used in an earlier discussion. We will fit the Gaussian Process regressor model with Matern kernel on randomly chosen 8 data points from the noisy sine curve dataset. The uncertainty measure for the regression tasks is the standard deviation or the predictive variance. In this example, we will take predictive variance as our measure of uncertainty.
As per uncertainty criteria, we should label the samples with higher predictive variance. Now, we will show a comparison of uncertainty sampling with random sampling for ten iterations. We also show the next sample to query at each iteration.
Animation 2 demonstrates a comparison between uncertainty sampling and random sampling. We can observe that uncertainty sampling-based samples are more informative to the model and ultimately help reduce model uncertainty (variance) and RMSE compared to random sampling.
Now, we will discuss the query by committee method.
Query by committee approach involves creating a committee of two or more learners or models. Each of the learners can vote for samples in the pool set. Samples for which all committee members disagree the most are considered for querying. For classification tasks, we can take a mode of votes from all learners, and in regression settings, we can take average predictions from all the learners. The central intuition behind QBC is to minimize the *version space*. Initially, each model has different hypotheses that try to converge as we query more samples.
We can set up a committee for the QBC using the following approaches
We will explain the first approach (Same model with different hyperparameters) using SVC (Support Vector Classifier) model with RBF kernel on Iris dataset.
We initially train the model on six samples and Actively choose 30 samples from the pool set. We will test the model performance at each iteration on the same test set of 30 samples.
Separation boundaries between different colors are decision boundaries in Animation 3. Points queried by the committee are the points where the learners disagree the most. This can be observed from the above plot. We can see that initially, all models learn different decision boundaries for the same data. Iteratively they converge to a similar hypothesis and thus start learning similar decision boundaries.
We now show the comparison of the overall F1-score between random sampling and our model. QBC, most of the time, outperforms the random sampling method.
Thus far, we have seen and understood various active learning strategies by examples. Now, let us compare the uncertainty sampling methods and query by committee (QBC).
We will use MNIST dataset to demonstrate the performance of various sampling techniques. For uncertainty sampling, we will use the Random Forest classifier. For QBC, let us use three different classifiers (Random Forest Classifier, Logistic Regression [14], and Support Vector Classifier). Animation 4 shows the simulation of active learning for 100 iterations and testing F1-score on the test set.
Query by committee is performing better than uncertainty sampling. The reason is that uncertainty sampling tends to be biased towards the actual learner, and it may miss important examples that are not in sight of the estimator [1, 5]. QBC overcomes this problem by taking votes from different models on the same datapoints or different datapoints with the same model. Our case was different models with the same datapoints.
Till now, in the article, we have queried only one sample at a time. We should consider the time to retrain the model over the train set and evaluate it on the pool set. Indeed, updating the model after each queried sample is ideal as the informativeness of samples is updated without adding noise in the form of non-informative samples. Let us assume that each sample's annotation cost is constant, and thus, we can ignore it here. We will now see the effect of selecting $ K $ samples at once on model improvement and overall time-taken for the train-test process.
We will use query by committee strategy to query for 30 samples on the IRIS dataset. The mean results of repeating each experiment 50 times with different train, validation, test splits are shown below.
We can see that, as K increases, time taken to complete the task decreases. Macro averaged F1-score also decreases with K. From this experiment, we can conclude that a good trade-off between time and K should be chosen to achieve the optimal results.
There are few more active learning techniques we are not covering in this article, but we describe them in brief here:
With this, we end the visual tour to the active learning techniques.