Active Learning: A Visual Tour

Zeel B Patel, IIT Gandhinagar, patel_zeel@iitgn.ac.in

Nipun Batra, IIT Gandhinagar, nipun.batra@iitgn.ac.in

Repo

Discussions

Issues

Rise of Supervised Learning

Today, machine learning (ML) is applied to numerous fields, including, but not limited to Natural Language Processing (NLP), Computer-aided diagnosis, Optimization, and Bioinformatics. A significant proportion of this success is due to a subset of ML called supervised learning. There are three main reasons behind the success of supervised learning (and machine learning, generally): 1) availability of massive data; 2) better algorithms; and 3) powerful computational infrastructure jointly called the AI Trinity. Supervised learning techniques require labeled data. As the data turns into 'Big data,' the effort required to label it becomes more laborious. In this article, we will talk about active learning, a suite of techniques for intelligent and data-driven annotations.

Data Annotation is Expensive

Labeled data is the primary need of supervised learning. Annotating the data is expensive because it may require: i) excessive time and manual effort ii) costly sensors. Let us get an overview of how expensive labeling is with a few examples.

Speech Recognition

Let us say we need to convert a speech or audio into text for an application such as subtitle generation (speech-to-text task). We have to annotate multiple audio segments that correspond to the words and phrases in the audio. The following example audio of 6 seconds may require around a minute to annotate manually. Thus, annotating hours of publicly available audio data is an impracticable task.

Similarly, researchers have made efforts to detect COVID-19 from the sound of the human cough [10]. Collecting labels for thousands of human cough samples is a difficult task.

Human Activity Recognition

Let us say we want to classify different human activities into different categories (shown in Figure 2). We need expensive sensors to monitor the alignment or motion of the human body parts for such tasks. In the end, we need to map the sensor data with various activities with substantial manual effort.

All the Samples are Not Equally Important

We know that increasing the training (labeled) data increases model performance. Though, all the samples do not contribute equally in improving the model. Let us understand this with a few examples.

SVC Says: Closer is Better

We will use synthetic two-class data generated from a bivariate normal distribution for this experiment. Now, we will train a Support Vector Classifier (SVC) [8] model on a subset of this dataset (5 data points) and visualize the decision boundary.

(All Figures in this article are interactive. Hover over plots to know more. Click on legend items to hide/show elements)

Support vector points help the SVC model to distinguish between various classes. The model fit in the above diagram is not accurate because we have used a small number of train datapoints. Consider some potential train points A, B, C, and D from unlabeled datapoints. Datapoints B and D are closer to the confusion area than A and C. Thus, B and D are more informative in improving the model if added to the train points. When the SVC model says, "closer is better," it means closer to the confusion area.

Confusion in Digit Classification

Let us consider the MNIST dataset (a well-known public dataset with labeled images of digits $0$ to $9$) for the classification task. We have shown a few examples here.

We will train the Support Vector Classifier (SVC) model on a few random samples of the MNIST dataset. Let us see what our model learns with a set of $50$ data points ($5$ samples for each class). We show the normalized confusion matrix over the test set having $10000$ samples (Figure $4$).

We can see from the confusion matrix that few digits have more confusion than others. For example, the digit '1' has the least confusion; digit '9' is confused with '7' and '4'. Some digits are difficult to distinguish from the model's perspective. Thus, we may need more training examples for the before-mentioned digits to learn them correctly. Now, we will see a regression-based example.

GP Needs 'Good' Data Points

We will consider a sine curve data with added noise. We take a few samples (8 samples) as the train points, a few as the potential train points, and the rest as the test points.

We will fit a GPR (Gaussian Process Regressor) [11] model to our dataset with Matern kernel. GPR models additionally provide the uncertainty about the predictions. The predictive variance is measure of uncertainty of the model about its predictions (predictive mean).

We can observe that uncertainty (predictive variance) is higher at the distant datapoints from the train points. Let us consider a set of datapoints A, B, C, and D, to see if they are equally informative to the model.

We can say from RMSE and predictive variance that datapoints A and D are more informative to the model than B and C. Note that adding points to the train set is equivalent to annotating unlabeled data and using them for training. We can either have an intelligent way to choose these 'good' points or randomly choose some datapoints and label them. Active learning techniques can help us determine the 'good' datapoints, which are likely to improve our model. Now, we will discuss active learning techniques in detail.

The Basics of Active Learning

Wikipedia quotes the definition of active learning as the following,

  • 'Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs.'

The below diagram illustrates the general flow of active learning.

As shown in the flow diagram, an ML model gives a few samples to the oracle (human annotator or data source) for labeling from an unlabeled pool or distribution. These samples are chosen intelligently by a few criteria. Thus, active learning is also called as optimal experimental design in other words [link].

Random Baseline

An ML model can randomly sample datapoints and send them to the oracle for labeling. Random sampling will also eventually result in capturing the global distribution of the dataset in the train datapoints. However, active learning aims to improve the model by intelligently selecting the datapoints for labeling. Thus, Random sampling is an appropriate baseline to compare with active learning.

Different Scenarios for Active Learning

We have mainly three different scenarios of active learning:

  1. Membership Query Synthesis [12]: In this scenario, the model has an underlying distribution of data points from where it can generate the samples. The generated samples are sent to the oracle for labeling.
  2. Stream-Based Selective Sampling [13]: We have a live stream of online data samples, and for each incoming sample model can choose to query for it or discard it based on some criteria. One possible criterion is to have some information measure or a query strategy to query the incoming sample [2, 3]. Another way is to define several hypotheses that define a region where they agree for labeled dataset called version space [2, 4] but disagree for some unlabeled dataset. Calculating the exact region is expensive thus approximations and other methods are used in practice [2, 5, 6, 7].
  3. Pool-Based Sampling [8]: In this case, we already have a pool of unlabeled samples (We called them potential train points in the prior discussion). Based on some criteria, model queries for a few samples.

The pool-based sampling scenario is suitable for most of the real-world applications. Thus, we restrict our article to pool-based sampling only.

Pool-Based Sampling

We can query the datapoints from an unlabeled pool with the following methods:

  1. Uncertainty Sampling [8]: We query the samples based on the model's uncertainty about the predictions.
  2. Query by Committee [5]: In this approach, we create a committee of two or more models. The Committee queries for the samples where predictions disagree the most among themselves.

We will demonstrate each of the above strategies with examples in the subsequent sections.

Uncertainty Sampling

There are different approaches for the Classification and Regression tasks in uncertainty sampling. We will go through them one by one with examples here.

Digit Classification with MNIST Dataset

We will fit a Random Forest Classifier model [10] (an ensemble model consisting of multiple Decision Tree Classifiers) on a few random samples (50 samples) of MNIST dataset and visualize the predictions. We will explain different ways to perform uncertainty sampling using the predictions.

Above are the model predictions in terms of probability for a few random test samples. We can use different uncertainty strategies as the following.

  1. Least confident [16]: In this method, we choose samples for which the most probable class's probability is minimum. In the above example, sample 1 is least confident about its highest probable class digit '1'. So, we will choose sample 1 among all for labeling using this approach.

  2. Margin sampling [17]: In this method, we choose samples for which the difference between the probability of the most probable class and the second most probable class is minimum. In the above example, sample 1 has the least margin; thus, we will choose sample 1 for labeling using this approach.

  3. Entropy [15]: Entropy can be calculated for N number of classes using the following equation, where $P(x_i)$ is predicted probability for $i^{th}$ class. \begin{equation} H(X) = -\sum\limits_{i=0}^{N}P(x_i)log_2P(x_i) \end{equation} Entropy is likely to be higher if the probability is distributed over all classes. Thus, we can say that if entropy is higher, the model is more confused among all classes. For the above example, sample 2 has the highest entropy in predictions. So, we can choose the same for labeling.

We will now see the effect of active learning with these strategies on test data (contains 10000 samples). We will continue using the Random Forest Classifier model for this problem. We start with 50 samples as the initial train set and add 100 actively chosen samples over 100 iterations.

The above animation shows F1-scores for samples of individual digits and overall F1-scores across all digits after each iteration. We can see that each of the strategies, except random sampling, tends to choose more samples of a digit class having a lower F1-score. Margin sampling performs better than the other strategies in terms of F1-score. Margin sampling and Least confident method easily outperform the random baseline. The entropy method, in this case, is comparable to the random baseline. The Figure below shows a comparison of all strategies.

Thus far, we have seen uncertainty for classification tasks. Now, we see an example of regression to understand uncertainty sampling.

Regression on Noisy Sine Curve

We will consider the sine curve dataset we have used in an earlier discussion. We will fit the Gaussian Process regressor model with Matern kernel on randomly chosen 8 data points from the noisy sine curve dataset. The uncertainty measure for the regression tasks is the standard deviation or the predictive variance. In this example, we will take predictive variance as our measure of uncertainty.

As per uncertainty criteria, we should label the samples with higher predictive variance. Now, we will show a comparison of uncertainty sampling with random sampling for ten iterations. We also show the next sample to query at each iteration.

Animation 2 demonstrates a comparison between uncertainty sampling and random sampling. We can observe that uncertainty sampling-based samples are more informative to the model and ultimately help reduce model uncertainty (variance) and RMSE compared to random sampling.

Now, we will discuss the query by committee method.

Query by Committee (QBC)

Query by committee approach involves creating a committee of two or more learners or models. Each of the learners can vote for samples in the pool set. Samples for which all committee members disagree the most are considered for querying. For classification tasks, we can take a mode of votes from all learners, and in regression settings, we can take average predictions from all the learners. The central intuition behind QBC is to minimize the *version space*. Initially, each model has different hypotheses that try to converge as we query more samples.

We can set up a committee for the QBC using the following approaches

  1. Same model with different hyperparameters
  2. Same model with different segments of the dataset
  3. Different models with the same dataset

Classification on Iris Dataset

We will explain the first approach (Same model with different hyperparameters) using SVC (Support Vector Classifier) model with RBF kernel on Iris dataset.

We initially train the model on six samples and Actively choose 30 samples from the pool set. We will test the model performance at each iteration on the same test set of 30 samples.

Separation boundaries between different colors are decision boundaries in Animation 3. Points queried by the committee are the points where the learners disagree the most. This can be observed from the above plot. We can see that initially, all models learn different decision boundaries for the same data. Iteratively they converge to a similar hypothesis and thus start learning similar decision boundaries.

We now show the comparison of the overall F1-score between random sampling and our model. QBC, most of the time, outperforms the random sampling method.

Comparison between Uncertainty sampling and QBC

Thus far, we have seen and understood various active learning strategies by examples. Now, let us compare the uncertainty sampling methods and query by committee (QBC).

We will use MNIST dataset to demonstrate the performance of various sampling techniques. For uncertainty sampling, we will use the Random Forest classifier. For QBC, let us use three different classifiers (Random Forest Classifier, Logistic Regression [14], and Support Vector Classifier). Animation 4 shows the simulation of active learning for 100 iterations and testing F1-score on the test set.

Query by committee is performing better than uncertainty sampling. The reason is that uncertainty sampling tends to be biased towards the actual learner, and it may miss important examples that are not in sight of the estimator [1, 5]. QBC overcomes this problem by taking votes from different models on the same datapoints or different datapoints with the same model. Our case was different models with the same datapoints.

How many samples to query at once?

Till now, in the article, we have queried only one sample at a time. We should consider the time to retrain the model over the train set and evaluate it on the pool set. Indeed, updating the model after each queried sample is ideal as the informativeness of samples is updated without adding noise in the form of non-informative samples. Let us assume that each sample's annotation cost is constant, and thus, we can ignore it here. We will now see the effect of selecting $ K $ samples at once on model improvement and overall time-taken for the train-test process.

We will use query by committee strategy to query for 30 samples on the IRIS dataset. The mean results of repeating each experiment 50 times with different train, validation, test splits are shown below.

We can see that, as K increases, time taken to complete the task decreases. Macro averaged F1-score also decreases with K. From this experiment, we can conclude that a good trade-off between time and K should be chosen to achieve the optimal results.

Few More Active Learning Strategies

There are few more active learning techniques we are not covering in this article, but we describe them in brief here:

  1. Expected model change: Selecting the samples that would have the most significant change in the model.
  2. Expected error reduction: Selecting the samples likely to reduce the generalization error of the model.
  3. Variance reduction: Selecting samples that may help reduce output variance.

With this, we end the visual tour to the active learning techniques.

References

  1. Settles, Burr. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
  2. Danka, Tivadar, and Peter Horvath. "modAL: A modular active learning framework for Python." arXiv preprint arXiv:1805.00979, 2018.
  3. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers. In Proceedings of the International Conference on Machine Learning (ICML), pages 150–157. Morgan Kaufmann, 1995.
  4. T. Mitchell. Generalization as search. Artificial Intelligence, 18:203–226, 1982.
  5. H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the ACM Workshop on Computational Learning Theory, pages 287–294, 1992.
  6. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
  7. S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems (NIPS), volume 20, pages 353–360. MIT Press, 2008.
  8. D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3–12. ACM/Springer, 1994.
  9. Imran, Ali, et al. "AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app." arXiv preprint arXiv:2004.01275, 2020.
  10. Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.
  11. Carl Eduard Rasmussen and Christopher K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press 2006.
  12. D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
  13. D. Cohn, L. Atlas, R. Ladner, M. El-Sharkawi, R. Marks II, M. Aggoune, and D. Park. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems (NIPS). Morgan Kaufmann, 1990.
  14. Yu, Hsiang-Fu, Fang-Lan Huang, and Chih-Jen Lin. "Dual coordinate descent methods for logistic regression and maximum entropy models." Machine Learning 85.1-2 (2011): 41-75.
  15. C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423,623–656, 1948.
  16. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 148–156. Morgan Kaufmann, 1994.
  17. T. Scheffer, C. Decomain, and S. Wrobel. Active hidden Markov models for information extraction. In Proceedings of the International Conference on Advances in Intelligent Data Analysis (CAIDA), pages 309–318. Springer-Verlag, 2001.