What is Active Learning and Why Should You Care?

Disclaimer

This article is a gentle (and almost math-free) introduction to Active Learning. The point is to give you an overview of what Active Learning is, why it is interesting, where it comes from, why it is super exciting, and when it can (or not) make your life easier. Hopefully, you will leave this page with at least some of these questions answered.

Also, I could not possibly cover everything there is to say, so here is a bit of a disclaimer: this really is just an overview of the big picture (hence why it is, almost, math free). If you want to get into the practical details, I kindly direct you to the references I will cite throughout this blog.

Lastly, note that the term Active Learning is also used in the field of education and describes a methodology of teaching. Despite the shared name and some overlap in the general motivations, Active Learning in education and Active Learning in machine learning are completely unrelated.

So, what is Active Learning?

One thing I learned when I left academia after my PhD is that Active Learning is not a well-known topic outside of the world of Machine Learning research. It may be obvious to some, but the first thing to get out of the gate is that it is not an umbrella term for any kind of Human-in-the-Loop interactive procedure. Active Learning is a well-defined sub-field of Machine Learning interested in studying how a learning algorithm may interact with some kind of Oracle mechanism. Ironically, it is often illustrated with a Human-in-the-Loop scheme because it makes the idea of Oracle intuitive to understand. It is, however, a mostly theoretic field of research and having a human as part of its operating loop is not always the best idea, as we will see later.

Asymmetry of Label Availability

Active Learning tries to address one of the most crippling flaws of supervised learning, that is the asymmetry of cost between collecting an (unlabeled) datapoint and labelling it. Supervised learning is built on the idea that a training set of fully labelled points is available. However, oftentimes collecting a datapoint is not the same as labelling it. A quintessential example of this is image classification. Collecting an unlabeled dataset is as simple as running a script to scrape all the images you can find on the internet. Labelling a dataset, however, typically requires a human eye and is a task with a much more limited throughput. In cases like this, the efficiency of the entire supervised learning pipeline (from gathering data to inference) is bottlenecked by your labelling throughput. You are only using as much data as you can label.

Semi-supervised learning is built on the idea that those unlabeled datapoints might have some use in learning the distribution of labels. In many regards, semi-supervised learning can be thought of as supervised learning augmented with additional, unlabeled, datapoint. In contrast, active learning is looking at the problem as a budget assignment task. Given a large pool of unlabeled data, what are the most interesting ones to spend time labelling? This can be tackled as a one-shot task where all labels are asked at the same time (so-called batch active learning) or an iterative one (regular Active Learning); though in this article we will focus on the latter as it is what people usually refer to when they speak of Active Learning.

Why is it a good idea?

For Active Learning to make sense, it requires that some datapoints are more valuable to label than others. If this is not the case, it would mean that any label is equally useful, and it would therefore make no sense to carefully choose which one to obtain.

The first thing to understand is that the value of labelling a datapoint is a function of both the type of model you are trying to learn and the labels you already have acquired.

There is a subtle distinction to make between model, hypothesis and learning algorithm. Roughly speaking, a model is the general shell of what your solution should look like (it may be a linear model, or a deep learning model, or a random forest model); a hypothesis is a version of your model trained on some data; finally, a learning algorithm is the process you use to construct a hypothesis from a dataset. For instance, in the case of image classification using a deep neural network, your model refers to your network architecture (usually some kind of CNN (Convolutional Neural Network)), your training algorithm is likely a variation of the gradient descent algorithm (ADAM, SGD, etc.) and your hypothesis is what is referred to as the “trained model”. It is also important to note that your trained model is one hypothesis, the one you think is best, and any parameters set (weights in the case of a neural network) over your model define a different hypothesis.

These distinctions are important for Active Learning because the way it works is by iteratively finding new hypotheses by collecting more labels. Namely, for a label to be considered useful it needs to do two things:

Provides new information about the distribution of labels
Contradicts your current best hypothesis

Those are the conditions under which acquiring a new label will lead to an improvement of your current best hypothesis. From an information theoretic point of view, a change of hypothesis means we had to update our beliefs, which in turn means the label we just received was carrying meaningful information. In other words, it was valuable.

In all three figures, we look at how valuable the label (red) of the point in the green box is. The purple line represents the best hypothesis that is consistent with the datapoint, except for the one in the green box.

*The label provides new information on the distribution of red points, but our hypothesis was already correct.*

*The label contradicts our current hypothesis but stays consistent with what we observed and thus would only lead to a minor adjustment.*

*The label both contradicts our hypothesis and provides new information on the distribution of red points, this labels lead to the biggest update in our hypothesis.*

A caveat to this is that, in general, labels are not known in advance. One must manage how useful a datapoint’s label might potentially be versus how likely it is for the datapoint to have a non-informative label. In practice, a good Active Learning algorithm can judge how informative a label is likely to be.

*If the green boxed point is labeled blue, it is valuable information, but it is also very unlikely given its proximity to the red cluster.*

How to: Writing an Active Learning Algorithm

Greek Mythology, Oracle and Query

The mathematical framework for devising and studying Active Learning algorithms revolves around two central ideas: interactivity and expensive labelling. The latter is the very raison d’être of Active Learning; the former comes from the fact that there is no way to “carefully pick which datapoint to label” without some interactivity between your labeling process and your learning algorithm (otherwise, the best you can do is to randomly select the datapoints to label and run a supervised learning algorithm). In other words, an Active Learning algorithm is a procedure to iteratively acquire labels as efficiently as possible.

The parlance among researchers has typically been to use the joint metaphors of Oracle and Queries. An Oracle is an all-knowing entity that can label any datapoint. The idea behind the Oracle, much like its Greek mythology counterpart, is that the inner workings of the Oracle are unknown and irrelevant to the problem. That is, what matters is that the Oracle is always right and that it is costly to ask the Oracle for an answer. A query refers to the act of asking the Oracle for a label.

*The Active Learning loop as pictured in Burr Settles’ seminal review of the field “Active Learning Literature Survey“*

A pseudocode example

An Active Learning algorithm starts with an unlabeled dataset $𝒰$ , empty “labeled” dataset $ℒ$ and a supervised learning algorithm $𝒜$ . For all intents and purposes $𝒰$ is as big as we need it to be, and $ℒ$ can be initialized with a few labeled datapoint picked randomly from $𝒰$ . 𝒜 is usually part of the problem definition because Active Learning is not interested in learning from a labeled dataset so much as querying the most relevant datapoint to label. However, most Active Learning algorithms are defined with a particular class of learning algorithm in mind.

The core of an Active Learning algorithm is its query function. That is, the function that will select which point in $𝒰$ should be presented to the Oracle next. Everything around the query function is usually very standard and only serves the purpose of articulating the query function and the Oracle together.

Here is a minimal example of what an Active Learning algorithm might be:

1. Starts with an unlabeled dataset 𝒰, labeled dataset ℒ, Oracle 𝒪 learning algorithm 𝒜 and stopping condition C
2. Let x = Query (𝒰,ℒ,𝒜) be our query
3. Get the label y = 𝒪(x) from the Oracle
4. Update ℒ = ℒ ∪{(x,y)}
5. Remove x from the unlabeled dataset 𝒰 = 𝒰 ∖ {x}
6. If C is met, return 𝒜(ℒ) otherwise go to 2.

Where C could be budget dependent (for example, a limit on the size of $ℒ$ ) or a convergence condition (e.g., until the performance of 𝒜(ℒ) does not significantly improve from one loop iteration to the other).

For the query function, a rough pseudocode outline could look like this:

1. Starts with an unlabeled dataset 𝒰, labeled dataset ℒ and learning algorithm 𝒜
2. Let h = 𝒜(ℒ) be our best hypothesis
3. For each point x in 𝒰 compute how confident h is in its predicted label h(x)
4. Return the point x ∈ 𝒰 with the lowest label confidence from step 3.

A binarized version of the Fashion-MNIST dataset restricted to two classes: “sneaker” and “ankle boot”. For visualization purposes, the data are projected from 784 dimensions to 2 dimensions using PCA (Principal Component Analysis). The vignettes represent the original 28×28 pixels greyscale image associated with each datapoint. They are shown here for purely visual purposes and are never made available to the learning algorithm. The red and blue box around each vignette correspond to the class of each datapoint, where red is “sneaker” and blue is “ankle boot”. The purple line represents the decision boundary of a SVM (Support Vector Machines) trained on the datapoints shown in the figure with a gaussian kernel and hyper-parameters σ= 0.165 and C = 1.84. Those values have been obtained by running a quick 10-fold cross-validated grid search. The decision boundary depicted here achieves an accuracy score of 0.89 on a withheld test set.

Visualization of the successive queries made by an Active Learning algorithm (Active-SVM). The queried datapoints are boxed according to their labels (red for “sneaker”, blue for “ankle boot”) representing the fact that their label is known to the learning algorithm, all the other datapoints are unlabeled. The animation represents which datapoints are queried by the active learning algorithm and how it affects the current best hypothesis (purple decision boundary) by running a supervised SVM on the available (i.e., queried) labels. In this example, we simulate a perfect Oracle by simply revealing the true label of each queried datapoint. Note how the queried points are always the one closest to the decision boundary; this is an expected behavior, that is well-described by the mathematical theory behind Active Learning.

A note on supervised learning

From the pseudocode above, you may notice that we are running a supervised algorithm 𝒜 at the end of the main loop (step 6). In fact, Active Learning can be thought of as a way to efficiently gather labeled data for a supervised learning algorithm down the line.

Thinking of supervised learning in this context means there is a very intuitive way to measure what are the gains of using an Active Learning scheme. Because supervised learning assumes the training set is sampled (and labelled) at random from the underlying data distribution, it is equivalent to running a simplistic query function that returns a point at random in $𝒰$ .

Intuitively we can see how supervised learning would increasingly waste its labelling budget as the number of datapoints satisfying the two “valuable labels” conditions mentioned in the previous section decreases. Active Learning improves supervised learning in that it tries to pinpoint the general area where labels are the most likely to satisfy those two conditions.

Does it really matter?

Consider this, supervised learning is equivalent to randomly picking which datapoint to label. It means that for a supervised learning algorithm to perform well it requires a good coverage of your labeling distribution (see covering number and VC dimension for details). The gist of it is that you need to acquire labels randomly until almost every datapoint is either labeled or close to a labeled datapoint. This is because labeling datapoints randomly prevents you from looking at what is the current situation with the labels and best hypothesis you already have.

*A comparison between a hypothesis from a SVM trained on the full dataset (left) and one trained with the Active-SVM algorithm (right).*
Both hypotheses share the same hyper-parameters and achieve the same accuracy of 0.89 on a withheld test set. However, the one trained from the full dataset uses all the 106 available labels whereas the Active Learning one (right) can achieve the same performance after only 22 queries.

In contrast, Active Learning does precisely that. It tries to identify the datapoints for which your best hypothesis is the most uncertain and ask for their label specifically. Note that what uncertainty means from a mathematical context falls outside of the boundary of this article; it is a delicate topic to tackle when dealing with black-box models like we do in machine learning. The big picture is that Active Learning acts, in some fashion, as a bisection method. It iteratively reduces by half the number of unlabeled datapoints that are “uncertain” given your current best hypothesis. Compared to the uniform covering approach of supervised learning (that is, random labeling) this is highly effective. In fact, for a supervised learning model trained with $N$ labeled datapoints, it is known that an Active Learning model with comparable performance can be obtained with only $lo g_{2} (N)$ labeled datapoint on average. That is an exponential reduction in labelling budget!

Comparison between the test accuracy achieved by the Active-SVM query scheme (blue) and random query scheme (red) where the oracle is queried with a random unlabeled datapoint. Note that the random query scheme is representative of how one would spend a fixed labelling budget when collecting a dataset for supervised learning. The Active Learning query scheme consistently outperforms the random query scheme with the first few queries leading to better performance gains and achieving maximum accuracy after 25 queries instead 65 for the random query scheme. The reported accuracy is an average of 100 trials with randomized train-test split over the modified Fashion-MNIST dataset discussed above. The blue and red bands represent the standard deviation.

Why isn’t everybody using Active Learning?

There are several reasons. Mainly it has to do with the fact that Active Learning is not a very practical setting. Active Learning is restrictive in that it needs to be integrated into your data collection pipeline. It means you need to know what to do with your data while collecting it. In contrast, Supervised Learning decouples data collection from data exploitation thus allowing delayed processing and/or processing of pre-collected data. Decoupling also means you can collect data with minimal knowledge of what to do with it. With Active Learning, you are pretty much required to have your data pipeline fully integrated from the get-go.

Another, more superficial, reason comes from the textbook example of Active Learning where the Oracle is a human manually labelling your data. This is impractical for several reasons: humans are prone to error, sensitive to repetitive tasks, and subject to tiredness. Typical Active Learning settings fail to account for any of those things, let alone all of them at the same time. Try to interact with an active learning algorithm for more than 10 minutes and you will realize something: while it may make sense from a mathematical point of view, the queries it makes are redundant and feel unintuitive from a human perspective. Ultimately, this has to do with the problem of interpretability in machine learning; a problem that is very much open and one of the main blockers of a wider adoption of machine learning. If you do not understand your model, you do not understand why it makes those specific queries, and it quickly becomes frustrating.

The first 28 queries made by the Active-SVM algorithm on the binarized Fashion-MNIST dataset. The first few queries show a lot of variability in class and shape whereas the last row focuses on almost a single type of shoe with a lot of similar (to a human eye) images. Note that as opposed to previous figures, those queries have been obtained by running the Active-SVM algorithm on the Fashion-MNIST dataset in its original dimensionality of 28×28 pixel to preserve the visual characteristics of each datapoint.

Lastly, Deep Neural Networks, which are by far the most well-known and successful Machine Learning models today, are ill-suited to Active Learning. That is, the state of the art in Active Learning has yet to lay the theoretical groundwork of how it can work with neural networks, despite some promising recent advances.

What is a good Active Learning Setting?

It really depends! But one thing that will make me very strongly consider using Active Learning is if I have a simple data collection pipeline with a costly, automated, labeling step. I want to particularly stress the automated part. Active Learning works best when your Oracle is consistent. Even if you are dealing with an imperfect Oracle, if you can characterize its quirks and reasonably predict its behavior, it will make your life much easier when working with Active Learning.

Another important thing to consider is how easily you can collect unlabeled data. Usually there are two cases where acquiring proper unlabeled data might be difficult: (i) collecting data is not cheap or (ii) collecting independently and identically distributed (i.i.d.) data is not feasible. The former is self-explanatory, if you cannot collect more unlabeled data than what you can label, Supervised Learning would be the way to go. The second case is a bit less intuitive. Imagine you have an agent playing a game. Letting your agent play the game for days or even months is cheap, but the data you obtain is conditioned on the starting conditions of your experiment (where your agent was, what the state of the world was etc.). To collect i.i.d. data you would need to reset your experience regularly and uniformly test all the relevant starting conditions. This is a typical case of hidden collection costs. Collecting the data is easy, but ensuring they are statistically meaningful is costly. For cases like this, you are probably better off exploring solutions like reinforcement learning, or more generally TD-learning.

Does such a setting exist in practice?

What comes to mind is any situation where a reliable labelling solution exists, but with a throughput that is too limited to be used in supervised learning. Some great examples are from physics where measurement facilities are prohibitively costly to use for mass labeling (like a telescope, or a particle accelerator) or computationally intensive problems. There’s value in learning a statistical model to approximate the results of such problems, if only to decide when to run the full, and expensive, experiment/simulation and when not to. Yet, acquiring the necessary labels to use a supervised learning algorithm may take months of observations and/or computation. This, however, could be reduced to days by using Active Learning. Moreover, the labelling process (solving a complex problem, running some experimental measurement, etc.) may be slow but it is reliable. In short, it fits the requirement for a good Oracle.

Some of the more Computer Science oriented readers may note that using a statistical model to approximately find an answer to a complex problem sounds a lot like a heuristic, and they would be perfectly right. Active Learning shines precisely in situations where using a heuristic is desirable. In that sense, it is quite natural to use Active Learning as a way to automatically learn robust heuristics. One may even start with a manually defined heuristic (set by some experts or according to some general principles) and refine it by using Active Learning to learn the error said heuristic would make. This may be less exciting than having a human Oracle, but it is certainly more practical.

A Practical Example with Video Game

The paper “Software Testing by Active Learning for Commercial Games” uses Active Learning to build a heuristic of the probability of scoring a goal in FIFA for a given shooting position. Each position on the field can be seen as a datapoint for which the outcome of the shooting (goal or not) is the label. Because the game’s physical simulation is capped at 60 frames per seconds (fps), generating and labeling a dataset for this is very time consuming.

The authors of the paper used Active Learning to limit the number of simulations they needed. An interesting by-product of their approach is that by looking at the query made by their active learning algorithm it helped their designer understand where the ideal goal condition was on the field (so-called “sweet-spot” in the paper). This helped with balancing the game by using machine learning to identify “sweet-spots” after each design iteration; thanks to active learning this no longer was a prohibitively lengthy process.

In conclusion

That wraps up our quick, almost math-free, overview of what Active Learning is. There is still a ton to talk about, but hopefully this will give you an idea of what Active Learning is all about! There are a lot of variations around Active Learning and Oracles especially. There are lazy Oracle (which may decide to not answer), noisy Oracles (which make mistakes), Oracles with variable costs, etc. Each of them with their own theory on how to make the best query and optimal strategies. I tried to stay as general as possible here, but the world of Active Learning is vast and each setting comes with its own twists.

A follow-up topic I may tackle in a hypothetical future article is how query functions are intricately related to the concept of uncertainty. Interestingly, this very idea eventually loops back to the foundation of machine learning, information theory and compression schemes specifically. One of the most famous Active Learning paper specifically explore this connection and how one of the most studied machine learning models (Support Vector Machine) naturally yield an almost optimal query function for Active Learning.

Author

Ugo Louche joined Eidos-Montréal in 2019 as a Machine Learning Specialist. He obtained his PhD in Machine Learning in 2016 under the supervision of Pr. Liva Ralaivola where his research activities were mostly focused on Active Learning and Linear Classification. Ugo believes that Machine Learning has the potential to change how we make games, for the better, by allowing developers to focus on creative tasks and relegating menial work to AIs. He is committed to make this change happen and his work at Eidos-Montréal focuses on all aspects of Machine Learning that can help with game development.