By default, the data is presented to the neural network at a random order. However, curriculum learning and anti-curriculum learning suggest modifying the order the examples are presented by their difficulty. Curriculum learning proposes to present easier examples earlier, whereas anti-curriculum learning proposes to present the harder examples earlier. This paper perform an empirical study of these ordered learning techniques on an image classification task and conclude that:
This paper may be interesting to you if you:
Although the idea behind curriculum learning and anti-curriculum learning is simple, there are many choices that could result in a different curriculum. We can define a curriculum through 3 components:
Before training, each example in the dataset is given a score through the scoring function. During training, At each step $t$, the pacing function determines the size of the dataset. Depending the order ("curriculum" or "anti-curriculum"), the dataset for step $t$ consists of examples with $g(t)$ lowest or highest scored examples. We also allow "random" ordering to serve as a baseline.
For the scoring function, the paper chooses the c-score scoring function by Jiang et al., 2020, which quantifies how well the model could predict the example's label when trained on a dataset without that example. Other ways to score a example might to be use the loss or to use the index of the epoch where the model first predicted the example correctly. However, experiments show that these 3 scoring functions are highly correlated anyways on both VGG-11 and ResNet-18, so only the c-score scoring function is used.
There are infinitely many valid pacing functions, as all we need is an monotonic function. This paper experiments with 6 families of pacing functions: logarithmic, exponential, step, linear, quadratic, and root. There are also two important parameters: the fraction of training steps needed before using the full dataset ($a$) and the fraction of the dataset used at the beginning of training ($b$).
To test ordered learning, a ResNet-50 model was trained on the CIFAR10 and CIFAR100 datasets for 100 epochs. Each combination of 180 pacing functions and the 3 orders (curriculum, anti-curriculum, and random) were tested, and the best out of 3 random seeds were used for each combination.
The paper defines 3 baselines to evaluate the runs. The standard1 baseline is the mean performance of all 540 runs. The standard2 baseline is the mean of 180 maximums from groups of 3, and represents a hyperparameter sweep. The standard3 baseline is the mean of top three values of 540 runs.
Experiments show that all three orderings show similar performance, which suggest that the benefit comes from the dynamic dataset size induced by the pacing function. However, even this benefit is marginal, as it does not singificantly outperform the standard2 baseline that consider the large-scale hyperparameter sweep performed.
For the short-time setting, the same experiments are performed but with 1, 5 or 50 epochs (253, 1760, 17600 steps) instead of 100 epochs (35200 steps). As the number of total steps decrease, curriculum learning shows higher performance gains. The pacing function also seems to help performance, as all three ordered learning methods show at least comparable performance to the standard3 baseline.
To test ordered learning is the noisy setting, artificial label noise was added by random permuting labels. Experiments were done with the same setup but with 20%, 40%, 60%, and 80% label noise, and with recomputed c-scores. Again, curriculum learning clearly outperform other methods in all noise levels.
Curriculum learning only helps performance if training time is limited of if dataset contains noisy labels. This reflects the practice where ordered learning is not a standard practice in supervised image classification but is used when training generalized language models.
These are some relevant papers that could be interesting to read as well: