Week 1, Lecture 2 - Multi-task learning

In a traditional supervised ML paradigm, you have a labeled dataset, \(\mathscr{D}=\left\{(\mathbf{x}, \mathbf{y})_k\right\}\), where the goal is to minimize some loss function (\(\mathscr{L}\)) given \(\mathscr{D}\), by finding relevant parameters, \(\theta\) (i.e., the task):

\[ \min _\theta \mathscr{L}(\theta, \mathscr{D}). \]

Multi-task setup

The setup is the same as above, but instead of optimizing one objective, we want to optimize several (under the same model!) at once.

\[ \text { Solve multiple tasks } \mathscr{T}_1, \cdots, \mathscr{T}_T \text { at once. } \]

Example (credit):

instead of doing this (as usual):

image

we do this:

image

Three components:

  • Model
  • Loss (objective)
  • Optimization

Model

We ask, should we share parameters across tasks? Or should the learnable parameters in the architecture be entirely separate for each task? The latter may look something like this (credit: CS 330 slides):

image

and the former:

image

See PCGrad.

Loss

To obtain the loss, we perform a simple summation over each of the individual task’s loss:

\[ \min _\theta \sum_{i=1}^T \mathscr{L}_i\left(\theta, \mathscr{D}_i\right) \]

Of course, some tasks might be important than others so we can performed a weighted summation:

\[ \min _\theta \sum_{i=1}^T w_i \mathscr{L}_i\left(\theta, \mathscr{D}_i\right) \]

Note that we can update the weights during training (as a learnable parameter) or manually determime the weights of each loss beforehand (a hyperparameter).

Optimization

Optimization is straightforward and is nearly the same as in the supervised setting. The CS 330 slides defines the following sequence:

  1. Sample mini-batch of tasks \(\mathscr{B} \sim\left\{\mathscr{T}_i\right\}\)
  2. Sample mini-batch datapoints for each task \(\mathscr{D}_i^b \sim \mathscr{D}_i\)
  3. Compute loss on the mini-batch: \(\hat{\mathscr{L}}(\theta, \mathscr{B})=\sum_{\mathscr{T}_k \in \mathscr{B}} \mathscr{L}_k\left(\theta, \mathscr{D}_k^b\right)\)
  4. Backpropagate loss to compute gradient \(\nabla_\theta \hat{\mathscr{L}}\)
  5. Apply gradient with your favorite neural net optimizer

Note: one place where multi-task loss & optimization differ from the traditional supervised learning paradigm is that we can weight the tasks differently. That is, we can weight the losses differently corresponding to different tasks. This is a key hyperparameter to pay attention to in the multi-task setting. Explicity, we can assign more weight (zero-sum total) to a task that we (human) deem as more important. However, we can also realize that, during training, one task might outweigh the others and cause bad performance for the other tasks. As such, it may be a good idea to weight the “weaker” tasks stronger than the “stronger” ones. And by stronger, I mean tasks that, if evenly weighted for all tasks, would dominate the optimization procedure (the model params would be biased towards this task).1

PyTorch Example

Coming soon… see this blog post for an example.

Applications

  • Tesla uses a multi-task “hydranet” model for its inference in self-driving cars. The vision-only system takes in a tuple of 8 images from the car’s cameras, passes the concatenated feature vector through a shared backbone encoder, and separates each task via a decoder head. For example, Tesla uses one network for things like depth estimation, lane segmentation, etc. See here.

Confusions/quesions

  • Does the input feature vector need to be the same for each task? (I believe the answer is no)

References

Footnotes

  1. Question: can we treat the task weights as actual parameters (not hyperparameters) that we update via the optimization process (either explicitly or implicitly)?↩︎