This track requires the identification of 23 action units (AUs). The AUs included in the challenge are: 1, 2, 4, 5, 6, 9, 10, 12, 15, 17, 18, 20, 24, 25, 26, 28, 43, 51, 52, 53, 54, 55, 56.

**Training data**: The EmotioNet database includes 950,000 images with annotated AUs. These were annotated with the algorithm described in [1]. You can train your system using this set. You can also use any other annotated dataset you think appropriate. This dataset has been used to successfully train a variety of classifiers, including several deep networks.

**Optimization data**: We also include 25K (24,600 to be precise) *manually* annotated AUs. You may want to use this dataset to see how well your algorithm works or to optimize the parameters of your algorithm.

**Verification phase**: Participants will have access to a server where they can test their algorithm. Participants will receive a unique access code. Each participant will be able to test their algorithm twice. Comparative results against those of the 2017 EmotioNet Challenge [4] will be provided.

**Challenge phase**: Participants will be able to connect to the server one final time to complete the final test. The test dataset used in this phase will be different than the one in the verification phase. The results of this phase will be used to compute the final scores of the challenge.

**Evaluation**: Identification accuracy of AUs will be measured using two criteria – accuracy and F-scores. Algorithms will then be classified from first (best) to last based on an ordinal ranking of their performance on the mean of the recognition of all AUs. Formally, these criteria are defined as follows.

Accuracy is a measurement of closeness to the true value. We will compute recognition accuracy of each AU, i.e., $accuracy_i$ corresponds to the accuracy of your algorithm in identifying that an image has AU i active/present, with i= 1, 2, 4, 5, 6, 9, 10, 12, 15, 17, 18, 20, 24, 25, 26, 28, 43, 51, 52, 53, 54, 55, 56. Formally,

\[accuracy_i=\frac{\sum \mbox{true positives}_i+\sum \mbox{true negatives}_i}{\sum \mbox{total population}}\]

where $\mbox{true positives}_i$ are AU $i$ instances correctly identified in the test images, $\mbox{true negatives}_i$ are images correctly labeled as not having AU $i$ active/present, and total population is the total number of test images.

The mean accuracy is $accuracy= m^{-1}\sum accuracy_i$ , and the standard deviation is $\sigma^2=m^{-1} \sum(accuracy_i-accuracy)^2$ , where $m$ is the number of AUs.

F-scores will also be provided by each AU before computing the mean and standard deviation. The F-score of AU i is given by

\[F_{\beta_i}=(1+\beta^2)\frac{precision_i \cdot recall_i}{\beta^2presion_i+recall_i}\]

where $precision_i$ is the fraction of AU $i$ is correctly identified, $recall_i$ is the number of correct recognitions of AU i over the actual number of images with AU $i$ active, and $\beta$ defines the relative importance of precision over recall. We will use $\beta=.5,1,2$. $\beta=.5$ gives more importance to precision (this is useful in applications where false positives are not as important as precision), $\beta=2$ emphasizes recall (which is important in applications where false negatives are unwanted), and $\beta=1$ provides a measure where recall and precision are equally relevant. We will compute the mean $F_{\beta}=m^{-1}\sum F_{\beta_i}$.

The **final ranking of all participants** will be given by the average of accuracy and $F_1$, i.e., $\mathbf{\mbox{final ranking}}=.5(accuracy+F_1 )$. We may also provide rankings for each of the individual measurements (i.e., $accuracy$ and $F_\beta$, with $\beta=.5,1,2$) as well as the number of times each algorithm wins in the classification of each AU using these evaluations. But only the **final ranking** will be used to order submissions from first to last.

A variety of additional experiments might be performed on the data to better understand the limitations of current algorithms, but these will have no effect on the final ranking.