Skip to main content

Tidbits

Here I keep notes of small things I learned that do not fit into a larger overarching topic. Any time I learn something interesting that I want to remember, I put that here.

F1 Score

One thing I recently looked deeper into (as part of my flood mapping project) was the math behind the F1 score. I have always known that it was the "harmonic" mean of precision (TPTP+FP\frac{TP}{TP + FP}) and recall (TPTP+FN\frac{TP}{TP + FN}), but did not know anything really beyond that.

First, let's define the "harmonic" mean mathematically. One thing to remember is that it is normally used for positive values, typically ratios and rates. For positive reals x1,x2,,xnx_1, x_2, \ldots, x_n, it is defined as

H=n1x1+1x2++1xnH = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \ldots + \frac{1}{x_n}}

It is also the reciprocal of the arithmetic mean of the reciprocals:

H=11n(1x1+1x2++1xn)H = \frac{1}{\frac{1}{n}(\frac{1}{x_1} + \frac{1}{x_2} + \ldots + \frac{1}{x_n})}

In the context of F1 as a value of precision and recall, we have that:

F1=21precision+1recall=2TP2TP+FN+FPF1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} = \frac{2TP}{2TP + FN + FP}

The harmonic mean has the property that it is always greater than or equal to the minimum of its values x1,x2,,xnx_1, x_2, \ldots, x_n. The harmonic mean of a list of values however, tends to its least value as opposed to the arithmetic mean, and therefore is less susceptible to the impact of outliers. A good way of thinking about this is that we use arithmetic mean of the reciprocals to bias the small values in the same way, and we convert that effect back into our original unit by taking a reciprocal again. The reason why it is used over the arithmetic mean in the context of balancing recall and precision, is that it punishes imbalance more strongly. If you have high precision but low recall because you are too picky, F1 will be low. If you have high recall but low precision because you are overly sensitive, then F1 will also be low. F1 will only be high when both are high and in agreement.

The only drawback is that F1 treats recall and precision as equally important. In cases where recall is more important than precision or precision more than recall you want to use the F-beta score which is a generalization of F1 with weighted harmonic mean. Given β>0\beta > 0,

Fβ=(1+β2)precisionrecall(β2)precision+recall=(1+β2)TP(1+β2)TP+FP+β2FNF_{\beta} = \frac{(1+\beta^2) * precision * recall}{(\beta^2) * precision + recall} = \frac{(1+\beta^2)TP}{(1+\beta^2)TP + FP + \beta^2FN}

Observe that at β=1\beta = 1 the weighting is equal and it defaults to the normal F1. But if β=2\beta = 2 then recall is weighted twice as important and β=0.5\beta = 0.5 precision is weighted twice as important.

Harmonic means shows up in rates and ratios. Why? Suppose you drive a car 60mph for dd miles, and then return at 30mph for dd miles. What is the average speed? The average speed is NOT the arithmetic mean 60+302=45\frac{60+30}{2} = 45, because it assumes the same time elapsed between the two.

AUPRC

The metric AUPRC stands for the Area Under the Precision Recall Curve, and is a useful performance metric in addition to F1 score for capturing ability to predict positives in an imbalanced dataset. A perfect AUPRC means predicting all the positives correctly without any false positives. AUPRC also captures how the model performs at different thresholds - so it is fine grained at the level of probabilities (a model that has ambivalent probabilities will be distinguished from a model that is more confident), not just the actually binary outcomes!

To calculate AUPRC it is important to know about the Precision-Recall Curve, which shows tradeoff between precision and recall across different decision thresholds. As we change the classification threshold across a list of values 0,0.1,0.2,,0.9,1.00, 0.1, 0.2, \ldots, 0.9, 1.0 we get a corresponding confusion matrix, precision, and recall for each, which we can plot as a point on the graph. We don't actually need to try every single probability threshold - for starters you can limit the thresholds to the discrete set of probability values in the model prediction, so if the model only predicts 0.01,0.990.01, 0.99 as probabilities, then you only need to check the two thresholds.

A good place to start conceptualizing this curve is the perfect case. A perfect AUPRC of 1.01.0 corresponds to a square on the plot from (0, 1) recall + precision to (1, 1) to (1, 0).

# subtlety: while rec = 0/FN = 0, prec is undefined 0/0. Convention is prec=1 since no FP mistakes made (CAVEAT IN THE DETAIL BELOW!)
Threshold = 1 ==> all examples classified negative, so no FP and no TP: rec=0, prec=1 (see above)
# top right corner of square
0 < Threshold < 1 ==> all examples classified correctly, so no FP, FN: rec=1, prec=1
# bottom right corner (note it doesn't always end at prec=0)
Threshold = 0 ==> all examples classified positive, so no FN: rec=1, prec=0
Detail

Although you might think that the AUPRC curve should always start at the top left corner (rec=0, prec=1), it might not be in practice - it boils down to label of the example with the top predicted score. Remember: we start with threshold of 1 with everything labeled negatively and start lowering it so positive labels crop up. In practice, the very first positive prediction happens at the highest predicted score, i.e. the model's top ranked example say 0.99999. If the top scoring example is actually a positive ground truth, then we start from the top left corner (0, 1) artificial point and proceed to the first data point rec>0, prec=1. BUT if the top scoring example is negative, then the first positive label will contribute to the FP, so you have the first data point as rec=0, prec=0, resulting in the model starting from (0, 0).

Really (0, 0) is just an artifact added depending on certain cases for the plot to work.

Similarly the (1, 0) point is also an artifact added for plotting.

Trick

PR curves can be thought of as simply dependent on the ranking of the scores given to each example. Listing the scores from highest to lowest, you sweep down the list and recalculate the confusion matrix, precision, and recall at each step or cutoff.

A PR curve often ends at the lower right because at threshold of 0 the recall is perfect 1 but the precision is low.

The AUPRC is calculated as the area beneath the PR curve, which can be plotted on a graph with recall on the x-axis and precision on the y-axis ranging from 0 to 1. Observe that PR curves do not consider the number of true negatives due to the focus on precision and recall. As a result AUPRC is unaffected by imbalanced dataset classes with a large percentage of negative cases.

How do we interpret AUPRC? With AUPRC the baseline is equal to the fraction of positives among total cases. Here's why: say the classifier was no better at random guessing. If you have NN total examples, PP positive examples, the positive rate is r=PNr = \frac{P}{N}. Your classifier would spit out a ranking (relevant to PR curve) of each example by probability score. If the classifer guesses, then the ranking is completely random, and for any cutoff with MM examples above it, there will be rMr * M TP and (1r)M(1-r)M FP, so the precision is always approximately rMrM+(1r)M=rMM=r\frac{rM}{rM + (1-r)M} = \frac{rM}{M} = r. As for the recall, if you have NmN-m negative predictions, it will be mrmr+(Nm)r=mrNr=mN\frac{mr}{mr + (N-m)r} = \frac{mr}{Nr} = \frac{m}{N} which is just the proportion of positive predictions changing linearly with the cutoff. This makes sense for recall, because at 50% threshold you've only captured 50% of the positive cases, and at 100% you've captured 100% of the positive cases. Integrating the PR curve, AUPRC=01rd(recall)=r01d(recall)=r\text{AUPRC} = \int_{0}^{1} rd(\text{recall}) = r \cdot \int_{0}^{1} d(\text{recall}) = r. Hence we get the baseline being the rate of positives. A class with 12% positives has 0.12 as the AUPRC baseline, so 0.40 AUPRC is considered good. A class with 98% positives and a 0.98 AUPRC is bad.

AUPRC math

Why do we care about the tradeoff between recall and precision? With a decision threshold of 0.5 you might not be at the optimal threshold for recall and precision! Especially if the class distribution is imbalanced. If the positive instance is rare, you might need to lower the threshold to detect them and boost recall.

For my ML experiment, it is possible to keep using F1 at 0.50.5 threshold as development time objective, but it is important to inspect the PR curve after and find the best threshold and its corresponding F1 score.

ROC-AUC

The Receiver Operating Characteristic (ROC) curve also represents model performance across thresholds. The curve shows the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis. The ROC-AUC or area under the ROC curve is the probability that the model, if randomly given a positive and negative example, will rank the positive higher than the negative, irrespective of what the actual threshold is.

ROC-AUC examples

For a binary classifier with random guesses (e.g. a coin flip), the ROC-AUC is 0.50.5.

IoU

Intersection Over Union

Python as an Interpreter

I've found an urge to branch out of Python because while it is quite a useful and simplistic programming language for machine learning tasks, it is not as useful for building fast applications. I've encountered many difficulties with bottlenecks with Python scripts.

Python bytecode interpreter.

Codecs

When talking about ffmpeg a term that comes up a lot is codecs. A codec is a software or algorithm that encodes raw audio/video into a compressed format for storage or transmission and decodes the compressed format back into the original form. In ffmpeg codecs are implemented as modules that handle these encoding and decoding operations. An example of a codec supported by ffmpeg is mp3! In the API codecs are abstracted as AVCodec (Audio Video Codec). Choosing the type of codec determines the compression efficiency, quality and encoding/decoding speed.

Here is a slice of the large list of available codecs you get by running ffmpeg --codecs:

 DEAIL. mp3                  MP3 (MPEG audio layer 3) (decoders: mp3float mp3 ) (encoders: libmp3lame libshine )
D.AIL. mp3adu ADU (Application Data Unit) MP3 (MPEG audio layer 3) (decoders: mp3adufloat mp3adu )
D.AIL. mp3on4 MP3onMP4 (decoders: mp3on4float mp3on4 )
D.AI.S mp4als MPEG-4 Audio Lossless Coding (ALS) (decoders: als )
..A.L. mpegh_3d_audio MPEG-H 3D Audio
D.AIL. musepack7 Musepack SV7 (decoders: mpc7 )
D.AIL. musepack8 Musepack SV8 (decoders: mpc8 )
DEAIL. nellymoser Nellymoser Asao
DEAIL. opus Opus (Opus Interactive Audio Codec) (decoders: opus libopus ) (encoders: opus libopus )

Resources