Tidbits
Here I keep notes of small things I learned that do not fit into a larger overarching topic. Any time I learn something interesting that I want to remember, I put that here.
F1 Score
One thing I recently looked deeper into (as part of my flood mapping project) was the math behind the F1 score. I have always known that it was the "harmonic" mean of precision () and recall (), but did not know anything really beyond that.
First, let's define the "harmonic" mean mathematically. One thing to remember is that it is normally used for positive values, typically ratios and rates. For positive reals , it is defined as
It is also the reciprocal of the arithmetic mean of the reciprocals:
In the context of F1 as a value of precision and recall, we have that:
The harmonic mean has the property that it is always greater than or equal to the minimum of its values . The harmonic mean of a list of values however, tends to its least value as opposed to the arithmetic mean, and therefore is less susceptible to the impact of outliers. A good way of thinking about this is that we use arithmetic mean of the reciprocals to bias the small values in the same way, and we convert that effect back into our original unit by taking a reciprocal again. The reason why it is used over the arithmetic mean in the context of balancing recall and precision, is that it punishes imbalance more strongly. If you have high precision but low recall because you are too picky, F1 will be low. If you have high recall but low precision because you are overly sensitive, then F1 will also be low. F1 will only be high when both are high and in agreement.
The only drawback is that F1 treats recall and precision as equally important. In cases where recall is more important than precision or precision more than recall you want to use the F-beta score which is a generalization of F1 with weighted harmonic mean. Given ,
Observe that at the weighting is equal and it defaults to the normal F1. But if then recall is weighted twice as important and precision is weighted twice as important.
Harmonic means shows up in rates and ratios. Why? Suppose you drive a car 60mph for miles, and then return at 30mph for miles. What is the average speed? The average speed is NOT the arithmetic mean , because it assumes the same time elapsed between the two.
AUPRC
The metric AUPRC stands for the Area Under the Precision Recall Curve, and is a useful performance metric in addition to F1 score for capturing ability to predict positives in an imbalanced dataset. A perfect AUPRC means predicting all the positives correctly without any false positives. AUPRC also captures how the model performs at different thresholds - so it is fine grained at the level of probabilities (a model that has ambivalent probabilities will be distinguished from a model that is more confident), not just the actually binary outcomes!
To calculate AUPRC it is important to know about the Precision-Recall Curve, which shows tradeoff between precision and recall across different decision thresholds. As we change the classification threshold across a list of values we get a corresponding confusion matrix, precision, and recall for each, which we can plot as a point on the graph. We don't actually need to try every single probability threshold - for starters you can limit the thresholds to the discrete set of probability values in the model prediction, so if the model only predicts as probabilities, then you only need to check the two thresholds.
A good place to start conceptualizing this curve is the perfect case. A perfect AUPRC of corresponds to a square on the plot from (0, 1)
recall + precision to (1, 1)
to (1, 0)
.
# subtlety: while rec = 0/FN = 0, prec is undefined 0/0. Convention is prec=1 since no FP mistakes made (CAVEAT IN THE DETAIL BELOW!)
Threshold = 1 ==> all examples classified negative, so no FP and no TP: rec=0, prec=1 (see above)
# top right corner of square
0 < Threshold < 1 ==> all examples classified correctly, so no FP, FN: rec=1, prec=1
# bottom right corner (note it doesn't always end at prec=0)
Threshold = 0 ==> all examples classified positive, so no FN: rec=1, prec=0
Although you might think that the AUPRC curve should always start at the top left corner (rec=0, prec=1
), it might not be in practice - it boils down to label of the example with the top predicted score. Remember: we start with threshold of 1 with everything labeled negatively and start lowering it so positive labels crop up. In practice, the very first positive prediction happens at the highest predicted score, i.e. the model's top ranked example say 0.99999
. If the top scoring example is actually a positive ground truth, then we start from the top left corner (0, 1)
artificial point and proceed to the first data point rec>0, prec=1
. BUT if the top scoring example is negative, then the first positive label will contribute to the FP, so you have the first data point as rec=0, prec=0
, resulting in the model starting from (0, 0)
.
Really (0, 0)
is just an artifact added depending on certain cases for the plot to work.
Similarly the (1, 0)
point is also an artifact added for plotting.
PR curves can be thought of as simply dependent on the ranking of the scores given to each example. Listing the scores from highest to lowest, you sweep down the list and recalculate the confusion matrix, precision, and recall at each step or cutoff.
A PR curve often ends at the lower right because at threshold of 0
the recall is perfect 1
but the precision is low.
The AUPRC is calculated as the area beneath the PR curve, which can be plotted on a graph with recall on the x-axis and precision on the y-axis ranging from 0 to 1. Observe that PR curves do not consider the number of true negatives due to the focus on precision and recall. As a result AUPRC is unaffected by imbalanced dataset classes with a large percentage of negative cases.
How do we interpret AUPRC? With AUPRC the baseline is equal to the fraction of positives among total cases. Here's why: say the classifier was no better at random guessing. If you have total examples, positive examples, the positive rate is . Your classifier would spit out a ranking (relevant to PR curve) of each example by probability score. If the classifer guesses, then the ranking is completely random, and for any cutoff with examples above it, there will be TP and FP, so the precision is always approximately . As for the recall, if you have negative predictions, it will be which is just the proportion of positive predictions changing linearly with the cutoff. This makes sense for recall, because at 50% threshold you've only captured 50% of the positive cases, and at 100% you've captured 100% of the positive cases. Integrating the PR curve, . Hence we get the baseline being the rate of positives. A class with 12% positives has 0.12 as the AUPRC baseline, so 0.40 AUPRC is considered good. A class with 98% positives and a 0.98 AUPRC is bad.
Why do we care about the tradeoff between recall and precision? With a decision threshold of 0.5
you might not be at the optimal threshold for recall and precision! Especially if the class distribution is imbalanced. If the positive instance is rare, you might need to lower the threshold to detect them and boost recall.
For my ML experiment, it is possible to keep using F1 at threshold as development time objective, but it is important to inspect the PR curve after and find the best threshold and its corresponding F1 score.
ROC-AUC
The Receiver Operating Characteristic (ROC) curve also represents model performance across thresholds. The curve shows the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis. The ROC-AUC or area under the ROC curve is the probability that the model, if randomly given a positive and negative example, will rank the positive higher than the negative, irrespective of what the actual threshold is.
For a binary classifier with random guesses (e.g. a coin flip), the ROC-AUC is .
IoU
Intersection Over Union
Python as an Interpreter
I've found an urge to branch out of Python because while it is quite a useful and simplistic programming language for machine learning tasks, it is not as useful for building fast applications. I've encountered many difficulties with bottlenecks with Python scripts.
Python bytecode interpreter.
Codecs
When talking about ffmpeg
a term that comes up a lot is codecs. A codec is a software or algorithm that encodes raw audio/video into a compressed format for storage or transmission and decodes the compressed format back into the original form. In ffmpeg
codecs are implemented as modules that handle these encoding and decoding operations. An example of a codec supported by ffmpeg
is mp3! In the API codecs are abstracted as AVCodec
(Audio Video Codec). Choosing the type of codec determines the compression efficiency, quality and encoding/decoding speed.
Here is a slice of the large list of available codecs you get by running ffmpeg --codecs
:
DEAIL. mp3 MP3 (MPEG audio layer 3) (decoders: mp3float mp3 ) (encoders: libmp3lame libshine )
D.AIL. mp3adu ADU (Application Data Unit) MP3 (MPEG audio layer 3) (decoders: mp3adufloat mp3adu )
D.AIL. mp3on4 MP3onMP4 (decoders: mp3on4float mp3on4 )
D.AI.S mp4als MPEG-4 Audio Lossless Coding (ALS) (decoders: als )
..A.L. mpegh_3d_audio MPEG-H 3D Audio
D.AIL. musepack7 Musepack SV7 (decoders: mpc7 )
D.AIL. musepack8 Musepack SV8 (decoders: mpc8 )
DEAIL. nellymoser Nellymoser Asao
DEAIL. opus Opus (Opus Interactive Audio Codec) (decoders: opus libopus ) (encoders: opus libopus )