Tidbits

Here I keep notes of small things I learned that do not fit into a larger overarching topic. Any time I learn something interesting that I want to remember, I put that here.

F1 Score

One thing I recently looked deeper into (as part of my flood mapping project) was the math behind the F1 score. I have always known that it was the "harmonic" mean of precision ( $\frac{TP}{TP + FP}$ ) and recall ( $\frac{TP}{TP + FN}$ ), but did not know anything really beyond that.

First, let's define the "harmonic" mean mathematically. One thing to remember is that it is normally used for positive values, typically ratios and rates. For positive reals $x_1, x_2, \ldots, x_n$ , it is defined as

H = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \ldots + \frac{1}{x_n}}

It is also the reciprocal of the arithmetic mean of the reciprocals:

H = \frac{1}{\frac{1}{n}(\frac{1}{x_1} + \frac{1}{x_2} + \ldots + \frac{1}{x_n})}

In the context of F1 as a value of precision and recall, we have that:

F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} = \frac{2TP}{2TP + FN + FP}

The harmonic mean has the property that it is always greater than or equal to the minimum of its values $x_1, x_2, \ldots, x_n$ . The harmonic mean of a list of values however, tends to its least value as opposed to the arithmetic mean, and therefore is less susceptible to the impact of outliers. A good way of thinking about this is that we use arithmetic mean of the reciprocals to bias the small values in the same way, and we convert that effect back into our original unit by taking a reciprocal again. The reason why it is used over the arithmetic mean in the context of balancing recall and precision, is that it punishes imbalance more strongly. If you have high precision but low recall because you are too picky, F1 will be low. If you have high recall but low precision because you are overly sensitive, then F1 will also be low. F1 will only be high when both are high and in agreement.

The only drawback is that F1 treats recall and precision as equally important. In cases where recall is more important than precision or precision more than recall you want to use the F-beta score which is a generalization of F1 with weighted harmonic mean. Given $\beta > 0$ ,

F_{\beta} = \frac{(1+\beta^2) * precision * recall}{(\beta^2) * precision + recall} = \frac{(1+\beta^2)TP}{(1+\beta^2)TP + FP + \beta^2FN}

Observe that at $\beta = 1$ the weighting is equal and it defaults to the normal F1. But if $\beta = 2$ then recall is weighted twice as important and $\beta = 0.5$ precision is weighted twice as important.

Harmonic means shows up in rates and ratios. Why? Suppose you drive a car 60mph for $d$ miles, and then return at 30mph for $d$ miles. What is the average speed? The average speed is NOT the arithmetic mean $\frac{60+30}{2} = 45$ , because it assumes the same time elapsed between the two. Instead, it is the harmonic mean:

harmonic mean calc

AUPRC

The metric AUPRC stands for the Area Under the Precision Recall Curve, and is a useful performance metric in addition to F1 score for capturing ability to predict positives in an imbalanced dataset. A perfect AUPRC means predicting all the positives correctly without any false positives. AUPRC also captures how the model performs at different thresholds - so it is fine grained at the level of probabilities (a model that has ambivalent probabilities will be distinguished from a model that is more confident), not just the actually binary outcomes!

To calculate AUPRC it is important to know about the Precision-Recall Curve, which shows tradeoff between precision and recall across different decision thresholds. As we change the classification threshold across a list of values $0, 0.1, 0.2, \ldots, 0.9, 1.0$ we get a corresponding confusion matrix, precision, and recall for each, which we can plot as a point on the graph. We don't actually need to try every single probability threshold - for starters you can limit the thresholds to the discrete set of probability values in the model prediction, so if the model only predicts $0.01, 0.99$ as probabilities, then you only need to check the two thresholds.

AUPRC curve example

A good place to start conceptualizing this curve is the perfect case. A perfect AUPRC of $1.0$ corresponds to a square on the plot from (0, 1) recall + precision to (1, 1) to (1, 0).

# subtlety: while rec = 0/FN = 0, prec is undefined 0/0. Convention is prec=1 since no FP mistakes made (CAVEAT IN THE DETAIL BELOW!)
Threshold = 1 ==> all examples classified negative, so no FP and no TP: rec=0, prec=1 (see above)
# top right corner of square
0 < Threshold < 1 ==> all examples classified correctly, so no FP, FN: rec=1, prec=1
# bottom right corner (note it doesn't always end at prec=0)
Threshold = 0 ==> all examples classified positive, so no FN: rec=1, prec=0

Detail

Although you might think that the AUPRC curve should always start at the top left corner (rec=0, prec=1), it might not be in practice - it boils down to label of the example with the top predicted score. Remember: we start with threshold of 1 with everything labeled negatively and start lowering it so positive labels crop up. In practice, the very first positive prediction happens at the highest predicted score, i.e. the model's top ranked example say 0.99999. If the top scoring example is actually a positive ground truth, then we start from the top left corner (0, 1) artificial point and proceed to the first data point rec>0, prec=1. BUT if the top scoring example is negative, then the first positive label will contribute to the FP, so you have the first data point as rec=0, prec=0, resulting in the model starting from (0, 0).

Really (0, 0) is just an artifact added depending on certain cases for the plot to work.

Similarly the (1, 0) point is also an artifact added for plotting.

Trick

PR curves can be thought of as simply dependent on the ranking of the scores given to each example. Listing the scores from highest to lowest, you sweep down the list and recalculate the confusion matrix, precision, and recall at each step or cutoff.

A PR curve often ends at the lower right because at threshold of 0 the recall is perfect 1 but the precision is low.

The AUPRC is calculated as the area beneath the PR curve, which can be plotted on a graph with recall on the x-axis and precision on the y-axis ranging from 0 to 1. Observe that PR curves do not consider the number of true negatives due to the focus on precision and recall. As a result AUPRC is unaffected by imbalanced dataset classes with a large percentage of negative cases.

How do we interpret AUPRC? With AUPRC the baseline is equal to the fraction of positives among total cases. Here's why: say the classifier was no better at random guessing. If you have $N$ total examples, $P$ positive examples, the positive rate is $r = \frac{P}{N}$ . Your classifier would spit out a ranking (relevant to PR curve) of each example by probability score. If the classifer guesses, then the ranking is completely random, and for any cutoff with $M$ examples above it, there will be $r * M$ TP and $(1-r)M$ FP, so the precision is always approximately $\frac{rM}{rM + (1-r)M} = \frac{rM}{M} = r$ . As for the recall, if you have $N-m$ negative predictions, it will be $\frac{mr}{mr + (N-m)r} = \frac{mr}{Nr} = \frac{m}{N}$ which is just the proportion of positive predictions changing linearly with the cutoff. This makes sense for recall, because at 50% threshold you've only captured 50% of the positive cases, and at 100% you've captured 100% of the positive cases. Integrating the PR curve, $\text{AUPRC} = \int_{0}^{1} rd(\text{recall}) = r \cdot \int_{0}^{1} d(\text{recall}) = r$ . Hence we get the baseline being the rate of positives. A class with 12% positives has 0.12 as the AUPRC baseline, so 0.40 AUPRC is considered good. A class with 98% positives and a 0.98 AUPRC is bad.

AUPRC math

Why do we care about the tradeoff between recall and precision? With a decision threshold of 0.5 you might not be at the optimal threshold for recall and precision! Especially if the class distribution is imbalanced. If the positive instance is rare, you might need to lower the threshold to detect them and boost recall.

For my ML experiment, it is possible to keep using F1 at $0.5$ threshold as development time objective, but it is important to inspect the PR curve after and find the best threshold and its corresponding F1 score.

ROC-AUC

The Receiver Operating Characteristic (ROC) curve also represents model performance across thresholds. The curve shows the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis.

The true positive rate is just another name for recall, so $\text{TPR} = \frac{TP}{TP + FN}$ , while the false positive rate is the proportion of all actual negatives incorrectly classified $\text{FPR} = \frac{\text{incorrectly classified negatives}}{\text{actual negatives}} = \frac{FP}{FP + TN}$ , aka the probability of false alarm. As a result, the curve bends from left to right (unlike the PR curve), where starting from very high threshold you start in the bottom left corner with no false positives, then as you lower the threshold to zero and classify everything as positive, the TPR and FPR goes to 100% with all false positives and no false negatives - to the top right corner.

The ROC-AUC or area under the ROC curve is the probability that the model, if randomly given a positive and negative example, will rank the positive higher than the negative, irrespective of what the actual threshold is. The ROC-AUC curve also helps you determine what you want in the tradeoff - if false positives are costly, choose a threshold point with lower FPR at the expense of some TPR. Conversely, if false negative is costly, you can also push toward higher TPR with higher FPR as well.

ROC-AUC examples

For a binary classifier with random guesses (e.g. a coin flip), the ROC-AUC is $0.5$ . Anything below that you are performing worse than chance.

The disadvantage of the ROC-AUC metric with respect to imbalanced datasets, is that it can be misleadingly high. For instance, say you have a very small number ~10 of positives cases in the dataset of 1000, so most of the data will be negative. You can have a naive model label the negatives well, so the FPR is always perfect at 0. You can also have a good recall so you guess the 10 positives correctly but do so with extremely low precision with 90 false positives. As a result the FPR is $\frac{90}{990} < 0.10$ and the TPR is high $\frac{10}{10} = 1$ , leading to a high ROC-AUC. Thus the problem is that ROC can look good in imbalanced datasets while hiding misclassification in the minority of cases. This is unlike the AUPRC which is unaffected by the total number of TN predictions and takes into account the precision which is more relevant when the minority class is rare.

IoU

Intersection Over Union is a similarity metric commonly used in object detection model evaluation. The easiest way to think about it is that given a predicted area and ground truth area (i.e. a segmentation mask or bounding box), it is calculated by dividing the area of the intersection with the union of the areas. The resulting value of 0 indicates no overlap, and 1 as perfect overlap.

With binary classification, it can be written as:

\text{IoU} = \frac{TP}{TP + FP + FN}

Here's an easy visualization of how the set and the binary classification definitions are equivalent:

IoU sketch

`hydra` ML Configs

hydra works closely with the package omegaconf for DictConfig objects. This allows you to easily import .yaml config files into a python config object and access variables either dictionary style or via object attributes.

One powerful thing I learned in the process about omegaconf is interpolation. It allows you to reference other variables across the entire config tree. For example, if I have paths defined in a group paths and then want to set a weights path in the group train using of those previously defined paths, I don't need to duplicate the path to the again, I can interpolate the strings over with ${foo.1} dot notation. By default interpolation is absolute, but relative is also possible e.g. ${..foo}. See this example:

paths:
    base_dir: ${oc.env:PROJECT_ROOT,/lcrc/project/hydrosm/dma}
    data_dir: ${base_dir}/data
    output_dir: ${base_dir}/outputs
train:
    # No need to duplicate paths here!
    weights: ${paths.data_dir}/experiments/unet_wyhy12/model.pth

Python as an Interpreter

I've found an urge to branch out of Python because while it is quite a useful and simplistic programming language for machine learning tasks, it is not as useful for building fast applications. I've encountered many difficulties with bottlenecks with Python scripts.

Python bytecode interpreter.

IP Addresses

Every machine on a network has an IP address.

Remote Procedural Call (RPC) Server

An RPC server simplifies communication between systems in the client/server model. It allows a client computer to request the execution of programs or servers on a remote server as if it was happening locally. The RPC server listens for requests from RPC clients, executes code upon request and sends the results back to the client. An analogy of MCP is that it is similar to RPC as a communication conduit between agents, but may not provide a solution to advanced agent-agent coordination.

Relational Database Management System (RDBMS)

An RDBMS is a software that manages relational databases. An interesting thought is that people thought of Salesforce in the past as an RDB wrapper in that to some degree it puts some interface and logic on top of relational databases which is obviously overly simplified. In the same way together GPT-wrappers or application layers of new AI models can actually provide significant value if done correctly.

Codecs

When talking about ffmpeg a term that comes up a lot is codecs. A codec is a software or algorithm that encodes raw audio/video into a compressed format for storage or transmission and decodes the compressed format back into the original form. In ffmpeg codecs are implemented as modules that handle these encoding and decoding operations. An example of a codec supported by ffmpeg is mp3! In the API codecs are abstracted as AVCodec (Audio Video Codec). Choosing the type of codec determines the compression efficiency, quality and encoding/decoding speed.

Here is a slice of the large list of available codecs you get by running ffmpeg --codecs:

 DEAIL. mp3                  MP3 (MPEG audio layer 3) (decoders: mp3float mp3 ) (encoders: libmp3lame libshine )
 D.AIL. mp3adu               ADU (Application Data Unit) MP3 (MPEG audio layer 3) (decoders: mp3adufloat mp3adu )
 D.AIL. mp3on4               MP3onMP4 (decoders: mp3on4float mp3on4 )
 D.AI.S mp4als               MPEG-4 Audio Lossless Coding (ALS) (decoders: als )
 ..A.L. mpegh_3d_audio       MPEG-H 3D Audio
 D.AIL. musepack7            Musepack SV7 (decoders: mpc7 )
 D.AIL. musepack8            Musepack SV8 (decoders: mpc8 )
 DEAIL. nellymoser           Nellymoser Asao
 DEAIL. opus                 Opus (Opus Interactive Audio Codec) (decoders: opus libopus ) (encoders: opus libopus )

Strong vs. Weak Scaling

When it comes to scalability on HPCs, there are two important concepts with regards to how computational capacity scales.

Strong scaling is the speedup that can be achieved through increasing the processor resources while the problem size stays constant, and is bound by Amdahl's Law to the portion of code that is not parallelizable.

Weak scaling is the speedup that can be achieved by increasing both number of processors and increasing the problem size. The difference here is that by scaling the problem size, i.e. more processors with a constant workload per processor, we can solve larger problems on a large machine that we cannot achieve as fast on a smaller machine. The limit here is only processors and maximum problem size. An example of weak scaling is increasing the batch size (or problem size) for ML experiments.

Just as Amdahl's law governs strong scaling, Gustafson's Law governs weak scaling. The law assumes the parallel part of a problem scales linearly with resources while the serial part does not increase as problem size increases. The law claims that,

\text{Scaled speedup} = s + p \times N

Where $s$ is proportion of time spend on serial execution, $p$ is proportion execution time on part that can be parallelized, and $N$ the number of processors. Think of it as the ratio of time to solve a larger problem (i.e. scaled by $N$ ) on a serial computation relative to having $N$ processors.

Hence, the law claims that scaled speedup increases linearly with respect to number of processors with no upper limit.

Key Idea

In practice, you are often not limited to a set task size. Thus, the sizes of problems scale with number of resources.

By nature if you have a data parallel problem where processors can work on its own set of data separately, this will scale weakly.

Trick

Measure strong scaling on HPC by testing how overall computational time of a job scales with resources (threads or MPI processes).

Measure weak scaling on HPC by testing computational time of a job while increasing both job size and resources. This would have resources on x axis and scaled speedup on y axis.

Scaling Variables

Speedup for strong scaling is time using one processor divided by time using N processors:

\text{Speedup} = \frac{t_1}{t_N}

Where $t_1$ is the execution time on one processor, $t_N$ is the execution time on $N$ processors.

Efficiency is ideal time divided by measured time:

\text{Efficiency} = \frac{\text{Ideal Time}}{\text{Measured Time}} \cdot 100

With weak scaling, the efficiency is the ratio of time to complete 1 unit of work with 1 processor relative to N work units with N processors.

The Bitter Lesson

The essay by Richard Sutton is famous for highlighting the tension between leveraging human knowledge and leveraging computation in the realm of AI. Search and learning, he argues, are what ultimately matters when computation is scaled enormously. Search can be thought of as brute force trial and error and learning can take the form of self play.

The bitter lesson is this: that building methods that mimic how we as humans think does not work in the long run. In the long term the breakthroughs will always come in the form of scaling search and learning. Human centric approaches thus always fall short of more general methods.

Human knowledge pales in comparison to Moore's law.

Test Your Knowledge

Question 1 of 5

What is the harmonic mean formula for two values x₁ and x₂?

0 of 5 answered

Resources

AUPRC Video + article

F1 Score​

AUPRC​

ROC-AUC​

IoU​

hydra ML Configs​

Python as an Interpreter​

IP Addresses​

Remote Procedural Call (RPC) Server​

Relational Database Management System (RDBMS)​

Codecs​

Strong vs. Weak Scaling​

Scaling Variables​

The Bitter Lesson​

Test Your Knowledge​

Resources​