[POINT COORDINATE COUNTING] Mean Absolute Error

Mean Absolute Error (MAE): The average absolute difference between the predicted count and the actual count across all images. E.g, it tells you how off your headcount usually is for people counting.

MAE=1Ni=1Nyiy^i\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|

[POINT COORDINATE COUNTING] Root mean squared error

Similar to MAE, but it squares the errors before averaging. This metric heavily penalizes severe miscounts. If your model usually misses by 1, but occasionally hallucinates 10 extra people, RMSE will spike while MAE might stay relatively stable.

[POINT COORDINATE] Macro Averaging

Calculate the metric for each class independently, then average the class scores. This treats all classes equally, regardless of how many instances of each class exist in the dataset.

Macro Averaging

[POINT COORDINATE] mAP@d

Mean Average Precision at distance dd. Calculate the Area Under the Curve (AUC) for the Precision-Recall curve at a specific distance radius.

[POINT COORDINATE] mAP@d-[0.01:0.10]

The mean of mAP scores calculated at strict distance intervals (e.g., d=0.01,0.020.10d=0.01, 0.02 \dots 0.10). This is the point-based equivalent of the COCO mAP@[0.50:0.95] metric.

[POINT COORDINATE] Point localization error

For all True Positives, what is the average Euclidean distance between the predicted coordinate and the exact ground truth coordinate?

[POINT COORDINATE] Precision

Out of all points the model predicted, how many were actually correct?

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

[POINT COORDINATE] Recall

Out of all the ground truth points that actually exist, how many did the model successfully find?

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

2:1 structured sparsity

alt text

F1 score

Harmonic mean of Precision and Recall. We use it when we want a single metric that balances both, especially when we have an uneven class distribution.

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} F1=2TP2TP+FP+FNF1 = \frac{2TP}{2TP + FP + FN}

Inception-v3

Inception-v3 is a 48-layer deep convolutional neural network (CNN) for image classification, developed by Google in 2016 to improve accuracy while reducing computational cost.

Introduced label smoothing.

L1 Regularization (Lasso)

A method of penalizing weights proportional to their L1 norm, the sum of their absolute values (Manhattan Distance). Pushes weights to zero, encourages sparsity.

Losstotal=Lossdata+λiwi\text{Loss}_{\text{total}} = \text{Loss}_{\text{data}} + \lambda \sum_i |w_i|

Where:

L2 Regularization (Ridge)

A method of penalizing weights proportional to their L1 norm, the sum of their absolute values (Manhattan Distance). Pushes weights to zero, encourages sparsity.

Losstotal=Lossdata+λiwi\text{Loss}_{\text{total}} = \text{Loss}_{\text{data}} + \lambda \sum_i |w_i|

Where:

Language Model Samplers

SamplerHow it worksThe Flaw
Top-KKeeps only the KK most likely tokens.Too rigid; ignores context confidence.
Top-PKeeps tokens until their cumulative probability hits P%P\%.Can let in terrible tokens if the model is confused.
Min-PSets a threshold relative to the top token’s probability (e.g., must be at least 10%10\% as likely as the #1 choice).None; seamlessly adapts to both high confidence and high chaos.

Matching Phase (Bipartite Matching)

To calculate precision and recall, we must assign predictions to ground truth points. We cannot just measure the distance of every prediction to every ground truth. We need a 1:1 mapping.

For each prediction, we find the nearest ground truth point.

If the distance is within the allowed tolerance, we consider it a match. If the ground truth point is already matched to a closer prediction, we consider the current prediction a false positive.

Any ground truth point without a match is a false negative.

ClassDefinition
True Positive (TP)A prediction is within distance dd of an unmatched ground truth point.
False Positive (FP)A prediction that does not fall within distance dd of any ground truth point, or the ground truth point is already “claimed” by a closer prediction.
False Negative (FN)A ground truth point that has no prediction within distance dd.

precision (metric)

You’re fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk. You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

recall (metric)

You’re fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk. You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.

Also just the True Positive Rate (TPR).

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Skip Connections

In classical hidden layer backprop calculations, we focused heavily on calculating gradients to update our weights, solvings LW\frac{\partial L}{\partial W}. To mathematically reach a weight matrix deep in the early layers of the network, the error signal from the loss function must first survive the journey through all the intermediate activations above it.

La[l]=(W[l+2]σ(Z[l+1])W[l+1])\frac{\partial L}{\partial a^{[l]}} = \sum \left( \dots \cdot W^{[l+2]} \cdot \sigma'(Z^{[l+1]}) \cdot W^{[l+1]} \right)

We multiply by the weight matrix W[l]W^{[l]} and the activation derivative σ(Z[l])\sigma'(Z^{[l]}) at every single layer. When σ\sigma' is <1< 1, or initialized weights are small, terms multiply together and decay exponentially towards zero, causing the vanishing gradient problem.

When we add skip connections [1], we add the input a[l]a^{[l]} as a residual:

Z[l+1]=F(a[l],W[l+1])+a[l]Z^{[l+1]} = \mathcal{F}(a^{[l]}, W^{[l+1]}) + a^{[l]}

F\mathcal{F} represents the composite function of all operations. We take the input a[l]a^{[l]}, pass it through our architecture F\mathcal{F}, and add the result back to the original input.

During backpropagation, we want to pass the gradient of the loss LL back to a[l]a^{[l]}. Applying the chain rule:

La[l]=LZ[l+1]Z[l+1]a[l]\frac{\partial L}{\partial a^{[l]}} = \frac{\partial L}{\partial Z^{[l+1]}} \cdot \frac{\partial Z^{[l+1]}}{\partial a^{[l]}}

Now, let’s expand the local derivative Z[l+1]a[l]\frac{\partial Z^{[l+1]}}{\partial a^{[l]}} based on our residual equation:

Z[l+1][l]=a[l](F(a[l],W[l+1])+a[l])\frac{\partial Z^{[l+1]}}{\partial ^{[l]}} = \frac{\partial}{\partial a^{[l]}} \left( \mathcal{F}(a^{[l]}, W^{[l+1]}) + a^{[l]} \right) Z[l+1]a[l]=Fa[l]+1\frac{\partial Z^{[l+1]}}{\partial a^{[l]}} = \frac{\partial \mathcal{F}}{\partial a^{[l]}} + 1

Substitute this back into the chain rule:

La[l]=LZ[l+1](Fa[l]+1)\frac{\partial L}{\partial a^{[l]}} = \frac{\partial L}{\partial Z^{[l+1]}} \left( \frac{\partial \mathcal{F}}{\partial a^{[l]}} + 1 \right) La[l]=LZ[l+1]Fa[l]+LZ[l+1]\frac{\partial L}{\partial a^{[l]}} = \frac{\partial L}{\partial Z^{[l+1]}} \frac{\partial \mathcal{F}}{\partial a^{[l]}} + \frac{\partial L}{\partial Z^{[l+1]}}

The first term, LZ[l+1]Fa[l]\frac{\partial L}{\partial Z^{[l+1]}} \frac{\partial \mathcal{F}}{\partial a^{[l]}}, represents the standard gradient flowing through the weights of the convolutional layers. In a deep network, this term might still vanish to zero.

The second term, +LZ[l+1]+ \frac{\partial L}{\partial Z^{[l+1]}}, is same gradient from the layer above, passing completely through the + operator. Because the gradient is added rather than multiplied, the loss landscape becomes smoother.

Instead of forcing F\mathcal{F} to learn a new representation of data, xx carries historical information forward. F\mathcal{F} is now responsible for learning the residual needed to improve the current representation. If a particular layer decides it doesn’t need to learn anything new, the optimizer pushes the weights of F\mathcal{F} toward zero. The output becomes Z[l+1]=0+x=xZ^{[l+1]} = 0 + x = x.

[1]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition.” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385

Sparsity

Sparsity is when you force most of the weights or activations in a neural network to be zero, keeping only the most important connections. Usually you can keep 98% of the accuracy while removing most of the weights.