Amortized Inference

Parametric Feedforward Networks

backpropagation

The fundamental algorithm for training artificial neural networks by efficiently calculating gradients of the loss function with respect to network weights using the chain rule.

You can see a fully-derived backpropogation example on both a sigmoid and softmax-output neural network here.

Beam-search decoding

Chain Parsing (CSP Topology)

The real-time translation of Unix shell operators (|, &&, ||, ;) into a simplified, linear architecture of Communicating Sequential Processes (CSP). Rather than the LLM orchestrating multiple independent function calls, Chain Parsing allows the agent to construct an on-the-fly execution graph within a single string constraint. This relies heavily on the Rule of Composition, passing data seamlessly from one process to the next.

See [1], [2] and [3] for more resources on the formalization of the subject.

[1]
C. A. R. Hoare, Communicating Sequential Processes. Prentice-Hall, 1985. [Online]. Available: https://antares.sip.ucm.es/~luis/doctorado06-07/cspbook.pdf
[2]
R. Milner, J. Parrow, and D. Walker, “A Calculus of Mobile Processes,” University of Edinburgh, techreport, 1992. [Online]. Available: https://www.cis.upenn.edu/~stevez/cis670/pdfs/pi-calculus.pdf
[3]
E. S. Raymond, The Art of Unix Programming. Addison-Wesley, 2003. [Online]. Available: https://cdn.nakamotoinstitute.org/docs/taoup.pdf

chunking (rag)

https://old.reddit.com/r/Rag/comments/1r47duk/we_benchmarked_7_chunking_strategies_most_best/

Direct Pixel Optimization

We freeze a model’s weights like a ResNet10 classifier, feed it a low-resolution image, set input_batch.requires_grad = True, and define a structural loss function, we can optimize the image directly:

xt+1=xtαLxtx_{t+1} = x_t - \alpha \frac{\partial \mathcal{L}}{\partial x_t}

Deep Image Prior [1] for image restoration tasks. However, there is another more optimal modeling for such tasks called Parametric Feedforward Networks.

[1]
D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep Image Prior,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1867–1888, Mar. 2020, doi: 10.1007/s11263-020-01303-4.

Discriminative Region Problem

When a standard classifier ends up only locating the most distinguishing features in order to output a confident classification, omitting other important traits.

Dynamic Masking Augmentation Techniques

A limitation of the Park et al. formalization is that dynamic masking techniques, such as PuzzleMix [1] and AutoMix [2] cannot be formally explained by their framework, as their theorem relies on the mask being stochastic and independent of pixel values. When we add pixel-level awareness, we can no longer approximate loss function as a straightforward input gradient.

Empirically, though they achieve state-of-the-art results against MixUp and CutMix in selected setups, the overall impact on head class prediction is small. These techniques also take signficantly more compute per batch than stochastic techniques.

[1]
J.-H. Kim, W. Choo, and H. O. Song, “Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup.” 2020. [Online]. Available: https://arxiv.org/abs/2009.06962
[2]
Z. Liu et al., “AutoMix: Unveiling the Power of Mixup for Stronger Classifiers.” 2022. [Online]. Available: https://arxiv.org/abs/2103.13027

Exponential Moving Average

After every single training step, we update the shadow model by taking a massive fraction of its past self, and adding a tiny fraction of the new active weights:

θema(t)=βθema(t1)+(1β)θ(t)\theta_{ema}^{(t)} = \beta \cdot \theta_{ema}^{(t-1)} + (1 - \beta) \cdot \theta^{(t)}

F1 score

Harmonic mean of Precision and Recall. We use it when we want a single metric that balances both, especially when we have an uneven class distribution.

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} F1=2TP2TP+FP+FNF1 = \frac{2TP}{2TP + FP + FN}

False Positive Rate (FPR)

FPR=FPFP+TN\text{FPR} = \frac{FP}{FP + TN}

Generative Image Augmentation

Earlier work has shown that using diffusion models to synthesize and augment data improves classification results on ImageNet [1], [2]. Given the latest developments in image generation models and abilities, these have moved to more sophisticated enhancements, such as background diversification, and saliency-aware generation pipelines [3], [4], [5]. Generally, these techniques isolate the ground-truth subject via segmentation or feature-extraction, and use the generative model as as an augmenter build the surrounding context and handle the blending.

[1]
S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic Data from Diffusion Models Improves ImageNet Classification.” 2023. [Online]. Available: https://arxiv.org/abs/2304.08466
[2]
B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Effective Data Augmentation With Diffusion Models.” 2025. [Online]. Available: https://arxiv.org/abs/2302.07944
[3]
F. Rahat, M. S. Hossain, M. R. Ahmed, S. K. Jha, and R. Ewetz, “Data Augmentation for Image Classification using Generative AI.” 2024. [Online]. Available: https://arxiv.org/abs/2409.00547
[4]
B. Chen et al., “XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation.” 2025. [Online]. Available: https://arxiv.org/abs/2506.21416
[5]
T. Zhao et al., “Salient Concept-Aware Generative Data Augmentation,” Advances in Neural Information Processing Systems 38 (NeurIPS 2025). 2025.

Greedy Decoding

ImageNet

ImageNet [1] is a large-scale visual dataset designed for object recognition research. It became foundational to modern computer vision, especially after the breakthrough of University of Toronto’s team in the 2012 ImageNet Large Scale Visual Recognition Challenge using AlexNet [2]. The original dataset includes 14 million images and 21,841 synsets (classes), though teams often evaluate on the ILSVRC Subset benchmark [3] consisting of 1.2 million training images and 1,000 classes.

The original ImageNet images were not exhaustively labeled. An image labeled “dog” might also contain trees, cars, or people. It is single-label per image. ILSVRC was essential for adding stricter quality control, as well as bounding box and object localization for more nuanced identification.

[1]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009, doi: 10.1109/CVPR.2009.5206848.
[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, pp. 1097–1105, 2012.
[3]
O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015, doi: 10.1007/s11263-015-0816-y.

Inception-v3

Inception-v3 is a 48-layer deep convolutional neural network (CNN) for image classification, developed by Google in 2016 to improve accuracy while reducing computational cost.

Introduced label smoothing.

Jetson AGX Orin

Hardware from Nvidia.

L1 Regularization (Lasso)

A method of penalizing weights proportional to their L1 norm, the sum of their absolute values (Manhattan Distance). Pushes weights to zero, encourages sparsity.

Losstotal=Lossdata+λiwi\text{Loss}_{\text{total}} = \text{Loss}_{\text{data}} + \lambda \sum_i |w_i|

Where:

L2 Regularization (Ridge)

A method of penalizing weights proportional to their L1 norm, the sum of their absolute values (Manhattan Distance). Pushes weights to zero, encourages sparsity.

Losstotal=Lossdata+λiwi\text{Loss}_{\text{total}} = \text{Loss}_{\text{data}} + \lambda \sum_i |w_i|

Where:

Linear Attention

Faithful transformers implement softmax on attention layers. This introduces quadratic complexity, and there are modern techniques to approximate it and reduce complexity to linear time.

[1]

[1]
M. Xu, X. Lin, X. Guo, W. Xu, and W. Cui, “Softmax Linear Attention: Reclaiming Global Competition.” 2026. [Online]. Available: https://arxiv.org/abs/2602.01744

Linear Scaling Law

Batchsize versus learning rate dynamics[1]:

If we multiply the batch size by a factor of kk, we must also multiply the learning rate by kk:

lrnew=lrold×NnewNold\text{lr}_{new} = \text{lr}_{old} \times \frac{N_{new}}{N_{old}}
[1]
P. Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.” 2018. [Online]. Available: https://arxiv.org/abs/1706.02677

Mixed-Sample Data Augmentation

In 2022, Park et al. provided a unified formal treatment of mixed-sample data augmentation methods (MSDA) [1], demonstrating that such techniques were equivalent to designing a spatial decay kernel for gradient regularization, including CutMix [2] and MixUp [3]. Each technique posseses their own strengths and limitations. For example, CutMix may introduce multi-label noise, erasing a class that only existed in the cutout area, or introduces a new class that isn’t actually present in the pasted patch. There are some further static enhancements on this techniques, such as Fourier Mix (FMix)[4], which produces smooth and continuous patches instead of sharp ones, and ResizeMix [5], which resizes the patching image instead of cropping.

The Park et al. formalization demonstrates how MSDA reshapes the loss landscape, mathematically forcing it to learn smoother functions by penalizing erratic changes in its input gradients, weighted by how close the data points are (spatial decay). The core of the theorem relies on the stochastic nature of applicable image augmentation techniques. They used their theorems to formalize Hybrid Mix (HMix) and Gaussian Mix (GMix), intermediate spatial regularization techniques to sit between CutMix and MixUp distributions.

[1]
C. Park, S. Yun, and S. Chun, “A Unified Analysis of Mixed Sample Data Augmentation: A Loss Function Perspective.” 2022. [Online]. Available: https://arxiv.org/abs/2208.09913
[2]
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.” 2019. [Online]. Available: https://arxiv.org/abs/1905.04899
[3]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” International Conference on Learning Representations (ICLR), 2018.
[4]
E. Harris, A. Marcu, M. Painter, M. Niranjan, A. Prügel-Bennett, and J. Hare, “FMix: Enhancing Mixed Sample Data Augmentation.” 2021. [Online]. Available: https://arxiv.org/abs/2002.12047
[5]
J. Qin, J. Fang, Q. Zhang, W. Liu, X. Wang, and X. Wang, “ResizeMix: Mixing Data with Preserved Object Information and True Labels.” 2020. [Online]. Available: https://arxiv.org/abs/2012.11101

MSDA Regularization

Mixup blends images globally across all pixels, creating semi-transparent ghost images, hampering structural awareness. MixUp-CAM [1] is an introduced technique to use formalize uncertainty regularization constraints:

Lall=Lcls(I,Y)+λemLem(I)+λconLcon(M)\mathcal{L}_{all} = \mathcal{L}_{cls}(I^{\prime}, Y^{\prime}) + \lambda_{em}\mathcal{L}_{em}(I^{\prime}) + \lambda_{con}\mathcal{L}_{con}(M)

This formula combines a classification loss, class-wise entropy regularization, and spatial concentration loss to optimize the MixUp procedure, and prevent the response from being too divergent.

CutMix slready enforces spatial formalization through its geometry, such as feature exploration and spatial integrity. There is some research into techniques like semantic proportioning, making the augmentation differentiable, and using dynamic view-scales crops [2], [3], [4] to incrementally improve augmentations. These gains are incremental and selected, whereas the original CutMix usually suffices and is the common benchmark.

[1]
Y.-T. Chang, Q. Wang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.-H. Yang, “Mixup-CAM: Weakly-supervised Semantic Segmentation via Uncertainty Regularization.” 2020. [Online]. Available: https://arxiv.org/abs/2008.01201
[2]
S. Huang, X. Wang, and D. Tao, “SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data.” 2020. [Online]. Available: https://arxiv.org/abs/2012.04846
[3]
B. Li et al., “DDAug: Differentiable Data Augmentation for Weakly Supervised Semantic Segmentation,” Trans. Multi., vol. 26, pp. 4764–4775, Jan. 2024, doi: 10.1109/TMM.2023.3326300.
[4]
H. Kim, D. Kim, P. Ahn, S. Suh, H. Cho, and J. Kim, “ContextMix: A context-aware data augmentation method for industrial visual inspection systems,” Engineering Applications of Artificial Intelligence, vol. 131, p. 107842, May 2024, doi: 10.1016/j.engappai.2023.107842.

N-Gram Modeling

Language modeling has evolved from discrete statistical counting to continuous neural representation. Statistical n-gram models, first demonstrated by Claude Shannon in 1948 [1], estimate word probabilities based on raw frequency counts observed in a training corpus. These models inherently suffer from the data sparsity, in which sequences not seen during training receive a zero probability. We address this by applying Add-k smoothing, an technique used to reassigns probability mass to unseen events to ensure the model remains functional.

In 2003, Neural Probabilistic Language Models were introduced by Bengio et al [2]. They replaced discrete counts with word embeddings, allowing neural models solve sparsity through interpolation within embedding space, mapping similar vector coordinates together. This allows for generalization beyond the specific sequences encountered during training.

[1]
C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
[2]
Y. Bengio, R. Ducharme, and P. Vincent, “A Neural Probabilistic Language Model,” in Advances in Neural Information Processing Systems, T. Leen, T. Dietterich, and V. Tresp, Eds., MIT Press, 2000. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf

Nucleus Sampling

Parametric Feedforward Networks

We feed models a low-resolution input xLRx_{LR}, and it outputs a high-resolution prediction x^HR\hat{x}_{HR}:

x^HR=F(xLR,θ)\hat{x}_{HR} = \mathcal{F}(x_{LR}, \theta)

During training, we calculate the loss between our prediction and the ground-truth high-res image yHRy_{HR}. We backpropagate to update the weights, not pixels:

θnew=θoldαL(x^HR,yHR)θ\theta_{new} = \theta_{old} - \alpha \frac{\partial \mathcal{L}(\hat{x}_{HR}, y_{HR})}{\partial \theta}

amortized inference

Perceptual Loss

Perceptual loss for super resolution:[1]

We extract the activation maps a[l]a^{[l]} from deep inside the frozen network (e.g., the output of layer3) and compute the loss between the feature representations, not the pixels:

Lperceptual=1ClHlWlEfrozen(l)(y^)Efrozen(l)(y)22\mathcal{L}_{perceptual} = \frac{1}{C_l H_l W_l} || \mathcal{E}_{frozen}^{(l)}(\hat{y}) - \mathcal{E}_{frozen}^{(l)}(y) ||_2^2
[1]
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” 2016. [Online]. Available: https://arxiv.org/abs/1603.08155

Poisson Distribution

A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

The probability of observing exactly kk events is given by the formula:

P(X=k)=eλλkk!P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}

Where:

Post-Training

Post training [1]

[1]
B. Rank et al., “PostTrainBench: Can LLM Agents Automate LLM Post-Training?” 2026. [Online]. Available: https://arxiv.org/abs/2603.08640

precision (metric)

You’re fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk. You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Prompt Engineering

Abstract

In many real-world applications, Large Language Models (LLMs) are used to reason over structured tabular data rather than plain text. Examples include risk assessment, customer churn prediction, and medical triage systems. We investigate whether advanced prompt engineering strategies improve LLM performance when making predictions from structured data. We use LLMs for a tabular reasoning task, design and compare prompt engineering techniques, evaluate predictions using classification metrics, and analyze where LLM reasoning fails. Full code implementation and data is available on github.

Prompting Techniques

We have documented all prompt techniques used:

Results

Table: Results Table [!h]

LLMPrompting MethodAccuracyPrecisionRecallF1
google/gemini-3-flash-previewZero-shot0.9900.9947090.9791670.986877
google/gemini-3-flash-previewFive-shot0.9881.0000000.9687500.984127
google/gemini-3-flash-previewCoT0.9820.9893050.9635420.976253
google/gemini-3-flash-previewSelf-consistency (3)0.9861.0000000.9635420.981432
google/gemini-3-flash-previewToT0.9740.9944750.9375000.965147
openai/gpt-4.1-nanoZero-shot0.7800.8796300.4947920.633333
openai/gpt-4.1-nanoFive-shot0.7760.7857140.5729170.662651
openai/gpt-4.1-nanoCoT0.6920.5887850.6562500.620690
openai/gpt-4.1-nanoSelf-consistency (3)0.7580.6698560.7291670.698254
openai/gpt-4.1-nanoToT0.6160.0000000.0000000.000000

: Full comparison results table comparing all prompting techniques.

alt text
Confusion Matrices for Gemini Flash 3 Preview on the Kaggle Titanic dataset
alt text
Confusion Matrices for GPT 4.1 Nano on the Kaggle Titanic dataset

Hard Case Analysis

gemini-3-flash-preview

Full reasoning traces for 9 missclassified samples for Gemini 3 Flash are at Appendix \ref{sec:titanic-results-gemini}.

The model performed slightly worse when prompted for greater reasoning, overthinking on our system information. The model primarily missed 3rd-class males of whom survived, passengers of ID 82, 205, 268, 745 and 401. Passenger 808, an 18-year-old 3rd class female, was predicted to have survived, but did not, which was a statistical anomaly. The model primarily made these misclassifications due to particular statistical assumption priors. However, the mode explicitly hallucinates passenger ID 70 as a “documented historical anomaly” whom survived, though this is false.

The Gemini model is exceptionally more accurate, faster and at higher availability than other SOTA models. It’s likely it was pre-trained on this particular problem. The next highest achiever was openai/gpt-5.2-chat (missed logging it) at 85% accuracy using zero-shot. Because of this, I believe Gemini likely has easily-accessible parametric knowledge about this particular dataset, and that our particular prompts were able to elicit the particular pre-training configuration it was trained on.

gpt-4.1-nano

Full reasoning traces for 9 missclassified samples for GPT 4.1 Nano at Appendix \ref{sec:titanic-results-nano}.

Our system prompt elicited the model to consider anomalies in survival, and thus some reasoning responses directly contradicted known statistical knowledge about the event. For example, for passengers 55 and 140, higher-class males were predicted to have survived, despite them having greater statistical likelihood of dying. The model also came to underestimate 3rd-class women traveling alone. Overall, enhanced reasoning degraded model performance, where the shorter prompts likely engaged activations with more dependence on direct statistical assumptions.

recall (metric)

You’re fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk. You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.

Also just the True Positive Rate (TPR).

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Regularization

300 samples per class is small, making Stochastic Gradient Descent (SGD) inherently difficult to land reliably. The sample complexity is just too low for SGD to reliably find flat, generalizable minima. We have to find a regularization strategy that helps us mitigate our limitations and prevent a sharp minima (overfitting). This forces us into a situation in which we must squeeze out every state-of-the-art regularization technique we can do to make the data as meaningful as possible.

SEAM

It has been demonstrated that standard CAMs fail to cover entire objects as they are translation and transformation invariant[1]. The localization and segmentation we seek require equivariance, where masks must alongside augmentation. Yude Wang et al. introduced Self-supervised Equivariant Attention Mechanism (SEAM) and Equivariant Cross Regularization (ECR) loss, taking the original image, generating its CAM, and then apply a spatial transformation TT. TT is then applied on the input, we yield the second CAM, and enforce on the loss that these two pathways yield the exact same spatial tensor:

LECR=M(T(x))T(M(x))22(1)L_{ECR} = || M(T(x)) - T(M(x)) ||_2^2 \tag{1}

By adding this to the standard cross-entropy loss, we force the network to produce stable object localizations rather than peaked activations that shift wildly when the image is augmented.

In the original implementation by Yude Wang et al., rotation and translation failed to produce sufficient supervision, while rescaling added major results (47.43%47.43\% to 55.41%55.41\%).

[1]
Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation.” 2020. [Online]. Available: https://arxiv.org/abs/2004.04581

Skip Connections

In classical hidden layer backprop calculations, we focused heavily on calculating gradients to update our weights, solvings LW\frac{\partial L}{\partial W}. To mathematically reach a weight matrix deep in the early layers of the network, the error signal from the loss function must first survive the journey through all the intermediate activations above it.

La[l]=(W[l+2]σ(Z[l+1])W[l+1])\frac{\partial L}{\partial a^{[l]}} = \sum \left( \dots \cdot W^{[l+2]} \cdot \sigma'(Z^{[l+1]}) \cdot W^{[l+1]} \right)

We multiply by the weight matrix W[l]W^{[l]} and the activation derivative σ(Z[l])\sigma'(Z^{[l]}) at every single layer. When σ\sigma' is <1< 1, or initialized weights are small, terms multiply together and decay exponentially towards zero, causing the vanishing gradient problem.

When we add skip connections [1] (See our ResNet10 classifier), we add the input a[l]a^{[l]} as a residual:

Z[l+1]=F(a[l],W[l+1])+a[l]Z^{[l+1]} = \mathcal{F}(a^{[l]}, W^{[l+1]}) + a^{[l]}

F\mathcal{F} represents the composite function of all operations. We take the input a[l]a^{[l]}, pass it through our architecture F\mathcal{F}, and add the result back to the original input.

During backpropagation, we want to pass the gradient of the loss LL back to a[l]a^{[l]}. Applying the chain rule:

La[l]=LZ[l+1]Z[l+1]a[l]\frac{\partial L}{\partial a^{[l]}} = \frac{\partial L}{\partial Z^{[l+1]}} \cdot \frac{\partial Z^{[l+1]}}{\partial a^{[l]}}

Now, let’s expand the local derivative Z[l+1]a[l]\frac{\partial Z^{[l+1]}}{\partial a^{[l]}} based on our residual equation:

Z[l+1]a[l]=a[l](F(a[l],W[l+1])+a[l])\frac{\partial Z^{[l+1]}}{\partial a^{[l]}} = \frac{\partial}{\partial a^{[l]}} \left( \mathcal{F}(a^{[l]}, W^{[l+1]}) + a^{[l]} \right) Z[l+1]a[l]=Fa[l]+1\frac{\partial Z^{[l+1]}}{\partial a^{[l]}} = \frac{\partial \mathcal{F}}{\partial a^{[l]}} + 1

Substitute this back into the chain rule:

La[l]=LZ[l+1](Fa[l]+1)\frac{\partial L}{\partial a^{[l]}} = \frac{\partial L}{\partial Z^{[l+1]}} \left( \frac{\partial \mathcal{F}}{\partial a^{[l]}} + 1 \right) La[l]=LZ[l+1]Fa[l]+LZ[l+1]\frac{\partial L}{\partial a^{[l]}} = \frac{\partial L}{\partial Z^{[l+1]}} \frac{\partial \mathcal{F}}{\partial a^{[l]}} + \frac{\partial L}{\partial Z^{[l+1]}}

The first term, LZ[l+1]Fa[l]\frac{\partial L}{\partial Z^{[l+1]}} \frac{\partial \mathcal{F}}{\partial a^{[l]}}, represents the standard gradient flowing through the weights of the convolutional layers. In a deep network, this term might still vanish to zero.

The second term, +LZ[l+1]+ \frac{\partial L}{\partial Z^{[l+1]}}, is same gradient from the layer above, passing completely through the + operator. Because the gradient is added rather than multiplied, the loss landscape becomes smoother.

Instead of forcing F\mathcal{F} to learn a new representation of data, xx carries historical information forward. F\mathcal{F} is now responsible for learning the residual needed to improve the current representation. If a particular layer decides it doesn’t need to learn anything new, the optimizer pushes the weights of F\mathcal{F} toward zero. The output becomes Z[l+1]=0+x=xZ^{[l+1]} = 0 + x = x.

[1]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition.” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385

Sparse mixture-of-expert models

Sparsely activated MoE Models.

The Single-Tool Hypothesis

An agent design paradigm stating that a unified string-composition interface outperforms a catalog of highly typed, discrete function calls. By collapsing function selection into command syntax, the framework shifts the LLM’s cognitive load away from schema mapping (context-switching) and toward natural language generation (CLI syntax), which is highly represented in its training data. A full report was developed by a backend lead at Manus describing the concept. Smolagents from Huggingface approaches the same idea with Python.

See Chain Parsing (CSP Topology).

Top-K Sampling

WSOL

Weakly Supervised Object Localization (WSOL) is a task where we try to locate objects in an image using only image-level classification labels rather than detailed bounding boxes or masks.

WSSS

Weakly-Supervised Semantic Segmentation (WSSS) is a task where the data lacks pixel-level labels and masks, and the modeling must learn an alternative. Both of these domains apply to our problems, and there is significant literature available.