full backpropagation calculation for neural networks

Question 1

Forward Pass

Hidden Layer

Z1[1]=W10[1]+W11[1]a1[0]+W12[1]a2[0]Z_1^{[1]} = W_{10}^{[1]} + W_{11}^{[1]}a_1^{[0]} + W_{12}^{[1]}a_2^{[0]}

Z2[1]=W20[1]+W21[1]a1[0]+W22[1]a2[0]Z_2^{[1]} = W_{20}^{[1]} + W_{21}^{[1]}a_1^{[0]} + W_{22}^{[1]}a_2^{[0]}

a1[1]=σ(Z1[1])a_1^{[1]} = \sigma(Z_1^{[1]})

a2[1]=σ(Z2[1])a_2^{[1]} = \sigma(Z_2^{[1]})

Output layer

Z[2]=W0[2]+W1[2]a1[1]+W2[2]a2[1]Z^{[2]} = W_0^{[2]} + W_1^{[2]}a_1^{[1]} + W_2^{[2]}a_2^{[1]}

a[2]=σ(Z[2])a^{[2]} = \sigma(Z^{[2]})


Backpropagation

Binary Cross Entropy Loss Derivative

We assume binary cross entropy loss:

L(a[2],y)=[ylog(a[2])+(1y)log(1a[2])]L(a^{[2]}, y) = -[y \cdot \log(a^{[2]}) + (1 - y) \cdot \log(1 - a^{[2]})]

ddxlog(x)=1x\frac{d}{dx} \log(x) = \frac{1}{x}

ddxlog(1x)=11x\frac{d}{dx} \log(1-x) = -\frac{1}{1-x}

L(a[2],y)a[2]=[y1a[2]+(1y)(11a[2])]\frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} = -\left[ y \cdot \frac{1}{a^{[2]}} + (1-y) \cdot \left( -\frac{1}{1-a^{[2]}} \right) \right]

L(a[2],y)a[2]=ya[2]+1y1a[2]\frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} = -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}}

Derivative of Sigmoid Activation

dadz=ddz(1+ez)1\frac{da}{dz} = \frac{d}{dz}(1 + e^{-z})^{-1}

dadz=1(1+ez)2(ez)=ez(1+ez)2\frac{da}{dz} = -1(1 + e^{-z})^{-2} \cdot (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2}

dadz=(11+ez)(1+ez11+ez)\frac{da}{dz} = \left( \frac{1}{1 + e^{-z}} \right) \cdot \left( \frac{1 + e^{-z} - 1}{1 + e^{-z}} \right)

dadz=σ(z)(1σ(z))\frac{da}{dz} = \sigma(z) \cdot (1 - \sigma(z))


Output Layer Weights

We calculate the gradient for each weight by multiplying the loss derivative, the activation derivative, and the logit derivative.

Weight W0[2]W_0^{[2]}

L(a[2],y)W0[2]=(L(a[2],y)a[2])(a[2]Z[2])(Z[2]W0[2])\frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial W_0^{[2]}} \right)

L(a[2],y)W0[2]=(ya[2]+1y1a[2])(a[2](1a[2]))(1)\frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (1)

L(a[2],y)W0[2]=a[2]y\frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = a^{[2]} - y

Weight W1[2]W_1^{[2]}

L(a[2],y)W1[2]=(L(a[2],y)a[2])(a[2]Z[2])(Z[2]W1[2])\frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial W_1^{[2]}} \right)

L(a[2],y)W1[2]=(ya[2]+1y1a[2])(a[2](1a[2]))(a1[1])\frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (a_1^{[1]})

L(a[2],y)W1[2]=(a[2]y)a1[1]\frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = (a^{[2]} - y)a_1^{[1]}

Weight W2[2]W_2^{[2]}

L(a[2],y)W2[2]=(L(a[2],y)a[2])(a[2]Z[2])(Z[2]W2[2])\frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial W_2^{[2]}} \right)

L(a[2],y)W2[2]=(ya[2]+1y1a[2])(a[2](1a[2]))(a2[1])\frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (a_2^{[1]})

L(a[2],y)W2[2]=(a[2]y)a2[1]\frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = (a^{[2]} - y)a_2^{[1]}


Hidden Layer

For the hidden layer, we must first find how the loss changes with respect to the hidden activations ai[1]a_i^{[1]}.

Hidden Node 1 Weights

First, we find the gradient of the loss with respect to the activation a1[1]a_1^{[1]}:

L(a[2],y)a1[1]=(L(a[2],y)a[2])(a[2]Z[2])(Z[2]a1[1])\frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial a_1^{[1]}} \right)

L(a[2],y)a1[1]=(ya[2]+1y1a[2])(a[2](1a[2]))(W1[2])\frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (W_1^{[2]})

Now we apply the chain rule for the weights in Node 1:

L(a[2],y)W11[1]=(L(a[2],y)a1[1])(a1[1]Z1[1])(Z1[1]W11[1])\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{11}^{[1]}} \right)

L(a[2],y)W11[1]=(L(a[2],y)a1[1])(a1[1](1a1[1]))(a1[0])\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( a_1^{[1]}(1 - a_1^{[1]}) \right) \cdot (a_1^{[0]})

L(a[2],y)W11[1]=(a[2]y)W1[2]a1[1](1a1[1])a1[0]\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = (a^{[2]} - y)W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_1^{[0]}

Hidden Node 2

First, we find the gradient of the loss with respect to the activation a2[1]a_2^{[1]}:

L(a[2],y)a2[1]=(L(a[2],y)a[2])(a[2]Z[2])(Z[2]a2[1])\frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial a_2^{[1]}} \right)

L(a[2],y)a2[1]=(ya[2]+1y1a[2])(a[2](1a[2]))(W2[2])\frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (W_2^{[2]})

Now we apply the chain rule for the weights in Node 2:

L(a[2],y)W21[1]=(L(a[2],y)a2[1])(a2[1]Z2[1])(Z2[1]W21[1])\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{21}^{[1]}} \right)

L(a[2],y)W21[1]=(L(a[2],y)a2[1])(a2[1](1a2[1]))(a1[0])\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( a_2^{[1]}(1 - a_2^{[1]}) \right) \cdot (a_1^{[0]})

L(a[2],y)W21[1]=(a[2]y)W2[2]a2[1](1a2[1])a1[0]\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = (a^{[2]} - y)W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_1^{[0]}


Final Weight Derivations

L(a[2],y)W0[2]=a[2]y\frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = a^{[2]} - y

L(a[2],y)W1[2]=(a[2]y)a1[1]\frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = (a^{[2]} - y) a_1^{[1]}

L(a[2],y)W2[2]=(a[2]y)a2[1]\frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = (a^{[2]} - y) a_2^{[1]}

L(a[2],y)W10[1]=(a[2]y)W1[2]a1[1](1a1[1])\frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[1]}} = (a^{[2]} - y) W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]})

L(a[2],y)W11[1]=(a[2]y)W1[2]a1[1](1a1[1])a1[0]\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = (a^{[2]} - y) W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_1^{[0]}

L(a[2],y)W12[1]=(a[2]y)W1[2]a1[1](1a1[1])a2[0]\frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[1]}} = (a^{[2]} - y) W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_2^{[0]}

L(a[2],y)W20[1]=(a[2]y)W2[2]a2[1](1a2[1])\frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[1]}} = (a^{[2]} - y) W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]})

L(a[2],y)W21[1]=(a[2]y)W2[2]a2[1](1a2[1])a1[0]\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = (a^{[2]} - y) W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_1^{[0]}

L(a[2],y)W22[1]=(a[2]y)W2[2]a2[1](1a2[1])a2[0]\frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[1]}} = (a^{[2]} - y) W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_2^{[0]}

Question 2

Forward Pass

Hidden Layer

Z1[1]=W10[1]+W11[1]a1[0]+W12[1]a2[0]Z_1^{[1]} = W_{10}^{[1]} + W_{11}^{[1]}a_1^{[0]} + W_{12}^{[1]}a_2^{[0]}

Z2[1]=W20[1]+W21[1]a1[0]+W22[1]a2[0]Z_2^{[1]} = W_{20}^{[1]} + W_{21}^{[1]}a_1^{[0]} + W_{22}^{[1]}a_2^{[0]}

a1[1]=σ(Z1[1])a_1^{[1]} = \sigma(Z_1^{[1]})

a2[1]=σ(Z2[1])a_2^{[1]} = \sigma(Z_2^{[1]})

Output layer

Z1[2]=W10[2]+W11[2]a1[1]+W12[2]a2[1]Z_1^{[2]} = W_{10}^{[2]} + W_{11}^{[2]}a_1^{[1]} + W_{12}^{[2]}a_2^{[1]}

Z2[2]=W20[2]+W21[2]a1[1]+W22[2]a2[1]Z_2^{[2]} = W_{20}^{[2]} + W_{21}^{[2]}a_1^{[1]} + W_{22}^{[2]}a_2^{[1]}

a1[2]=exp(Z1[2])exp(Z1[2])+exp(Z2[2])a_1^{[2]} = \frac{\exp(Z_1^{[2]})}{\exp(Z_1^{[2]}) + \exp(Z_2^{[2]})}

a2[2]=exp(Z2[2])exp(Z1[2])+exp(Z2[2])a_2^{[2]} = \frac{\exp(Z_2^{[2]})}{\exp(Z_1^{[2]}) + \exp(Z_2^{[2]})}


Backpropagation

Categorical Cross Entropy Loss Derivative

We assume categorical cross entropy loss for the multiple output classes:

L(a[2],y)=k=12yklog(ak[2])L(a^{[2]}, y) = -\sum_{k=1}^{2} y_k \cdot \log(a_k^{[2]})

L(a[2],y)ak[2]=ykak[2]\frac{\partial L(a^{[2]}, y)}{\partial a_k^{[2]}} = -\frac{y_k}{a_k^{[2]}}

Derivative of Softmax Activation

ai[2]=exp(Z1[2])j=1Kexp(Zj[2])a_i^{[2]} = \frac{\exp(Z_1^{[2]})}{\sum_{j=1}^{K} \exp(Z_j^{[2]})}

When i=ji = j: ai[2]Zi[2]=ai[2](1ai[2])\frac{\partial a_i^{[2]}}{\partial Z_i^{[2]}} = a_i^{[2]}(1 - a_i^{[2]})

When iji \neq j: ai[2]Zj[2]=ai[2]aj[2]\frac{\partial a_i^{[2]}}{\partial Z_j^{[2]}} = -a_i^{[2]}a_j^{[2]}

Applying the chain rule to find the gradient of the loss with respect to the logit Zi[2]Z_i^{[2]}:

L(a[2],y)Zi[2]=k=12(L(a[2],y)ak[2])(ak[2]Zi[2])\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = \sum_{k=1}^{2} \left( \frac{\partial L(a^{[2]}, y)}{\partial a_k^{[2]}} \right) \cdot \left( \frac{\partial a_k^{[2]}}{\partial Z_i^{[2]}} \right)

For a specific node ii (and letting the other node be jj):

L(a[2],y)Zi[2]=(yiai[2])ai[2](1ai[2])+(yjaj[2])(aj[2]ai[2])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} &= \left( -\frac{y_i}{a_i^{[2]}} \right) \cdot a_i^{[2]}(1 - a_i^{[2]}) \\ &\quad + \left( -\frac{y_j}{a_j^{[2]}} \right) \cdot (-a_j^{[2]}a_i^{[2]}) \end{aligned}

L(a[2],y)Zi[2]=yi(1ai[2])+yjai[2]\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = -y_i(1 - a_i^{[2]}) + y_j a_i^{[2]}

L(a[2],y)Zi[2]=yi+yiai[2]+yjai[2]\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = -y_i + y_i a_i^{[2]} + y_j a_i^{[2]}

L(a[2],y)Zi[2]=ai[2](yi+yj)yi\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = a_i^{[2]}(y_i + y_j) - y_i

We let yi+yj=1y_i + y_j = 1:

L(a[2],y)Zi[2]=ai[2]yi\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = a_i^{[2]} - y_i


Output Layer Weights

We calculate the gradient for each weight by multiplying the loss derivative with respect to the logit (L(a[2],y)Zi[2]\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}}) and the logit derivative with respect to the weight.

Weights for Output Node 1

L(a[2],y)W10[2]=(L(a[2],y)Z1[2])(Z1[2]W10[2])=(a1[2]y1)(1)=a1[2]y1\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial W_{10}^{[2]}} \right) \\ &= (a_1^{[2]} - y_1) \cdot (1) = a_1^{[2]} - y_1 \end{aligned} L(a[2],y)W11[2]=(L(a[2],y)Z1[2])(Z1[2]W11[2])=(a1[2]y1)(a1[1])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial W_{11}^{[2]}} \right) \\ &= (a_1^{[2]} - y_1) \cdot (a_1^{[1]}) \end{aligned} L(a[2],y)W12[2]=(L(a[2],y)Z1[2])(Z1[2]W12[2])=(a1[2]y1)(a2[1])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial W_{12}^{[2]}} \right) \\ &= (a_1^{[2]} - y_1) \cdot (a_2^{[1]}) \end{aligned}

Weights for Output Node 2

L(a[2],y)W20[2]=(L(a[2],y)Z2[2])(Z2[2]W20[2])=(a2[2]y2)(1)=a2[2]y2\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial W_{20}^{[2]}} \right) \\ &= (a_2^{[2]} - y_2) \cdot (1) = a_2^{[2]} - y_2 \end{aligned} L(a[2],y)W21[2]=(L(a[2],y)Z2[2])(Z2[2]W21[2])=(a2[2]y2)(a1[1])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial W_{21}^{[2]}} \right) \\ &= (a_2^{[2]} - y_2) \cdot (a_1^{[1]}) \end{aligned} L(a[2],y)W22[2]=(L(a[2],y)Z2[2])(Z2[2]W22[2])=(a2[2]y2)(a2[1])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial W_{22}^{[2]}} \right) \\ &= (a_2^{[2]} - y_2) \cdot (a_2^{[1]}) \end{aligned}

Hidden Layer

For the hidden layer, we must first find how the loss changes with respect to the hidden activations ai[1]a_i^{[1]}. Because each hidden node connects to multiple output nodes, we sum the gradients propagating backwards from all paths.

Hidden Node 1 Weights

First, we find the gradient of the loss with respect to the activation a1[1]a_1^{[1]}:

L(a[2],y)a1[1]=(L(a[2],y)Z1[2])(Z1[2]a1[1])+(L(a[2],y)Z2[2])(Z2[2]a1[1])=(a1[2]y1)W11[2]+(a2[2]y2)W21[2]\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial a_1^{[1]}} \right) \\ &\quad + \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial a_1^{[1]}} \right) \\ &= (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \end{aligned}

Now we apply the chain rule for the weights in Hidden Node 1:

L(a[2],y)W10[1]=(L(a[2],y)a1[1])(a1[1]Z1[1])(Z1[1]W10[1])=[(a1[2]y1)W11[2]+(a2[2]y2)W21[2]]a1[1](1a1[1])(1)\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{10}^{[1]}} \right) \\ &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\ &\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot (1) \end{aligned} L(a[2],y)W11[1]=(L(a[2],y)a1[1])(a1[1]Z1[1])(Z1[1]W11[1])=[(a1[2]y1)W11[2]+(a2[2]y2)W21[2]]a1[1](1a1[1])(a1[0])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{11}^{[1]}} \right) \\ &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\ &\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot (a_1^{[0]}) \end{aligned} L(a[2],y)W12[1]=(L(a[2],y)a1[1])(a1[1]Z1[1])(Z1[1]W12[1])=[(a1[2]y1)W11[2]+(a2[2]y2)W21[2]]a1[1](1a1[1])(a2[0])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{12}^{[1]}} \right) \\ &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\ &\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot (a_2^{[0]}) \end{aligned}

Hidden Node 2

First, we find the gradient of the loss with respect to the activation a2[1]a_2^{[1]}:

L(a[2],y)a2[1]=(L(a[2],y)Z1[2])(Z1[2]a2[1])+(L(a[2],y)Z2[2])(Z2[2]a2[1])=(a1[2]y1)W12[2]+(a2[2]y2)W22[2]\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial a_2^{[1]}} \right) \\ &\quad + \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial a_2^{[1]}} \right) \\ &= (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \end{aligned}

Now we apply the chain rule for the weights in Hidden Node 2:

L(a[2],y)W20[1]=(L(a[2],y)a2[1])(a2[1]Z2[1])(Z2[1]W20[1])=[(a1[2]y1)W12[2]+(a2[2]y2)W22[2]]a2[1](1a2[1])(1)\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{20}^{[1]}} \right) \\ &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\ &\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot (1) \end{aligned} L(a[2],y)W21[1]=(L(a[2],y)a2[1])(a2[1]Z2[1])(Z2[1]W21[1])=[(a1[2]y1)W12[2]+(a2[2]y2)W22[2]]a2[1](1a2[1])(a1[0])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{21}^{[1]}} \right) \\ &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\ &\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot (a_1^{[0]}) \end{aligned} L(a[2],y)W22[1]=(L(a[2],y)a2[1])(a2[1]Z2[1])(Z2[1]W22[1])=[(a1[2]y1)W12[2]+(a2[2]y2)W22[2]]a2[1](1a2[1])(a2[0])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{22}^{[1]}} \right) \\ &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\ &\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot (a_2^{[0]}) \end{aligned}

Final Weight Derivations

L(a[2],y)W10[2]=a1[2]y1\frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[2]}} = a_1^{[2]} - y_1 L(a[2],y)W11[2]=(a1[2]y1)a1[1]\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[2]}} = (a_1^{[2]} - y_1) a_1^{[1]} L(a[2],y)W12[2]=(a1[2]y1)a2[1]\frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[2]}} = (a_1^{[2]} - y_1) a_2^{[1]}

L(a[2],y)W20[2]=a2[2]y2\frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[2]}} = a_2^{[2]} - y_2 L(a[2],y)W21[2]=(a2[2]y2)a1[1]\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[2]}} = (a_2^{[2]} - y_2) a_1^{[1]} L(a[2],y)W22[2]=(a2[2]y2)a2[1]\frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[2]}} = (a_2^{[2]} - y_2) a_2^{[1]}

L(a[2],y)W10[1]=[(a1[2]y1)W11[2]+(a2[2]y2)W21[2]]a1[1](1a1[1])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\ &\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \end{aligned} L(a[2],y)W11[1]=[(a1[2]y1)W11[2]+(a2[2]y2)W21[2]]a1[1](1a1[1])a1[0]\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\ &\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_1^{[0]} \end{aligned} L(a[2],y)W12[1]=[(a1[2]y1)W11[2]+(a2[2]y2)W21[2]]a1[1](1a1[1])a2[0]\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\ &\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_2^{[0]} \end{aligned} L(a[2],y)W20[1]=[(a1[2]y1)W12[2]+(a2[2]y2)W22[2]]a2[1](1a2[1])\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\ &\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \end{aligned} L(a[2],y)W21[1]=[(a1[2]y1)W12[2]+(a2[2]y2)W22[2]]a2[1](1a2[1])a1[0]\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\ &\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_1^{[0]} \end{aligned} L(a[2],y)W22[1]=[(a1[2]y1)W12[2]+(a2[2]y2)W22[2]]a2[1](1a2[1])a2[0]\begin{aligned} \frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\ &\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_2^{[0]} \end{aligned}