full backpropagation calculation for neural networks Question 1
Forward Pass
Hidden Layer
Z 1 [ 1 ] = W 10 [ 1 ] + W 11 [ 1 ] a 1 [ 0 ] + W 12 [ 1 ] a 2 [ 0 ] Z_1^{[1]} = W_{10}^{[1]} + W_{11}^{[1]}a_1^{[0]} + W_{12}^{[1]}a_2^{[0]} Z 1 [ 1 ] = W 10 [ 1 ] + W 11 [ 1 ] a 1 [ 0 ] + W 12 [ 1 ] a 2 [ 0 ]
Z 2 [ 1 ] = W 20 [ 1 ] + W 21 [ 1 ] a 1 [ 0 ] + W 22 [ 1 ] a 2 [ 0 ] Z_2^{[1]} = W_{20}^{[1]} + W_{21}^{[1]}a_1^{[0]} + W_{22}^{[1]}a_2^{[0]} Z 2 [ 1 ] = W 20 [ 1 ] + W 21 [ 1 ] a 1 [ 0 ] + W 22 [ 1 ] a 2 [ 0 ]
a 1 [ 1 ] = σ ( Z 1 [ 1 ] ) a_1^{[1]} = \sigma(Z_1^{[1]}) a 1 [ 1 ] = σ ( Z 1 [ 1 ] )
a 2 [ 1 ] = σ ( Z 2 [ 1 ] ) a_2^{[1]} = \sigma(Z_2^{[1]}) a 2 [ 1 ] = σ ( Z 2 [ 1 ] )
Output layer
Z [ 2 ] = W 0 [ 2 ] + W 1 [ 2 ] a 1 [ 1 ] + W 2 [ 2 ] a 2 [ 1 ] Z^{[2]} = W_0^{[2]} + W_1^{[2]}a_1^{[1]} + W_2^{[2]}a_2^{[1]} Z [ 2 ] = W 0 [ 2 ] + W 1 [ 2 ] a 1 [ 1 ] + W 2 [ 2 ] a 2 [ 1 ]
a [ 2 ] = σ ( Z [ 2 ] ) a^{[2]} = \sigma(Z^{[2]}) a [ 2 ] = σ ( Z [ 2 ] )
Backpropagation
Binary Cross Entropy Loss Derivative
We assume binary cross entropy loss:
L ( a [ 2 ] , y ) = − [ y ⋅ log ( a [ 2 ] ) + ( 1 − y ) ⋅ log ( 1 − a [ 2 ] ) ] L(a^{[2]}, y) = -[y \cdot \log(a^{[2]}) + (1 - y) \cdot \log(1 - a^{[2]})] L ( a [ 2 ] , y ) = − [ y ⋅ log ( a [ 2 ] ) + ( 1 − y ) ⋅ log ( 1 − a [ 2 ] )]
d d x log ( x ) = 1 x \frac{d}{dx} \log(x) = \frac{1}{x} d x d log ( x ) = x 1
d d x log ( 1 − x ) = − 1 1 − x \frac{d}{dx} \log(1-x) = -\frac{1}{1-x} d x d log ( 1 − x ) = − 1 − x 1
∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] = − [ y ⋅ 1 a [ 2 ] + ( 1 − y ) ⋅ ( − 1 1 − a [ 2 ] ) ] \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} = -\left[ y \cdot \frac{1}{a^{[2]}} + (1-y) \cdot \left( -\frac{1}{1-a^{[2]}} \right) \right] ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) = − [ y ⋅ a [ 2 ] 1 + ( 1 − y ) ⋅ ( − 1 − a [ 2 ] 1 ) ]
∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] = − y a [ 2 ] + 1 − y 1 − a [ 2 ] \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} = -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) = − a [ 2 ] y + 1 − a [ 2 ] 1 − y
Derivative of Sigmoid Activation
d a d z = d d z ( 1 + e − z ) − 1 \frac{da}{dz} = \frac{d}{dz}(1 + e^{-z})^{-1} d z d a = d z d ( 1 + e − z ) − 1
d a d z = − 1 ( 1 + e − z ) − 2 ⋅ ( − e − z ) = e − z ( 1 + e − z ) 2 \frac{da}{dz} = -1(1 + e^{-z})^{-2} \cdot (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2} d z d a = − 1 ( 1 + e − z ) − 2 ⋅ ( − e − z ) = ( 1 + e − z ) 2 e − z
d a d z = ( 1 1 + e − z ) ⋅ ( 1 + e − z − 1 1 + e − z ) \frac{da}{dz} = \left( \frac{1}{1 + e^{-z}} \right) \cdot \left( \frac{1 + e^{-z} - 1}{1 + e^{-z}} \right) d z d a = ( 1 + e − z 1 ) ⋅ ( 1 + e − z 1 + e − z − 1 )
d a d z = σ ( z ) ⋅ ( 1 − σ ( z ) ) \frac{da}{dz} = \sigma(z) \cdot (1 - \sigma(z)) d z d a = σ ( z ) ⋅ ( 1 − σ ( z ))
Output Layer Weights
We calculate the gradient for each weight by multiplying the loss derivative, the activation derivative, and the logit derivative.
Weight W 0 [ 2 ] W_0^{[2]} W 0 [ 2 ]
∂ L ( a [ 2 ] , y ) ∂ W 0 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] ) ⋅ ( ∂ a [ 2 ] ∂ Z [ 2 ] ) ⋅ ( ∂ Z [ 2 ] ∂ W 0 [ 2 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial W_0^{[2]}} \right) ∂ W 0 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z [ 2 ] ∂ a [ 2 ] ) ⋅ ( ∂ W 0 [ 2 ] ∂ Z [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ W 0 [ 2 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( 1 ) \frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (1) ∂ W 0 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( − a [ 2 ] y + 1 − a [ 2 ] 1 − y ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( 1 )
∂ L ( a [ 2 ] , y ) ∂ W 0 [ 2 ] = a [ 2 ] − y \frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = a^{[2]} - y ∂ W 0 [ 2 ] ∂ L ( a [ 2 ] , y ) = a [ 2 ] − y
Weight W 1 [ 2 ] W_1^{[2]} W 1 [ 2 ]
∂ L ( a [ 2 ] , y ) ∂ W 1 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] ) ⋅ ( ∂ a [ 2 ] ∂ Z [ 2 ] ) ⋅ ( ∂ Z [ 2 ] ∂ W 1 [ 2 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial W_1^{[2]}} \right) ∂ W 1 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z [ 2 ] ∂ a [ 2 ] ) ⋅ ( ∂ W 1 [ 2 ] ∂ Z [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ W 1 [ 2 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( a 1 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (a_1^{[1]}) ∂ W 1 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( − a [ 2 ] y + 1 − a [ 2 ] 1 − y ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( a 1 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 1 [ 2 ] = ( a [ 2 ] − y ) a 1 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = (a^{[2]} - y)a_1^{[1]} ∂ W 1 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) a 1 [ 1 ]
Weight W 2 [ 2 ] W_2^{[2]} W 2 [ 2 ]
∂ L ( a [ 2 ] , y ) ∂ W 2 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] ) ⋅ ( ∂ a [ 2 ] ∂ Z [ 2 ] ) ⋅ ( ∂ Z [ 2 ] ∂ W 2 [ 2 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial W_2^{[2]}} \right) ∂ W 2 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z [ 2 ] ∂ a [ 2 ] ) ⋅ ( ∂ W 2 [ 2 ] ∂ Z [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ W 2 [ 2 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( a 2 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (a_2^{[1]}) ∂ W 2 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( − a [ 2 ] y + 1 − a [ 2 ] 1 − y ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( a 2 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 2 [ 2 ] = ( a [ 2 ] − y ) a 2 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = (a^{[2]} - y)a_2^{[1]} ∂ W 2 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) a 2 [ 1 ]
Hidden Layer
For the hidden layer, we must first find how the loss changes with respect to the hidden activations a i [ 1 ] a_i^{[1]} a i [ 1 ] .
Hidden Node 1 Weights
First, we find the gradient of the loss with respect to the activation a 1 [ 1 ] a_1^{[1]} a 1 [ 1 ] :
∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] ) ⋅ ( ∂ a [ 2 ] ∂ Z [ 2 ] ) ⋅ ( ∂ Z [ 2 ] ∂ a 1 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial a_1^{[1]}} \right) ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z [ 2 ] ∂ a [ 2 ] ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( W 1 [ 2 ] ) \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (W_1^{[2]}) ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( − a [ 2 ] y + 1 − a [ 2 ] 1 − y ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( W 1 [ 2 ] )
Now we apply the chain rule for the weights in Node 1:
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z 1 [ 1 ] ) ⋅ ( ∂ Z 1 [ 1 ] ∂ W 11 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{11}^{[1]}} \right) ∂ W 11 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 1 [ 1 ] ∂ a 1 [ 1 ] ) ⋅ ( ∂ W 11 [ 1 ] ∂ Z 1 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] ) ⋅ ( a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ) ⋅ ( a 1 [ 0 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( a_1^{[1]}(1 - a_1^{[1]}) \right) \cdot (a_1^{[0]}) ∂ W 11 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ) ⋅ ( a 1 [ 0 ] )
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 1 ] = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 1 [ 0 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = (a^{[2]} - y)W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_1^{[0]} ∂ W 11 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 1 [ 0 ]
Hidden Node 2
First, we find the gradient of the loss with respect to the activation a 2 [ 1 ] a_2^{[1]} a 2 [ 1 ] :
∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a [ 2 ] ) ⋅ ( ∂ a [ 2 ] ∂ Z [ 2 ] ) ⋅ ( ∂ Z [ 2 ] ∂ a 2 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a^{[2]}} \right) \cdot \left( \frac{\partial a^{[2]}}{\partial Z^{[2]}} \right) \cdot \left( \frac{\partial Z^{[2]}}{\partial a_2^{[1]}} \right) ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z [ 2 ] ∂ a [ 2 ] ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( W 2 [ 2 ] ) \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} = \left( -\frac{y}{a^{[2]}} + \frac{1-y}{1-a^{[2]}} \right) \cdot \left( a^{[2]}(1 - a^{[2]}) \right) \cdot (W_2^{[2]}) ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( − a [ 2 ] y + 1 − a [ 2 ] 1 − y ) ⋅ ( a [ 2 ] ( 1 − a [ 2 ] ) ) ⋅ ( W 2 [ 2 ] )
Now we apply the chain rule for the weights in Node 2:
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z 2 [ 1 ] ) ⋅ ( ∂ Z 2 [ 1 ] ∂ W 21 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{21}^{[1]}} \right) ∂ W 21 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 2 [ 1 ] ∂ a 2 [ 1 ] ) ⋅ ( ∂ W 21 [ 1 ] ∂ Z 2 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] ) ⋅ ( a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ) ⋅ ( a 1 [ 0 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( a_2^{[1]}(1 - a_2^{[1]}) \right) \cdot (a_1^{[0]}) ∂ W 21 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ) ⋅ ( a 1 [ 0 ] )
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 1 ] = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 1 [ 0 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = (a^{[2]} - y)W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_1^{[0]} ∂ W 21 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 1 [ 0 ]
Final Weight Derivations
∂ L ( a [ 2 ] , y ) ∂ W 0 [ 2 ] = a [ 2 ] − y \frac{\partial L(a^{[2]}, y)}{\partial W_0^{[2]}} = a^{[2]} - y ∂ W 0 [ 2 ] ∂ L ( a [ 2 ] , y ) = a [ 2 ] − y
∂ L ( a [ 2 ] , y ) ∂ W 1 [ 2 ] = ( a [ 2 ] − y ) a 1 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_1^{[2]}} = (a^{[2]} - y) a_1^{[1]} ∂ W 1 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) a 1 [ 1 ]
∂ L ( a [ 2 ] , y ) ∂ W 2 [ 2 ] = ( a [ 2 ] − y ) a 2 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_2^{[2]}} = (a^{[2]} - y) a_2^{[1]} ∂ W 2 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) a 2 [ 1 ]
∂ L ( a [ 2 ] , y ) ∂ W 10 [ 1 ] = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[1]}} = (a^{[2]} - y) W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) ∂ W 10 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 1 ] = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 1 [ 0 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} = (a^{[2]} - y) W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_1^{[0]} ∂ W 11 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 1 [ 0 ]
∂ L ( a [ 2 ] , y ) ∂ W 12 [ 1 ] = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 2 [ 0 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[1]}} = (a^{[2]} - y) W_1^{[2]} \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_2^{[0]} ∂ W 12 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 1 [ 2 ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 2 [ 0 ]
∂ L ( a [ 2 ] , y ) ∂ W 20 [ 1 ] = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) \frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[1]}} = (a^{[2]} - y) W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) ∂ W 20 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 1 ] = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 1 [ 0 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} = (a^{[2]} - y) W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_1^{[0]} ∂ W 21 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 1 [ 0 ]
∂ L ( a [ 2 ] , y ) ∂ W 22 [ 1 ] = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 2 [ 0 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[1]}} = (a^{[2]} - y) W_2^{[2]} \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_2^{[0]} ∂ W 22 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( a [ 2 ] − y ) W 2 [ 2 ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 2 [ 0 ]
Question 2
Forward Pass
Hidden Layer
Z 1 [ 1 ] = W 10 [ 1 ] + W 11 [ 1 ] a 1 [ 0 ] + W 12 [ 1 ] a 2 [ 0 ] Z_1^{[1]} = W_{10}^{[1]} + W_{11}^{[1]}a_1^{[0]} + W_{12}^{[1]}a_2^{[0]} Z 1 [ 1 ] = W 10 [ 1 ] + W 11 [ 1 ] a 1 [ 0 ] + W 12 [ 1 ] a 2 [ 0 ]
Z 2 [ 1 ] = W 20 [ 1 ] + W 21 [ 1 ] a 1 [ 0 ] + W 22 [ 1 ] a 2 [ 0 ] Z_2^{[1]} = W_{20}^{[1]} + W_{21}^{[1]}a_1^{[0]} + W_{22}^{[1]}a_2^{[0]} Z 2 [ 1 ] = W 20 [ 1 ] + W 21 [ 1 ] a 1 [ 0 ] + W 22 [ 1 ] a 2 [ 0 ]
a 1 [ 1 ] = σ ( Z 1 [ 1 ] ) a_1^{[1]} = \sigma(Z_1^{[1]}) a 1 [ 1 ] = σ ( Z 1 [ 1 ] )
a 2 [ 1 ] = σ ( Z 2 [ 1 ] ) a_2^{[1]} = \sigma(Z_2^{[1]}) a 2 [ 1 ] = σ ( Z 2 [ 1 ] )
Output layer
Z 1 [ 2 ] = W 10 [ 2 ] + W 11 [ 2 ] a 1 [ 1 ] + W 12 [ 2 ] a 2 [ 1 ] Z_1^{[2]} = W_{10}^{[2]} + W_{11}^{[2]}a_1^{[1]} + W_{12}^{[2]}a_2^{[1]} Z 1 [ 2 ] = W 10 [ 2 ] + W 11 [ 2 ] a 1 [ 1 ] + W 12 [ 2 ] a 2 [ 1 ]
Z 2 [ 2 ] = W 20 [ 2 ] + W 21 [ 2 ] a 1 [ 1 ] + W 22 [ 2 ] a 2 [ 1 ] Z_2^{[2]} = W_{20}^{[2]} + W_{21}^{[2]}a_1^{[1]} + W_{22}^{[2]}a_2^{[1]} Z 2 [ 2 ] = W 20 [ 2 ] + W 21 [ 2 ] a 1 [ 1 ] + W 22 [ 2 ] a 2 [ 1 ]
a 1 [ 2 ] = exp ( Z 1 [ 2 ] ) exp ( Z 1 [ 2 ] ) + exp ( Z 2 [ 2 ] ) a_1^{[2]} = \frac{\exp(Z_1^{[2]})}{\exp(Z_1^{[2]}) + \exp(Z_2^{[2]})} a 1 [ 2 ] = e x p ( Z 1 [ 2 ] ) + e x p ( Z 2 [ 2 ] ) e x p ( Z 1 [ 2 ] )
a 2 [ 2 ] = exp ( Z 2 [ 2 ] ) exp ( Z 1 [ 2 ] ) + exp ( Z 2 [ 2 ] ) a_2^{[2]} = \frac{\exp(Z_2^{[2]})}{\exp(Z_1^{[2]}) + \exp(Z_2^{[2]})} a 2 [ 2 ] = e x p ( Z 1 [ 2 ] ) + e x p ( Z 2 [ 2 ] ) e x p ( Z 2 [ 2 ] )
Backpropagation
Categorical Cross Entropy Loss Derivative
We assume categorical cross entropy loss for the multiple output classes:
L ( a [ 2 ] , y ) = − ∑ k = 1 2 y k ⋅ log ( a k [ 2 ] ) L(a^{[2]}, y) = -\sum_{k=1}^{2} y_k \cdot \log(a_k^{[2]}) L ( a [ 2 ] , y ) = − ∑ k = 1 2 y k ⋅ log ( a k [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ a k [ 2 ] = − y k a k [ 2 ] \frac{\partial L(a^{[2]}, y)}{\partial a_k^{[2]}} = -\frac{y_k}{a_k^{[2]}} ∂ a k [ 2 ] ∂ L ( a [ 2 ] , y ) = − a k [ 2 ] y k
Derivative of Softmax Activation
a i [ 2 ] = exp ( Z 1 [ 2 ] ) ∑ j = 1 K exp ( Z j [ 2 ] ) a_i^{[2]} = \frac{\exp(Z_1^{[2]})}{\sum_{j=1}^{K} \exp(Z_j^{[2]})} a i [ 2 ] = ∑ j = 1 K e x p ( Z j [ 2 ] ) e x p ( Z 1 [ 2 ] )
When i = j i = j i = j : ∂ a i [ 2 ] ∂ Z i [ 2 ] = a i [ 2 ] ( 1 − a i [ 2 ] ) \frac{\partial a_i^{[2]}}{\partial Z_i^{[2]}} = a_i^{[2]}(1 - a_i^{[2]}) ∂ Z i [ 2 ] ∂ a i [ 2 ] = a i [ 2 ] ( 1 − a i [ 2 ] )
When i ≠ j i \neq j i = j : ∂ a i [ 2 ] ∂ Z j [ 2 ] = − a i [ 2 ] a j [ 2 ] \frac{\partial a_i^{[2]}}{\partial Z_j^{[2]}} = -a_i^{[2]}a_j^{[2]} ∂ Z j [ 2 ] ∂ a i [ 2 ] = − a i [ 2 ] a j [ 2 ]
Applying the chain rule to find the gradient of the loss with respect to the logit Z i [ 2 ] Z_i^{[2]} Z i [ 2 ] :
∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] = ∑ k = 1 2 ( ∂ L ( a [ 2 ] , y ) ∂ a k [ 2 ] ) ⋅ ( ∂ a k [ 2 ] ∂ Z i [ 2 ] ) \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = \sum_{k=1}^{2} \left( \frac{\partial L(a^{[2]}, y)}{\partial a_k^{[2]}} \right) \cdot \left( \frac{\partial a_k^{[2]}}{\partial Z_i^{[2]}} \right) ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) = ∑ k = 1 2 ( ∂ a k [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z i [ 2 ] ∂ a k [ 2 ] )
For a specific node i i i (and letting the other node be j j j ):
∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] = ( − y i a i [ 2 ] ) ⋅ a i [ 2 ] ( 1 − a i [ 2 ] ) + ( − y j a j [ 2 ] ) ⋅ ( − a j [ 2 ] a i [ 2 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} &= \left( -\frac{y_i}{a_i^{[2]}} \right) \cdot a_i^{[2]}(1 - a_i^{[2]}) \\
&\quad + \left( -\frac{y_j}{a_j^{[2]}} \right) \cdot (-a_j^{[2]}a_i^{[2]})
\end{aligned} ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) = ( − a i [ 2 ] y i ) ⋅ a i [ 2 ] ( 1 − a i [ 2 ] ) + ( − a j [ 2 ] y j ) ⋅ ( − a j [ 2 ] a i [ 2 ] )
∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] = − y i ( 1 − a i [ 2 ] ) + y j a i [ 2 ] \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = -y_i(1 - a_i^{[2]}) + y_j a_i^{[2]} ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) = − y i ( 1 − a i [ 2 ] ) + y j a i [ 2 ]
∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] = − y i + y i a i [ 2 ] + y j a i [ 2 ] \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = -y_i + y_i a_i^{[2]} + y_j a_i^{[2]} ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) = − y i + y i a i [ 2 ] + y j a i [ 2 ]
∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] = a i [ 2 ] ( y i + y j ) − y i \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = a_i^{[2]}(y_i + y_j) - y_i ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) = a i [ 2 ] ( y i + y j ) − y i
We let y i + y j = 1 y_i + y_j = 1 y i + y j = 1 :
∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] = a i [ 2 ] − y i \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} = a_i^{[2]} - y_i ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) = a i [ 2 ] − y i
Output Layer Weights
We calculate the gradient for each weight by multiplying the loss derivative with respect to the logit (∂ L ( a [ 2 ] , y ) ∂ Z i [ 2 ] \frac{\partial L(a^{[2]}, y)}{\partial Z_i^{[2]}} ∂ Z i [ 2 ] ∂ L ( a [ 2 ] , y ) ) and the logit derivative with respect to the weight.
Weights for Output Node 1
∂ L ( a [ 2 ] , y ) ∂ W 10 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 1 [ 2 ] ) ⋅ ( ∂ Z 1 [ 2 ] ∂ W 10 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) ⋅ ( 1 ) = a 1 [ 2 ] − y 1 \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial W_{10}^{[2]}} \right) \\
&= (a_1^{[2]} - y_1) \cdot (1) = a_1^{[2]} - y_1
\end{aligned} ∂ W 10 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 1 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ W 10 [ 2 ] ∂ Z 1 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) ⋅ ( 1 ) = a 1 [ 2 ] − y 1
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 1 [ 2 ] ) ⋅ ( ∂ Z 1 [ 2 ] ∂ W 11 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) ⋅ ( a 1 [ 1 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial W_{11}^{[2]}} \right) \\
&= (a_1^{[2]} - y_1) \cdot (a_1^{[1]})
\end{aligned} ∂ W 11 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 1 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ W 11 [ 2 ] ∂ Z 1 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) ⋅ ( a 1 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 12 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 1 [ 2 ] ) ⋅ ( ∂ Z 1 [ 2 ] ∂ W 12 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) ⋅ ( a 2 [ 1 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial W_{12}^{[2]}} \right) \\
&= (a_1^{[2]} - y_1) \cdot (a_2^{[1]})
\end{aligned} ∂ W 12 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 1 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ W 12 [ 2 ] ∂ Z 1 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) ⋅ ( a 2 [ 1 ] )
Weights for Output Node 2
∂ L ( a [ 2 ] , y ) ∂ W 20 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 2 [ 2 ] ) ⋅ ( ∂ Z 2 [ 2 ] ∂ W 20 [ 2 ] ) = ( a 2 [ 2 ] − y 2 ) ⋅ ( 1 ) = a 2 [ 2 ] − y 2 \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial W_{20}^{[2]}} \right) \\
&= (a_2^{[2]} - y_2) \cdot (1) = a_2^{[2]} - y_2
\end{aligned} ∂ W 20 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 2 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ W 20 [ 2 ] ∂ Z 2 [ 2 ] ) = ( a 2 [ 2 ] − y 2 ) ⋅ ( 1 ) = a 2 [ 2 ] − y 2
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 2 [ 2 ] ) ⋅ ( ∂ Z 2 [ 2 ] ∂ W 21 [ 2 ] ) = ( a 2 [ 2 ] − y 2 ) ⋅ ( a 1 [ 1 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial W_{21}^{[2]}} \right) \\
&= (a_2^{[2]} - y_2) \cdot (a_1^{[1]})
\end{aligned} ∂ W 21 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 2 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ W 21 [ 2 ] ∂ Z 2 [ 2 ] ) = ( a 2 [ 2 ] − y 2 ) ⋅ ( a 1 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 22 [ 2 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 2 [ 2 ] ) ⋅ ( ∂ Z 2 [ 2 ] ∂ W 22 [ 2 ] ) = ( a 2 [ 2 ] − y 2 ) ⋅ ( a 2 [ 1 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[2]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial W_{22}^{[2]}} \right) \\
&= (a_2^{[2]} - y_2) \cdot (a_2^{[1]})
\end{aligned} ∂ W 22 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 2 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ W 22 [ 2 ] ∂ Z 2 [ 2 ] ) = ( a 2 [ 2 ] − y 2 ) ⋅ ( a 2 [ 1 ] )
Hidden Layer
For the hidden layer, we must first find how the loss changes with respect to the hidden activations a i [ 1 ] a_i^{[1]} a i [ 1 ] . Because each hidden node connects to multiple output nodes, we sum the gradients propagating backwards from all paths.
Hidden Node 1 Weights
First, we find the gradient of the loss with respect to the activation a 1 [ 1 ] a_1^{[1]} a 1 [ 1 ] :
∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 1 [ 2 ] ) ⋅ ( ∂ Z 1 [ 2 ] ∂ a 1 [ 1 ] ) + ( ∂ L ( a [ 2 ] , y ) ∂ Z 2 [ 2 ] ) ⋅ ( ∂ Z 2 [ 2 ] ∂ a 1 [ 1 ] ) = ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial a_1^{[1]}} \right) \\
&\quad + \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial a_1^{[1]}} \right) \\
&= (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]}
\end{aligned} ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 1 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z 1 [ 2 ] ) + ( ∂ Z 2 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z 2 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ]
Now we apply the chain rule for the weights in Hidden Node 1:
∂ L ( a [ 2 ] , y ) ∂ W 10 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z 1 [ 1 ] ) ⋅ ( ∂ Z 1 [ 1 ] ∂ W 10 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ ( 1 ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{10}^{[1]}} \right) \\
&= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\
&\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot (1)
\end{aligned} ∂ W 10 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 1 [ 1 ] ∂ a 1 [ 1 ] ) ⋅ ( ∂ W 10 [ 1 ] ∂ Z 1 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ ( 1 )
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z 1 [ 1 ] ) ⋅ ( ∂ Z 1 [ 1 ] ∂ W 11 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ ( a 1 [ 0 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{11}^{[1]}} \right) \\
&= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\
&\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot (a_1^{[0]})
\end{aligned} ∂ W 11 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 1 [ 1 ] ∂ a 1 [ 1 ] ) ⋅ ( ∂ W 11 [ 1 ] ∂ Z 1 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ ( a 1 [ 0 ] )
∂ L ( a [ 2 ] , y ) ∂ W 12 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 1 [ 1 ] ) ⋅ ( ∂ a 1 [ 1 ] ∂ Z 1 [ 1 ] ) ⋅ ( ∂ Z 1 [ 1 ] ∂ W 12 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ ( a 2 [ 0 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_1^{[1]}} \right) \cdot \left( \frac{\partial a_1^{[1]}}{\partial Z_1^{[1]}} \right) \cdot \left( \frac{\partial Z_1^{[1]}}{\partial W_{12}^{[1]}} \right) \\
&= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\
&\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot (a_2^{[0]})
\end{aligned} ∂ W 12 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 1 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 1 [ 1 ] ∂ a 1 [ 1 ] ) ⋅ ( ∂ W 12 [ 1 ] ∂ Z 1 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ ( a 2 [ 0 ] )
Hidden Node 2
First, we find the gradient of the loss with respect to the activation a 2 [ 1 ] a_2^{[1]} a 2 [ 1 ] :
∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ Z 1 [ 2 ] ) ⋅ ( ∂ Z 1 [ 2 ] ∂ a 2 [ 1 ] ) + ( ∂ L ( a [ 2 ] , y ) ∂ Z 2 [ 2 ] ) ⋅ ( ∂ Z 2 [ 2 ] ∂ a 2 [ 1 ] ) = ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_1^{[2]}} \right) \cdot \left( \frac{\partial Z_1^{[2]}}{\partial a_2^{[1]}} \right) \\
&\quad + \left( \frac{\partial L(a^{[2]}, y)}{\partial Z_2^{[2]}} \right) \cdot \left( \frac{\partial Z_2^{[2]}}{\partial a_2^{[1]}} \right) \\
&= (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]}
\end{aligned} ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ Z 1 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z 1 [ 2 ] ) + ( ∂ Z 2 [ 2 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z 2 [ 2 ] ) = ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ]
Now we apply the chain rule for the weights in Hidden Node 2:
∂ L ( a [ 2 ] , y ) ∂ W 20 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z 2 [ 1 ] ) ⋅ ( ∂ Z 2 [ 1 ] ∂ W 20 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ ( 1 ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{20}^{[1]}} \right) \\
&= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\
&\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot (1)
\end{aligned} ∂ W 20 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 2 [ 1 ] ∂ a 2 [ 1 ] ) ⋅ ( ∂ W 20 [ 1 ] ∂ Z 2 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ ( 1 )
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z 2 [ 1 ] ) ⋅ ( ∂ Z 2 [ 1 ] ∂ W 21 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ ( a 1 [ 0 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{21}^{[1]}} \right) \\
&= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\
&\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot (a_1^{[0]})
\end{aligned} ∂ W 21 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 2 [ 1 ] ∂ a 2 [ 1 ] ) ⋅ ( ∂ W 21 [ 1 ] ∂ Z 2 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ ( a 1 [ 0 ] )
∂ L ( a [ 2 ] , y ) ∂ W 22 [ 1 ] = ( ∂ L ( a [ 2 ] , y ) ∂ a 2 [ 1 ] ) ⋅ ( ∂ a 2 [ 1 ] ∂ Z 2 [ 1 ] ) ⋅ ( ∂ Z 2 [ 1 ] ∂ W 22 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ ( a 2 [ 0 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[1]}} &= \left( \frac{\partial L(a^{[2]}, y)}{\partial a_2^{[1]}} \right) \cdot \left( \frac{\partial a_2^{[1]}}{\partial Z_2^{[1]}} \right) \cdot \left( \frac{\partial Z_2^{[1]}}{\partial W_{22}^{[1]}} \right) \\
&= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\
&\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot (a_2^{[0]})
\end{aligned} ∂ W 22 [ 1 ] ∂ L ( a [ 2 ] , y ) = ( ∂ a 2 [ 1 ] ∂ L ( a [ 2 ] , y ) ) ⋅ ( ∂ Z 2 [ 1 ] ∂ a 2 [ 1 ] ) ⋅ ( ∂ W 22 [ 1 ] ∂ Z 2 [ 1 ] ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ ( a 2 [ 0 ] )
Final Weight Derivations
∂ L ( a [ 2 ] , y ) ∂ W 10 [ 2 ] = a 1 [ 2 ] − y 1 \frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[2]}} = a_1^{[2]} - y_1 ∂ W 10 [ 2 ] ∂ L ( a [ 2 ] , y ) = a 1 [ 2 ] − y 1
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 2 ] = ( a 1 [ 2 ] − y 1 ) a 1 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[2]}} = (a_1^{[2]} - y_1) a_1^{[1]} ∂ W 11 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a 1 [ 2 ] − y 1 ) a 1 [ 1 ]
∂ L ( a [ 2 ] , y ) ∂ W 12 [ 2 ] = ( a 1 [ 2 ] − y 1 ) a 2 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[2]}} = (a_1^{[2]} - y_1) a_2^{[1]} ∂ W 12 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a 1 [ 2 ] − y 1 ) a 2 [ 1 ]
∂ L ( a [ 2 ] , y ) ∂ W 20 [ 2 ] = a 2 [ 2 ] − y 2 \frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[2]}} = a_2^{[2]} - y_2 ∂ W 20 [ 2 ] ∂ L ( a [ 2 ] , y ) = a 2 [ 2 ] − y 2
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 2 ] = ( a 2 [ 2 ] − y 2 ) a 1 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[2]}} = (a_2^{[2]} - y_2) a_1^{[1]} ∂ W 21 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a 2 [ 2 ] − y 2 ) a 1 [ 1 ]
∂ L ( a [ 2 ] , y ) ∂ W 22 [ 2 ] = ( a 2 [ 2 ] − y 2 ) a 2 [ 1 ] \frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[2]}} = (a_2^{[2]} - y_2) a_2^{[1]} ∂ W 22 [ 2 ] ∂ L ( a [ 2 ] , y ) = ( a 2 [ 2 ] − y 2 ) a 2 [ 1 ]
∂ L ( a [ 2 ] , y ) ∂ W 10 [ 1 ] = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{10}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\
&\quad \cdot a_1^{[1]}(1 - a_1^{[1]})
\end{aligned} ∂ W 10 [ 1 ] ∂ L ( a [ 2 ] , y ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 11 [ 1 ] = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 1 [ 0 ] \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{11}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\
&\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_1^{[0]}
\end{aligned} ∂ W 11 [ 1 ] ∂ L ( a [ 2 ] , y ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 1 [ 0 ]
∂ L ( a [ 2 ] , y ) ∂ W 12 [ 1 ] = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 2 [ 0 ] \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{12}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{11}^{[2]} + (a_2^{[2]} - y_2)W_{21}^{[2]} \right] \\
&\quad \cdot a_1^{[1]}(1 - a_1^{[1]}) \cdot a_2^{[0]}
\end{aligned} ∂ W 12 [ 1 ] ∂ L ( a [ 2 ] , y ) = [ ( a 1 [ 2 ] − y 1 ) W 11 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 21 [ 2 ] ] ⋅ a 1 [ 1 ] ( 1 − a 1 [ 1 ] ) ⋅ a 2 [ 0 ]
∂ L ( a [ 2 ] , y ) ∂ W 20 [ 1 ] = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{20}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\
&\quad \cdot a_2^{[1]}(1 - a_2^{[1]})
\end{aligned} ∂ W 20 [ 1 ] ∂ L ( a [ 2 ] , y ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] )
∂ L ( a [ 2 ] , y ) ∂ W 21 [ 1 ] = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 1 [ 0 ] \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{21}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\
&\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_1^{[0]}
\end{aligned} ∂ W 21 [ 1 ] ∂ L ( a [ 2 ] , y ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 1 [ 0 ]
∂ L ( a [ 2 ] , y ) ∂ W 22 [ 1 ] = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 2 [ 0 ] \begin{aligned}
\frac{\partial L(a^{[2]}, y)}{\partial W_{22}^{[1]}} &= \left[ (a_1^{[2]} - y_1)W_{12}^{[2]} + (a_2^{[2]} - y_2)W_{22}^{[2]} \right] \\
&\quad \cdot a_2^{[1]}(1 - a_2^{[1]}) \cdot a_2^{[0]}
\end{aligned} ∂ W 22 [ 1 ] ∂ L ( a [ 2 ] , y ) = [ ( a 1 [ 2 ] − y 1 ) W 12 [ 2 ] + ( a 2 [ 2 ] − y 2 ) W 22 [ 2 ] ] ⋅ a 2 [ 1 ] ( 1 − a 2 [ 1 ] ) ⋅ a 2 [ 0 ]