Show how these equations can be unfolded in time in order to obtain the back- prop through time algorithm to compute the total gradient dC , using the local dθ
Consider a simple RNN with only 3 hidden units and a 100 time steps sequence. Show a different decomposition of the gradient, i.e., a way to compute it in prin- ciple, which would cost computations proportional to 3^100. Contrast this with the computational cost of the same gradient using back-propagation through time.
上記の要請によって、Reccurence 部分が timestep の冪乗操作になる、冪乗される捜査は 3 layer での Backprop の 各レイヤーごとの 3step。 これをタイムステップ回の乗数として繰り返す(h_T よりも h_1 が繰り返さなきゃいけない)?
(BONUS) A very deep neural network can be compared with a recurrent neural network unfolded in time, except that the weights are shared in one case but not the other. How could that difference have an impact on the tendency of gradients to vanish (or explode) faster in one case compared with the other? partial derivatives through each instantation of f, g and L.