Hiroki Naganuma

9.

Screenshot 2021-12-10 at 12 14 08 PM

17A_Q4と全く同じな気がする
Memo

(a).

Show how these equations can be unfolded in time in order to obtain the back- prop through time algorithm to compute the total gradient dC , using the local dθ

Screenshot 2021-12-10 at 5 31 17 PM

Screenshot 2021-12-10 at 3 50 18 PM

(b).

Consider a simple RNN with only 3 hidden units and a 100 time steps sequence. Show a different decomposition of the gradient, i.e., a way to compute it in prin- ciple, which would cost computations proportional to 3^100. Contrast this with the computational cost of the same gradient using back-propagation through time.

Answer
h_t と h_{t+1} の比を1になるような正規化を加える
各タイムステップでの重みベクトルが同じ固有ベクトルを持つことを仮定する

上記の要請によって、Reccurence 部分が timestep の冪乗操作になる、冪乗される捜査は 3 layer での Backprop の各レイヤーごとの 3step。これをタイムステップ回の乗数として繰り返す（h_T よりも h_1 が繰り返さなきゃいけない）?

(c).

(BONUS) A very deep neural network can be compared with a recurrent neural network unfolded in time, except that the weights are shared in one case but not the other. How could that difference have an impact on the tendency of gradients to vanish (or explode) faster in one case compared with the other? partial derivatives through each instantation of f, g and L.

Answer
RNNの勾配消失とLSTMがなぜ勾配消失しないのか
RNN の方が、一回の勾配計算で $\frac{\partial h_{t-1}}{\partial h_{t}}$　をデータ回数分行うので、データ数に比例して勾配消失　or 発散が起きやすくなる、深いDNNはデータ数には依存せず、レイヤー数に応じて、そのリスクが上がる（正規化や Skip Connection を入れない場合）
ニューラルネットワークを訓練するのはなぜ難しいのか
2010年にUnderstanding the difficulty of training deep feedforward neural networks, by Xavier Glorot and Yoshua Bengio (2010)。が、シグモイドを活性化関数に使うと、深いネットワークでの訓練時に問題が生じる証拠を発見した論文がある。特に、シグモイドが最後の隠れ層の活性化において、訓練の初期にほぼ 0 に飽和して、実質的に学習を遅くさせる証拠を見つけています。彼らは、この飽和の問題の発生しないような、代わりの活性化関数を提案しました。
京大医学部小島諒介講師の資料：めちゃくちゃわかりやすいしもはや答え

Screenshot 2021-12-11 at 1 48 52 PM