Hiroki Naganuma

Screenshot 2022-04-20 at 5 55 30 PM

In this series of exercises you will try to explain what may be the advantages of attention mechanisms in deep learning, with respect to long-term dependencies as well as to process set-like data, as enabled by Transformers. この一連の演習では、Transformersによって可能になった、長期的な依存関係や集合的なデータの処理に関して、深層学習における注意メカニズムの利点が何であるかを説明しようとするものである。

Second, please try to remind us of the problem with learning long-term dependencies in systems like recurrent networks (of any kind, LSTM, GRU or vanilla), i.e., there is a state variable st updated via some st = f(st−1,xt) where {xt}t is the input sequence. Think about how a change in x1 can affect st as t increases, in terms of Jacobians of f. How is that a problem for gradient-based learning?

第二に、リカレントネットワーク（LSTM、GRU、バニラなど種類は問わない）のようなシステムで長期依存性を学習する際の問題を思い出してほしい。つまり、あるst = f(st-1,xt)によって更新される状態変数stがあって、{xt}tは入力系列である。x1 の変化が t の増加に伴って st にどのような影響を与えるか、f のヤコビアンの観点から考えてみてください。

Vanila RNN では勾配消失とか爆発が起きる

Third, imagine something like the Transformer, which also processes a sequence (like RNNs), but at each layer, for element i of the sequence, we are allowed to use soft- attention to soft-select an input of some MLP from any element j of the sequence at the previous layer. The MLP can have several such inputs, and in a transformer it also takes the element i of the previous layer. Now, explain how this kind of architecture may defeat the long-term dependencies you outlined in the previous question.

第三に，トランスフォーマーのようなものを想像してみてください．これも（RNNのように）シーケンスを処理しますが，各層で，シーケンスの要素iに対して，前の層でのシーケンスの任意の要素jから，あるMLPの入力をソフトアテンションで選択することが許されているのです．MLPにはそのような入力がいくつかあり、変換器では前の層の要素iも取り込まれます。さて、このようなアーキテクチャが、前問で概説した長期的な依存関係をどのように打ち破りうるか、説明してください。

めちゃくちゃ時間かかってたのを軽減した、LSTM で全ての情報を最初から最後まで持つのは難しい。Attention ではこの経路を短縮できるほか、並列化できる。

Answer

Reference

RNN

AI界を席巻する「Transformer」をゆっくり解説(2日目) ～Introduction / Background編～ RNN：　再帰型ニューラルネットワーク。時系列データによく利用されるモデル。時系列データとは文章などの自然言語処理や、売上や株価などの過去から未来を類推するような場合を言います。詳細は要望があれば別途 LSTM：　Long Short Term Memoryのこと。RNNの一種。一般的なRNNと違って、長期的な記憶力と関連付けが出来る gated RNN：　Gated Recurrent Neural Network（ゲート付きRNN）。Gated Recurrent Unit（GRU）を持つRNN。GRUはLSTMと似ていて、学習時の勾配消失や勾配爆発を防ぐための仕組みが工夫されており、それによって長期の記憶力と関連付けが出来る

LSTM

LSTM

Screen Shot 2022-04-21 at 7 04 27

How LSTM networks solve the problem of vanishing gradients

Attention

Screen Shot 2022-04-21 at 7 04 04

アテンション機構(attention)と系列対系列変換 [seq2seq~Transformer]

argmaxで，最大アテンション係数値のベクトルだけを，コンテキストベクトルとして採用するHard Attentionに対して，全ての入力ベクトルのアテンション重み付け和（or 平均）からコンテキストベクトルを計算するこの方法をSoft Attentionと各論文では呼び分けている

Transformer

Transformerとは？AI自然言語学習の技術を解説

Transformerとは、2017年に発表された”Attention Is All You Need”という自然言語処理に関する論文の中で初めて登場した深層学習モデルです。それまで主流だったCNN、RNNを用いたエンコーダ・デコーダモデルとは違い、エンコーダとデコーダをAttentionというモデルのみで結んだネットワークアーキテクチャです。 Attentionとは、簡単にいうと文中の単語の意味を理解するのにどの単語に注目すればいいのかを表すスコア、もしくはそれを出す機構です。
自然言語処理の必須知識 Transformer を徹底解説！
Trends in Natural Language Processing at NeurIPS 2019.: 一番わかりやすい