Hiroki Naganuma

1.

Explain all relationships between the following notions: the bias-variance tradeoff, model capacity, number of training samples, overfitting, underfitting.

The bias-variance tradeoff is the trade-off between bias and variance when the forecast error is decomposed into bias and variance.
When the model capacity is large, the Variance is large, 
which is considered to be an overfitting condition that strongly depends on the training data.
When the number of training samples is small, the bias is large and the model is underfitting, 
which means that it does not learn the data sufficiently.

汎化誤差の期待値はバイアス＋バリアンス＋ノイズの3つの和に分解することができる

ノイズ: 削減不可能な誤差であり、データ中のノイズの量の尺度、どんなに良いモデルを作ってもデータには除去できないノイズがある
バリアンス: モデルの予測値に対する分散であり、モデルの容量が大きいときは、バリアンスが大きい状態となりやすい。
バイアス: モデルの予測値の期待値と真値の差が小さいほど、バイアスが小さくなる。データに対するモデルのあてはまりの良さを表しており、バイアスが小さいほど学習に使用したデータにより適合している。
Bias Variance Tradeoff

2.

Suppose you have a training dataset of n observations.

2.a.

Give one (good) strategy to select the hyperparameters of your model.

Model w/ Minimum Value of Validation Loss in Hold-Out Method (if n is sufficient large) or LOOCV  
AIC for regular model (or TIC in misspecified situation)

2.b.

For concreteness, suppose we consider a support vector machine (SVM) classifier with the Gaussian kernel. Give an example of a fairly natural, but bad strategy to choose the kernel bandwidth hyperparameter σ using the data (i.e. a typical rookie mistake). What would go wrong?

Let the problem of tuning the hyperparameter σ of a Gaussian kernel in a hard-margin SVM be assumed.
In this case, we use the same data for training and evaluation, 
without separating all the data at hand (this is a mistake).
The hyperparameter is searched by grid search, which minimizes the evaluation loss.
It can then be assumed that values with σ close to zero are selected. 
This is overfitting to training data, and since the data used for training and evaluation are the same, 
we can assume that a situation similar to k=1 in the K-nearest neighbor (KNN) method is occurring.

At this time, under the assumption that all the data can be used for training and AIC or TIC can be used, 
if these (AIC or TIC) minimization is used, the training data can be used for the evaluation of AIC and TIC.
It should be noted, however, that there is an estimation error of 1/n for n number of data.

3.

Write the equation of the soft-argmax operation. Express the corresponding (log-likelihood based) loss incurred when the true class label index is y.

The soft-argmax function is defined as softmax(x)
where x is the input and k is the number of an element of the vector

$Softmax(x)_{k} = \hat{y_k} = \frac{exp(x_k)}{\sum_{c=1}^{C}exp(x_c)}$ .

The loss function based on log-loss is expressed as follows.

$L(x, y)=\sum_{k=1}^{C} y \log(\hat{y_k} )$ .

This is the cross-entropy loss of \hat(y) and y.
Given the KL distance between \hat(y) and y, the following equation is given

$KL(y||\hat{y})=y \log(\frac{\hat{y}}{y})$ .
$KL(y||\hat{y})=y \log(\hat{y}) - y \log(y)$ .

Then, when minimizing the KL distance, we will focus on minimizing the first term, since the second term does not depend on \hat(y).
This corresponds to the log-loss of the cross-entropy loss.

4.

The straightforward equation for the soft-argmax can lead to numerical instability. Write an equivalent implementation of its computation that avoids numerical problems.

The output of the softmax function is in the range 0-1, but this can lead to overflow, especially in calculations such as exp(x).
The maximum value of the input x is x_max, and x is normalized by replacing x with x - x_max to prevent this phenomenon.
The justification for replacing x with x - x_max is as follows.

$\hat{y} = \frac{exp(x_k)}{\sum_{c=1}^{C}exp(x_c)} \times \frac{- x_\text{max}}{- x_\text{max}}$ . $=\frac{exp(x_k - x_\text{max} )}{\sum_{c=1}^{C}exp(x_c - x_\text{max})}$ .

余談だけれど、pytorchのBCEとCEは実装において入力形式が異なる.
ここでは、簡単のため、データ数を1にして説明する、また正解データ target は one-hot vetor としてエンコーディングした場合のインデックスである.
特に下記の式を見ればわかりやすいが、CEに関しては、input(モデルの推定結果、softmaxとかを噛ませた後) はベクトルだけれど、target は正解クラスを示すインデックスのスカラー値である.
これは、数式から実装の効率化のために変更されている.

$\text{loss}(x, class) = -\log\left(\frac{\exp(x[class])}{\sum_j \exp(x[j])}\right)$ . $= -x[class] \log\left(\sum_j \exp(x[j])\right)$ .

対して、BCEは、input target ともに、同じ長さのベクトルを入力する.
このやり方を、CEでもやりたければ、LogSoiftmax の結果を input として更新し直して NLLに流してあげれば良い(計算効率は悪い).

5.

Suppose you are a (hardcore subjective) Bayesian modeling independent biased coin flips, represented by the random variables Xi with sample space {0,1}. Your subjective belief is that the likelihood is Xi｜θ ∼ Bernouilli(θ) and that the prior is θ ∼ beta(α, β). Suppose though that you are uncertain on the value of the hyperparameters α and β. What would you do as a true Bayesian? Knowing that α > 0 and β > 0 for a beta distribution, what constraints does it put on your suggestion?

Set some prior distributions that take only positive values for α and β respectively, and calculate the posterior distributions of α and β and θ from the data D = (X1,...,Xn). ,Xn) to calculate the posterior distributions of α and β, and θ.

Xi|θ ~ Bernouilli(θ) = bern(xi|θ)
θ ~ Beta(α, β)
α ~ f(ω1)
β ~ g(ω2)

p(θ|xi, ω1, ω2) = bern(xi|θ) beta(θ| α, β) f(ω1) g(ω2) / marginal likelihood

then argmax p(θ|xi, ω1, ω2) wrt ω1, ω2

Screenshot 2021-12-09 at 5 54 46 PM

Screenshot 2021-12-10 at 10 35 50 AM Screenshot 2021-12-10 at 10 34 33 AM

(MAP) - II Estimate

Screenshot 2021-12-10 at 11 03 06 AM

Memo

その他手書きメモ PDF