The loss gradient is n-dimensional, where each axis corresponds to a parameter of the model. We seek the global minimum, i.e. where the mean loss is smallest.
Note 1: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
@%wrCnWSYb
Previous
Note did not exist
New Note
Front
Back
The loss gradient is n-dimensional, where each axis corresponds to a parameter of the model. We seek the global minimum, i.e. where the mean loss is smallest.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | The loss gradient is {{c1::n-dimensional}}, where each axis corresponds to {{c2::a parameter of the model}}. We seek the {{c3::global minimum}}, i.e. where the {{c3::mean loss is smallest}}. |
Note 2: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
Cz3eAYn$g6
Previous
Note did not exist
New Note
Front
The absolute (L1) loss is defined as \(\ell_{\text{abs}}(r) = |r|\). Its main drawback is that it is not differentiable at zero.
Back
The absolute (L1) loss is defined as \(\ell_{\text{abs}}(r) = |r|\). Its main drawback is that it is not differentiable at zero.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | The {{c1::absolute (L1)}} loss is defined as \(\ell_{\text{abs}}(r) = {{c2::|r|}}\). Its main drawback is that it is {{c3::not differentiable at zero}}. |
Note 3: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Classic
GUID:
added
Note Type: Horvath Classic
GUID:
Ek0p90pSOj
Previous
Note did not exist
New Note
Front
Why does the Huber loss use \(\delta|r| - \frac{1}{2}\delta^2\) (not just \(\delta|r|\)) for the linear region?
Back
Why does the Huber loss use \(\delta|r| - \frac{1}{2}\delta^2\) (not just \(\delta|r|\)) for the linear region?
To ensure continuous differentiability at \(r = \pm\delta\):
- The slopes must match: \(L_{\text{square}}' = r\), so the abs part gets slope \(\pm\delta\).
- The y-values must match: square gives \(\frac{1}{2}\delta^2\) but \(\delta|\pm\delta| = \delta^2\), so we subtract \(\frac{1}{2}\delta^2\).
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Front | Why does the Huber loss use \(\delta|r| - \frac{1}{2}\delta^2\) (not just \(\delta|r|\)) for the linear region? | |
| Back | To ensure <b>continuous differentiability</b> at \(r = \pm\delta\):<br><ul><li>The slopes must match: \(L_{\text{square}}' = r\), so the abs part gets slope \(\pm\delta\).</li><li>The y-values must match: square gives \(\frac{1}{2}\delta^2\) but \(\delta|\pm\delta| = \delta^2\), so we subtract \(\frac{1}{2}\delta^2\).</li></ul> |
Note 4: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
JHuwTa2Ri*
Previous
Note did not exist
New Note
Front
When computing loss over multiple datapoints, we take the average loss across all points.
Back
When computing loss over multiple datapoints, we take the average loss across all points.
e.g. mean squared error or mean Huber loss
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | When computing loss over {{c1::multiple datapoints}}, we take the {{c2::average loss}} across all points. | |
| Extra | e.g. mean squared error or mean Huber loss |
Note 5: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Classic
GUID:
added
Note Type: Horvath Classic
GUID:
RJES!%$LBz
Previous
Note did not exist
New Note
Front
Derive the normal equation for MSE regression from first principles.
Back
Derive the normal equation for MSE regression from first principles.
We want \(\hat{\boldsymbol{w}}\) minimizing \(\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\).
Set gradient to zero:
\(\nabla_{\boldsymbol{w}} \|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{w}}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\hat{\boldsymbol{w}} - \boldsymbol{y}) = 0\)
\(\iff \boldsymbol{X}^\top \boldsymbol{X} \hat{\boldsymbol{w}} = \boldsymbol{X}^\top \boldsymbol{y}\)
Set gradient to zero:
\(\nabla_{\boldsymbol{w}} \|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{w}}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\hat{\boldsymbol{w}} - \boldsymbol{y}) = 0\)
\(\iff \boldsymbol{X}^\top \boldsymbol{X} \hat{\boldsymbol{w}} = \boldsymbol{X}^\top \boldsymbol{y}\)
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Front | Derive the <b>normal equation</b> for MSE regression from first principles. | |
| Back | We want \(\hat{\boldsymbol{w}}\) minimizing \(\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\).<br><br>Set gradient to zero:<br>\(\nabla_{\boldsymbol{w}} \|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{w}}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\hat{\boldsymbol{w}} - \boldsymbol{y}) = 0\)<br>\(\iff \boldsymbol{X}^\top \boldsymbol{X} \hat{\boldsymbol{w}} = \boldsymbol{X}^\top \boldsymbol{y}\) |
Note 6: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Classic
GUID:
added
Note Type: Horvath Classic
GUID:
U1Eb8hej*j
Previous
Note did not exist
New Note
Front
Write the MSE regression objective in matrix form and state the optimal weight vector.
Back
Write the MSE regression objective in matrix form and state the optimal weight vector.
Objective: \(\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\)
Optimal weights: \(\hat{\boldsymbol{w}} = \arg\min_{\boldsymbol{w} \in \mathbb{R}^d} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\)
Optimal weights: \(\hat{\boldsymbol{w}} = \arg\min_{\boldsymbol{w} \in \mathbb{R}^d} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\)
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Front | Write the MSE regression objective in matrix form and state the optimal weight vector. | |
| Back | Objective: \(\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\)<br><br>Optimal weights: \(\hat{\boldsymbol{w}} = \arg\min_{\boldsymbol{w} \in \mathbb{R}^d} \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}\|_2^2\) |
Note 7: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Classic
GUID:
added
Note Type: Horvath Classic
GUID:
UWJy4WStr%
Previous
Note did not exist
New Note
Front
What is a loss function in regression?
Back
What is a loss function in regression?
A function \(\ell(r)\) that characterizes how "bad" a prediction is, where \(r = y - \hat{y}\) is the residual.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Front | What is a <b>loss function</b> in regression? | |
| Back | A function \(\ell(r)\) that characterizes how "bad" a prediction is, where \(r = y - \hat{y}\) is the residual. |
Note 8: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
g6K#9WP@pG
Previous
Note did not exist
New Note
Front
The advantage of Huber loss over L1 and L2 is that it is less sensitive to outliers than L2 while still being differentiable everywhere (unlike L1).
Back
The advantage of Huber loss over L1 and L2 is that it is less sensitive to outliers than L2 while still being differentiable everywhere (unlike L1).
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | The advantage of Huber loss over L1 and L2 is that it is {{c1::less sensitive to outliers than L2}} while still being {{c2::differentiable everywhere (unlike L1)}}. |
Note 9: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
k0Ztc^E2Jy
Previous
Note did not exist
New Note
Front
Setting \(\nabla_{\boldsymbol{w}} \|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{w}}\|_2^2 = 0\) gives \({{c1::2\boldsymbol{X}^\top(\boldsymbol{X}\hat{\boldsymbol{w}} - \boldsymbol{y}) = 0}}\), which simplifies to the normal equation: \({{c2::\boldsymbol{X}^\top \boldsymbol{X} \hat{\boldsymbol{w}} = \boldsymbol{X}^\top \boldsymbol{y}}}\).
Back
Setting \(\nabla_{\boldsymbol{w}} \|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{w}}\|_2^2 = 0\) gives \({{c1::2\boldsymbol{X}^\top(\boldsymbol{X}\hat{\boldsymbol{w}} - \boldsymbol{y}) = 0}}\), which simplifies to the normal equation: \({{c2::\boldsymbol{X}^\top \boldsymbol{X} \hat{\boldsymbol{w}} = \boldsymbol{X}^\top \boldsymbol{y}}}\).
At a minimum the gradient must be zero. The factor 2 cancels out, leaving the normal equation.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | Setting \(\nabla_{\boldsymbol{w}} \|\boldsymbol{y} - \boldsymbol{X}\hat{\boldsymbol{w}}\|_2^2 = 0\) gives \({{c1::2\boldsymbol{X}^\top(\boldsymbol{X}\hat{\boldsymbol{w}} - \boldsymbol{y}) = 0}}\), which simplifies to the <b>normal equation</b>: \({{c2::\boldsymbol{X}^\top \boldsymbol{X} \hat{\boldsymbol{w}} = \boldsymbol{X}^\top \boldsymbol{y}}}\). | |
| Extra | At a minimum the gradient must be zero. The factor 2 cancels out, leaving the normal equation. |
Note 10: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
rthT$U8vno
Previous
Note did not exist
New Note
Front
The square (L2) loss is defined as \(\ell(r) = {{c2:::\frac{1}{2} r^2}}\). The \(\frac{1}{2}\) factor is included because it makes the derivative clean: \(L_{\text{square' = r\)}}.
Back
The square (L2) loss is defined as \(\ell(r) = {{c2:::\frac{1}{2} r^2}}\). The \(\frac{1}{2}\) factor is included because it makes the derivative clean: \(L_{\text{square' = r\)}}.
Very sensitive to outliers since the residual is squared.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | The {{c1::square (L2)}} loss is defined as \(\ell(r) = {{c2:::\frac{1}{2} r^2}}\). The \(\frac{1}{2}\) factor is included because {{c3::it makes the derivative clean: \(L_{\text{square}}' = r\)}}. | |
| Extra | Very sensitive to outliers since the residual is squared. |
Note 11: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Classic
GUID:
added
Note Type: Horvath Classic
GUID:
uMZ9&uAV6k
Previous
Note did not exist
New Note
Front
Write the definition of the Huber loss.
Back
Write the definition of the Huber loss.
\[\ell_{\text{huber}}(r) = \begin{cases} \frac{1}{2} r^2 & |r| \le \delta, \\ \delta |r| - \frac{1}{2} \delta^2 & |r| > \delta. \end{cases}\]
Uses square loss for \([-\delta, \delta]\) and absolute loss outside.
Uses square loss for \([-\delta, \delta]\) and absolute loss outside.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Front | Write the definition of the <b>Huber loss</b>. | |
| Back | \[\ell_{\text{huber}}(r) = \begin{cases} \frac{1}{2} r^2 & |r| \le \delta, \\ \delta |r| - \frac{1}{2} \delta^2 & |r| > \delta. \end{cases}\]<br>Uses square loss for \([-\delta, \delta]\) and absolute loss outside. |
Note 12: ETH::Electives::IML
Deck: ETH::Electives::IML
Note Type: Horvath Cloze
GUID:
added
Note Type: Horvath Cloze
GUID:
xwS2i87ME6
Previous
Note did not exist
New Note
Front
The asymmetric loss is \(\ell_\tau(r) = {{c1::\tau \max\{r,0\} + (1-\tau)\max\{-r,0\}}}\). A higher \(\tau\) means steeper penalty for over-shooting (positive residual), lower \(\tau\) means steeper penalty for under-shooting.
Back
The asymmetric loss is \(\ell_\tau(r) = {{c1::\tau \max\{r,0\} + (1-\tau)\max\{-r,0\}}}\). A higher \(\tau\) means steeper penalty for over-shooting (positive residual), lower \(\tau\) means steeper penalty for under-shooting.
Like absolute loss, but with two different slopes on each side of the y-axis.
Field-by-field Comparison
| Field | Before | After |
|---|---|---|
| Text | The asymmetric loss is \(\ell_\tau(r) = {{c1::\tau \max\{r,0\} + (1-\tau)\max\{-r,0\}}}\). A higher \(\tau\) means {{c2::steeper penalty for over-shooting (positive residual)}}, lower \(\tau\) means {{c3::steeper penalty for under-shooting}}. | |
| Extra | Like absolute loss, but with two different slopes on each side of the y-axis. |