Softmax函數的性質

发布日期

2024年1月9日

Softmax是深度學習模型中常用的一種函數，常常用在分類模型的最後一層。向量經過softmax函數後，其總和等於1，因此能夠作爲分類模型的輸出，用來逼近一個真實的概率分佈。

1 數學定義

softmax函數的定義為： \[ \sigma(\vec z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}, \qquad(1)\] 其中，\(\vec z\)為輸入向量。

下面的代碼展示了softmax函數的一種簡單的實現方式。

import torch 

def my_softmax(z, dim):
    # z.shape == bn, n
    z = torch.exp(z)
    z = z / torch.sum(z, dim=dim)
    return z 

tensor_test = torch.rand(10, 100)
diff = (my_softmax(tensor_test, 0) - torch.softmax(tensor_test, 0)).abs().mean().item()
assert diff < 1e-6, diff
print('Difference: ', diff)

Difference:  6.180256750809576e-09

2 Softmax的上溢和下溢問題

softmax在計算機的實際運算過程中，容易遇到上溢和下溢問題。設\(\vec z\)是softmax函數的輸入。假如\(\vec z\)中的數值都極小（趨於負無窮大），這時公式 1的分母接近0，容易導致產生的數超出浮點型的上限，這被稱爲下溢；反之，若\(\vec z\)中存在特別大的數，由於函數\(e^{x}\)增長極快，輸出的數也很可能超出浮點型的上限，這被稱爲上溢。

下面的代碼展示了softmax的計算中出現上溢出和下溢出的情形。

out = my_softmax(tensor_test + 1e5, 0)
if torch.any(torch.isnan(out)):
    print('檢測到上溢！')

out = my_softmax(tensor_test - 1e5, 0)
if torch.any(torch.isnan(out)):
    print('檢測到下溢！')

out = torch.softmax(tensor_test + 1e5, 0)
out2 = torch.softmax(tensor_test - 1e5, 0)
if not (
    torch.any(torch.isnan(out)) or 
    torch.any(torch.isnan(out2))
):
    print('torch.softmax函數沒有出現上溢下溢問題。')

檢測到上溢！
檢測到下溢！
torch.softmax函數沒有出現上溢下溢問題。

儘管我們實現的簡單的softmax函數會發生上溢和下溢，但torch的softmax函數沒有出現問題。torch是如何做到的呢？

2.1 上溢和下溢問題的應對方法

softmax函数有這樣的特性：\(softmax(\vec z + y) = softmax(\vec z), \forall y \in \mathbb R\)，即對輸入向量隨意加上任意一個數，輸出都不會變。

選取\(y=-z_k\)，其中\(k=\max_j(z_j)\)，可以解决这一问题。这时，\(\forall j, e^{z_j +y}\)在\(j=k\)时取得最大值，最大值为1，于是不存在上溢。同时，由于分母必然大于等于1，因此也不存在下溢。

改進後的softmax函數如下：

def my_softmax_2(z, dim):
    # z.shape == bn, n

    # 減去最大值
    z -= z.max(dim=dim)[0]

    z = torch.exp(z)
    z = z / torch.sum(z, dim=dim)
    return z 

out = my_softmax_2(tensor_test + 1e5, 0)
if torch.any(torch.isnan(out)):
    print('檢測到上溢！')

out = my_softmax_2(tensor_test - 1e5, 0)
if torch.any(torch.isnan(out)):
    print('檢測到下溢！')

diff = (my_softmax_2(tensor_test, 0) - torch.softmax(tensor_test, 0)).abs().mean().item()
assert diff < 1e-6, diff
print('Difference: ', diff)

Difference:  9.164214387347158e-10

3 Softmax函數求導

softmax是一個將向量映射為向量的函數。求其輸出對輸入的導數得到一個雅克比矩陣。為了方便，記\(s_i:=\sigma(\vec z)_i\). 計算該雅克比矩陣要分兩種情況討論。一是\(\frac{\partial s_i}{\partial z_i}\)（雅克比矩陣的對角線），二是\(\frac{\partial s_i}{\partial z_k}, i\neq k\)的情況。

第一種情況的導數計算如下： \[ \begin{aligned} s_i &= \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{1}{\sum_{j, j\neq i} e^{z_j - z_i} + 1}\\ \frac{\partial s_i}{\partial z_i} &= \frac{e^{z_i}\sum_j e^{z_j}}{(\sum_j e^{z_j})^2} - \frac{e^{z_i} e^{z_i}}{(\sum_j e^{z_j})^2} \\ &= s_i - s_i^2 \\ &= s_i(1 - s_i), \end{aligned} \] 第二種情況的導數\(\frac{\partial s_i}{z_k} (i \neq k)\)計算如下： \[ \begin{aligned} \frac{\partial s_i}{\partial z_k} &= -\frac{e^{z_i} e^{z_k}}{(\sum_j e^{z_j})^2} \\ &= -s_i s_k \end{aligned} \] 因此對softmax函數求導，其雅可比矩陣為： \[ \begin{pmatrix} s_1(1-s_1) & -s_1 s_2 & -s_1 s_3 & \cdots & -s_1 s_n \\ -s_2 s_1 & s_2(1-s_2) & -s_2 s_3 & \cdots & -s_2 s_n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -s_n s_1 & -s_n s_2 & -s_n s_3 & \cdots & s_n (1 - s_n) \\ \end{pmatrix} \]

觀察該矩陣，我們可以發現在兩種情況下，矩陣的將會接近0矩陣。一是當softmax函數的輸出為one hot時（即\(\exists s_i = 1\)且\(\forall j \neq i, s_j=0\)），得到的雅克比矩陣為\(0\)矩陣；另一種情況是softmax的輸出維度較多，且個維度的值都較為均等的情況，此時\(s_1\approx s_2 \approx s_3 \cdots \approx s_n \approx 0\)。訓練中，這兩種情況將導致梯度消失，可能影響訓練，因此是我們要提前了解的。

4 結語

記得在讀碩士的最後一學期時，有一回面試一家公司。這家公司是國内小有名氣的中厰。面試官問了softmax函數的問題，我便以上面所述的内容對之。結束時，面試官笑著説，記得幾年前，我問同樣的問題，幾乎沒幾個人答得上來。但是，今年同樣的問題，居然人人都能回答上來呢。

聽到面試官的話，我也不禁感慨。看來這幾年時間過去，深度學習/人工智能的賽道確實是越來越卷了。作爲一名準備面試的學生，這個問題在我看來確實是常識。再過幾年，它會不會成爲一般本科生，甚至中小學生的常識呢。笑。