机器学习之——线性回归（监督学习）

一、单变量线性回归1.模型介绍2.模型建立3.模型损失4.模型参数求解5.模型优缺点6.代码实现二、多变量线性回归三、正规方程

本人计算机小白一枚，刚开始接触机器学习，下面是一篇关于机器学习线性回归的文章，参考各种资料写出来的，不喜勿喷，如有错误地方，还望指出，谢谢。

机器学习算法有很多，根据学习形式可以分为三大类：一是监督学习；二是无监督学习；三是半监督学习。

1. 监督学习：指对数据的若干特征与若干标签（类型）之间的关联性进行建模的过程。它的训练集数据中包含了类别信息（数据已经含有了标签），在学习时知道其分类结果。可以训练带标签的数据以预测新数据标签的模型。

监督学习算法包括：回归和分类。回归算法中的标签是连续的值，可以预测连续值的模型；而分类的算法中的标签是离散的值，可以预测两个或多个离散分类标签的模型

监督学习算法：线性回归、逻辑回归、K-近邻算法(KNN)、BP神经网络、朴素贝叶斯算法、随机森林、决策树、支持向量机

2. 无监督学习：对不带任何标签的数据特征进行建模。它的训练集数据中没有类别信息（数据没有标签），在学习时并不知道分类结果。可以识别无标签数据结构的模型。

无监督学习算法包括：聚类和降维。聚类是检测、识别数据显著组别的模型；降维是从高维数据中检测、识别低维数据结构模型无监督学习算法：（K-均值）K-means、主成分分析（PAC）、自编码器（Auto-Encoder）、最大期望（EM）算法、高斯混合模型、Apriori算法、谱聚类。

3. 半监督学习：介于监督学习和无监督学习之间，通常可以在数据标签不完整下使用。

三者最本质的区别是：在这三种学习形式当中，根据其训练集来判别是哪种类型，训练集含有输入变量(xxx)和输出变量(yyy)则为监督学习；训练集只含有输入变量(xxx)则为无监督学习；训练集有一部分为输入变量有对应的输出变量(yyy)，另一部分则没有输出变量(yyy)则为半监督学习。

接下来开始介绍第一个最简单的机器学习算法——线性回归

一、单变量线性回归

1.模型介绍

单变量线性回归模型中，含有两个变量，其中一个为输入变量（也即特征），另一个为输出变量（也即目标）。

单变量线性回归其实就是通过这两个变量，找到它们之间的关系，然后用一条线性的直线将其关系进行表示出来的一个过程。比如下面的例子，可以利用线性回归找到一条拟合直线来描述他们之前的关系。这样一来就可以通过该拟合直线来预测我们想要的数据。

2.模型建立

因此我们对上述数据进行建模，从数据可视化的图中可以看出，满足线性回归模型，线性回归模型的函数一般写为：

(1)hθ(x)=θ0+θ1x\color{red} h_{\theta}(x)={{ \theta }_{0}}+{{\theta }_{1}}{{x}}\tag{1}hθ(x)=θ0+θ1x(1)

其中θ0{{ \theta }_{0}}θ0，θ1{{ \theta }_{1}}θ1为拟合参数，xxx为特征（输入变量）， hθ(x)h_{\theta}(x)hθ(x)为模型输出值（预测值）。因为只有一个特征，所以为单变量线性回归。

3.模型损失

当然在使用上述线性回归模型时，难免会存在一定的误差（模型所预测的值与训练中实际值之间的差距），产生的这个误差我们把它叫做建模误差。于是我们想把该误差（损失）消除或者使其最小，从而就能得到最佳的模型参数（θ0{{ \theta }_{0}}θ0，θ1{{ \theta }_{1}}θ1），这是就产生了代价函数（也叫损失函数，成本函数等）。

(2)J(θ)=12m∑i=1m(hθ(x(i))−y(i))2\color{red}J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}}}\tag{2}J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2(2)

其中 mmm ：样本数量；

x(i){{x}^{(i)}}x(i)：代表第iii个数据；

y(i){{y}^{(i)}}y(i)：代表第iii个数据的预测值；

得到的代价函数J(θ)J(\theta)J(θ)是一个关于参数θ\thetaθ的函数，接下来通过梯度下降法来求使代价值最小的参数θ\thetaθ。

(ps：这里构造的平方的代价函数是为了方便进行梯度下降算法)

计算代价函数的Python代码如下：

## 代价函数：J(theta)def computeCost(X, y, theta): inner = np.power(((X * theta.T) - y), 2)return np.sum(inner) / (2 * len(X))

4.模型参数求解

接下来对模型进行求解参数，我们需要用到一个α\alphaα（学习率）来计算并更新每次迭代（itersitersiters）的参数θ\thetaθ参数。

(3)θj:=θj−α∂∂θjJ(θ)\color{red}{{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta \right)\tag{3}θj:=θj−α∂θj∂J(θ)(3)

此时得到的∂∂θjJ(θ)\frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta \right)∂θj∂J(θ)的表达式中只含有θ\thetaθ，并且要同时更新θ\thetaθ。

α\alphaα（学习率）决定了我们沿着能让代价函数下降程度最大方向向下迈出的步子有多大；itersitersitersitersitersiters（迭代次数），直到代价函数J(θ)J(\theta)J(θ)收敛为止，即可求出我们想要的参数。

计算梯度下降的代码如下：

def gradientDescent(X, y, theta, alpha, iters):temp = np.matrix(np.zeros(theta.shape))parameters = int(theta.ravel().shape[1])cost = np.zeros(iters)for i in range(iters):error = (X * theta.T) - yfor j in range(parameters):term = np.multiply(error, X[:,j])temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))theta = tempcost[i] = computeCost(X, y, theta)return theta, cost

5.模型优缺点

优点：1)建模速度快，不需要复杂的计算，在数据量大的时候依然运行速度快；

2)可以根据系数给出每个变量的理解和解释。

缺点：对异常值很敏感。

6.代码实现

1）使用常规法实现（numpy和pandas\color{blue}numpy和pandasnumpy和pandas）：

import seaborn as sns;sns.set()import matplotlib.pyplot as pltimport pandas as pdimport numpy as npdata = pd.read_csv('ex1.csv', names=['a', 'b'])x = np.asarray(data.get('a')).reshape(-1, 1)y = data.get('b')def computeCost(x, y, theta):inner = np.power(((x*theta.T)-y), 2)return np.sum(inner)/(2*len(x))ones = pd.DataFrame({'ones': np.ones(len(data))})data = pd.concat([ones, data], axis=1)x = data.get(['ones', 'a'])y = data.get(['b'])x = np.matrix(x)y = np.matrix(y)theta = np.matrix(np.array([0, 0]))J_0 = computeCost(x, y, theta)# 编写梯度下降算法def gradientDescent(X, y, theta, alpha, iters):temp = np.matrix(np.zeros(theta.shape))parameters = int(theta.ravel().shape[1])cost = np.zeros(iters)for i in range(iters):error = (X * theta.T) - yfor j in range(parameters):term = np.multiply(error, X[:, j])temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))theta = tempcost[i] = computeCost(X, y, theta)return theta, costalpha = 0.01iters = 1000g, cost = gradientDescent(x, y, theta, alpha, iters)J = computeCost(x, y, g)f = g[0, 0] + (g[0, 1] * x[:, 1])plt.figure(figsize=(10, 7))plt.plot(np.asarray(x)[:, 1], f, 'r', label='PredictedResult')plt.scatter(data.a, data.b, label='Traning Data')plt.xlabel('a')plt.ylabel('b')plt.title('a vs. b')plt.legend() plt.show()plt.figure(figsize=(10, 7))plt.plot(np.arange(iters), cost, 'r')plt.xlabel('Iterations')plt.ylabel('Cost')plt.title("Error")plt.show()print("第一次的代价函数的值：%f" % J_0)print("收敛时的代价函数的值：%f" % J)print("theta_0= %f" % g[0, 0])print("theta_1= %f" % g[0, 1])

运行结果如下：

2）用 sklearn\color{blue}sklearnsklearn：

from sklearn import linear_modelfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import KFold, cross_val_scoreimport seaborn as sns;sns.set()import matplotlib.pyplot as pltimport pandas as pdimport numpy as npdata = pd.read_csv('ex1.csv', names=['a', 'b'])x = np.asarray(data.get('a')).reshape(-1, 1)y = data.get('b')## 将数据集分为训练集和测试集x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)clf = linear_model.LinearRegression(normalize=True)clf.fit(x_train, y_train)theta_1, theta_0= clf.coef_, clf.intercept_ ## 拟合参数## 可视化结果plt.figure(figsize=(10, 7))plt.scatter(x_train, y_train, color='blue', label='TraningSet')plt.plot(x_train, clf.predict(x_train), color='red', label='TraningResult')plt.scatter(x_test, y_test, color='green', label='TestSet')plt.plot(x_test, clf.predict(x_test), color='yellow', label='TestResult')plt.xlabel('a_x')plt.ylabel('b_y')plt.title('a_x vs. b_y')plt.legend()plt.show()print("theta_1= %f" % theta_1)print("theta_0= %f" % theta_0)clf.score(x_test, y_test) ## 预测模型的好坏，分越高，效果更好，1.0为最佳

运行结果如下：

从可视化结果可以看出，红色直线为训练的结果，黄色直线为测试的结果，两者相互重合，说明该模型的准确率还是很高的；通过测试集的分数（97.35%）也可以看出，效果很不错。

二、多变量线性回归

在线性回归当中，有的时候不止是一个自变量（输入变量），可能出现多个输入变量，比如房屋的价格（PricePricePrice）是由房屋的面积（AreaAreaArea）和房屋的数量（RoomNumRoomNumRoomNum）所决定的。此时特征就由一个变成了多个(x1x_1x1, x2x_2x2, x3x_3x3, …)，此时的对数据的处理和单变量线性回归是一样的，只不过多个特征，将其数据可视化要在多维空间里实现，两个特征的话，可以在三维空间实现可视化，但是二维以上的特征，就难以实现，只能得到拟合参数和代价函数。不过在多变量线性回归里面，我们可能会多加一个预处理的步骤——特征归一化。

归一化就是要把你需要处理的数据经过处理后（通过某种算法）限制在你需要的一定范围内，是为了后面数据处理的方便，也可以消除特征之间量级不同导致的影响，其次是保正程序运行时收敛加快。对于分类问题来说，一般是需要进行归一化处理，因为分类问题关心变量的值；而对于概论型问题，则不需要进行归一化处理，因为它并不关心变量的值，而关心变量的分布和变量之间的条件概率。

在多变量线性回归当中，模型函数中的特征变多了，于是变成了下面的式子：

(4)J(θ)=12m∑j=0n∑i=1m(hθ(xj(i))−yj(i))2\color{red}J\left( \theta \right)=\frac{1}{2m}\sum\limits_{j=0}^{n}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x_j}^{(i)}} \right)-{{y_j}^{(i)}} \right)}^{2}}}\tag{4}J(θ)=2m1j=0∑ni=1∑m(hθ(xj(i))−yj(i))2(4)

其中：

(5)hθ(xj)=∑j=0nθjxj=θTX=θ0x0+θ1x1+θ2x2+...+θnxn\color{red}h_{\theta}(x_j)=\sum\limits_{j=0}^{n}{\theta}_{j}{x_j}={{\theta }^{T}}X={{\theta }_{0}}{{x}_{0}}+{{\theta }_{1}}{{x}_{1}}+{{\theta }_{2}}{{x}_{2}}+...+{{\theta }_{n}}{{x}_{n}}\tag{5}hθ(xj)=j=0∑nθjxj=θTX=θ0x0+θ1x1+θ2x2+...+θnxn(5)

mmm：样本的数量

nnn：特征的数量（输出变量的数量）

θj{\theta_j}θj：拟合参数

X=[x1x2x3...xn]X=\begin{bmatrix}{x_1}\\{x_2}\\{x_3}\\.\\.\\.\\{x_n}\end{bmatrix}X=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡x1x2x3...xn⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤ θ=[θ1θ2θ3...θn]{\theta}=\begin{bmatrix}{\theta_1}\\{\theta_2}\\{\theta_3}\\.\\.\\.\\{\theta_n}\end{bmatrix}θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡θ1θ2θ3...θn⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤

梯度下降更新拟合参数：

(6)θj:=θj−α∂∂θjJ(θ)\color{red}{{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta \right)\tag{6}θj:=θj−α∂θj∂J(θ)(6)

代码如下：

1）用常规方法：

import seaborn as sns;sns.set()from mpl_toolkits.mplot3d import Axes3Dimport matplotlib.pyplot as pltimport numpy as npimport pandas as pd# 编写代价函数def computeCost(x, y, theta):inner = np.power(((x*theta.T)-y), 2)return np.sum(inner)/(2*len(x))# 编写梯度下降算法def gradientDescent(X, y, theta, alpha, iters):temp = np.matrix(np.zeros(theta.shape))parameters = int(theta.ravel().shape[1])cost = np.zeros(iters)for i in range(iters):error = (X * theta.T) - yfor j in range(parameters):term = np.multiply(error, X[:, j])temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))theta = tempcost[i] = computeCost(X, y, theta)return theta, costdata = pd.read_csv('ex2.csv', header=None, names=['Area', 'RoomNum', 'Price'])data = (data - data.mean()) / data.std()## 对数据进行归一化处理，并添加一行data1 = (data-data.mean())/data.std()ones = pd.DataFrame({'one': np.ones(len(data1))})data1 = pd.concat([ones, data1], axis=1)X = data1.get(['one', 'Area', 'RoomNum'])x = np.matrix(X)Y = data1.get(['Price'])y = np.matrix(Y)# 初始化变量---学习率α和迭代次数alpha = 0.01iters = 1000theta1 = np.matrix(np.array([0, 0, 0]))J_0 = computeCost(x, y, theta1)## 计算梯度下降g, cost1 = gradientDescent(x, y, theta1, alpha, iters)# print(cost1)J = computeCost(x, y, g)## 把得到的参数取出来，要变成nparray数组，因为在计算预测值f时，算的是数量积## matrix的 * 为向量积，## array的 * 为数量积X1, X2 = np.meshgrid(np.asarray(X)[:, 1], np.asarray(X)[:, 2])# a, b, c = np.asarray(g)[:, 0], np.meshgrid(g)[:, 1], np.asarray(g)[:, 2]a = np.asarray(g)[:, 0]b = np.asarray(g)[:, 1]c = np.asarray(g)[:, 2]f =np.asarray(a + b*X1 + c*X2)## 3D散点图fig = plt.figure()axes3d = Axes3D(fig)## 使用plt.scatter()画散点图时，xy直接为DataFrame的每一列axes3d.scatter(np.asarray(X)[:, 1], np.asarray(X)[:, 2], y, c='b', marker='o') ### 使用plt.plot_surface()函数画图时，需要注意的是，里面的xyz三个坐标为meshgrid（nparrary）的网格点axes3d.plot_surface(X1, X2, f, shade=False, color='red') axes3d.set_xlabel('Area')axes3d.set_ylabel('RoomNum')axes3d.set_zlabel('Price')axes3d.set_title('PredictedResult')fig, ax = plt.subplots(figsize=(12,8))ax.plot(np.arange(iters), cost1, 'r')ax.set_xlabel('Iterations')ax.set_ylabel('Cost')ax.set_title('Error vs. Training Epoch')plt.show()

运行结果如下：

2）用sklearnsklearnsklearn：

from mpl_toolkits.mplot3d import Axes3Dimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom sklearn import linear_modelfrom sklearn.model_selection import train_test_splitimport timedata = pd.read_csv('ex2.csv', names=['Area', 'RoomNum', 'Price'])x = data.get(['Area', 'RoomNum'])y = data.get('Price')x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)reg = linear_model.LinearRegression()reg.fit(x_train, y_train)x0_train, x1_train = np.meshgrid(np.asarray(x_train)[:, 0], np.asarray(x_train)[:, 1])y_train_pred = np.array(reg.coef_[0]*x0_train + reg.coef_[1]*x1_train + reg.intercept_)x0_test, x1_test = np.meshgrid(np.asarray(x_test)[:, 0], np.asarray(x_test)[:, 1])y_test_pred = np.array(reg.coef_[0]*x0_test + reg.coef_[1]*x1_test + reg.intercept_)acc_test = reg.score(x_test, y_test)acc_train = reg.score(x_train, y_train)fig = plt.figure()ax3d = Axes3D(fig)ax3d.scatter(np.asarray(x_train)[:, 0], np.asarray(x_train)[:, 1], np.asarray(y_train), alpha=0.8, color='blue')ax3d.scatter(np.asarray(x_test)[:, 0], np.asarray(x_test)[:, 1], np.asarray(y_test), alpha=0.8, color='green')ax3d.plot_surface(x0_train, x1_train, y_train_pred, shade=False, color='red')ax3d.plot_surface(x0_test, x1_test, y_test_pred, shade=False, color='yellow')ax3d.set_xlabel('area')ax3d.set_ylabel('room_num')ax3d.set_zlabel('price')ax3d.set_title('PredictedResult')plt.show()print(acc_train)print(acc_test)

运行结果如下：

从可视化结果可以看出，红色平面为训练的结果，黄色平面为测试的结果，两者几乎相互重合，说明该模型的准确率还是很高的；测试集和训练集的分数不是很好，两者还是有一点差距，效果不是很好。

三、正规方程

对于线性回归来说，大部分使用的是梯度下降算法，但是对于某些问题来说，正规方程也是一种很好的解决办法。

我们假设我们预测得到的结果为yyy，那么有如下的式子（J(θ)J({\theta})J(θ)的公式为（4）， hθ(x)h_{\theta}(x)hθ(x)的公式为（5））：

(7)y=θTX\color{red}y={\theta}^TX\tag{7}y=θTX(7)

对J(θ)J({\theta})J(θ)求偏导，然后令偏导等于零：

(8)∂∂θjJ(θ)=0\color{red}\frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta \right)=0\tag{8}∂θj∂J(θ)=0(8)

来求解最小的参数θ{\theta}θ，最后通过解方程得到:

(9)θ=(XTX)−1XTy\color{red}{\theta}=(X^TX)^{-1}X^Ty\tag{9}θ=(XTX)−1XTy(9)

我们得知道对矩阵求偏导的法则（可以参考下面的文章）：

/daaikuaichuan/article/details/80620518

在这里我们要用到其中的两个公式：

(10)∂Au(x)∂x=AT\color{red}\frac{\partial{A u(x)}}{\partial x}=A^T\tag{10}∂x∂Au(x)=AT(10)

(11)∂xTAx∂x=2Ax\color{red}\frac{\partial{x^TAx}}{\partial x}=2Ax\tag{11}∂x∂xTAx=2Ax(11)

计算结果如下：

将所有的变量都由向量表示，于是就有（XXX为(m∗n)(m*n)(m∗n)的矩阵，θ{\theta}θ为(n∗1)(n*1)(n∗1)d的矩阵，yyy为(m∗1)(m*1)(m∗1)的矩阵）：

(12)J(θ)=12(Xθ−y)2\color{red}J({\theta})=\frac{1}{2} (X{\theta}-y)^2\tag{12}J(θ)=21(Xθ−y)2(12)

(13)J(θ)=12(Xθ−y)T(Xθ−y)=12(XTθT−yT)(Xθ−y)\color{red}J({\theta})=\frac{1}{2} (X{\theta}-y)^T(X{\theta}-y)=\frac{1}{2} (X^T{\theta}^T-y^T)(X{\theta}-y)\tag{13}J(θ)=21(Xθ−y)T(Xθ−y)=21(XTθT−yT)(Xθ−y)(13)

(14)J(θ)=12(θTXTXθ−θTXTy−yTXθ+yTy)\color{red}J({\theta})=\frac{1}{2}({\theta}^{T}{X}^TX{\theta}-{\theta}^TX^Ty-y^TX{\theta}+y^Ty)\tag{14}J(θ)=21(θTXTXθ−θTXTy−yTXθ+yTy)(14)

(15)∂Jθ∂θ=12(2XTXθ−XTy−(yTX)T+0)\color{red}\frac{\partial{J{\theta}}}{\partial{\theta}} = \frac{1}{2}(2X^TX{\theta}-X^Ty-(y^TX)^T+0)\tag{15}∂θ∂Jθ=21(2XTXθ−XTy−(yTX)T+0)(15)

(16)∂Jθ∂θ=12(2XTXθ−XTy−XTy)=XTXθ−XTy\color{red}\frac{\partial{J{\theta}}}{\partial{\theta}}=\frac{1}{2}(2X^TX{\theta}-X^Ty-X^Ty)=X^TX{\theta}-X^Ty\tag{16}∂θ∂Jθ=21(2XTXθ−XTy−XTy)=XTXθ−XTy(16)

令

(17)∂J(θ)∂θ=0\color{red}\frac{\partial{J({\theta})}}{\partial {{\theta }}}=0\tag{17}∂θ∂J(θ)=0(17)

则有：

(18)θ=(XTX)−1XTy\color{red}{\theta}=(X^TX)^{-1}X^Ty\tag{18}θ=(XTX)−1XTy(18)

代码：

import seaborn as sns;sns.set()import matplotlib.pyplot as pltimport pandas as pdimport numpy as npdata = pd.read_csv('ex1.csv', names=['a', 'b'])ones = pd.DataFrame({'ones': np.ones(len(data))})data = pd.concat([ones, data], axis=1)x = data.get(['ones', 'a'])y = data.get(['b'])x = np.matrix(x)y = np.matrix(y)# 正规方程求解def normalEqn(X, y):theta = np.linalg.inv(X.T@X)@X.T@y #X.T@X等价于X.T.dot(X)return thetatheta=normalEqn(x[:, :2], y)x1 = np.asarray(x[:,1])y = np.asarray(y)theta_0 = np.asarray(theta)[0]theta_1 = np.asarray(theta)[1]f = theta_0 + theta_1*x1plt.figure(figsize=(10, 7))plt.scatter(x1, y, label="TrainingData")plt.plot(x1, f, 'r', label="Pricted")plt.xlabel('a')plt.ylabel('b')plt.title('a vs. b')plt.legend() plt.show()

运行结果：

从可视化的结果来看，正规方程也能够很好的描述线性回归模型，所有的点就均匀分布在线性方程的两侧；计算出来的参数θ0{\theta_0}θ0和θ1{\theta_1}θ1与梯度下降得到的还是有一点差距的。

下面借用吴恩达的总结：