机器学习基石作业2

出现noisy时错误的可能性，这题是简单的概率问题，μ是h出错的概率，当y=f(x)时出错：λμ；当otherwise时出错：(1-λ)(1-μ)。

=>λμ+(1-λ)(1-μ)

h的表现和μ无关，λμ+(1-λ)(1-μ)=1-λ-(1-2λ)μ =>λ=0.5

根据公式4(2N)dVC exp(-1/8 ξ2 N)=0.05求得sample size最接近的是460000

1 import matplotlib.pyplot as plt 2 import numpy as np 3 4 t = np.arange(-10,10,0.01) 5 6 def original_vc_bound(n): 7return np.sqrt(((8.0 / n) * (np.log(80) + 50 * np.log(2 * n)))) 8 9 def rademacher_penalty_bound(n):10return 1.0 / n + np.sqrt((2.0 / n) * np.log(20)) + np.sqrt((2.0 * (np.log(2 * n) + 50 * np.log(n))) / n)11 12 def parrondo_vandenBroek(t, n):13return t - np.sqrt(1.0 / n * (2 * t + np.log(120) + 50 * np.log(2 * n)))14 15 def devroye(t, n):16return t - np.sqrt(1.0 / (2 * n) * (4 * t * (1 + t) + np.log(80) + 100 * np.log(n)))17 18 def variant_VC_bound(n):19return np.sqrt(16.0 / n * (np.log(2 / np.sqrt(0.05)) + 50 * np.log(n)))20 21 print original_vc_bound(10000)22 print rademacher_penalty_bound(10000)23 print variant_VC_bound(10000)24 25 plt.subplot(211)26 plt.plot(t,parrondo_vandenBroek(t, 10000))27 plt.subplot(212)28 plt.plot(t, devroye(t, 10000))29 plt.show()30 31 print parrondo_vandenBroek(0.331308785962, 10000)32 print devroye(0.331308785962, 10000)33 34 print parrondo_vandenBroek(0.22, 10000)35 print devroye(0.22, 10000)

parrondo_vandenBroek在图上可以看到大概当x = 0.22时，y < 0

devroye在图书可以看到大概当x = 0.22时，y > 0，x的bound肯定小于0.22

综上Devroye bound smallest

1 import matplotlib.pyplot as plt 2 import numpy as np 3 4 t = np.arange(-10,10,0.01) 5 6 def original_vc_bound(n): 7return np.sqrt(((8.0 / n) * (np.log(80) + 50 * np.log(2 * n)))) 8 9 def rademacher_penalty_bound(n):10return 1.0 / n + np.sqrt((2.0 / n) * np.log(20)) + np.sqrt((2.0 * (np.log(2 * n) + 50 * np.log(n))) / n)11 12 def parrondo_vandenBroek(t, n):13return t - np.sqrt(1.0 / n * (2 * t + np.log(120) + 50 * np.log(2 * n)))14 15 def devroye(t, n):16return t - np.sqrt(1.0 / (2 * n) * (4 * t * (1 + t) + np.log(80) + 100 * np.log(n)))17 18 def variant_VC_bound(n):19return np.sqrt(16.0 / n * (np.log(2 / np.sqrt(0.05)) + 50 * np.log(n)))20 21 print original_vc_bound(5)22 print rademacher_penalty_bound(5)23 print variant_VC_bound(5)24 25 plt.subplot(211)26 plt.plot(t,parrondo_vandenBroek(t, 5))27 plt.subplot(212)28 plt.plot(t, devroye(t, 5))29 plt.show()30 31 print parrondo_vandenBroek(7.04877656418, 5)32 print devroye(7.04877656418, 5)33 34 print parrondo_vandenBroek(5.5, 5)35 print devroye(5.5, 5)

很明显，最tightest的是

parrondo_vandenBroek

此题就是将N个数，拆成positive和negtive,最多只能三部分，排列组合：

C(n-1,2)*2 + C(n-1,1)*2 +2 = n2-n+2

由于已经有了mh（N)所以直接计算break point，n=3时m=8,n=4时m=14 => k=4 => dvc=3

可以把此题看成点在以a为半径的圆和以b为半径的圆之间时为1，其余情况-1

可将所有情况分为2种，N个点全不在a，b之间和其他

=> 在n个点的n+1个空间里选2个+n个点之外的任意空间 => C(n+1,2)+1

degree is D，hypothesis parameters c = (c0;c1;····;cd );

=> dVC = D + 1

把[xi > ti]看成一个d维度的决策树，那么可以选出这样2d个点使得[xi > ti]后，得到决策树的2d个叶子节点v。可以选择

不同的S，使得2d个叶子节点v，得到所有2^2d个结果。即N=2d时可以shatter，N=2d+1时，[xi > ti]后最多只能得到

2d个叶子节点，所有必定有一个重复，所以不管使用什么S，都无法得到2^(2d+1)个结果，即N=2d+1时不可shatter，综上

N=2d+1是break point => dvc =2d

感觉上由于α是随便取的，每个点都有机会取到两种情况所以2n，不知道该如何证明。。。。

这题很明显，等于是求每个函数的复杂度，√N^dvc是O(N^(dvc/2))；mh(n/2)是O((n/2)^dvc)；2dvc；

2i*mh(n-i)由于指数的存在肯定是大于mh(n)所以至少是O(N^(dvc));

由于N ≥ dvc ≥ 2 => O(N^(dvc)) > O(N^(dvc/2))，O((n/2)^dvc)，2dvc

=> 2i*mh(n-i) upper

很明显N是不对的，如果存在成长函数mh(n)=n，那么它的break point就在n=1的时候，那么如果当n=2

时，mh(2)=2，那么其中至少有一个点存在2种情况 => 该点shatter =>矛盾

此时mh(n)=1

intersection，肯定是越交集越小，依题意最小肯定是0，最大肯定也是小于H1~Hk里最小的

union，肯定是越并越大，所以最小也得H1~Hk里最大的，关于为什么不是0老师给了很好的解答

（any single hypothesis is of VC dimension 0，but the union of many hypotheses (which is a hypothesis set) can be of non-zero VC dimension)

upper bound比较麻烦，思考了很久，在论坛里看老师和小伙伴的回帖，有了些启发。我们思考这样一种情况Hk

dvc为n，总共有2n+k-1-(k-1)中情况也就是缺k-1种情况shatter，而这k-1种情况分别是single hypothesis H1~HK-1,

很明显unionH1~Hk的dvc是n+k-1 = sum(dvc)+k-1

并且：不考虑dvc max的H，每个H的情况2dvc_hi ≤ hi ≤2dvc_hi+ti < 2dvc_hi+1

因为x>1时，2x1*2x2 > 2x1+2x2 => union(hi)≤ sum(2dvc_hi+ti) ≤∏(2dvc_hi+1)

=> dvc(union(hi)) ≤dvc(∏(2dvc_hi+1)) = sum(dvc(hi)+1)

=>加上max 后dvc(∏(2dvc_hi+1)) = sum(dvc(hi))+k-1

当Θ>0时：

当s==1:Eout=((1-Θ)*s*0.2+(1-(1-Θ)*s)*0.8+1*0.2)/2=0.5+0.3*(Θ-1)*s

当s==-1:Eout=(-(1-Θ)*s*0.8+(1+(1-Θ)*s)*0.2+1*0.8)/2=0.5+0.3*(Θ-1)*s

由于Θ<0时和Θ>0时是对称的，综上

=> Eout=0.5+0.3*(|Θ|-1)*s

x = [i / 10.0 for i in range(-10, 11)]x.remove(0)y = []error = 0best = 100for i in x:if i <= 0:y.append(-1)else:y.append(1)y[0] = 1y[9] = 1y[19] = -1y[10] = -1for i in range(5000):if i < 2500:s = -1else:s = 1theta = -1 + (i % 2500) / 1250.0for j in range(20):if x[j] > theta:t = 1else:t = -1if t * s * y[j] < 0:error += 1if best > error:best = errorerror = 0print best / 20.0

1 import math 2 x = [i / 10.0 for i in range(-10, 11)] 3 x.remove(0) 4 y = [] 5 error = 0 6 best = 100 7 bestTheta = 0 8 bestS = 0 9 10 for i in x:11if i <= 0:12 y.append(-1)13else:14 y.append(1)15 y[0] = 116 y[9] = 117 y[19] = -118 y[10] = -119 20 for i in range(5000):21if i < 2500:22 s = -123else:24 s = 125theta = -1 + (i % 2500) / 1250.026 27for j in range(20):28 if x[j] > theta:29 t = 130 else:31 t = -132 if t * s * y[j] < 0:33 error += 134if best > error:35 best = error36 bestTheta = theta37 bestS = s38error = 039 40 print 0.5 + 0.3 * (math.fabs(bestTheta) - 1) * bestS

1 import math 2 x1 = [] 3 x2 = [] 4 x3 = [] 5 x4 = [] 6 y = [] 7 file = open("C:/Users/samsung/Desktop/ntumlone_hw1_hw1_18_train.dat") 8 for line in file: 9lines = line.split(' ')10x1.append(float(lines[0]))11x2.append(float(lines[1]))12x3.append(float(lines[2]))13x4.append(float(lines[3].split('\t')[0]))14y.append(int(lines[3].split('\t')[1].split('\n')[0]))15 file.close()16 x = [x1, x2, x3, x4]17 error = 018 best = 1000019 bestTheta = 020 bestS = 021 bestT = 022 for t in range(4):23for i in range(100000):24 if i < 50000:25 s = -126 else:27 s = 128 theta = 0 + (i % 50000) / 50000.029 30 for j in range(500):31 if x[t][j] > theta:32 a = 133 else:34 a = -135 if a * s * y[j] < 0:36 error += 137 if best > error:38 best = error39 bestTheta = theta40 bestS = s41 bestT = t42 error = 043 print bestT44 print bestTheta45 print bestS46 print best / 500.047 48 file = open("C:/Users/samsung/Desktop/ntumlone_hw1_hw1_18_test.dat")49 xtest = []50 ytest = []51 for line in file:52lines = line.split(' ')53xtest.append(float(lines[3].split('\t')[0]))54ytest.append(int(lines[3].split('\t')[1].split('\n')[0]))55 56 file.close()57 for j in range(500):58if xtest[j] > bestTheta:59 a = 160else:61 a = -162if a * bestS * ytest[j] < 0:63 error += 164 print error / 500.0