华院论文 | 面向数据驱动深度学习的快速分布鲁棒优化方法

作者：沈旭立

刊物：CCF-B类数据挖掘领域顶会 ICDM (IEEE International Conference on Data Mining)，本年接受率为9.77%。

致谢：论文撰写过程中得到UniDT 认知计算和小样本学习项目，以及NSFC项目(62176061) 的支持。同时，感谢高盛华老师的指导建议。

介绍

开放世界的数据集，例如具有长尾分布的自然图像分类，医疗影像分类，或人脸属性、情绪等分类任务，会遇到训练分布与测试分布不一致的挑战。例如在自然图像分类中包含了100只柴犬与1只中华田园猫图片，但是测试集中包含同比例的猫与狗图片，并混入了含噪声的拉布拉多犬。又如，在罕见病(脑动静脉畸形)的医疗训练集中包含了100个正样本和1个负样本，在测试集中存在未见过形态的负样本。我们不难看出开放世界的数据集具有极不平衡学习与开集(分类未知集)学习的特点。如果利用传统方法训练上述类型的数据，会导致模型偏向学习权重大的类别，并难以区分未见类。

分布鲁棒优化(DRO) 能够使在测试分布与训练分布不一致时稳定地训练模型。我们省略正则化项，DRO 的形式可以写成：

其中，ζ表示观测数据{(x1,y1),…,(xn,yn)}上的损失函数，参数θ∈Rn属于模型g，p指的是概率分布p=(p1,p2,…pn)| ∑ni=1pi=1，pi>0，h(p,1/n)是一个在p和均匀概率1/n之间的测量函数。为了解决上述优化问题，已有工作提出了许多随机对偶方法并在小规模数据上取得了良好的性能。

目前，数据驱动的深度学习方法主导了计算机视觉和自然语言处理领域。传统方法无法处理如此大规模的高维数据。具体来讲：(1) 求解p忽略了大规模数据集中n个数据点数值优化的0(n)空间复杂度，以及(2) 来自d的复杂度，即高维参数空间，尤其是深度神经网络。因此，通过分布的距离(wasserstein distance等)衡量的不确定集的分布差异会受到计算成本的影响。

我们利用了“一类非 i.i.d. 情形下的鲁棒学习算法”中的方法解决上述问题。在不确定集中，我们没有使用n个数据点，而是集中讨论分成j个子组的不确定集。基于标签划分的子组，我们为 DRO 提出了有效且实用的学习方法。我们用子组的最大期望误差来代替经验风险。然而，通过梯度下降法方法很难找到这个问题的解决方案，并且普通的 DRO 的采样策略也不直接。为了解决这个问题，每个子组的数据都独立于每个子组内的相同分布进行采样。计算框架如图1所示：

表I为我们提供了我们的方法与 DRO 同期代表工作的理论比较。本工作的贡献是：(1) 我们验证了一种缓解 DRO 训练效率低下的方法，即具有任意子组目标函数的深度学习模型更快的收敛速度，比目前的 DRO 方法更快。(2)我们给出了训练深度神经网络的子组抽样方法和学习率策略。所提出的方法在开集识别和不平衡数据学习上实现了更好的性能和鲁棒性。

前序内容

A.经验风险最小化的局限

在上一节中，我们提到训练数据与测试数据的偏移（即不确定集）使模型训练变得困难。为了展示其中经验风险最小化的局限性，我们提供了一个回归示例 (上图所示)。我们构造了一个数据集，包含四个子组，其中三个从三个不同的球形高斯分布中采样，而其余一个从多元高斯分布中采样。我们进行了三个线性回归实验，模型由一个构造的数据集训练，该数据集由具有不同混合权重的不同分布式数据点组成。例如，图2(b.II) 显示了混合权重相同的数据分布，可类比为常见的平衡数据集。图 2(b.III-IV)通过改变混合权重破坏了平衡的数据分布。

无论测试数据分布的混合权重是多少，总的数据分布在如图 2(a) 中的灰色区域所示。我们希望训练一个线性回归模型能够体现这个区域。为了展示经验风险最小化的局限性，我们训练了两个线性回归模型。一个基于经验风险最小化，另一个是基于本项工作：加权分布鲁棒优化(W-DRO)。通过逐渐改变三个子组的混合权重，如图2（a）中的（1）-（3）所示，我们发现经验风险最小化训练的线性回归偏离了灰色区域。因此，我们认为当训练数据与测试数据的分布不一致时，不能直接使用经验风险优化模型，需要一种能够适应任何数据分布假设的鲁棒算法。

值得注意的是，(X4,Y4)体现了极端数据情况，例如工业缺陷检测、股市里的黑天鹅事件等。上述两种事件包含了不平衡学习和开集学习的属性。我们通过分布鲁棒优化的方法解决了极端数据情况难训练的痛点。因此，分布鲁棒优化具有较强的模式适应性，可以直接应用于与分布无关的问题。

方法

本节为“一类非 i.i.d. 情形下的鲁棒学习算法”在分布鲁棒优化算法下的扩展。本节提供了细节的分组思路，并讨论了该方法(W-DRO)的收敛速度。W-DRO 能有效训练深度神经网络等参数量大的模型。方法详细部分请参考正文。

实验

本文通过比较准确性、收敛性和鲁棒性对不确定集设定下的大规模数据进行实验验证W-DRO的有效性。本文采用高斯噪声作为干扰构造不确定集。我们将此干扰项随机添加到数据95%的训练数据和5％的测试数据。我们考察了在两种训练数据与测试数据分布不符的情况，即通过开集学习设置不确定集，和从极不平衡数据中设置不确定性。详见正文。

结论

在这项工作中，我们求解分组 DRO 问题，并应用子组梯度加权的下降方向对神经网络参数进行数值更新。此种梯度凸组合形成的重新加权下降方向，为 DRO 带来了可扩展性和训练效率。我们证明了所提出的W-DRO算法是线性收敛的。实验结果表明，所提方法在具有开集识别和不平衡学习设置的不确定集上也获得了更好的性能和鲁棒性。

参考文献

[1] S. Gururangan, A. Marasovic, S. Swayamdipt, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language models to domains and tasks,” in ACL, .

[2] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,” ICLR, .

[3] Y. Yang, K. Zha, Y.-C. Chen, H. Wang, and D. Katabi, “Delving into deep imbalanced regression,” in International Conference on Machine Learning (ICML), .

[4] A. Sinha, H. Namkoong, and J. Duchi, “Certifying some distributional robustness with principled adversarial training,” in ICLR, .

[5] A. Ben-Tal, D. D. Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen, “Robust solutions of optimization problems affected by uncertain probabilities,” in Management Science, , p. 59(2):341–357.

[6] Q. Qi, Z. Guo, Y. Xu, R. Jin, and T. Yang, “An online method for a class of distributionally robust optimization with non-convex objectives,” in Advances in Neural Information Processing Systems, vol. 34, , pp. 10 067–10 080.

[7] Y. Yan, Y. Xu, Q. Lin, W. Liu, and T. Yang, “Sharp analysis of epoch stochastic gradient descent ascent methods for min-max optimization,” in NeurIPS, .

[8] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” in SIAM Journal on optimization, , p. 19(4):1574 1609.

[9] H. Namkoong and J. C. Duchi, “Stochastic gradient methods for distributionally robust optimization with f-divergences,” in Advances in neural information processing systems, , p. 2208–2216.

[10] J. Duchi, P. Glynn, and H. Namkoong, “Statistics of robust optimization: A generalized empirical likelihood approach,” arXiv:1610.03425, .

[11] S. Peng, “Nonlinear expectations and nonlinear markov chains,” Chin. Ann. Math., pp. 26B(2), 159–184, .

[12] P. Michel, T. Hashimoto, and G. Neubig, “Modeling the second player in distributionally robust optimization,” in ICLR, .

[13] L. Denis, M. Hu, and S. Peng, “Function spaces and capacity related to a sublinear expectation: application to g-brownian motion paths,” in Potential Analysis, .

[14] Y. Nesterov, “Gradient methods for minimizing composite functions.” In Math, .

[15] H. Elad, A. Amit, and K. Satyen, “Logarithmic regret algorithms for online convex optimization,” Machine Learning, vol. 69, no. 2, pp. 169– 192, .

[16] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[17] W. Sun and Y. Yuan, Optimization Theory and Methods. Springer Science, .

[18] S. M. Kakade and A. Tewari, “On the generalization ability of online strongly convex programming algorithms,” in Advances in Neural Information Processing Systems, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., vol. 21. Curran Associates, Inc., .

[19] A. Bendale and T. Boult, “Towards open set deep networks,” in CVPR, .

[20] P. Oza and V. M. Patel, “C2ae: Class conditioned auto-encoder for open-set recognition,” in CVPR, .

[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, .

[22] C. Geng, S.-J. Huang, and S. Chen, “Recent advances in open set recognition: A survey,” in IEEE Trans. on Pattern Analysis and Machine Intelligence, .

[23] L. Neal, M. Olson, X. Fern, W.-K. Wong, and F. Li, “Open set learning with counterfactual images,” in Proc. Eur. Conf. Comput. Vis., .

[24] S. Xie, R. B. Girshick, P. Dolla ́r, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” arXiv:1611.05431, .

[25] H. Gao, L. Zhuang, P. Geoff, V. D. M. Laurens, and W. Kilian, “Convolutional networks with dense connectivity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, .

[26] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, .

[27] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in arXiv:1706.06083, .

[28] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in ICCV, .

[29] C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning deep representation for imbalanced classification,” in CVPR, .

[30] Y. Wang, W. Gan, J. Yang, W. Wu, and J. Yan, “Dynamic curriculum learning for imbalanced data classification,” in ICCV, .

[31] E. M. Hand, C. D. Castillo, and R. Chellappa, “Doing the best we can with what we have: Multi-label balancing with selective learning for attribute prediction.” in AAAI, .

[32] J. Su, D. V. Vargas, and S. Kouichi, “One pixel attack for fooling deep neural networks.” in IEEE Transactions on Evolutionary Computation, .

[33] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in CVPR, .

[34] X. Mingzhou and C. Kun, “Convergence for sums of i.i.d. random variables under sublinear expectations,” Journal of Inequalities and Applications, no. 1, p. 157, .

[35] M. A. Ragusa and Q. Wu, “Complete convergence for end random variables under sublinear expectations,” Discrete Dynamics in Nature and Society, vol. , p. 5529109, .

[36] Q. Xu and X. Xuan, “Nonlinear regression without i.i.d. assumption,” Probability, Uncertainty and Quantitative Risk, .

[37] L.Lin,Y.Liu,andC.Lin,“Mini-max-riskandmini-mean-riskinferences for a partially piecewise regression,” Statistics, vol. 51, no. 4, pp. 745–765, .

[38] D. Zhang, K. Ahuja, Y. Xu, Y. Wang, and A. Courville, “Can subnetwork

structure be the key to out-of-distribution generalization,” in ICML, . [39] M. Staib, B. Wilder, and S. Jegelka, “Distributionally robust submodular maximization,” in International Conference on Artificial Intelligence and Statistics, .

[40] V.A.Nguyen,N.Si,andJ.Blanchet,“Distributionally robust submodular maximization,” in ICML, .

[41] L. Faury, U. Tanielian, E. Dohmatob, E. Smirnova, and F. Vasile, “Distributionally robust counterfactual risk minimization,” in AAAI, .

[42] H.Husain,“Distributionalrobustnesswithipmsandlinkstoregularization and gans,” in Neural Information Processing Systems, .

[43] Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang, “Distributionally robust language modeling,” in EMNLP, .

[44] W. Hu, G. Niu, I. Sato, and M. Sugiyama, “Does distributionally robust supervised learning give robust classifiers?” in ICML, .