[关闭]
@qqiseeu 2015-10-29T06:43:48.000000Z 字数 9131 阅读 8393

Numpy with Cython:记一次对SGD算法的优化


概述

本文所用的例子来自一个编程作业。在该作业中要求实现一个集成学习算法,其中的主要部分就是训练多个感知机(Perceptron)。这里的感知机就是一个简单的线性分类器f(x)=sign(wx+b)。假设训练样本为

D={(x1,y1),,(xN,yN)},xiD,yi{0,1}

yi=2yi1,则训练过程中参数的基本更新规则为
w(t+1)=w(t)+η(t)yixiifyi(xiw(t)+b(t))<0

这其实就是基于hinge loss的SGD算法,其中t是迭代轮数,η(t)是第t轮的步长,xi是随机sample的一个instance。在具体实现时可能会对上述规则做一些细微的修改,但基本规则不会有什么变化。

对于科学计算的任务(尤其是涉及到矩阵操作的任务),已经存在很好的Python数值计算库:NumpyScipy。其中Numpy实现了一个高效的(稠密)矩阵类型——numpy.ndarrayScipy则提供了多种稀疏矩阵类型,且两个库之间还具有很好的互操作性。写过Matlab的同学肯定知道这样一条要诀:“任何能用矩阵操作的地方都不要用循环”。在用Python做科学计算时也有相同的原则,因为Numpy提供的矩阵操作其实都是对底层C程序的封装,可以完全离开Python解释器工作,因此速度相比于用for循环会快上很多倍(不过Python的解释器其实已经很快了,如果以后Pypy对Numpy的支持能够得到完善,那有些时候直接用for循环也是可以接受的)。但是总有这么些个算法,用矩阵操作来写就是不方便,而这里我们将要实现的SGD算法就是其中之一。

为什么Python这么慢?主要原因有两个:

Python社区为这个问题提供了一个方便的解决方案:Cython。你可以简单地认为Cython是一种介于Python和C之间的编程语言,它为Python引入了C的静态类型系统(static type system),使得
我们可以将Python代码和C代码无缝衔接。除此之外Cython还可以封装C/C++的动态链接库并提供针对Python的接口。最关键的是,用Cython操作numpy.ndarray非常方便,通过调用numpy.ndarray.data可以直接访问其在底层封装的C数组。

Step 0:原始Python程序

下面是初版的程序,注意这是一个纯Python程序:

  1. # sgd.py
  2. from __future__ import division
  3. import numpy as np
  4. import numpy.random as random
  5. def sgd(X, Y, ep):
  6. """ Train a simple perceptron using SGD
  7. Parameters
  8. ----------
  9. X: array of shape=(n_samples, n_features) with dtype=np.float64
  10. Each row contains the feature vector for a training sample instance.
  11. Y: array of shape=(n_samples, ) with dtype=np.int64 ranging in {0, 1}
  12. Corresponding label of training instances.
  13. ep: double
  14. Desired accuracy, which controls max iteration number.
  15. Returns
  16. -------
  17. w: array of shape=(n_features, ) with dtype=np.float64
  18. Weights vector of the linear classifier
  19. b: double
  20. Intercept term of the linear classifier
  21. """
  22. n_samples, n_features = X.shape
  23. w = random.randn(n_features)
  24. w_previous = w.copy() + 1
  25. b = 0.0
  26. t = 0
  27. T = 1 / ep ** 2
  28. idx = np.arange(n_samples)
  29. while t < T:
  30. # check stop criterion
  31. if np.linalg.norm(w_previous - w) <= 0.0001:
  32. break
  33. else:
  34. w_previous[:] = w
  35. random.shuffle(idx)
  36. # update weights vector and intercept
  37. for j in idx:
  38. lx = 2 * Y[j] - 1
  39. if (w.dot(X[j].T) + b) * lx < 0:
  40. step_size = 0.1 / (0.1 + t * ep ** 2)
  41. w += step_size * X[j] * lx
  42. b += step_size * lx
  43. t += 1
  44. return w, b

这里有几个要点:

Step 1:明确标示所有变量的类型

首先,为了区别Cython文件与Python文件,我们要把文件名改为sgd.pyx。其实这时候直接就可以用Cython编译它了,因为Cython的语法和Python的语法是兼容的:

  1. $ cython sgd.pyx

但是这样做呢程序的性能并不会提升多少,我们还需要加点其它的东西进去,来帮助cython生成更高效的C代码。

第一件事就是显示声明每个变量的类型。正如文章开始时说过的,Python是动态类型的语言,每个变量的类型是在运行时决定的,若不声明变量类型,则Cython默认每个变量都是PyObject类型的,这个类型在Cython中用来指代Python对象。Cython操作这种类型的变量时和Python解释器的手法是一样的,因此会慢很多。

  1. #sgd.pyx
  2. from __future__ import division
  3. import numpy as np
  4. cimport numpy as np
  5. cimport cython
  6. import numpy.random as random
  7. DTYPE = np.double
  8. ctypedef np.double_t DTYPE_t
  9. TTYPE = np.int64
  10. ctypedef np.int64_t TTYPE_t
  11. @cython.cdivision(True)
  12. @cython.boundscheck(False)
  13. def sgd(np.ndarray[DTYPE_t, ndim=2] X,
  14. np.ndarray[TTYPE_t, ndim=1] Y,
  15. double ep):
  16. """ Train a simple perceptron using SGD
  17. Parameters
  18. ----------
  19. X: array of shape=(n_samples, n_features) with dtype=np.float64
  20. Each row contains the feature vector for a training sample instance.
  21. Y: array of shape=(n_samples, ) with dtype=np.int64 ranging in {0, 1}
  22. Corresponding label of training instances.
  23. ep: double
  24. Desired accuracy, which controls max iteration number.
  25. Returns
  26. -------
  27. w: array of shape=(n_features, ) with dtype=np.float64
  28. Weights vector of the linear classifier
  29. b: double
  30. Intercept term of the linear classifier
  31. """
  32. cdef int n_samples = X.shape[0]
  33. cdef int n_features = X.shape[1]
  34. cdef np.ndarray[DTYPE_t, ndim=1] w = random.randn(n_features)
  35. cdef np.ndarray[DTYPE_t, ndim=1] w_previous = w.copy() + 1
  36. cdef double b = 1.0
  37. cdef unsigned T = <unsigned>(1 / ep ** 2)
  38. cdef unsigned t = 0
  39. cdef long lx
  40. cdef unsigned j
  41. cdef double step_size
  42. cdef np.ndarray[np.uint32_t, ndim=1] idx = np.arange(n_samples, np.uint32)
  43. while t < T:
  44. # check stop criterion
  45. if np.linalg.norm(w_previous - w) <= 0.0001:
  46. break
  47. else:
  48. w_previous[:] = w
  49. random.shuffle(idx)
  50. # update weights vector and intercept
  51. for j in idx:
  52. lx = 2 * Y[j] - 1
  53. if (w.dot(X[j].T) + b) * lx < 0:
  54. step_size = 0.1 / (0.1 + t * ep ** 2)
  55. w += step_size * X[j] * lx
  56. b += step_size * lx
  57. t += 1
  58. return w, b

Step 2:添加对稀疏矩阵的支持

之前的程序默认输入的数据保存在numpy.ndarray对象中,这是一个稠密矩阵类型,在数据很稀疏时就不好用了。scipy.sparse提供了几种存储方式的稀疏矩阵,其底层其实都是用一个ndarray保存非零数据,再用几个ndarray保存数据的索引,唯一不同的只是构建索引的方式。最关键的是,这些底层的东西都可以通过某些办法直接操作的!

根据上面的思路,我模仿sklearn.utils.seq_datasetsklearn.utils.weight_vector实现了两个容器SequentialDatasetWeightVector,但是针对自己的需要做了一些修改。其中SequentialDataset是用来操作训练数据集(XY)的,支持顺序存取及shuffle操作;WeightVector则是用来操作权值向量的,支持向量加法、内积、复制等操作。

由于篇幅所限,上述两个容器的代码实现不列出在这里,仅在后面做出简要说明,有兴趣同学可以去看sklearn中的源码[5],毕竟思路是来自那里。下面列出修改后的sgd代码:

  1. # cython: cdivision=True
  2. # cython: boundscheck=False
  3. # cython: wraparound=False
  4. from __future__ import division
  5. cimport cython
  6. import numpy as np
  7. cimport numpy as np
  8. from .container cimport SequentialDataset
  9. from .container cimport WeightVector
  10. from .container cimport DTYPE_t, YTYPE_t
  11. def sgd(SequentialDataset dataset,
  12. np.ndarray[DTYPE_t, ndim=1, mode='c'] weights,
  13. double intercept,
  14. double ep,
  15. bint shuffle, np.uint32_t seed):
  16. """Train a perceptron-like linear classifier using Stochastic Gradient
  17. Descent.
  18. Parameters
  19. ----------
  20. dataset: SequentialDataset,
  21. Training samples.
  22. weights: array of shape = [n_features] with dtype = np.DTYPE.
  23. Allocated weight vector.
  24. intercept: DTYPE,
  25. initial intercept value.
  26. ep: double.
  27. error tolerance, controls the iteration numbers. Training time will
  28. be roughly O(1/min_node_err^2).
  29. Returns
  30. -------
  31. weights : array, shape=[n_features]
  32. The fitted weight vector.
  33. intercept : double
  34. The fitted intercept term.
  35. """
  36. # acquire data information
  37. cdef Py_ssize_t n_samples = dataset.n_samples
  38. cdef Py_ssize_t n_features = weights.shape[0]
  39. cdef unsigned int n_iter = max(np.int(1 / ep ** 2), 1)
  40. # some helper variables:
  41. # for iteration
  42. cdef unsigned int n_outer = max(n_iter, n_samples)
  43. cdef unsigned int n_inner = min(n_iter, n_samples)
  44. cdef unsigned int t = 0
  45. cdef np.int64_t j = 0
  46. # for weight vector
  47. cdef np.ndarray[DTYPE_t, ndim=1] w_previous = weights.copy() + 1
  48. cdef WeightVector w = WeightVector(weights)
  49. cdef WeightVector wp = WeightVector(w_previous)
  50. cdef DTYPE_t* w_data_ptr = NULL
  51. cdef np.ndarray[int, ndim=1] w_indices = np.arange(n_features).astype(np.int32)
  52. cdef int* w_ind_ptr = <int *>w_indices.data
  53. cdef int wnnz = <int>n_features
  54. # for sample instance
  55. cdef DTYPE_t *x_data_ptr = NULL
  56. cdef int *x_ind_ptr = NULL
  57. cdef int xnnz
  58. cdef YTYPE_t y
  59. # for update
  60. cdef double step_size = 0
  61. cdef YTYPE_t lx = 0
  62. cdef double p = 0.0
  63. with nogil:
  64. while t < n_outer:
  65. if shuffle:
  66. dataset.shuffle(seed)
  67. for j in range(n_inner):
  68. # get next sample
  69. dataset.next(&x_data_ptr,
  70. &x_ind_ptr,
  71. &xnnz,
  72. &y)
  73. lx = 2 * y - 1
  74. # update weights with hinge loss
  75. p = (w.dot(x_data_ptr, x_ind_ptr, xnnz) + intercept) * lx
  76. if p < 0:
  77. step_size = lx * 0.1 / (0.1 + t * ep ** 2)
  78. w.add(x_data_ptr, x_ind_ptr, xnnz, step_size)
  79. intercept += step_size
  80. t += 1
  81. # check for early stopping
  82. w_data_ptr = w.w_data_ptr
  83. wp.add(w_data_ptr, w_ind_ptr, wnnz, -1)
  84. if t < n_iter and wp.norm() <= 0.0001:
  85. break
  86. else:
  87. wp.assign(w)
  88. return weights, intercept

Reference

[1] Cython Users Guide,
http://docs.cython.org/src/userguide/index.html

[2] Cython Tutorial: Working with Numpy,
http://docs.cython.org/src/tutorial/numpy.html

[3] Cython Documentation: Using Parellelism,
http://docs.cython.org/src/userguide/parallelism.html

[4] Compressed Row Storage,
https://en.wikipedia.org/wiki/Sparse_matrix#Yale

[5] seq_dataset.pyx and weight_vector.pyx,
https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/utils

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注