Comparison of existing implementation

functionality gpu nd array(python interface) Theano CudaNdarray GPUmat GPU(single/double)
backend cuda/opencl cuda cuda
dtype float32 {u}int{8,16,32,64} complex64 (float64 and complex128 possible) float32 float32, complex32, float64, complex64
ndim generic generic generic
memory layout generic generic generic
contiguous transfer to/from gpu Yes Yes Yes
not contiguous transfer to/from gpu copy if needed copy if needed copy if needed
ascontiguousarray Yes No No
asfortranarray Yes No No
copy Yes Yes Yes, clone()
zeros Yes Yes Yes
empty Yes No Yes: GPUsingle();setSize();GPUallocVector()
len Yes Yes Yes: length()
subtensor(var[…]) Yes Yes Yes
subtensor(var[N]) Yes Yes Yes
subtensor(var[strides with step]) Yes Yes Yes
subtensor(var[strides with neg start/stop/step]) Yes Yes Yes
subtensor(var[ tuple with mix of slice, integer and numpy.int64]) Yes Yes No
elemwise generic with 1 output with dimensions collapsing, mixed dtype as gpu nd array as gpu nd array
elemwise with broadcasting Yes Yes Yes
reduction sum/prod generic for ndim and any combination of reduced axis sum/prod/min/max only with this pattern: 1, 11, 10, 01, 001, 010, 100, 110, 011, 111, 0011, 0101, 0111, 1011, 1111, pattern 1+ use only 1 block sum
__setitem__ Yes (with broadcast if necessary) Theano Op for slice/int/and list of int. Yes: subsasgn(), assign()
reshape Yes Yes (copy if not c_contiguous) Yes: setSize(), reshape()
n-dim transpose Yes (copy when numpy would copy) Yes(can add dim with shape 1 at the same time) No
dot/gemm Yes* Theano op Yes: times(), GPUtimes()
gemv Yes* Theano op ?

It need an external blas, that is included with CUDA. For OpenCL back-end you can use clmath, but clmath support isn’t good on Mac and Windows.

Not done but planned in gpu nd array.

ones No Theano op only Yes
subtensor with a list of index var[1,2,3,4] (part of numpy advanced indexing) No Yes Yes: slice(A, {[1,2,3,4]})
reduction (max, min, argmax) No No No
ger No Theano op ?
flatten No(you can use reshape for this) Yes ?
random No mrg, curand Yes: GPUrand(), GPUrandn()
join No Theano op ?

Other Theano op: CrossentropySoftmaxArgmax1HotWithBias, CrossentropySoftmax1HotWithBiasDx, Softmax, SoftmaxWithBias, DownsampleFactorMax, GpuImages2Neibs, Dot22SCalar, GpuEye, ErfinvGPU

gnumpy: as_garray, as_garray_or_scalar, as_numpy_array, tile(the same as numpy?), rand, randn, empty, zeros, ones, seed_rand, dot(0d,*d), dot(1d,1d), dot(1d,2d) dot(2d,1d), dot(2d,2d), dot(a1.ndim >= 2, a2.ndim >= 2) with reshape and transpose(transpose done by a loop?), outer, concatenate, where, nonzero, support newaxis?, eye, diagflat, tensordot, reduction(all, any, sum, mean, max, min, (prod and std cpu only)), elemwise(abs, exp, isinf, isnan, log, log_1_plus_exp, logistic, negative, sign, sqrt, tanh, (cpu only: log10)) gnumpy.garray fct: as_numpy_array, astype, ravel(call self.reshape(-1)), item(transfert to cpu), sort(cpu only), reshape_2d, T, transpose, shiftAxesRight, copy, diagflat, diagonal, diag, all_real, isinf, isreal, isnan, isnumber, abs, as_bool, exp, log, log_1_plus_exp, logistic, sigmoid, sign, sqrt, tanh, sum, mean, max, argmax(cpu), argmin(cpu), min, all, any, all2, any2, rand, euclid_norm, dot, where, nonzero, __lt__, gt, le, ge, ne, eq, sub, div, rmul, radd, rsub, rdiv, rpow, pos, neg, iadd, imul, isub, idiv, imod, ipow, len, getitem, iter, __setitem__