python - Theano simple linear regression runs on CPU instead of GPU -
i created simple python script (using theano) performing linear regression should run on gpu. when code starts says "using gpu device", (according profiler) operations cpu-specific (elemwise, instead of gpuelemwise, no gpufromhost etc.).
i checked variables, theano_flags, seems right , cannot see catch (especially when theano tutorials same settings correctly run on gpu :)).
here code:
# linear regression import numpy import theano import theano.tensor t input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]]) output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200]) ts = theano.shared(input_data, "training-set") e = theano.shared(output_data, "expected") w1 = theano.shared(numpy.zeros((1, 2))) o = t.dot(ts, w1.t) cost = t.mean(t.sqr(e - o.t)) gradient = t.grad(cost=cost, wrt=w1) update = [[w1, w1 - gradient * 0.0001]] train = theano.function([], cost, updates=update, allow_input_downcast=true) in range(1000): train()
- theano_flags=cuda.root=/usr/local/cuda
- device=gpu
- floatx=float32
- lib.cnmem=.5
- profile=true
- cuda_launch_blocking=1
output:
using gpu device 0: geforce gt 650m (cnmem enabled) function profiling ================== message: /home/mw/documents/liclipse workspace/theano1/test2.py:18 time in 1000 calls function.__call__: 3.348637e-02s time in function.fn.__call__: 2.419019e-02s (72.239%) time in thunks: 1.839781e-02s (54.941%) total compile time: 1.350801e-01s number of apply nodes: 18 theano optimizer time: 1.101730e-01s theano validate time: 2.029657e-03s theano linker time (includes c, cuda code generation/compiling): 1.491690e-02s import time 2.320528e-03s time in call theano.grad() 8.740902e-03s time since theano import 0.881s class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <class name> 71.7% 71.7% 0.013s 6.59e-06s py 2000 2 theano.tensor.basic.dot 12.3% 83.9% 0.002s 3.22e-07s c 7000 7 theano.tensor.elemwise.elemwise 5.7% 89.6% 0.001s 3.50e-07s c 3000 3 theano.tensor.elemwise.dimshuffle 4.0% 93.6% 0.001s 3.65e-07s c 2000 2 theano.tensor.subtensor.subtensor 3.6% 97.2% 0.001s 3.31e-07s c 2000 2 theano.compile.ops.shape_i 1.7% 98.9% 0.000s 3.06e-07s c 1000 1 theano.tensor.opt.makevector 1.1% 100.0% 0.000s 2.10e-07s c 1000 1 theano.tensor.elemwise.sum ... (remaining 0 classes account 0.00%(0.00s) of runtime) ops --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <op name> 71.7% 71.7% 0.013s 6.59e-06s py 2000 2 dot 4.0% 75.6% 0.001s 3.65e-07s c 2000 2 subtensor{int64} 3.5% 79.1% 0.001s 6.35e-07s c 1000 1 inplacedimshuffle{1,0} 3.3% 82.4% 0.001s 6.06e-07s c 1000 1 elemwise{mul,no_inplace} 2.4% 84.8% 0.000s 4.38e-07s c 1000 1 shape_i{0} 2.3% 87.1% 0.000s 4.29e-07s c 1000 1 elemwise{composite{((i0 * i1) / i2)}} 2.3% 89.3% 0.000s 2.08e-07s c 2000 2 inplacedimshuffle{x,x} 1.8% 91.1% 0.000s 3.25e-07s c 1000 1 elemwise{cast{float64}} 1.7% 92.8% 0.000s 3.06e-07s c 1000 1 makevector{dtype='int64'} 1.5% 94.3% 0.000s 2.78e-07s c 1000 1 elemwise{composite{(i0 - (i1 * i2))}}[(0, 0)] 1.4% 95.7% 0.000s 2.53e-07s c 1000 1 elemwise{sub}[(0, 1)] 1.2% 96.9% 0.000s 2.24e-07s c 1000 1 shape_i{1} 1.1% 98.0% 0.000s 2.10e-07s c 1000 1 sum{acc_dtype=float64} 1.1% 99.1% 0.000s 1.98e-07s c 1000 1 elemwise{sqr}[(0, 0)] 0.9% 100.0% 0.000s 1.66e-07s c 1000 1 elemwise{composite{((i0 / i1) / i2)}}[(0, 0)] ... (remaining 0 ops account 0.00%(0.00s) of runtime) apply ------ <% time> <sum %> <apply time> <time per call> <#call> <id> <apply name> 37.8% 37.8% 0.007s 6.95e-06s 1000 3 dot(<tensortype(float64, matrix)>, training-set.t) 33.9% 71.7% 0.006s 6.24e-06s 1000 14 dot(elemwise{composite{((i0 * i1) / i2)}}.0, training-set) 3.5% 75.1% 0.001s 6.35e-07s 1000 0 inplacedimshuffle{1,0}(training-set) 3.3% 78.4% 0.001s 6.06e-07s 1000 11 elemwise{mul,no_inplace}(inplacedimshuffle{x,x}.0, inplacedimshuffle{x,x}.0) 3.0% 81.4% 0.001s 5.58e-07s 1000 8 subtensor{int64}(elemwise{cast{float64}}.0, constant{1}) 2.4% 83.8% 0.000s 4.38e-07s 1000 2 shape_i{0}(expected) 2.3% 86.2% 0.000s 4.29e-07s 1000 12 elemwise{composite{((i0 * i1) / i2)}}(tensorconstant{(1, 1) of -2.0}, elemwise{sub}[(0, 1)].0, elemwise{mul,no_inplace}.0) 1.8% 87.9% 0.000s 3.25e-07s 1000 6 elemwise{cast{float64}}(makevector{dtype='int64'}.0) 1.7% 89.6% 0.000s 3.06e-07s 1000 4 makevector{dtype='int64'}(shape_i{0}.0, shape_i{1}.0) 1.6% 91.2% 0.000s 3.03e-07s 1000 10 inplacedimshuffle{x,x}(subtensor{int64}.0) 1.5% 92.7% 0.000s 2.78e-07s 1000 16 elemwise{composite{(i0 - (i1 * i2))}}[(0, 0)](<tensortype(float64, matrix)>, tensorconstant{(1, 1) of ..974738e-05}, dot.0) 1.4% 94.1% 0.000s 2.53e-07s 1000 5 elemwise{sub}[(0, 1)](expected, dot.0) 1.2% 95.3% 0.000s 2.24e-07s 1000 1 shape_i{1}(expected) 1.1% 96.5% 0.000s 2.10e-07s 1000 15 sum{acc_dtype=float64}(elemwise{sqr}[(0, 0)].0) 1.1% 97.6% 0.000s 1.98e-07s 1000 13 elemwise{sqr}[(0, 0)](elemwise{sub}[(0, 1)].0) 0.9% 98.5% 0.000s 1.72e-07s 1000 7 subtensor{int64}(elemwise{cast{float64}}.0, constant{0}) 0.9% 99.4% 0.000s 1.66e-07s 1000 17 elemwise{composite{((i0 / i1) / i2)}}[(0, 0)](sum{acc_dtype=float64}.0, subtensor{int64}.0, subtensor{int64}.0) 0.6% 100.0% 0.000s 1.13e-07s 1000 9 inplacedimshuffle{x,x}(subtensor{int64}.0) ... (remaining 0 apply instances account 0.00%(0.00s) of runtime)
as mentioned in comments although have set allow_input_downcast parameter true, need make sure data assigned shared variables in float32. of jan. 06, 2016 theano still cannot work other data type rather float32 computations on gpu, mentioned here in more details. have have cast data 'float32' format.
therefore, here should code need use:
import numpy import theano import theano.tensor t input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]]) output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200]) ts = theano.shared(input_data.astype('float32'), "training-set") e = theano.shared(output_data.astype('float32'), "expected") w1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32')) o = t.dot(ts, w1.t) cost = t.mean(t.sqr(e - o.t)) gradient = t.grad(cost=cost, wrt=w1) update = [[w1, w1 - gradient * 0.0001]] train = theano.function([], cost, updates=update, allow_input_downcast=true, profile = true) in range(1000): train() train.profile.print_summary() and here profiling result:
message: learntheano.py:18 time in 1000 calls function.__call__: 2.642968e-01s time in function.fn.__call__: 2.460811e-01s (93.108%) time in thunks: 1.877530e-01s (71.039%) total compile time: 2.483290e+01s number of apply nodes: 17 theano optimizer time: 2.818849e-01s theano validate time: 3.435850e-03s theano linker time (includes c, cuda code generation/compiling): 2.453926e+01s import time 1.241469e-02s time in call theano.grad() 1.206994e-02s class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <class name> 34.8% 34.8% 0.065s 3.27e-05s c 2000 2 theano.sandbox.cuda.blas.gpugemm 28.8% 63.5% 0.054s 1.80e-05s c 3000 3 theano.sandbox.cuda.basic_ops.gpuelemwise 12.9% 76.4% 0.024s 2.42e-05s c 1000 1 theano.sandbox.cuda.basic_ops.gpucareduce 10.3% 86.7% 0.019s 1.93e-05s c 1000 1 theano.sandbox.cuda.basic_ops.gpufromhost 7.2% 93.9% 0.014s 1.36e-05s c 1000 1 theano.sandbox.cuda.basic_ops.hostfromgpu 1.8% 95.7% 0.003s 1.13e-06s c 3000 3 theano.sandbox.cuda.basic_ops.gpudimshuffle 1.5% 97.2% 0.003s 2.81e-06s c 1000 1 theano.tensor.elemwise.elemwise 1.1% 98.4% 0.002s 1.08e-06s c 2000 2 theano.compile.ops.shape_i 1.1% 99.5% 0.002s 1.02e-06s c 2000 2 theano.sandbox.cuda.basic_ops.gpusubtensor 0.5% 100.0% 0.001s 9.96e-07s c 1000 1 theano.tensor.opt.makevector ... (remaining 0 classes account 0.00%(0.00s) of runtime) ops --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <op name> 25.3% 25.3% 0.047s 4.74e-05s c 1000 1 gpugemm{no_inplace} 12.9% 38.1% 0.024s 2.42e-05s c 1000 1 gpucareduce{pre=sqr,red=add}{1,1} 12.8% 51.0% 0.024s 2.41e-05s c 1000 1 gpuelemwise{mul,no_inplace} 10.3% 61.3% 0.019s 1.93e-05s c 1000 1 gpufromhost 9.5% 70.8% 0.018s 1.79e-05s c 1000 1 gpugemm{inplace} 8.2% 79.0% 0.015s 1.55e-05s c 1000 1 gpuelemwise{composite{((i0 / i1) / i2)}}[(0, 0)] 7.7% 86.7% 0.014s 1.44e-05s c 1000 1 gpuelemwise{composite{((i0 * i1) / i2)}}[(0, 1)] 7.2% 93.9% 0.014s 1.36e-05s c 1000 1 hostfromgpu 1.5% 95.4% 0.003s 2.81e-06s c 1000 1 elemwise{cast{float32}} 1.1% 96.5% 0.002s 1.02e-06s c 2000 2 gpusubtensor{int64} 1.0% 97.5% 0.002s 9.00e-07s c 2000 2 gpudimshuffle{x,x} 0.8% 98.3% 0.002s 1.59e-06s c 1000 1 gpudimshuffle{1,0} 0.7% 99.1% 0.001s 1.38e-06s c 1000 1 shape_i{0} 0.5% 99.6% 0.001s 9.96e-07s c 1000 1 makevector 0.4% 100.0% 0.001s 7.76e-07s c 1000 1 shape_i{1} ... (remaining 0 ops account 0.00%(0.00s) of runtime) apply ------ <% time> <sum %> <apply time> <time per call> <#call> <id> <apply name> 25.3% 25.3% 0.047s 4.74e-05s 1000 3 gpugemm{no_inplace}(expected, tensorconstant{-1.0}, <cudandarraytype(float32, matrix)>, gpudimshuffle{1,0}.0, tensorconstant{1.0}) 12.9% 38.1% 0.024s 2.42e-05s 1000 5 gpucareduce{pre=sqr,red=add}{1,1}(gpugemm{no_inplace}.0) 12.8% 51.0% 0.024s 2.41e-05s 1000 13 gpuelemwise{mul,no_inplace}(gpudimshuffle{x,x}.0, gpudimshuffle{x,x}.0) 10.3% 61.3% 0.019s 1.93e-05s 1000 7 gpufromhost(elemwise{cast{float32}}.0) 9.5% 70.8% 0.018s 1.79e-05s 1000 16 gpugemm{inplace}(<cudandarraytype(float32, matrix)>, tensorconstant{-9.99999974738e-05}, gpuelemwise{composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, tensorconstant{1.0}) 8.2% 79.0% 0.015s 1.55e-05s 1000 12 gpuelemwise{composite{((i0 / i1) / i2)}}[(0, 0)](gpucareduce{pre=sqr,red=add}{1,1}.0, gpusubtensor{int64}.0, gpusubtensor{int64}.0) 7.7% 86.7% 0.014s 1.44e-05s 1000 15 gpuelemwise{composite{((i0 * i1) / i2)}}[(0, 1)](cudandarrayconstant{[[-2.]]}, gpugemm{no_inplace}.0, gpuelemwise{mul,no_inplace}.0) 7.2% 93.9% 0.014s 1.36e-05s 1000 14 hostfromgpu(gpuelemwise{composite{((i0 / i1) / i2)}}[(0, 0)].0) 1.5% 95.4% 0.003s 2.81e-06s 1000 6 elemwise{cast{float32}}(makevector.0) 0.8% 96.3% 0.002s 1.59e-06s 1000 0 gpudimshuffle{1,0}(training-set) 0.7% 97.0% 0.001s 1.38e-06s 1000 2 shape_i{0}(expected) 0.7% 97.7% 0.001s 1.30e-06s 1000 8 gpusubtensor{int64}(gpufromhost.0, constant{0}) 0.6% 98.3% 0.001s 1.08e-06s 1000 11 gpudimshuffle{x,x}(gpusubtensor{int64}.0) 0.5% 98.8% 0.001s 9.96e-07s 1000 4 makevector(shape_i{0}.0, shape_i{1}.0) 0.4% 99.2% 0.001s 7.76e-07s 1000 1 shape_i{1}(expected) 0.4% 99.6% 0.001s 7.40e-07s 1000 9 gpusubtensor{int64}(gpufromhost.0, constant{1}) 0.4% 100.0% 0.001s 7.25e-07s 1000 10 gpudimshuffle{x,x}(gpusubtensor{int64}.0) ... (remaining 0 apply instances account 0.00%(0.00s) of runtime)
Comments
Post a Comment