python - Theano simple linear regression runs on CPU instead of GPU -

i created simple python script (using theano) performing linear regression should run on gpu. when code starts says "using gpu device", (according profiler) operations cpu-specific (elemwise, instead of gpuelemwise, no gpufromhost etc.).

i checked variables, theano_flags, seems right , cannot see catch (especially when theano tutorials same settings correctly run on gpu :)).

here code:

# linear regression  import numpy import theano import theano.tensor t  input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]]) output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])  ts = theano.shared(input_data, "training-set") e = theano.shared(output_data, "expected") w1 = theano.shared(numpy.zeros((1, 2)))  o = t.dot(ts, w1.t) cost = t.mean(t.sqr(e - o.t)) gradient = t.grad(cost=cost, wrt=w1) update = [[w1, w1 - gradient * 0.0001]] train = theano.function([], cost, updates=update, allow_input_downcast=true)  in range(1000):     train()

theano_flags=cuda.root=/usr/local/cuda

device=gpu

floatx=float32

lib.cnmem=.5

profile=true

cuda_launch_blocking=1

output:

using gpu device 0: geforce gt 650m (cnmem enabled) function profiling ==================   message: /home/mw/documents/liclipse workspace/theano1/test2.py:18   time in 1000 calls function.__call__: 3.348637e-02s   time in function.fn.__call__: 2.419019e-02s (72.239%)   time in thunks: 1.839781e-02s (54.941%)   total compile time: 1.350801e-01s     number of apply nodes: 18     theano optimizer time: 1.101730e-01s        theano validate time: 2.029657e-03s     theano linker time (includes c, cuda code generation/compiling): 1.491690e-02s        import time 2.320528e-03s  time in call theano.grad() 8.740902e-03s time since theano import 0.881s class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <class name>   71.7%    71.7%       0.013s       6.59e-06s     py    2000       2   theano.tensor.basic.dot   12.3%    83.9%       0.002s       3.22e-07s     c     7000       7   theano.tensor.elemwise.elemwise    5.7%    89.6%       0.001s       3.50e-07s     c     3000       3   theano.tensor.elemwise.dimshuffle    4.0%    93.6%       0.001s       3.65e-07s     c     2000       2   theano.tensor.subtensor.subtensor    3.6%    97.2%       0.001s       3.31e-07s     c     2000       2   theano.compile.ops.shape_i    1.7%    98.9%       0.000s       3.06e-07s     c     1000       1   theano.tensor.opt.makevector    1.1%   100.0%       0.000s       2.10e-07s     c     1000       1   theano.tensor.elemwise.sum    ... (remaining 0 classes account   0.00%(0.00s) of runtime)  ops --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <op name>   71.7%    71.7%       0.013s       6.59e-06s     py    2000        2   dot    4.0%    75.6%       0.001s       3.65e-07s     c     2000        2   subtensor{int64}    3.5%    79.1%       0.001s       6.35e-07s     c     1000        1   inplacedimshuffle{1,0}    3.3%    82.4%       0.001s       6.06e-07s     c     1000        1   elemwise{mul,no_inplace}    2.4%    84.8%       0.000s       4.38e-07s     c     1000        1   shape_i{0}    2.3%    87.1%       0.000s       4.29e-07s     c     1000        1   elemwise{composite{((i0 * i1) / i2)}}    2.3%    89.3%       0.000s       2.08e-07s     c     2000        2   inplacedimshuffle{x,x}    1.8%    91.1%       0.000s       3.25e-07s     c     1000        1   elemwise{cast{float64}}    1.7%    92.8%       0.000s       3.06e-07s     c     1000        1   makevector{dtype='int64'}    1.5%    94.3%       0.000s       2.78e-07s     c     1000        1   elemwise{composite{(i0 - (i1 * i2))}}[(0, 0)]    1.4%    95.7%       0.000s       2.53e-07s     c     1000        1   elemwise{sub}[(0, 1)]    1.2%    96.9%       0.000s       2.24e-07s     c     1000        1   shape_i{1}    1.1%    98.0%       0.000s       2.10e-07s     c     1000        1   sum{acc_dtype=float64}    1.1%    99.1%       0.000s       1.98e-07s     c     1000        1   elemwise{sqr}[(0, 0)]    0.9%   100.0%       0.000s       1.66e-07s     c     1000        1   elemwise{composite{((i0 / i1) / i2)}}[(0, 0)]    ... (remaining 0 ops account   0.00%(0.00s) of runtime)  apply ------ <% time> <sum %> <apply time> <time per call> <#call> <id> <apply name>   37.8%    37.8%       0.007s       6.95e-06s   1000     3   dot(<tensortype(float64, matrix)>, training-set.t)   33.9%    71.7%       0.006s       6.24e-06s   1000    14   dot(elemwise{composite{((i0 * i1) / i2)}}.0, training-set)    3.5%    75.1%       0.001s       6.35e-07s   1000     0   inplacedimshuffle{1,0}(training-set)    3.3%    78.4%       0.001s       6.06e-07s   1000    11   elemwise{mul,no_inplace}(inplacedimshuffle{x,x}.0, inplacedimshuffle{x,x}.0)    3.0%    81.4%       0.001s       5.58e-07s   1000     8   subtensor{int64}(elemwise{cast{float64}}.0, constant{1})    2.4%    83.8%       0.000s       4.38e-07s   1000     2   shape_i{0}(expected)    2.3%    86.2%       0.000s       4.29e-07s   1000    12   elemwise{composite{((i0 * i1) / i2)}}(tensorconstant{(1, 1) of -2.0}, elemwise{sub}[(0, 1)].0, elemwise{mul,no_inplace}.0)    1.8%    87.9%       0.000s       3.25e-07s   1000     6   elemwise{cast{float64}}(makevector{dtype='int64'}.0)    1.7%    89.6%       0.000s       3.06e-07s   1000     4   makevector{dtype='int64'}(shape_i{0}.0, shape_i{1}.0)    1.6%    91.2%       0.000s       3.03e-07s   1000    10   inplacedimshuffle{x,x}(subtensor{int64}.0)    1.5%    92.7%       0.000s       2.78e-07s   1000    16   elemwise{composite{(i0 - (i1 * i2))}}[(0, 0)](<tensortype(float64, matrix)>, tensorconstant{(1, 1) of ..974738e-05}, dot.0)    1.4%    94.1%       0.000s       2.53e-07s   1000     5   elemwise{sub}[(0, 1)](expected, dot.0)    1.2%    95.3%       0.000s       2.24e-07s   1000     1   shape_i{1}(expected)    1.1%    96.5%       0.000s       2.10e-07s   1000    15   sum{acc_dtype=float64}(elemwise{sqr}[(0, 0)].0)    1.1%    97.6%       0.000s       1.98e-07s   1000    13   elemwise{sqr}[(0, 0)](elemwise{sub}[(0, 1)].0)    0.9%    98.5%       0.000s       1.72e-07s   1000     7   subtensor{int64}(elemwise{cast{float64}}.0, constant{0})    0.9%    99.4%       0.000s       1.66e-07s   1000    17   elemwise{composite{((i0 / i1) / i2)}}[(0, 0)](sum{acc_dtype=float64}.0, subtensor{int64}.0, subtensor{int64}.0)    0.6%   100.0%       0.000s       1.13e-07s   1000     9   inplacedimshuffle{x,x}(subtensor{int64}.0)    ... (remaining 0 apply instances account 0.00%(0.00s) of runtime)

as mentioned in comments although have set allow_input_downcast parameter true, need make sure data assigned shared variables in float32. of jan. 06, 2016 theano still cannot work other data type rather float32 computations on gpu, mentioned here in more details. have have cast data 'float32' format.

therefore, here should code need use:

import numpy import theano import theano.tensor t   input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]]) output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])  ts = theano.shared(input_data.astype('float32'), "training-set") e = theano.shared(output_data.astype('float32'), "expected") w1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))  o = t.dot(ts, w1.t) cost = t.mean(t.sqr(e - o.t)) gradient = t.grad(cost=cost, wrt=w1) update = [[w1, w1 - gradient * 0.0001]] train = theano.function([], cost, updates=update, allow_input_downcast=true, profile = true)  in range(1000):     train()  train.profile.print_summary()

and here profiling result:

message: learntheano.py:18   time in 1000 calls function.__call__: 2.642968e-01s   time in function.fn.__call__: 2.460811e-01s (93.108%)   time in thunks: 1.877530e-01s (71.039%)   total compile time: 2.483290e+01s     number of apply nodes: 17     theano optimizer time: 2.818849e-01s        theano validate time: 3.435850e-03s     theano linker time (includes c, cuda code generation/compiling): 2.453926e+01s        import time 1.241469e-02s  time in call theano.grad() 1.206994e-02s class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <class name>   34.8%    34.8%       0.065s       3.27e-05s     c     2000       2   theano.sandbox.cuda.blas.gpugemm   28.8%    63.5%       0.054s       1.80e-05s     c     3000       3   theano.sandbox.cuda.basic_ops.gpuelemwise   12.9%    76.4%       0.024s       2.42e-05s     c     1000       1   theano.sandbox.cuda.basic_ops.gpucareduce   10.3%    86.7%       0.019s       1.93e-05s     c     1000       1   theano.sandbox.cuda.basic_ops.gpufromhost    7.2%    93.9%       0.014s       1.36e-05s     c     1000       1   theano.sandbox.cuda.basic_ops.hostfromgpu    1.8%    95.7%       0.003s       1.13e-06s     c     3000       3   theano.sandbox.cuda.basic_ops.gpudimshuffle    1.5%    97.2%       0.003s       2.81e-06s     c     1000       1   theano.tensor.elemwise.elemwise    1.1%    98.4%       0.002s       1.08e-06s     c     2000       2   theano.compile.ops.shape_i    1.1%    99.5%       0.002s       1.02e-06s     c     2000       2   theano.sandbox.cuda.basic_ops.gpusubtensor    0.5%   100.0%       0.001s       9.96e-07s     c     1000       1   theano.tensor.opt.makevector    ... (remaining 0 classes account   0.00%(0.00s) of runtime)  ops --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <op name>   25.3%    25.3%       0.047s       4.74e-05s     c     1000        1   gpugemm{no_inplace}   12.9%    38.1%       0.024s       2.42e-05s     c     1000        1   gpucareduce{pre=sqr,red=add}{1,1}   12.8%    51.0%       0.024s       2.41e-05s     c     1000        1   gpuelemwise{mul,no_inplace}   10.3%    61.3%       0.019s       1.93e-05s     c     1000        1   gpufromhost    9.5%    70.8%       0.018s       1.79e-05s     c     1000        1   gpugemm{inplace}    8.2%    79.0%       0.015s       1.55e-05s     c     1000        1   gpuelemwise{composite{((i0 / i1) / i2)}}[(0, 0)]    7.7%    86.7%       0.014s       1.44e-05s     c     1000        1   gpuelemwise{composite{((i0 * i1) / i2)}}[(0, 1)]    7.2%    93.9%       0.014s       1.36e-05s     c     1000        1   hostfromgpu    1.5%    95.4%       0.003s       2.81e-06s     c     1000        1   elemwise{cast{float32}}    1.1%    96.5%       0.002s       1.02e-06s     c     2000        2   gpusubtensor{int64}    1.0%    97.5%       0.002s       9.00e-07s     c     2000        2   gpudimshuffle{x,x}    0.8%    98.3%       0.002s       1.59e-06s     c     1000        1   gpudimshuffle{1,0}    0.7%    99.1%       0.001s       1.38e-06s     c     1000        1   shape_i{0}    0.5%    99.6%       0.001s       9.96e-07s     c     1000        1   makevector    0.4%   100.0%       0.001s       7.76e-07s     c     1000        1   shape_i{1}    ... (remaining 0 ops account   0.00%(0.00s) of runtime)  apply ------ <% time> <sum %> <apply time> <time per call> <#call> <id> <apply name>   25.3%    25.3%       0.047s       4.74e-05s   1000     3   gpugemm{no_inplace}(expected, tensorconstant{-1.0}, <cudandarraytype(float32, matrix)>, gpudimshuffle{1,0}.0, tensorconstant{1.0})   12.9%    38.1%       0.024s       2.42e-05s   1000     5   gpucareduce{pre=sqr,red=add}{1,1}(gpugemm{no_inplace}.0)   12.8%    51.0%       0.024s       2.41e-05s   1000    13   gpuelemwise{mul,no_inplace}(gpudimshuffle{x,x}.0, gpudimshuffle{x,x}.0)   10.3%    61.3%       0.019s       1.93e-05s   1000     7   gpufromhost(elemwise{cast{float32}}.0)    9.5%    70.8%       0.018s       1.79e-05s   1000    16   gpugemm{inplace}(<cudandarraytype(float32, matrix)>, tensorconstant{-9.99999974738e-05}, gpuelemwise{composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, tensorconstant{1.0})    8.2%    79.0%       0.015s       1.55e-05s   1000    12   gpuelemwise{composite{((i0 / i1) / i2)}}[(0, 0)](gpucareduce{pre=sqr,red=add}{1,1}.0, gpusubtensor{int64}.0, gpusubtensor{int64}.0)    7.7%    86.7%       0.014s       1.44e-05s   1000    15   gpuelemwise{composite{((i0 * i1) / i2)}}[(0, 1)](cudandarrayconstant{[[-2.]]}, gpugemm{no_inplace}.0, gpuelemwise{mul,no_inplace}.0)    7.2%    93.9%       0.014s       1.36e-05s   1000    14   hostfromgpu(gpuelemwise{composite{((i0 / i1) / i2)}}[(0, 0)].0)    1.5%    95.4%       0.003s       2.81e-06s   1000     6   elemwise{cast{float32}}(makevector.0)    0.8%    96.3%       0.002s       1.59e-06s   1000     0   gpudimshuffle{1,0}(training-set)    0.7%    97.0%       0.001s       1.38e-06s   1000     2   shape_i{0}(expected)    0.7%    97.7%       0.001s       1.30e-06s   1000     8   gpusubtensor{int64}(gpufromhost.0, constant{0})    0.6%    98.3%       0.001s       1.08e-06s   1000    11   gpudimshuffle{x,x}(gpusubtensor{int64}.0)    0.5%    98.8%       0.001s       9.96e-07s   1000     4   makevector(shape_i{0}.0, shape_i{1}.0)    0.4%    99.2%       0.001s       7.76e-07s   1000     1   shape_i{1}(expected)    0.4%    99.6%       0.001s       7.40e-07s   1000     9   gpusubtensor{int64}(gpufromhost.0, constant{1})    0.4%   100.0%       0.001s       7.25e-07s   1000    10   gpudimshuffle{x,x}(gpusubtensor{int64}.0)    ... (remaining 0 apply instances account 0.00%(0.00s) of runtime)

Search This Blog

If code

python - Theano simple linear regression runs on CPU instead of GPU -

Comments

Post a Comment

Popular posts from this blog

multithreading - Exception in Application constructor -

React Native allow user to reorder elements in a scrollview list -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -