Configuration, profiling and debugging
For debugging purpose, Theano can print more verbose information and offers different optimization modes:
>>> theano.config.exception_verbosity='high' >>> theano.config.mode 'Mode' >>> theano.config.optimizer='fast_compile'
In order for Theano to use the config.optimizer
value, the mode has to be set to Mode
, otherwise the value in config.mode
will be used:
config.mode / function mode |
config.optimizer (*) |
Description |
---|---|---|
|
|
Default; best run performance, slow compilation |
|
|
Disable optimizations |
|
|
Reduce the number of optimizations, compiles faster |
| Â |
Use the default mode, equivalent to |
| Â |
NaNs, Infs, and abnormally big value will raise errors |
| Â |
Self-checks and assertions during compilation |
The same parameter as in config.mode
can be used in the Mode
parameter in the function compile:
>>> f = theano.function([a,s0], results, updates=updates, mode='FAST_COMPILE')
Disabling optimization and choosing high verbosity will help finding errors in the computation graph.
For debugging on the GPU, you need to set a synchronous execution with the environment variable CUDA_LAUNCH_BLOCKING
, since GPU execution is by default, fully asynchronous:
CUDA_LAUNCH_BLOCKING=1 python
To find out the origin of the latencies in your computation graph, Theano provides a profiling mode.
Activate profiling:
>>> theano.config.profile=True
Activate memory profiling:
>>> theano.config.profile_memory=True
Activate profiling of optimization phase:
>>> theano.config.profile_optimizer=True
Or directly during compilation:
>>> f = theano.function([a,s0], results, profile=True) >>> f.profile.summary() Function profiling ================== Message: <stdin>:1 Time in 1 calls to Function.__call__: 1.490116e-03s Time in Function.fn.__call__: 1.251936e-03s (84.016%) Time in thunks: 1.203537e-03s (80.768%) Total compile time: 1.720619e-01s Number of Apply nodes: 14 Theano Optimizer time: 1.382768e-01s Theano validate time: 1.308680e-03s Theano Linker time (includes C, CUDA code generation/compiling): 2.405691e-02s Import time 1.272917e-03s Node make_thunk time 2.329803e-02s Time in all call to theano.grad() 0.000000e+00s Time since theano import 520.661s Class --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 58.2% 58.2% 0.001s 7.00e-04s Py 1 1 theano.scan_module.scan_op.Scan 27.3% 85.4% 0.000s 1.64e-04s Py 2 2 theano.sandbox.cuda.basic_ops.GpuFromHost 6.1% 91.5% 0.000s 7.30e-05s Py 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu 5.5% 97.0% 0.000s 6.60e-05s C 1 1 theano.sandbox.cuda.basic_ops.GpuIncSubtensor 1.1% 98.0% 0.000s 3.22e-06s C 4 4 theano.tensor.elemwise.Elemwise 0.7% 98.8% 0.000s 8.82e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuSubtensor 0.7% 99.4% 0.000s 7.87e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuAllocEmpty 0.3% 99.7% 0.000s 3.81e-06s C 1 1 theano.compile.ops.Shape_i 0.3% 100.0% 0.000s 1.55e-06s C 2 2 theano.tensor.basic.ScalarFromTensor ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) Ops --- <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 58.2% 58.2% 0.001s 7.00e-04s Py 1 1 forall_inplace,gpu,scan_fn} 27.3% 85.4% 0.000s 1.64e-04s Py 2 2 GpuFromHost 6.1% 91.5% 0.000s 7.30e-05s Py 1 1 HostFromGpu 5.5% 97.0% 0.000s 6.60e-05s C 1 1 GpuIncSubtensor{InplaceSet;:int64:} 0.7% 97.7% 0.000s 8.82e-06s C 1 1 GpuSubtensor{int64:int64:int16} 0.7% 98.4% 0.000s 7.87e-06s C 1 1 GpuAllocEmpty 0.3% 98.7% 0.000s 4.05e-06s C 1 1 Elemwise{switch,no_inplace} 0.3% 99.0% 0.000s 4.05e-06s C 1 1 Elemwise{le,no_inplace} 0.3% 99.3% 0.000s 3.81e-06s C 1 1 Shape_i{0} 0.3% 99.6% 0.000s 1.55e-06s C 2 2 ScalarFromTensor 0.2% 99.8% 0.000s 2.86e-06s C 1 1 Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}} 0.2% 100.0% 0.000s 1.91e-06s C 1 1 Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)] ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) Apply ------ <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 58.2% 58.2% 0.001s 7.00e-04s 1 12 forall_inplace,gpu,scan_fn}(TensorConstant{10}, GpuSubtensor{int64:int64:int16}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuFromHost.0) 21.9% 80.1% 0.000s 2.64e-04s 1 3 GpuFromHost(<TensorType(float32, vector)>) 6.1% 86.2% 0.000s 7.30e-05s 1 13 HostFromGpu(forall_inplace,gpu,scan_fn}.0) 5.5% 91.6% 0.000s 6.60e-05s 1 4 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, CudaNdarrayConstant{[ 0.]}, Constant{1}) 5.3% 97.0% 0.000s 6.41e-05s 1 0 GpuFromHost(s0) 0.7% 97.7% 0.000s 8.82e-06s 1 11 GpuSubtensor{int64:int64:int16}(GpuFromHost.0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1}) 0.7% 98.4% 0.000s 7.87e-06s 1 1 GpuAllocEmpty(TensorConstant{10}) 0.3% 98.7% 0.000s 4.05e-06s 1 8 Elemwise{switch,no_inplace}(Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{0}) 0.3% 99.0% 0.000s 4.05e-06s 1 6 Elemwise{le,no_inplace}(Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}.0, TensorConstant{0}) 0.3% 99.3% 0.000s 3.81e-06s 1 2 Shape_i{0}(<TensorType(float32, vector)>) 0.3% 99.6% 0.000s 3.10e-06s 1 10 ScalarFromTensor(Elemwise{switch,no_inplace}.0) 0.2% 99.8% 0.000s 2.86e-06s 1 5 Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}(TensorConstant{10}, Shape_i{0}.0) 0.2% 100.0% 0.000s 1.91e-06s 1 7 Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}.0, Shape_i{0}.0) 0.0% 100.0% 0.000s 0.00e+00s 1 9 ScalarFromTensor(Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)].0) ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)