PyTorch 0.3.0 has removed stochastic functions, i.e. Variable.reinforce(), citing “limited functionality and broad performance implications.”
The Python package has added a number of performance improvements, new layers, support to ONNX, CUDA 9, cuDNN 7, and “lots of bug fixes” in the new version.
“The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change,” PyTorch team said.
To replace stochastic functions, they have introduced the torch.distributions package.
So if your previous code looked like this:
probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)
action.reinforce(reward)
action.backward()
This could be the new equivalent code:
probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
Now, Some loss functions can compute per-sample losses in a mini-batch
reduce=False
to return individual losses for each sample in the mini-batchloss = nn.CrossEntropyLoss(..., reduce=False)
MSELoss
, NLLLoss
, NLLLoss2d
, KLDivLoss
, CrossEntropyLoss
, SmoothL1Loss
, L1Loss
PyTorch has built a low-level profiler to help you identify bottlenecks in your models.
Let us start with an example:
>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
... y = x ** 2
... y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
-------------------------------- ---------- ---------
Name CPU time CUDA time
------------------------------- ---------- ---------
PowConstant 142.036us 0.000us
N5torch8autograd9GraphRootE 63.524us 0.000us
PowConstantBackward 184.228us 0.000us
MulConstant 50.288us 0.000us
PowConstant 28.439us 0.000us
Mul 20.154us 0.000us
N5torch8autograd14AccumulateGradE 13.790us 0.000us
N5torch8autograd5CloneE 4.088us 0.000us
The profiler works for both CPU and CUDA models. For CUDA models, you have to run your python program with a special nvprof
prefix. For example:
nvprof --profile-from-start off -o trace_name.prof -- python <your arguments>
# in python
>>> with torch.cuda.profiler.profile():
... model(x) # Warmup CUDA memory allocator and profiler
... with torch.autograd.profiler.emit_nvtx():
... model(x)
Then, you can load trace_name.prof
in PyTorch and print a summary profile report.
>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)
For additional documentation, you can visit this link.
v0.3.0 has added higher-order gradients support for the following layers:
(In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.)
nearest
and linear
modes.AdaptivePool*d
and infer them at runtime.For example:
# target output size of 10x7
m = nn.AdaptiveMaxPool2d((None, 7))
torch.erf
and torch.erfinv
that compute the error function and the inverse error function of each element in the Tensor.Tensor.put_
and torch.take
similar to numpy.take
and numpy.put
.
zeros
and zeros_like
for sparse Tensors.int(torch.Tensor([5]))
works now.torch.cuda.get_device_name
and torch.cuda.get_device_capability
that do what the names say. Example:
>>> torch.cuda.get_device_name(0)
'Quadro GP100'
>>> torch.cuda.get_device_capability(0)
(6, 0)
torch.backends.cudnn.deterministic = True
, then the CuDNN convolutions use deterministic algorithmstorch.cuda_get_rng_state_all
and torch.cuda_set_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()
frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.softmax
and log_softmax
now take a dim
argument that specifies the dimension in which slices are taken for the softmax operation. dim
allows negative dimensions as well (dim = -1
will be the last dimension)torch.potrf
(Cholesky decomposition) is now differentiable and defined on Variable
device_id
and replace it with device
, to make things consistenttorch.autograd.grad
now allows you to specify inputs that are unused in the autograd graph if you use allow_unused=True
torch.autograd.grad
in large graphs with lists of inputs / outputsx, y = Variable(...), Variable(...)
torch.autograd.grad(x * 2, [x, y]) # errors
torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
pad_packed_sequence
now allows a padding_value
argument that can be used instead of zero-paddingDataset
now has a +
operator (which uses ConcatDataset
). You can do something like MNIST(...) + FashionMNIST(...)
for example, and you will get a concatenated dataset containing samples from both.torch.distributed.recv
allows Tensors to be received from any sender (hence, src
is optional). recv
returns the rank of the sender.zero_()
to Variable
Variable.shape
returns the size of the Tensor (now made consistent with Tensor)torch.version.cuda
specifies the CUDA version that PyTorch was compiled withrandom_
for CUDA.pathlib.Path
object, which is a standard Python3 typed filepath objectstate_dict
into another model (for example to fine-tune a pre-trained network), load_state_dict
was strict on matching the key names of the parameters. Now Pytorch provides a strict=False
option to load_state_dict
where it only loads in parameters where the keys match, and ignores the other parameter keys.nn.functional.embedding_bag
that is equivalent to nn.EmbeddingBag
torch
functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using ATen library.100k x 128
and a batch size of 1024, it is 33x faster.torch.arange
. For example torch.arange(10)
DLPack Tensors are cross-framework Tensor formats. We now have torch.utils.to_dlpack(x)
and torch.utils.from_dlpack(x)
to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, and Tensorflow at the moment. PyTorch models that are ConvNet-like and RNN-like (static graphs) can now be shipped to the ONNX format.
There is a new module torch.onnx (http://pytorch.org/docs/0.3.0/onnx.html) which provides the API for exporting ONNX models.
The operations supported in this release are:
warning
is printed to the user.load_state_dict
torch.manual_seed
(instead, the calls are queued and run when CUDA is initialized)x
is 2D, x[[0, 3],]
was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can do x[[0, 3]]
x.sort(descending=True)
used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))
will now work by making a copy.ones_like
and zeros_like
now create Tensors on the same device as the original Tensorexpand
and expand_as
allow expanding an empty Tensor to another empty Tensornumpy()
and torch.from_numpy
torch.scatter
random_
on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensorZeroDivisionError: float division by zero
when printing certain Tensorstorch.gels
when m > n
had a truncation bug on the CPU and returned incorrect results. Fixed.contiguous
symeig
on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.torch.var
and torch.std
by using Welford's algorithmuniform
samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug).
uniform
sampled numbers will return within the bounds [0, 1)
, across all types and devicestorch.svd
to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)index_select
(instead of erroring out)eigenvector=False
, symeig
returned some unknown value for the eigenvectors. Now this is corrected..type()
not converting indices tensor.type()
around non-default GPU input.torch.norm
returned 0.0
, the gradient was NaN
. We now use the subgradient at 0.0
, so the gradient is 0.0
.torch.prod
's backward was failing on the GPU due to a type error, fixed.torch.optim.lr_scheduler
is now imported by default.register_buffer("foo", ...)
is called, and self.foo already exists, then instead of silently failing, now raises a KeyError
_data_ptrs
attributes.nn.Embedding
had a hard error when using the max_norm
option. This is fixed now.max_norm
option, the passed-in indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel.F.affine_grid
now can take non-contiguous inputs1
value per channel in total, raise an error in training mode.-inf
was returned. Now this correctly returns 0.0
poisson_nll_loss
when log_input=False
by adding a small epsilonn = nn.DataParallel(Net()); out = n(input=i)
requires_grad=False
in DistributedDataParallelbuffers
(previously raised incoherent error)__get_state__
to be functional in DistributedDataParallel
(was returning nothing)Among other fixes,model.zoo.load_url
now first attempts to use the requests
library if available, and then falls back to urllib
.
To download the source code, click here.