Torch for R

## Torch for R

### Daniel Falbel
### RStudio, PBC

January 27, 2021

---

## Who am I?

- Daniel Falbel

- Live in São Paulo, Brazil

- Software engineer at RStudio, PBC

- Working in the Multiverse team

]

---

## Outline

- What's Torch?

- The Torch components

- Contributing

- Future work

]

---

## What's Torch?

- Array computation with strong 
GPU accelation

- Deep neural networks built on a 
tape-based autograd system
]

---

## Why Torch?

Torch is based on PyTorch, a framework which's rapidly increasing popularity among deep learning researchers.

We believe others can use torch's GPU acceleration to implement fast machine learning algorithms using it's convenient interface.
]

![](images/Screen Shot 2020-11-01 at 17.07.12.png)
Papers with code [trends](https://paperswithcode.com/trends) section.

]

---

## Torch and TensorFlow

- Torch uses LibTorch (C++ library) and TensorFlow uses the python implementation via reticulate.
]

]

---

## Implementation

- Almost all `torch_*` functions are autogenerated from the LibTorch declaration file.

- Most Neural network modules, optimizers and datasets and dataloaders code is writen in R.

]

![:scale 90%](images/implementation.png)
]

---
class: normal, middle

## example_00.R

Click [here](https://gist.github.com/dfalbel/10f2fca89dd1e7713be62785435e9064#file-example_00-r) for a link.

---

## Torch components

![:scale 50%](images/components.png)

---

## Tensors

---

## Tensors

![](images/tensor.png)

The `torch_tensor` is the core data structure in torch.

---

## Creating tensors from R objects

- Tensors can be created from R objects like `numeric` vectors, matrix's and arrays.

- Currently only integer, doubles a logicals are supported.

- **Note**: doubles are converted to float, because most operations in torch are optimized for it.

]

```r
torch_tensor(c(1L, 2L, 3L))
```

```
## torch_tensor
##  1
##  2
##  3
## [ CPULongType{3} ]
```

```r
m <- matrix(c(1,2,3,4), ncol = 2)
torch_tensor(m)
```

```
## torch_tensor
##  1  3
##  2  4
## [ CPUFloatType{2,2} ]
```

]

---

## Initialization functions

- Tensors can be created using the initialization functions.

- These functions have a convenient interface for creating multi-dimensional arrays
with any size.

- See more info [here](https://torch.mlverse.org/docs/articles/tensor-creation.html#using-creation-functions-1).

]

```r
# 2x2 matrix, standard normal
torch_randn(2, 2)
```

```
## torch_tensor
## -0.4151  0.8593
## -0.4053 -0.2437
## [ CPUFloatType{2,2} ]
```

```r
# lengh 3 vector, [0,1] uniform
torch_rand(3)
```

```
## torch_tensor
##  0.8956
##  0.0358
##  0.6335
## [ CPUFloatType{3} ]
```

]

---

## Indexing

- Indexing tensors is supported but differs from R indexing in a few cases.

- Negative indexing doesn't remove elements, instead it selects starting from the
end.

- See the docs [here](https://torch.mlverse.org/docs/articles/indexing.html).

]

```r
x <- torch_tensor(1:5)
x[1]
```

```
## torch_tensor
## 1
## [ CPULongType{} ]
```

```r
x[-1]
```

```
## torch_tensor
## 5
## [ CPULongType{} ]
```

]

---

## Indexing

- Interval selection works as expected

]

```r
x <- torch_tensor(1:5)
x[1:3]
```

```
## torch_tensor
##  1
##  2
##  3
## [ CPULongType{3} ]
```

```r
x[-3:N]
```

```
## torch_tensor
##  3
##  4
##  5
## [ CPULongType{3} ]
```

]

---

## Indexing

- Selecting without specifying the number of dimensions.

- Adding new dimensions is also supported.

]

```r
x <- torch_randn(2,2,3)
x[.., 1]$shape
```

```
## [1] 2 2
```

```r
x[.., 1, drop = FALSE]$shape
```

```
## [1] 2 2 1
```

```r
x[.., newaxis]$shape
```

```
## [1] 2 2 3 1
```

]

---

## Indexing

- Subset assignment is also supported.

]

```r
x <- torch_tensor(c(1,2,3))
x[1] <- 10
x[2:3] <- c(9, 8)
x
```

```
## torch_tensor
##  10
##   9
##   8
## [ CPUFloatType{3} ]
```

]

---

## Accessing attributes

- Tensor attributes can be accessed using the `$` operator.

- All tensors have a data type (`dtype`), a shape, device and the `requires_grad` flag.

]

```r
x <- torch_randn(2,2)
x$shape
```

```
## [1] 2 2
```

```r
x$dtype
```

```
## torch_Float
```

```r
x$device
```

```
## torch_device(type='cpu')
```

```r
x$requires_grad
```

```
## [1] FALSE
```

]

---

## Modifying attributes

- You can change all tensor attributes using the `to` method.

- Use named arguments.

]

```r
x <- torch_tensor(1:5)
x$dtype
```

```
## torch_Long
```

```r
x <- x$to(dtype = torch_float())
x$dtype
```

```
## torch_Float
```

```r
x <- x$to(device = "cuda")
x$device
## torch_device(type='cuda', index=0)
```
]

---

## Array computation

---

## Array computation

- torch features a comprehensive tensor computation library.

- More than 200 functions and methods to manipulate tensors.

- Many times you can choose between using the method or the function directly.

- Functions have the `torch_*` prefix.

]

```r
x <- torch_randn(2,3)
torch_mean(x)
```

```
## torch_tensor
## 0.551289
## [ CPUFloatType{} ]
```

```r
x$mean()
```

```
## torch_tensor
## 0.551289
## [ CPUFloatType{} ]
```

```r
torch_sum(x, dim  = 2)
```

```
## torch_tensor
##  0.0732
##  3.2346
## [ CPUFloatType{2} ]
```

]

---

## Other useful functions

- There are many other functions for many math operations you can think.

- See the full list [here](https://torch.mlverse.org/docs/reference/index.html#section-mathematical-operations-on-tensors).

]

```r
x <- torch_tensor(c(1, 2))
y <- torch_tensor(c(3, 4))
torch_cat(list(x, y))
```

```
## torch_tensor
##  1
##  2
##  3
##  4
## [ CPUFloatType{4} ]
```

```r
torch_unbind(x, dim = 1)
```

```
## [[1]]
## torch_tensor
## 1
## [ CPUFloatType{} ]
## 
## [[2]]
## torch_tensor
## 2
## [ CPUFloatType{} ]
```
]

---

## Methods

- Tensor methods are acessed using the `$` operator.

- Methods with the names ending with `_` operate **in-place**.

- Full list available [here](https://torch.mlverse.org/docs/articles/tensor/index.html).

]

```r
x <- torch_tensor(c(1,2))
x$mean()
```

```
## torch_tensor
## 1.5
## [ CPUFloatType{} ]
```

```r
y <- x$add_(1L); x
```

```
## torch_tensor
##  2
##  3
## [ CPUFloatType{2} ]
```
]

---

## Autograd

---

## What's autograd?

- Autograd can automatically compute exact derivatives of tensor operations.

- It's the core feature that allows torch to be used for training neural networks.

- You need to set `requires_grad = TRUE` if you want torch to track the operations and be able to compute derivatives.

]

```r
x <- torch_tensor(
  2, 
  requires_grad = TRUE
)

y <- x ^ 3

# torch will compute dy/d* 
y$backward() 
x$grad # 3 * x ^ 2
```

```
## torch_tensor
##  12
## [ CPUFloatType{1} ]
```
]

---
class: normal, middle

## Autograd

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Automatic differentiation is really pretty fantastic. There are so many things where I would think “In principle that is differentiable, but there is no way in hell I am going to work it out, so I’ll do something else instead”, but it Just Works with AD.</p>&mdash; John Carmack (@ID_AA_Carmack) <a href="https://twitter.com/ID_AA_Carmack/status/1353027631130832896?ref_src=twsrc%5Etfw">January 23, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---
class: normal, middle

## example_01.R

Click [here](https://gist.github.com/dfalbel/10f2fca89dd1e7713be62785435e9064#file-example_01-r) for a link.

---

## Extensions

- The autograd system can be extended if you need to add a function that can't be
composed of other torch functions.

- See documentation [here](https://torch.mlverse.org/docs/articles/extending-autograd.html)

]

```r
mul_constant <- autograd_function(
  forward = function(ctx, tensor, constant) {
    ctx$save_for_backward(constant = constant)
    tensor * constant
  },
  backward = function(ctx, grad_output) {
    v <- ctx$saved_variables
    list(
      tensor = grad_output * v$constant
    )
  }
)
x <- torch_tensor(1, requires_grad = TRUE)
o <- mul_constant(x, 2)
o$backward()
x$grad
```

```
## torch_tensor
##  2
## [ CPUFloatType{1} ]
```

]

---

## Read more

- Read Sigrid's blog post [introducing autograd](https://blogs.rstudio.com/ai/posts/2020-10-05-torch-network-with-autograd/) for further discussion.

- The white paper describing the PyTorch's implementation of automatic differentiation. See [here](https://openreview.net/forum?id=BJJsrmfCZ)

]

![](images/Screen Shot 2020-11-02 at 17.42.54.png)

]

---

## Neural network modules

---

## Neural network modules

- All models and layers use `nn_modules`.

- Models and layers are functions that transform input data. They have 'weights' (parameters) in their state.

- nn_module's make it easy to handle the state of the models.

- Modules are a convenient way to reuse code too.

]

```r
Linear <- function(in_feat, out_feat) {
  w <- torch_randn(in_feat, out_feat)
  b <- torch_zeros(out_feat)
  function(input) {
    torch_mm(input, w) + b
  }
}
linear <- Linear(10, 1)
input <- torch_randn(2, 10)
linear(input)
```

```
## torch_tensor
## -0.2812
##  1.7096
## [ CPUFloatType{2,1} ]
```
]

```r
Linear <- nn_module(
  initialize = function(in_feat, out_feat) {
    self$w <- nn_parameter(torch_randn(in_feat, out_feat))
    self$b <- nn_parameter(torch_zeros(out_feat))
  },
  forward = function(input) {
    torch_mm(input, self$w) + self$b
  }
)
linear <- Linear(10, 1)
input <- torch_randn(2, 10)
linear(input)
```

```
## torch_tensor
## -0.1271
## -0.2898
## [ CPUFloatType{2,1} ]
```

]

---

## Handling the state

- It's easy to access the parameters of the model.

- It's easy to move the model parameters to the 'cuda' device, or back to the 'cpu'.

]

```r
# list all parameters
str(linear$parameters)
```

```
## List of 2
##  $ w:Float [1:10, 1:1]
##  $ b:Float [1:1]
```

```r
# acess parameters individually
linear$w
linear$b

# move model to the cuda device
model$to(device = "cuda")
```

]

---

## It's all implemented in R

- All modules in the torch package are implemented this way.

- There are many code examples for you to use and learn from.

]

![:scale 55%](images/Screen Shot 2020-11-03 at 15.10.20.png)

(The implementation of the [linear module](https://github.com/mlverse/torch/blob/master/R/nn-linear.R#L59-L84))

]

---

## Modules can handle sub-modules

- Parameters of submodules are correctly tracked.

- Modules are a good abstraction for models, ie. combination of other modules (or layers).

- Some modules might not have parameters like `nn_relu()`. In this case you could also use the `nnf_relu` function in the forward call.

]

```r
mlp_module <- nn_module(
  initialize = function(in_feat, hidden_feat, out_feat) {
    self$fc1 <- nn_linear(in_feat, hidden_feat)
    self$relu <- nn_relu()
    self$fc2 <- nn_linear(hidden_feat, out_feat)
  },
  forward = function(input) {
    input %>% 
      self$fc1() %>% 
      self$relu() %>% 
      self$fc2()
  }
)
mlp <- mlp_module(10, 20, 1)
str(mlp$parameters)
```

```
## List of 4
##  $ fc1.weight:Float [1:20, 1:10]
##  $ fc1.bias  :Float [1:20]
##  $ fc2.weight:Float [1:1, 1:20]
##  $ fc2.bias  :Float [1:1]
```

]

---

## Sequential models

- You can use `nn_sequential` if the forward function of your model just call's all
submodules in order and you don't need an initialize function.

- You can also have sequential models inside `nn_module`'s.

]

```r
mlp <- nn_sequential(
  nn_linear(10, 20),
  nn_relu(),
  nn_linear(20, 1)
)
str(mlp$parameters)
```

```
## List of 4
##  $ 0.weight:Float [1:20, 1:10]
##  $ 0.bias  :Float [1:20]
##  $ 2.weight:Float [1:1, 1:20]
##  $ 2.bias  :Float [1:1]
```

]

---

## The functional API

.pull-left[
- Most `nn_modules` use the respective functional interface in their implementation. For example: nn_relu uses nnf_relu, nn_conv2d uses nnf_conv2d.

- The functional is usually the forward method of the `nn_modules`.

- You can choose the interface that you prefer and is better for your use case.

- **Note:** We have almost complete feature parity with PyTorch on `nnf_*` functions but not on `nn_*` modules.
]

```r
input <- torch_tensor(c(-1, 1))

nnf_relu(input)
```

```
## torch_tensor
##  0
##  1
## [ CPUFloatType{2} ]
```

```r
relu <- nn_relu()
relu(input)
```

```
## torch_tensor
##  0
##  1
## [ CPUFloatType{2} ]
```

]

---

## Optimizers

---

## Optimizers

- Optimizers are torch's abstraction for for defining the optimization steps.

- They encapsulate code responsible for updating weights in a model.

- They are also implemented in R! See the SGD implementation [here](https://github.com/mlverse/torch/blob/master/R/optim-sgd.R)

- Most optimizers are quite simple to implement, but it get's tricky when the optimizer must store some state, like Adam, SGD with momentum and others.

]

```r
parameters <- ...
learning_rate <- 0.001

for (parameter in parameters) {
  # you need to temporarily disable the autograd tracking 
  # as this operations is not part of the model training.
  with_no_grad({ 
    parameter$sub_(parameter$grad * learning_rate)
    parameter$grad$zero_()  
  })
}
```

]

```r
# the optimizer can keep track of state, etc.
optim <- optim_sgd(parameters, lr = 0.001)

optim$zero_grad()
... # loss backward..
optim$step()
```

]

---

## Optimizers

- Many optimizers are already implemented thanks to 
[Krzysztof Joachimiak](https://github.com/krzjoa).

- If you want to contribute, comment [here](https://github.com/mlverse/torch/issues/147).

]

- optim_sgd
- optim_adam
- optim_adagrad
- optim_adadelta
- optim_asgd
- optim_rmsprop
- optim_rprop
- optim_lbfgs

]

---

## Learning rate schedulers

- Related to the optimizers we also implemented some learning rate schedulers.

- Varying the learning rate during training is a common technique for faster convergence.

- See a learning rate scheduler in action in [this post](https://blogs.rstudio.com/ai/posts/2020-10-19-torch-image-classification/) by Sigrid.
]

![:scale 85%](images/Screen Shot 2020-11-03 at 16.58.58.png)

]

---

## Datasets

---

## Datasets

- Your dataset may not fit completely on RAM, and that's fine because you usually only need a single batch on RAM.

- You need to pass the implementation for 3 methods: 
    - `initialize` takes inputs for the datasets
    - `.getitem` retrieves an index and returns the data
    - `.length` returns the number of observations in the dataset
    
]

```r
mydataset <- dataset(
  initialize = function(paths_to_imgs, labels) {
    self$paths <- path_to_imgs
    self$labels <- labels
  },
  .getitem = function(i) {
    img <- jpeg::readJPEG(self$paths[i])
    list(x = img, y = self$labels[i])
  },
  .length = function() {
    length(self$paths)
  }
)
# initialize the datase
ds <- mydataset(c("hello.jpg", "bye.jpg"), 
                labels = c(0, 1))
ds[1] # take the item with index 1 of the dataset
length(ds) # returns the lenght of the dataset
```

]

---

## Dataset

- Your dataset `initialize` function can do anything you want. A common pattern
is to make your initialize function to download the data and cache in a local
directory.

- It's also common that the initialize functions prepares the data to a format that
can be easily consumed by `.getitem`.

- Currently we only support `map` datasets. We plan to support other kind of datasets.
Watch [this talk](https://www.youtube.com/watch?v=sCsPzVumtR8&list=PL_lsbAsL_o2BY-RrqVDKDcywKnuUTp-f3&index=6) for other examples and details.

]

- The .getitem can also do anything you want, including tranforming and normalizing examples. For example, in torchvision we implement a large number of transforms that can be used for image data augmnetation. These transforms are usually applied in this method.

- Some common uses of the .getitem method are: reading data from disk, subsetting data from RAM, querying a database and others.

- See a few examples of implemented datasets [here](https://github.com/mlverse/torchvision/blob/main/R/dataset-cifar.R), [here](https://github.com/mlverse/torchdatasets/blob/master/R/bird-species.R) and [here](https://github.com/curso-r/torchaudio/blob/master/R/dataset-speechcommands.R).

]

---

## Dataloaders

- Dataloaders can be conviniently used to pull batches from datasets. It supports any kind of datasets and can
shuffle them.

- You can easily iterate over a dataloader with the `enumerate` function.

- You can also use `dataloader_make_iter` and `dataloader_next` to manually iterate over them.

]

```r
x <- torch_randn(100, 10)
y <- torch_randn(100, 1)
ds <- tensor_dataset(x = x, y = y)
dl <- dataloader(ds, batch_size = 50)
for (batch in enumerate(dl)){
  str(batch$x)
  str(batch$y)
}
```

```
## Float [1:50, 1:10]
## Float [1:50, 1:1]
## Float [1:50, 1:10]
## Float [1:50, 1:1]
```

]

---

## Dataloaders

- Dataloaders can make use of the `num_workers` argument to take batches in parallel.

- There are 3 main things that affect performance of parallel dataloaders:
  - number of workers
  - time per batch: ie. time to run `.getitem` batch size times.
  - size of the returned tensor.

]

<img src="index_files/figure-html/unnamed-chunk-26-1.png" width="500px" height="350px" />
]

---

## A full toy example

---
class: normal, middle

## JIT

- In torch 0.2.0 we added initial support to JIT ('Just in time compile') torch programs to TorchScript.

- Currently we only support *tracing* R functions. When `tracing`
we invoke an R function with example inputs and record all operations that occurred when the function was run.

]

```r
w <- torch_randn(10, 1)
b <- torch_randn(1)
fn <- function(x) {
  a <- torch_mm(x, w)
  a + b
}
fn(torch_ones(2, 10))
```

```
## torch_tensor
##  1.4917
##  1.4917
## [ CPUFloatType{2,1} ]
```

]

---
class: normal, middle

## JIT

Now we use the `jit_trace` function to compile this R function into TorchScript:

```r
x <- torch_ones(2, 10)
tr_fn <- jit_trace(fn, x)
tr_fn(x)
```

```
## torch_tensor
##  1.4917
##  1.4917
## [ CPUFloatType{2,1} ]
```

```r
tr_fn$graph
```

```
## graph(%0 : Float(2:10, 10:1, requires_grad=0, device=cpu)):
##   %1 : Float(10:1, 1:1, requires_grad=0, device=cpu) = prim::Constant[value= 0.4607 -1.0476  1.0510  0.1441  1.4061 -0.9854 -0.3466 -0.1256  0.0654  1.9389 [ CPUFloatType{10,1} ]]()
##   %2 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::mm(%0, %1)
##   %3 : Float(1:1, requires_grad=0, device=cpu) = prim::Constant[value={-1.06925}]()
##   %4 : int = prim::Constant[value=1]()
##   %5 : Float(2:1, 1:1, requires_grad=0, device=cpu) = aten::add(%2, %3, %4)
##   return (%5)
```

---

## JIT

- We can now save the traced function to the disk with the `jit_save` function.

- Then we can reload the saved function in Python just like
any TorchScript program.

- It's also [possible to use](https://community.rstudio.com/t/r-model-serving-using-python-torchserve/94303/2?u=dfalbel) [torchserve](https://github.com/pytorch/serve) for high performance environments.

]

```r
jit_save(tr_fn, "linear.pt")
```

```python
import torch
fn = torch.jit.load("linear.pt")
fn(torch.ones(2, 10))
```

```
## tensor([[1.4917],
##         [1.4917]])
```

]

---

## Future work

- Currently only functions can be traced from R. You make trace `nn_modules` using some kind of [hack](https://github.com/mlverse/torch/blob/master/tests/testthat/test-trace.R#L27-L70).

- We will support tracing `nn_modules` in the future to enable speedups in training as well as easily serializing your model for deployment.

---

## Contributing

---

## Contributing

- No matter what are your current skills you can contribute to torch development.

- We have a few [open issues](https://github.com/mlverse/torch). Feel free to comment if you want to fix that issue! I will help as much as I can :)

- If you think the documentation is not clear or some details are missing, please open an issue! This helps a lot!

]

- Also open issues for bug reports, feature-requests and/or questions.

- If you are planning to add a new feature and don't know how to start, open an issue and we can discuss how to do it.

- You can also contribute extensions to torch!

]

---

## Extensions and support for torch

- [torchvision](https://github.com/mlverse/torchvision) is an extension package for computer vision tasks. It implements many different trasnformations for image data, datasets and pre-trained models.

- [torchaudio](https://github.com/curso-r/torchaudio) is an extension package for audio related tasks. It's developed by @athospd and already supports many functions for audio data transformation, datasets and models.

- The [targets](https://github.com/wlandau/targets) package already support serializing torch models. targets is a Function-oriented Make-like declarative workflows for R.

]

- We have benn working on [torchdatasets](https://github.com/mlverse/torchdatasets) package with other datasets that can be useful for examples and etc, but do not fit in other specific packages.

- There's also the [tabnet](https://github.com/mlverse/tabnet) package that implements the TabNet model with a tidymodels like interface.

- [lantern](https://github.com/tidymodels/lantern): a tidymodels interface for fitting multilayer perceptrons and linear models.

- Your idea?

]

---

## Future work

* Better interop with PyTorch. Interchanging tensors between both languages with zero-cost.

* Improve performance. Specially the performance of the dispatcher.

]

* Better support for JIT tracing `nn_modules`.

* Support for ONNX. Similarly to the JIT, we should be able to trace models and export to the ONNX format.

]

---

## Learn more

- [Torch for R website](https://torch.mlverse.org) includes many tutorials and links to blog posts.

- [RStudio AI blog](https://blogs.rstudio.com/ai/) contain many end-to-end examples. This is also where to get news about torch.

- [Torch book](https://mlverse.github.io/torchbook) a work-in-progress about deep learning with torch.

- The [documentation website](https://torch.mlverse.org/docs/) has guides for serialization, indexing and etc.

---

## Thanks very much!

![](images/Torch 3.png)