Performance Optimization in R: An Example Using Memoise

Author : Vincent Guyader
Tags : development, package, tips
Date :

Context

Whether during the (training sessions) I can provide or the (coaching) sessions I can conduct, there is often a moment when I go on a long rant about the power and usefulness of caching. How awesome it is, not so complicated to set up, and super interesting.

The last time I gave a presentation was to talk about ({targets}), but in this article, we’ll focus on a more “simple” package, yet extremely effective. We’ll talk about {memoise}, which will prove to be indispensable for your analyses or your Shiny applications.

{memoise} is a package that contains a function memoise(), and it is in this function that the main functionality is concentrated.

memoise::memoise() is a function that takes another function as a parameter and returns a function that will have the particularity of having a memory and not redoing the calculations for nothing. An example? Ok.

Demonstration

Let’s build calcul_long and its “memoized” version

calcul_long <- function(x){
  Sys.sleep(3)
  x*3
}
calcul_long_mem <- memoise::memoise(calcul_long)

If I execute calcul_long, then I should always wait 3 seconds for the result.

system.time({  calcul_long(x = 2) })
#>    user  system elapsed 
#>   0.018   0.005   3.000
system.time({  calcul_long(x = 3) })
#>    user  system elapsed 
#>   0.009   0.013   3.000
system.time({  calcul_long(x = 2) })
#>    user  system elapsed 
#>   0.017   0.008   3.000

However, the function calcul_long_mem will only make me wait 3 seconds the first time and will directly return the result the following times (if I ask it exactly the same thing of course).

system.time({  calcul_long_mem(x = 2) })
#>    user  system elapsed 
#>   0.016   0.005   3.001
system.time({  calcul_long_mem(x = 3) })
#>    user  system elapsed 
#>   0.010   0.013   3.001
system.time({  calcul_long_mem(x = 2) })
#>    user  system elapsed 
#>   0.001   0.000   0.000
system.time({  calcul_long_mem(x = 2) })
#>    user  system elapsed 
#>       0       0       0
system.time({  calcul_long_mem(x = 2) })
#>    user  system elapsed 
#>   0.001   0.000   0.000
system.time({  calcul_long_mem(x = 3) })
#>    user  system elapsed 
#>       0       0       0

Convenient, isn’t it?

A Persistent Cache

By default, the memory is temporary, limited to your R session, which is better than nothing, but we can do better and use disk storage to make the memory persistent.

To do this properly, we’ll use rappdirs::user_cache_dir() which will return a persistent cache folder consistently for all operating systems.

dossier_cache <- rappdirs::user_cache_dir("calcul_long_dir")
cd <- cachem::cache_disk(dir =  dossier_cache)
calcul_long_mem2 <- memoise::memoise(calcul_long, cache = cd)

And voilà!

Now calcul_long_mem2 will store all its knowledge in a disk folder, it will be shared between R sessions, today’s and tomorrow’s.

system.time({  calcul_long_mem2(x = 2) })
#>    user  system elapsed 
#>   0.001   0.000   0.001
system.time({  calcul_long_mem2(x = 3) })
#>    user  system elapsed 
#>   0.001   0.000   0.000
system.time({  calcul_long_mem2(x = 2) })
#>    user  system elapsed 
#>       0       0       0

Operation

If we analyze a little what just happened, then we notice that in the cache folder, .rds files accumulate, which contain the function’s memory.


fs::dir_tree(dossier_cache)
#> ~/.cache/calcul_long_dir
#> ├── 000c2eaa9359e329edb0fc4a01a8c0c5.rds
#> ├── 26a9a952a6f2b8c413604e732771d2e5.rds
#> ├── 2d6da9f223e1b062c5681f2e9f942ef8.rds
#> ├── 32c041e836202daa0a80ef4c7e269f15.rds
#> ├── 77252f4a89517c08a4f5f89de0440705.rds
#> ├── 830207659dfd7c44baa0f49d74d443db.rds
#> └── 8717feae48bb1499ca487fd679167cd6.rds

Upon the first execution of a function call, a new .rds is created, while for the subsequent ones, the corresponding rds is read and returned to the user.

Implementation

Here’s an implementation example,

Let’s say we have the function ocr_pdf_ which is capable of analyzing a pdf file to return lots of information, it’s rather slow to run and goes through a paid service. Let’s see how to implement in a package an optimized approach, in order to minimize execution time.


#' extract data from pdf
#'
#' @param file pdf file to parse
#' @param apikey token
#'
#' @return a list
#' @noRd
ocr_pdf_ <- function(file,apikey){
Sys.sleep(1)
  file
}
# memoised `ocr_pdf_` function
#'
#' @param memoise_cache_dir cache directory
#'
#' @return a list
#' @export
#' @importFrom cachem cache_disk
#' @importFrom memoise memoise
#' @importFrom rappdirs user_cache_dir
#' @noRd
ocr_pdf_mem <- function(memoise_cache_dir = getOption(
  "ocr_memoise_dir",
  default = rappdirs::user_cache_dir("ocr")
)
) {
  cd <- cachem::cache_disk(dir =  memoise_cache_dir)
  memoise::memoise(ocr_pdf_,cache = cd)
}
#' extract data from pdf
#' @param file pdf file to parse
#' @param apikey token
#' @export
#' @rdname ocr_pdf
ocr_pdf <- ocr_pdf_mem()
# Clean `ocr_pdf_mem` cache
#'
#' @param memoise_cache_dir cache directory
#'
#' @export
#' @importFrom rappdirs user_cache_dir
#' @rdname ocr_pdf
clean_memoise_cache_ocr <- function(memoise_cache_dir = getOption("ocr_memoise_dir",
                                                                  default = rappdirs::user_cache_dir("ocr"))){
  message("cleaning ",memoise_cache_dir," folder")
  unlink(memoise_cache_dir,recursive = TRUE,force = TRUE)
}

You will notice that the user can use the function ocr_pdf and that it will use a persistent cache, and if necessary the function clean_memoise_cache_ocr will erase this cache. A slightly “finer” user can directly call the function `ocr_pdf_mem` to play with the cache folders.

It’s up to you to be inspired for your use cases.

A More Complex Case

A little treat for the end, a bit more ambitious and exploratory, to give you some ideas.
How to use {memoise} to set up a cache system for Rmd rendering based on rmarkdown::render() ?

At first glance, this is not so trivial since the default behavior of memoise will be problematic:

  • In rmarkdown::render(input = "report.Rmd"), the different inputs do not change as such since the source path of the Rmd will always be the same, even if its content changes.
  • We use the function render() for its “side effect”, meaning that it does not return anything directly to the user and it generates an HTML on the disk.

To make memoise detect a change in the content of the source Rmd file, we will generate the hash of the content of this Rmd and pass it as a parameter to the function. There would be a lot to say about the concept of hash, if you are not familiar with this notion, remember that:

  • Two identical files will have the same hash,
  • The slightest comma added to the Rmd will change the hash,
  • And it is (almost) impossible for two different objects to have the same hash.

To generate this hash, we will use the {digest} package like this digest::digest(readLines("report.Rmd")).

Here’s a proposed implementation:

First, we’ll make sure that the function returns the HTML content to the user (that’s what we’ll save).

rmd_to_html_ <- function(input,hash=NULL
                         ,...
                         ){
  message("rendering")
  a <- rmarkdown::render(input = input,output_file =tempfile(), ...)
  out <- readLines(a)
  invisible(out)
}
file.copy(from = system.file("rmd","site","index.Rmd",package = "rmarkdown"),to="./a.Rmd")
#> [1] FALSE
a <- rmd_to_html_(input = "a.Rmd")
#> rendering
#> 
#> 
#> processing file: a.Rmd
#> 
  |                                                                                                                      
  |                                                                                                                |   0%
  |                                                                                                                      
  |................................................................................................................| 100%
#> output file: a.knit.md
#> /usr/lib/rstudio-server/bin/quarto/bin/tools/pandoc +RTS -K512m -RTS a.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmp1j3QTy/file3ecda11c83384d.html --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --template /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/Rtmp1j3QTy/rmarkdown-str3ecda14ba5eeeb.html
#> 
#> Output created: /tmp/Rtmp1j3QTy/file3ecda11c83384d.html
# avec cat on peut refabriquer le fichier
# cat(a,file = "test.html")
# browseURL("test.html")

with cat() we can rebuild the html file

cat(a,file = "test.html")
browseURL("test.html")

we implement a persistent cache

dossier_cache <- rappdirs::user_cache_dir("rmd_dir")
cd <- cachem::cache_disk(dir =  dossier_cache)
rmd_to_html <- memoise::memoise(rmd_to_html_,cache = cd)

we create a user-friendly function


compile_rmd <- function(input, output = tempfile(fileext = ".html")) {
  html_content <-
    rmd_to_html(input = input, hash = digest::digest(readLines(input)))
  cat(html_content, file = output)
  output
}

Tadaa


file.copy(from = system.file("rmd","site","index.Rmd",package = "rmarkdown"),to="./file1.Rmd")
#> [1] FALSE
compile_rmd(input = "file1.Rmd") 
#> [1] "/tmp/Rtmp1j3QTy/file3ecda12abf5b3a.html"
compile_rmd(input = "file1.Rmd") # instantané
#> [1] "/tmp/Rtmp1j3QTy/file3ecda11c2fd776.html"
cat("\n newline \n",file = "file1.Rmd",append = TRUE)
compile_rmd(input = "file1.Rmd") # le fichier a changé donc on relance la compilation
#> rendering
#> 
#> 
#> processing file: file1.Rmd
#> 
  |                                                                                                                      
  |                                                                                                                |   0%
  |                                                                                                                      
  |................................................................................................................| 100%
#> output file: file1.knit.md
#> /usr/lib/rstudio-server/bin/quarto/bin/tools/pandoc +RTS -K512m -RTS file1.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmp1j3QTy/file3ecda12505b85e.html --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --template /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/Rtmp1j3QTy/rmarkdown-str3ecda15c2c8b4c.html
#> 
#> Output created: /tmp/Rtmp1j3QTy/file3ecda12505b85e.html
#> [1] "/tmp/Rtmp1j3QTy/file3ecda148755619.html"
compile_rmd(input = "file1.Rmd") # instantané
#> [1] "/tmp/Rtmp1j3QTy/file3ecda167fd745d.html"

Okay, that’s fine, but what about image files that would need to be displayed in the HTML, in case we are not in standalone HTML?

We will rethink all this to manage the handling of physical dependencies, and the simplest will be to move all the files to a folder next to the HTML and adapt the source code.

Here’s a proposed implementation:

compile_rmd2 <-
  function(input, output = tempfile(fileext = ".html")) {
    html_content <-
      rmd_to_html(input = input, hash = digest::digest(readLines(input)))
    # we prepare the destination folder    
    dossier_destination <- dirname(output)
    dir.create(file.path(dossier_destination, "images"),showWarnings = FALSE,recursive = TRUE)
    images_to_handle <-  get_images_to_pimp(
      html_content = html_content,
      dossier_destination = dirname(output),
      folder_name = "images"
    )
    # we move the images
    move_images(images_to_handle)
    # we edit the html
    html_content %>% search_and_replace_in_txt(
      pattern =     images_to_handle %>% map_chr("source"),
      replacement = images_to_handle %>% map_chr("rewrite")
    ) %>%
      cat(file = output)
    output
  }

with the following tools:

#' @importFrom stringr str_replace_all
search_and_replace_in_txt <- function(txt,pattern, replacement){
  if (length(pattern)==0){return(txt)}
     motifs <- setNames(replacement,pattern)
     str_replace_all(string = txt, motifs)
}
#' @importFrom purrr map
#' @importFrom stringr str_subset
#' @importFrom rvest read_html html_nodes html_attr
get_images_to_pimp <- function(html_content,dossier_destination=".",folder_name ="images"){
      read_html(paste(html_content, sep = "", collapse = "")) %>%
      html_nodes("img") %>%
      html_attr("src") %>%
      stringr::str_subset(negate = TRUE, "^data") %>% 
      unique() %>%
      map( ~ list(
        source = .x,
        destination = file.path(dossier_destination, folder_name, basename(.x)),
        rewrite = file.path(".", folder_name, basename(.x))
        ))
}
#' @importFrom fs file_copy
#' @importFrom purrr map
move_images <- function(images_to_handle){
      images_to_handle %>%
      map(~file.copy(.$source,.$destination, overwrite = TRUE))
    }

That’s all for today.

Conclusion

And with all this, I hope you’re going to cache like never before!

header source image : https://alterconf.com/speakers/anjana-vakil


About the author

Vincent Guyader

Vincent Guyader

Codeur fou, formateur et expert logiciel R


Comments


Also read