R data.table and Apple M1 installation on Big Sur supporting openmp multithreading

Introduction

The official release of R4.1.0 has brought long-awaited native support for Apple M1 to R/Mac users. While the Rosetta engine provided by Apple does its job very well, we are always eager for more performance and less time spent on our data wrangling tasks. Unfortunately, there is still a lack of information regarding working in the new environment. Basic installation of data.table on any macOS causes a notification:

**********
This installation of data.table has not detected OpenMP support. 
It should still work but in single-threaded mode. 
If this is a Mac, please ensure you are using R>=3.4.0 and have followed our Mac instructions here:
https://github.com/Rdatatable/data.table/wiki/Installation. 
This warning message should not occur on Windows or Linux. If it does, please file a GitHub issue.
**********

Notification leads to data.table installation guide describing several steps required for obtaining multithreading mode. Actually, there is no openmp support by Apple. This obligates to compile data.table on a device from source using command:

install.packages("data.table", type = "source",
    repos = "https://Rdatatable.gitlab.io/data.table")

The guidance provides several options for macOS preparation for installation from source. For my Rosetta installation, I prefer to use the compiler: GCC (Official GNU Fortran) ver mainly because of its smaller disk space requirements. Unfortunately, this option turns out to be absolutely useless in the case of native aarch64 installation. After several attempts using different options provided by the guidance, I came to the conclusion that all described options are ineffective for the aarch64 case. Moreover, searching for a solution across the internet yields nothing regarding this case.

Solution

The solution provided below is at your own responsibility and is considered experimental. I highly recommend keeping the previous R installation alongside the new one to address any unpredictable issues caused by compiling other packages, such as the stringi issue or rcpp or whatever.

The option with llvm will be used. Steps are corresponded to data.table guidance except 0 one.

Step 0 (prepare RStudio)

First of all, you need to install preview release of RStudio which supports Apple Silicon (aarch64).

Step 1 (100% according guidance)

Now, ensure that you have command line tools installed. Do NOT skipย this step. It is essential. See https://github.com/Rdatatable/data.table/issues/1692. From the terminal, type:

xcode-select --install

If you get an error message:ย xcode-select: error: command line tools are already installed, use "Software Update" to install updates, then you already have command line tools and can proceed to the next step. Else please follow the onscreen instructions and install it first.

Step 2 (100% according guidance)

Then, install homebrew if you have not already. After that, we can install the OpenMP enabled clang from the terminal by typing:

# update: seems like this installs clang with openmp support, 
# as pointed out by @botanize in #1817
brew update && brew install llvm

Note that homebrew have separate location for installing arm version of packages: opt/homebrew, check details. So we need to reconfigure our building enviroment accordingly.

Step 3 (modified guidance)

Add the following lines to the fileย ~/.R/Makevarsย using your favourite text editor. It’s likely you need to create theย .Rdirectory and the fileย Makevarsย in it if it hasn’t already exist.

# if you downloaded llvm manually above, replace with your chosen NEW_PATH/clang
LLVM_LOC = /opt/homebrew/opt/llvm
CC=$(LLVM_LOC)/bin/clang -fopenmp
CXX=$(LLVM_LOC)/bin/clang++ -fopenmp
# -O3 should be faster than -O2 (default) level optimisation ..
CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe
CXXFLAGS=-g -O3 -Wall -pedantic -std=c++11 -mtune=native -pipe
LDFLAGS=-L/opt/homebrew/opt/gettext/lib -L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib
CPPFLAGS=-I/opt/homebrew/opt/gettext/include -I$(LLVM_LOC)/include

The only difference of configuration above with original one is substitution of compiler links to /opt/homebrew/.... After that all necessary configurations are done and package ready to be installed from source. Use the following command: install.packages("data.table", type = "source")

Perfomance tests

Is it worth migrating from Rosetta to native support? While there are difficulties with installation and managing several lib directories, the benefits appear to be valuable. Benchmark results are provided below.

Hardware

  1. MacBook Air (M1, 2020) | Memory 16 Gb | Big Sur 11.4

  2. Core(TM) i7-7700 CPU | Memory 44 Gb | Ubuntu 20.04.2 LTS

Ubuntu is running as a virtualized instance on a remote server. Both systems use fast SSDs and were tested with the 4-way multithreading option activated. All tests were conducted on a refreshed session (also see SessionInfo()).

Data generation

I believe that performing simple mathematical calculations using base R and incorporating the most useful functions from the data.table package would be an appropriate profile for emulating real tasks. While this approach may not be comprehensive and entirely objective, it provides a practical assessment based on your specific needs.

library(bench)
library(data.table) 

smpl0 <- rexp(3, n = 5e6) # for R base iteration 
smpl1 <- data.table(fctr = sample(letters, n, replace = TRUE), num1 = rnorm(n, 3, 4), num2 = sample(1:100, n, replace = TRUE), num3 = runif(n, 0, 100)) 
smpl2 <- data.table(fctr = sample(letters, 26), num1 = rnorm(26, 3, 4), num2 = sample(1:100, 26, replace = TRUE), num3 = runif(26, 0, 100))

# base R simple benchmarking 
mark(min_time = .1, min_iterations = 50,
  lapply(smpl0, log),
  purrr::map(smpl0, log), # some additional time needed to attach function
  as.list(log(smpl0)))

# data.table common dunction usage
mark(min_time = .1, min_iterations = 50, check = FALSE, 
             smpl1[, lapply(.SD, mean), .SDcols = is.numeric],
             smpl1[smpl2, on = "fctr"], 
             uniqueN(smpl1, by = c("fctr", "num2")))

Iteration with base R and purrr package

Intel comparible hardware running Ubuntu
  expression                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 lapply(smpl, log)         1.2s    1.44s     0.554    38.1MB    0.853    50    77       1.5m
2 purrr::map(smpl, log)    4.89s    6.72s     0.141    38.1MB    0.761    50   270      5.92m
3 as.list(log(smpl))    124.93ms  215.6ms     2.78     76.3MB    1.06     50    19        18s
Rosseta emulation mode
  expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 lapply(smpl0, log)     978.31ms    1.33s     0.569    38.1MB    0.762    50    67      1.47m
2 purrr::map(smpl0, log)    3.84s    5.23s     0.169    38.1MB    0.790    50   234      4.94m
3 as.list(log(smpl0))     95.75ms 132.74ms     3.23     76.3MB    0.968    50    15      15.5s

It seems Rosetta emulation does not work very well showing 5-10% advantage over intel comparable platform.

Native arm mode
# A tibble: 3 x 9
  expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 lapply(smpl0, log)     699.88ms 763.53ms     1.15     38.1MB     1.15    50    50     43.44s
2 purrr::map(smpl0, log)    2.56s    3.68s     0.239    38.4MB     1.08    50   226      3.49m
3 as.list(log(smpl0))     67.98ms  76.13ms     6.23     76.3MB     2.24    50    18      8.02s

Native support shows more solid results. It seems to be twice faster than Rosetta mode. Very promising!

Some wrangling with data.table

Intel comparible hardware running Ubuntu
# A tibble: 3 x 9
  expression                                            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 smpl1[, lapply(.SD, mean), .SDcols = is.numeric] 485.05ms  487.4ms     2.02      1.8MB   0         50     0     24.74s
2 smpl1[smpl2, on = "fctr"]                           4.58s    5.46s     0.181    5.96GB   0.184     50    51      4.61m
3 uniqueN(smpl1, by = c("fctr", "num2"))           720.81ms    1.33s     0.779   381.5MB   0.0623    50     4      1.07m
Rosseta emulation mode
# A tibble: 3 x 9
  expression                                            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 smpl1[, lapply(.SD, mean), .SDcols = is.numeric]   25.29s   25.85s    0.0365   81.02KB    0        50     0     22.86m
2 smpl1[smpl2, on = "fctr"]                           4.57s    4.96s    0.199     5.96GB    0.199    50    50      4.18m
3 uniqueN(smpl1, by = c("fctr", "num2"))           570.96ms 596.29ms    1.49    381.48MB    0.119    50     4     33.62s

data.table functions show contrudictionary results:

  • The aggregation function reveals a significant gap between the Intel-based system and Rosetta mode. Rosetta appears to be slower by approximately 50 times. This is an intriguing result and is not easily explained. To validate this observation, I reset the R session and repeated the test several times, yet the results remained consistently similar.

  • Left join seems to be faster for Rosetta, just as it is for the base R level, with an improvement of 5-10%.

  • Calculating unique observations is more than two times faster on Rosetta than on the Intel-based system.

Native arm mode
# A tibble: 3 x 9
  expression                                           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 smpl[, lapply(.SD, mean), .SDcols = is.numeric] 462.24ms    464ms     2.12    81.02KB    0        50     0     23.63s
2 smpl1[smpl2, on = "fctr"]                          4.61s     4.9s     0.205    5.96GB    0.410    50   100      4.07m
3 uniqueN(smpl1, by = c("fctr", "num2"))          518.12ms  577.3ms     1.51   381.48MB    0.121    50     4     33.19s

Interestingly, the native mode demonstrates a slight advantage over Rosetta mode for left join and unique calculation. However, the performance for aggregation grows dramatically and exhibits a small advantage over the Intel-based system (approximately 5%).

SessionInfo()

Intel comparible hardware running Ubuntu
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=ru_RU.UTF-8     LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=ru_RU.UTF-8   
 [7] LC_PAPER=ru_RU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.6       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3       tibble_3.1.2      ggplot2_3.3.3     tidyverse_1.3.1  
[10] data.table_1.14.0 bench_1.1.1   
Rosseta emulation mode
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.14.1 bench_1.1.1       forcats_0.5.1     stringr_1.4.0     dplyr_1.0.5       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3       tibble_3.1.1     
[10] ggplot2_3.3.3     tidyverse_1.3.1 
Native arm mode
R version 4.1.0 (2021-05-18)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-aarch64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.6       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3       tibble_3.1.2      ggplot2_3.3.3     tidyverse_1.3.1  
[10] data.table_1.14.1 bench_1.1.1      

Conclusions

  1. Running R and data.table in native mode is still considered experimental, and practitioners often encounter a lack of information to adapt their macOS environment to meet the new requirements.

  2. Despite being experimental, native mode provides a substantial performance improvement over Rosetta mode for base R, resulting in calculations that are twice as fast.

  3. The surprising drastic performance downgrade observed in the aggregation function for data.table calculations running in Rosetta mode is resolved when running in native mode.

Recommendations

  1. Be prepared for potential issues as running R and data.table in native mode is experimental.

  2. Keep an older R version as a backup plan to ensure continuity.

  3. Consider using Rswitch for seamless switching between old and new R versions. RSwitch