Last year I came across the base R function
Vectorize() vectorizes the action of a non-vectorized function. Let's give an example.
In one of my current research projects, I need to hash patient ids to fulfill the requirements of data privacy protection. With
digest package contains a function to calculate a hash of an object. Let's see what the function does, when we apply it to a column of the
mtcars data frame:
First, we write the row names (names of the cars) into a new variable ('NAME'):
library(dplyr) library(tibble) data("mtcars") mtcars <- mtcars %>% tibble::rownames_to_column('NAME')
Now, we assume that 'NAME' is the id variable we want to hash:
library(digest) mtcars <- mtcars %>% mutate(HASH = sha1(NAME)) %>% select(NAME, HASH, mpg) head(mtcars)
## NAME HASH mpg ## 1 Mazda RX4 cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.0 ## 2 Mazda RX4 Wag cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.0 ## 3 Datsun 710 cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 22.8 ## 4 Hornet 4 Drive cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.4 ## 5 Hornet Sportabout cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 18.7 ## 6 Valiant cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 18.1
As we can see, different car names received the same hash. This is not exactly what we want. It happened because the
sha1() function is not vectorized.
In the final step, we vectorize the
sha1() function and apply it once again to the
mtcars data frame:
sha1_vectorized <- Vectorize(digest::sha1) mtcars <- mtcars %>% mutate(HASH = sha1_vectorized(NAME)) %>% select(NAME, HASH, mpg) head(mtcars)
## NAME HASH mpg ## 1 Mazda RX4 b22967895db5fb044febfaad31d34ccfc95f4440 21.0 ## 2 Mazda RX4 Wag 45464747af0f4df66ee253bfef89d4b106cfb713 21.0 ## 3 Datsun 710 785ba328b314246358feec3166fafa71bb724793 22.8 ## 4 Hornet 4 Drive e1265538639ccf3f772038fe3db16aaaa28a4dd9 21.4 ## 5 Hornet Sportabout 0b3f30b312e17c7c610399bf204ea9de2c71b96e 18.7 ## 6 Valiant fe5206e3d182bff5748e295f9f78dba99ed0ec7f 18.1
Bingo! The vectorized version of
sha1() did the job!
PS: Vectorizing a function makes the function perform the same operation on every entry in a data structure (but with different values) (see Win-Vector Blog). The non-vectorized
sha1() function seems to treat the variable
NAME as a scalar (a single value). Thus, it hashes not every single entry of the variable, but all elements of the variable on the whole.