How to Vectorize a Function in R

Last year I came across the base R function Vectorize(). Vectorize() vectorizes the action of a non-vectorized function. Let's give an example.

In one of my current research projects, I need to hash patient ids to fulfill the requirements of data privacy protection. With sha1(), the digest package contains a function to calculate a hash of an object. Let's see what the function does, when we apply it to a column of the mtcars data frame:

First, we write the row names (names of the cars) into a new variable ('NAME'):

library(dplyr)
library(tibble)

data("mtcars")
mtcars <- mtcars %>%
  tibble::rownames_to_column('NAME')

Now, we assume that 'NAME' is the id variable we want to hash:

library(digest)

mtcars <- mtcars %>%
  mutate(HASH = sha1(NAME)) %>%
  select(NAME, HASH, mpg)

head(mtcars)
##                NAME                                     HASH  mpg
## 1         Mazda RX4 cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.0
## 2     Mazda RX4 Wag cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.0
## 3        Datsun 710 cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 22.8
## 4    Hornet 4 Drive cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.4
## 5 Hornet Sportabout cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 18.7
## 6           Valiant cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 18.1

As we can see, different car names received the same hash. This is not exactly what we want. It happened because the sha1() function is not vectorized.

In the final step, we vectorize the sha1() function and apply it once again to the mtcars data frame:

sha1_vectorized <- Vectorize(digest::sha1)
mtcars <- mtcars %>%
  mutate(HASH = sha1_vectorized(NAME)) %>%
  select(NAME, HASH, mpg)

head(mtcars)
##                NAME                                     HASH  mpg
## 1         Mazda RX4 b22967895db5fb044febfaad31d34ccfc95f4440 21.0
## 2     Mazda RX4 Wag 45464747af0f4df66ee253bfef89d4b106cfb713 21.0
## 3        Datsun 710 785ba328b314246358feec3166fafa71bb724793 22.8
## 4    Hornet 4 Drive e1265538639ccf3f772038fe3db16aaaa28a4dd9 21.4
## 5 Hornet Sportabout 0b3f30b312e17c7c610399bf204ea9de2c71b96e 18.7
## 6           Valiant fe5206e3d182bff5748e295f9f78dba99ed0ec7f 18.1

Bingo! The vectorized version of sha1() did the job!

PS: Vectorizing a function makes the function perform the same operation on every entry in a data structure (but with different values) (see Win-Vector Blog). The non-vectorized sha1() function seems to treat the variable NAME as a scalar (a single value). Thus, it hashes not every single entry of the variable, but all elements of the variable on the whole.

Advertisements

About norbert

Biometrician at Clinical Trial Centre, Leipzig University (GER), with degrees in sociology (MA) and public health (MPH).
This entry was posted in Tips & Tricks and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.