Intro
Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.
To replicate the R code I’m going to use in this post, four R packages must be loaded:
library(dplyr) # for data manipulation library(ggplot2) # for plotting data library(gghighlight) # ggplot2 extension for highlighting values
The dataset I’m going to use in this post (mtcars
) is part of the datasets
package.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In the first code chunk, we:
- extract the first word of each car name and write it into a new variable called “brand”,
- rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
- calculate the median miles per gallon (mpg) for each car brand.
df.mtcars % mutate(name = str_extract(rownames(.), "^\\w+\\b"), brand = str_replace(name, "^M\\w+", 'Others')) %>% group_by(brand) %>% summarize(mpg = median(mpg)) df.mtcars$brand
## [1] "AMC" "Cadillac" "Camaro" "Chrysler" "Datsun" "Dodge" ## [7] "Duster" "Ferrari" "Fiat" "Ford" "Honda" "Hornet" ## [13] "Lincoln" "Lotus" "Others" "Pontiac" "Porsche" "Toyota" ## [19] "Valiant" "Volvo"
The following code chunk is to reorder the brand
variable by level frequency using the reorder()
function.
df.mtcars % mutate(brand = as.factor(brand), brand = reorder(brand, mpg)) levels(df.mtcars$brand)
## [1] "Cadillac" "Lincoln" "Camaro" "Duster" "Chrysler" "AMC" ## [7] "Dodge" "Ford" "Valiant" "Others" "Pontiac" "Ferrari" ## [13] "Hornet" "Volvo" "Datsun" "Porsche" "Toyota" "Fiat" ## [19] "Honda" "Lotus"
As we can see, the bar representing the “Others” level is roughly in the middle of the plot.
ggplot(df.mtcars, aes(brand, mpg, fill = brand)) + coord_flip() + geom_col(width = 0.5) + gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") + scale_fill_manual(values = c("grey")) + theme_bw() + theme(legend.position = 'none') + labs(x = NULL, y = 'Miles per Gallon', title = "Factor variable ordered by level frequency")
To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the relevel()
function.
df.mtcars % mutate(brand = relevel(brand, ref = "Others")) levels(df.mtcars$brand)
## [1] "Others" "Cadillac" "Lincoln" "Camaro" "Duster" "Chrysler" ## [7] "AMC" "Dodge" "Ford" "Valiant" "Pontiac" "Ferrari" ## [13] "Hornet" "Volvo" "Datsun" "Porsche" "Toyota" "Fiat" ## [19] "Honda" "Lotus"
Finally, the bar representing the “Others” level appears at the desired position.
ggplot(df.mtcars, aes(brand, mpg, fill = brand)) + coord_flip() + geom_col(width = 0.5) + gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") + scale_fill_manual(values = c("grey")) + theme_bw() + theme(legend.position = 'none') + labs(x = NULL, y = 'Miles per Gallon', title = "Factor variable ordered by level frequency and level name")
PS: In both plots, the gghighlight()
function of the gghighlight
package was used to highlight the desired factor level.
Nice post! Have you checked out Hadley Wickhams forcats package? It can automatically create “Others” categories and offers helpful functions for ordering and re-ordering factor levels. See fct_lump() to create “Others” automatically, or fct_collapse for manual specifications.
Thanks a lot, Wolf! 🙂