Intro

Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.

To replicate the R code I’m going to use in this post, four R packages must be loaded:

```library(dplyr) # for data manipulation
library(ggplot2) # for plotting data
library(gghighlight) # ggplot2 extension for highlighting values
```

The dataset I’m going to use in this post (`mtcars`) is part of the `datasets` package.

```head(mtcars)
```
```##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

In the first code chunk, we:

• extract the first word of each car name and write it into a new variable called “brand”,
• rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
• calculate the median miles per gallon (mpg) for each car brand.
```df.mtcars %
mutate(name = str_extract(rownames(.), "^\\w+\\b"),
brand = str_replace(name, "^M\\w+", 'Others')) %>%
group_by(brand) %>%
summarize(mpg = median(mpg))
df.mtcars\$brand
```
```##  [1] "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"
##  [7] "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"
## [13] "Lincoln"  "Lotus"    "Others"   "Pontiac"  "Porsche"  "Toyota"
## [19] "Valiant"  "Volvo"
```

The following code chunk is to reorder the `brand` variable by level frequency using the `reorder()` function.

```df.mtcars %
mutate(brand = as.factor(brand),
brand = reorder(brand, mpg))
levels(df.mtcars\$brand)
```
```##  [1] "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler" "AMC"
##  [7] "Dodge"    "Ford"     "Valiant"  "Others"   "Pontiac"  "Ferrari"
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"
## [19] "Honda"    "Lotus"
```

As we can see, the bar representing the “Others” level is roughly in the middle of the plot.

```ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
coord_flip() +
geom_col(width = 0.5) +
gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
scale_fill_manual(values = c("grey")) +
theme_bw() +
theme(legend.position = 'none') +
labs(x = NULL,
y = 'Miles per Gallon',
title = "Factor variable ordered by level frequency")
```

To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the `relevel()` function.

```df.mtcars %
mutate(brand = relevel(brand, ref = "Others"))
levels(df.mtcars\$brand)
```
```##  [1] "Others"   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler"
##  [7] "AMC"      "Dodge"    "Ford"     "Valiant"  "Pontiac"  "Ferrari"
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"
## [19] "Honda"    "Lotus"
```

Finally, the bar representing the “Others” level appears at the desired position.

```ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
coord_flip() +
geom_col(width = 0.5) +
gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
scale_fill_manual(values = c("grey")) +
theme_bw() +
theme(legend.position = 'none') +
labs(x = NULL,
y = 'Miles per Gallon',
title = "Factor variable ordered by level frequency and level name")
```

PS: In both plots, the `gghighlight()` function of the `gghighlight` package was used to highlight the desired factor level.