The goal of this blog post is to introduce the reader to the powerful tool ggplot2 and provide an explanation and scope of some of the abilities of this tool to create effective visualizations of experimental data. By reading this post, the reader will be able to use the parameters laid out and write code of their own. Any set of tabular data can be sorted, analyzed and visualized with ggplot2.

General Format

ggplot2 has three basic components: the data set, which is best provided in a data.frame format, the aesthetic (not to be confused with coordinates!) system with the Cartesian x, y, and potentially z, and finally the geom, which is the visual representation of the data on the graph. Examples of ggplot2 geoms are bar plots, box-and-whisker plots, histograms, and even shapes with the right data set! It is also important to include the function ggsave to save the output of the code with your choice of format (png is recommended here). The following code chunk shows this:

# First step towards making a plot
 nice_plot <- ggplot(nice_data, aes(x = "dependent_value", y = "independent_value"))

# Save plot here with your filename
ggsave(plot = nice_plot,
  filename = "output/figures/nice_plot.png")

Note that the default for ggsave is the last plot you generate, so to prevent any possible confusion, it is best practice to explicitly name the plot you want to save as a png.

Building the Plot

The first part of the ggplot code will include the following parameters, the dataset you’re working with (preferably a .csv), the aes (short for aesthetics) that includes your x and y (also z if needed). It will look like this:

# First step towards making a plot
 nice_plot <- ggplot(nice_data, aes(x = "dependent_value", y = "independent_value"))

# Note: the x and y here refer to variables that go on each axis, not the labels for the axes!

If you do need to have a z-axis value, you can go back to the aes section and add a value for the z that corresponds to a column in your data set. Note that we have not included the geom type yet: this is added in a separate code followed by a +. The great aspect of this is that if you aren’t happy with the first attempt of visualizing your data, you can go back and change it without altering your assigned coordinate values.

Adding the Geoms

1. Bar Plots

There are many types of geoms in ggplot2, but the syntax for all of them is the same. Let us start with a common and very useful type of visualization: the bar graph. There are two geoms for this: geom_bar and geom_col. We will go over both of these in the following code chunk since they use the same parameters. All the default values will be included for each parameter and elaborated on.

# Taking the same data set from the last code chunk

 nice_plot <- ggplot(nice_data, aes(x = "dependent_value", y = "independent_value")) +
geom_col(
mapping = "NULL",
data = "NULL",
stat = "count",
position = stack,
width = NULL,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE)

The data inside the geom refers to your dataset from before and if set to NULL, will automatically refer to your dataset from before. The stat records the count of each x or y value depending on your preferred orientation and is best suited for histograms (note that there is a separate geom_histogram() for this as well). The width refers to the width of each bar in the plot, you might have to adjust this depending on how crowded your axis is. A good way to determine this is simply if you can match each data value with its column. If you cannot, then go for a smaller width so you can match these values. na.rm refers to missing values: if set to TRUE, they are just removed, if set to FALSE, they are removed but with a warning. orientation refers to the axes the dependent and independent values are on: if set to NA ggplot2 will figure out which is which on its own. This parameter can be set to x or y if this fails for any reason. show.legend refers to the legend for your dataset: when set to NA, it automatically includes anything that is visualized in the graph: most likely x and y will be described here. inherit.aes refers to using the aes values from the first section of the code. position will either be “stack” where x values are on top of each other, or “dodge” where they are kept separate.

The differences between geom_bar and geom_col are in how you want to define your bars in your graph: geom_bar makes the bar proportional to the number of cases of each group (like a histogram), while geom_col makes the bar proportional to the value of the dataset as is. If you are trying to determine the statistical distribution of data, use geom_bar. For visualizing your dataset as is, use geom_col.

2. Box and Whisker Plots

Box and whisker plots are especially useful for scientific studies: it allows you not to just see data, but also the margin of error and how wide it is. A lot of parameters here are the same for geom_col and geom_bar, but geom_boxplot will need further elaboration about the lower and upper quartiles of data. Again, let’s do this in a code chunk:

# Include default parameters here

ggplot(nice_data, aes(x = "dependent value", y = "independent value")) +
geom_boxplot(
  mapping = NULL,
  data = NULL,
  stat = "boxplot",
  position = "dodge2",
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  outlier.color = NULL,
  outlier.fill = NULL,
  outlier.shape = 19,
  outlier.size = 1.5,
  outlier.stroke = 0.5,
  outlier.alpha = NULL,
  notch = FALSE,
  notchwidth = 0.5,
  varwidth = FALSE)

The outlier parameters refer to data that falls outside the box plot: you can customize the shape, size, and fill (if applicable) to your liking so they stand out from the rest of your dataset. If you do not want any outliers to show up, set outlier.shape to NA and they will disappear. Another useful aspect is the notch in the box plot that shows the median value of the data with a roughly 95% confidence interval. If notch is set to TRUE, it will show that notch and depending on notchwidth, you will see the width of the box getting smaller (think of an hourglass shape) and be smallest at where the median value is. varwidth refers to the width of the box: if set to FALSE, it will produce a standard box plot, and if set to TRUE, it will produce a box with the width proportional to the square root of number of observations (wider if there’s more data).

Summary

This post has given you a sense of what ggplot2 can provide in terms of data visualization, but it is not a complete picture. Box plots and box-and-whisker plots are often used for scientific research, but they are not the only options out there. The format for writing code for other types of plots uses the same grammar in R as shown in the post, so use this information to help you write code when doing your research! Check out the cited website below for more information about ggplot2!

Sources Cited

  1. https://ggplot2.tidyverse.org/index.html
  2. https://ggplot2.tidyverse.org/reference/geom_bar.html
  3. https://ggplot2.tidyverse.org/reference/geom_boxplot.html