Intro to ggplot2 and its parameters
The goal of this blog post is to introduce the reader to the powerful tool ggplot2
and provide an explanation and scope of some of the abilities of this tool to create effective
visualizations of experimental data. By reading this post, the reader will be
able to use the parameters laid out and write code of their own. Any set of tabular data can be sorted, analyzed and visualized with ggplot2
.
General Format
ggplot2
has three basic components: the data set, which is best provided in
a data.frame format, the
aesthetic (not to be confused with coordinates!) system with the Cartesian x, y, and potentially z, and finally the geom, which is the visual representation of the data on the graph. Examples of ggplot2
geoms are bar plots, box-and-whisker plots, histograms, and even shapes with the right data set! It is also important to include the function ggsave
to save the output of the code with your choice of format (png is recommended here). The following code chunk shows this:
# First step towards making a plot
nice_plot <- ggplot(nice_data, aes(x = "dependent_value", y = "independent_value"))
# Save plot here with your filename
ggsave(plot = nice_plot,
filename = "output/figures/nice_plot.png")
Note that the default for ggsave
is the last plot you generate, so to prevent
any possible confusion, it is best practice to explicitly name the plot you want
to save as a png.
Building the Plot
The first part of the ggplot
code will include the following parameters, the
dataset you’re working with (preferably a .csv), the aes
(short for aesthetics) that includes your x and y (also z if needed). It will look like this:
# First step towards making a plot
nice_plot <- ggplot(nice_data, aes(x = "dependent_value", y = "independent_value"))
# Note: the x and y here refer to variables that go on each axis, not the labels for the axes!
If you do need to have a z-axis value, you can go back to the aes
section and add a value for the z that corresponds to a column in your data set. Note that we have not included the geom type yet: this is added in a separate code followed by a +. The great aspect of this is that if you aren’t happy with the first attempt of visualizing your data, you can go back and change it without altering your assigned coordinate values.
Adding the Geoms
1. Bar Plots
There are many types of geoms in ggplot2
, but the syntax for all of them is
the same. Let us start with a common and very useful type of visualization: the
bar graph. There are two geoms for this: geom_bar and geom_col. We will
go over both of these in the following code chunk since they use the same parameters. All the default values will be included
for each parameter and elaborated on.
# Taking the same data set from the last code chunk
nice_plot <- ggplot(nice_data, aes(x = "dependent_value", y = "independent_value")) +
geom_col(
mapping = "NULL",
data = "NULL",
stat = "count",
position = stack,
width = NULL,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE)
The data inside the geom refers to your dataset from before
and if set to NULL, will automatically refer to your dataset
from before. The stat records the count of each x or y value depending on your preferred orientation and is best suited for histograms (note that there is a separate geom_histogram() for this as well). The width refers to the width of each bar in the plot, you might have to adjust this depending on how crowded your axis is. A good way to determine this is simply if you can match each data value with its column. If you cannot, then go for a smaller width so you can match these values. na.rm refers to missing values: if set to TRUE, they are just removed, if set to FALSE, they are removed but with a warning. orientation refers to the axes the dependent and independent values are on: if set to NA ggplot2
will figure out which is which on its own. This parameter can be set to x or y if this fails for any reason. show.legend refers to the legend for your dataset: when set to NA, it automatically includes anything that is visualized in the graph: most likely x and y will be described here. inherit.aes refers to using the aes
values from the first section of the code. position will either be “stack” where x values are on top of each other, or “dodge” where they are kept separate.
The differences between geom_bar and geom_col are in how you want to define your bars in your graph: geom_bar makes the bar proportional to the number of cases of each group (like a histogram), while geom_col makes the bar proportional to the value of the dataset as is. If you are trying to determine the statistical distribution of data, use geom_bar. For visualizing your dataset as is, use geom_col.
2. Box and Whisker Plots
Box and whisker plots are especially useful for scientific studies: it allows you not to just see data, but also the margin of error and how wide it is. A lot of parameters here are the same for geom_col and geom_bar, but geom_boxplot will need further elaboration about the lower and upper quartiles of data. Again, let’s do this in a code chunk:
# Include default parameters here
ggplot(nice_data, aes(x = "dependent value", y = "independent value")) +
geom_boxplot(
mapping = NULL,
data = NULL,
stat = "boxplot",
position = "dodge2",
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE,
outlier.color = NULL,
outlier.fill = NULL,
outlier.shape = 19,
outlier.size = 1.5,
outlier.stroke = 0.5,
outlier.alpha = NULL,
notch = FALSE,
notchwidth = 0.5,
varwidth = FALSE)
The outlier parameters refer to data that falls outside the box plot: you can customize the shape, size, and fill (if applicable) to your liking so they stand out from the rest of your dataset. If you do not want any outliers to show up, set outlier.shape to NA and they will disappear. Another useful aspect is the notch in the box plot that shows the median value of the data with a roughly 95% confidence interval. If notch is set to TRUE, it will show that notch and depending on notchwidth, you will see the width of the box getting smaller (think of an hourglass shape) and be smallest at where the median value is. varwidth refers to the width of the box: if set to FALSE, it will produce a standard box plot, and if set to TRUE, it will produce a box with the width proportional to the square root of number of observations (wider if there’s more data).
Summary
This post has given you a sense of what ggplot2
can provide in terms of data visualization, but it is not a complete picture. Box plots and box-and-whisker plots are often used for scientific research, but they are not the only options out there. The format for writing code for other types of plots uses the same grammar in R as shown in the post, so use this information to help you write code when doing your research! Check out the cited website below for more information about ggplot2
!
Sources Cited
- https://ggplot2.tidyverse.org/index.html
- https://ggplot2.tidyverse.org/reference/geom_bar.html
- https://ggplot2.tidyverse.org/reference/geom_boxplot.html