loader image
Skip to main content
If you continue browsing this website, you agree to our policies:
x

Topic outline

  • Unit 4: Data Visualization

    Visualization is essential for story-telling with data and communicating the results of your analysis. Graphs help us efficiently assess prominent features in the data (trends, variability), irregularities (changepoints, outliers), and relationships between variables and compare those across different samples. R has powerful tools for creating scientific graphs. Commonly used tools belong to two groups: built-in R functions (base-R) and functions from the package ggplot2 following a bit different grammar of graphics. This unit introduces the syntax for both these approaches and how to export a publication-quality graph from R.

    Completing this unit should take you approximately 3 hours.

    • Upon successful completion of this unit, you will be able to:

      • recognize the syntax of R graphics in base-R and ggplot2;
      • prepare a publication-quality figure in vector or raster format (export the figure from R); and
      • create graphs of the common types (scatterplot, histogram, boxplot, and time series plot).
    • 4.1: Base-R and ggplot2 Graphics

      There are two main plotting systems in R, the base plotting system (referred to as "base-R graphics" in this course) and the ggplot2 package. Sometimes, the lattice package is counted as the third system, but this package is outside the scope of this course. Both base-R and ggplot2 have their advantages and disadvantages. Generally, you can produce the same plot with any of the systems, but the syntax will be very different. This section aims to introduce both plotting systems so you can navigate their code comfortably. This introduction may be lengthy, but it helps us study specific plot types in the next sections and create the plots quickly in both base R and ggplot2.

      • This section introduces the base-R graphics. Reading the materials will familiarize you with different options and commands used for plotting. You should start coding by implementing the high-level function like the plot, then incrementally modify and add code to change the plot appearance and add the function par to fine-tune the margins, etc. You will also learn about the R graphics devices used to save plots for publications (do not use the point-and-click interface to save plots from RStudio); these device commands are also applicable to outputs of the ggplot2.

      • In this short practice exercise, you will implement the high-level function plot. It is convenient for fast checks and does not require installing additional packages. This exercise does not count toward your grade. It is just for practice!

      • This section introduces the ggplot2 graphics. You will see how different the syntax is from the base-R graphics. You can think of ggplot2 creating graphs by combining layers with the "+" sign. The default gray background of the ggplot is not as good for printed publications and can be replaced by adding a theme layer, for example, + theme_minimal()

        Key Points

        • Use ggplot2 to create plots.

        • Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.

      • In this exercise, you can practice the implementation of ggplot and compare it to the base-R graphics. This exercise does not count toward your grade. It is just for practice!

    • 4.2: Creating Histograms

      We use histograms to identify the general position and shape of the distribution, including its center, scale, outliers, and possible multimodality (presence of clusters in the data). This section introduces the base-R and ggplot2 syntax for creating and decorating histograms. An important difference to remember here is that the base-R function hist employs a sensible default method to calculate the number of bins or breaks in the histogram (based on the Sturges formula, which is most suitable for normal-like distributions), while ggplot2 uses 30 bins by default. See the respective help files by running ?hist and ?geom_histogram.

      • This video shows an interactive approach to creating histograms in base R, developing your code, and addressing the error messages. You will see more details on the available options in the next sections.

      • Now you will learn the ggplot2 syntax for building and customizing histograms.

      • Here you will see more examples of how to build histograms in base R. Note that when the total counts for two or more samples are different, we can convert the vertical axis to density so the distributions can be easily compared on the same plot.

      • In this exercise, you will practice plotting a histogram for a publication. This exercise does not count toward your grade. It is just for practice!

    • 4.3: Creating Scatterplots

      Scatterplots are commonly used to study relationships between variables that are measured on the horizontal (x) and vertical (y) axes. One can see the strength, direction, and shape of the relationships from a scatterplot and clusters and outliers in the data. Adding features to the plot improves story-telling capabilities. For example, you can color the points using some third variable. This section shows an implementation of these features in R.

      • This video demonstrates the steps to create and tailor a scatterplot in the base-R plotting system. Notice the incremental development of the code, adding elements to the plot and checking its view in the plot window. Finally, the code in the video also uses the png command to export the resulting plot for publication.

      • Here we introduce scatterplots in base R. The codes are simple, but you should also remember the options that make the plots more informative, like adding colors, legends, and error bars.

      • You will learn the layered syntax of ggplot2 for scatterplots in this section. It also demonstrates how regression lines can be added (compared with the base-R syntax shown in the introductory video).

      • In this exercise, you practice producing scatterplots for a publication. This exercise does not count toward your grade. It is just for practice!

    • 4.4: Creating Boxplots

      You can think about boxplots as scatterplots where one of the variables is a categorical variable, but boxplots are also more than that. Boxplots combine statistical summaries like the quartiles, show the interquartile range (IQR), and tell us about the distribution skewness and outliers. Boxplots provide a more focused view of the data distribution than histograms. This section first explains the statistics represented in a boxplot, then how to create boxplots of different types in R.

      • This video shows how a boxplot is built. You should understand what each of the bars and whiskers means so you can interpret the boxplot.

      • This section introduces the functionality of the base-R function boxplot. Note that for some data formats, the plot function with x being a factor variable will also work.

      • In this section, you will learn the ggplot2 codes for producing boxplots. While the syntax and default appearance may differ, these plots aim to compare distributions and identify outliers. If you need, you can add a few lines of code to make the base-R and ggplot2 graphs look the same. The choice of which plotting system to use is yours now.

      • This quick practice exercise asks you to produce boxplots for a publication. This exercise does not count toward your grade. It is just for practice!

    • 4.5: Creating Time Series Plots

      A time series is a type of variable in which observations are ordered in time. To show this order in a plot, we usually use lines or lines and points rather than just points. Time series plots may show us the general tendency in the data, such as changes in the mean, variance, and periodic patterns. These plots can also be used to detect changepoints and outliers visually.

      • This section is a short introduction to time series plots in R. You can use the analogy with the scatterplots where the horizontal axis is time.

      • If you save the data in the special format ts, the plotting function plot.ts can produce a better-looking x-axis automatically. The ts format adds attributes to your data, such as the beginning and end times and frequency. This section shows how you can convert a usual vector to the ts format, then plot it.

      • Of course, the ggplot2 can also visualize time series. This section introduces the relevant ggplot2 syntax.

      • In this exercise, you practice plotting a time series for a publication. This exercise does not count toward your grade. It is just for practice!

    • Unit 4 Assessment

      • Take this assessment to see how well you understood this unit.

        • This assessment does not count towards your grade. It is just for practice!
        • You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
        • You can take this assessment as many times as you want, whenever you want.