library(tidyverse)
library(sf)
library(ggrepel)
<- mpg |>
small_mpg # Only use the first 10 rows
slice(1:10) |>
# Make a label column
mutate(fancy_label = paste0(manufacturer, " ", model, " (", year, ")"))
ggplot(small_mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_label_repel(aes(label = fancy_label), seed = 1234)
Weeks 11 and 12 tips and FAQs
Hi everyone!
Great work with your dashboards! Quarto dashboards are a super new feature (like, they were only added in January 2024), and you’re one of the first R classes to make them as part of an assignment. You’re on the cutting edge!
Just a few quick FAQs and tips from the past couple sessions:
Do people use Plotly and Quarto dashboards in the real world?
In the lecture I mentioned that most COVID dashboards were/are made with Shiny or {flexdashboard}. Both are still extremely popular and you’ll see Shiny and {flexdashboard} websites all over the internet.
Organizations have been moving to Quarto too. In November, Idaho launched a new election results dashboard for its general election results and it’s 100% built with Quarto and Plotly.
And now you can all do similar, real world things too!
I’m bored with ggplot’s default colors and/or viridis—how can I use other color palettes?
There’s a guide to using all sorts of different colors, including how to use fancy scientifically-designed palettes, custom colors from organizations like GSU, and colors from art, music, movies and TV shows, and historical events. You can make gorgeous plots with these different colors!
How many decimal points should I use when I’m working with latitude and longitude?
When you copy/paste geographic coordinates from Google Maps, you’ll get numbers that have a ton of decimal points, even though the website doesn’t show that many. For instance, here’s 55 Park Place (my office building on campus):
When you copy those coordinates, though, you get these values, which have a lot more decimal points:
33.755935998360144, -84.38748192406294
Which one is right? 33.75594 or 33.755935998360144?
As you add more decimal points to geographic data, you get more and more precise.
34, -84 gets you a little north of Lawrenceville.
33.8, -84.4 gets you a little north of Midtown.
33.76, -84.39 gets you just outside of Centennial Olympic Park.
With only three decimal points, 33.756, -84.387 gets you in 55 Park Place
Adding more decimal points will get even more precise like down to the exact inch on the earth. That’s excessive.
Can I use geom_label_repel()
with maps?
You learned about the {ggrepel} package in session 9, with its geom_text_repel()
and geom_label_repel()
functions that make sure none of your labels overlap:
In session 12, you learned about geom_sf_text()
and geom_sf_label()
for adding text and labels to maps. But what if your map labels overlap, like this?
# Download cb_2022_us_county_5m.zip under "County" from
# https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html
<- read_sf("data/cb_2022_us_county_5m/cb_2022_us_county_5m.shp") |>
ga_counties filter(STATEFP == 13)
<- tribble(
ga_places ~city, ~lat, ~long,
"Atlanta", 33.748955, -84.388099,
"Alpharetta", 34.075318, -84.294105,
"Duluth", 34.002262, -84.143614
|>
) st_as_sf(coords = c("long", "lat"), crs = st_crs("EPSG:4326"))
ggplot() +
geom_sf(data = ga_counties, linewidth = 0.1) +
geom_sf(data = ga_places) +
geom_sf_label(data = ga_places, aes(label = city)) +
theme_void()
Unfortunately there’s no such thing as geom_sf_label_repel()
. BUT there’s still a way to use geom_label_repel()
and geom_text_repel()
with maps, with a couple little tweaks:
- You have to map the
geometry
column in the data to thegeometry
aesthetic ingeom_text/label_repel()
- You have to tell
geom_text/label_repel()
to use the “sf_coordinates” stat so that it uses the latitude and longitude coordinates for x/y
ggplot() +
geom_sf(data = ga_counties, linewidth = 0.1) +
geom_sf(data = ga_places) +
geom_label_repel(
data = ga_places,
aes(label = city, geometry = geometry),
stat = "sf_coordinates", seed = 1234
+
) theme_void()
Can I use a heatmap on a map?
Choropleths are helpful for filling regions of a map (like countries, states, counties, etc.) by a specific value, but you can also fill an entire map area with a gradient too! This is a good approach for dealing with overplotting. See this post to learn how to do it:
I tried to make a map and countries are missing—why?
Many of you were brave and made a map of refugee counts for mini project 2. That’s fantastic!
If you did, you likely ran into an issue with plotting the countries and getting an incomplete map. Here’s an example with our beloved gapminder data.
library(countrycode) # For dealing with country names, abbreviations, and codes
library(gapminder) # Global health and wealth
# Add an ISO country code column to gapminder for joining
<- gapminder |>
gapminder_clean mutate(ISO_A3 = countrycode(country, "country.name", "iso3c"))
# Load the world map data from exercise 12
# Download "Admin 0 – Countries" from
# https://www.naturalearthdata.com/downloads/110m-cultural-vectors/
<- read_sf("data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp") |>
world_map filter(ISO_A3 != "ATA") |> # Bye penguins
mutate(ISO3 = ADM0_A3) # Use ADM0_A3 as the main country code column
Let’s take just 2007 from gapminder and map life expectancy. To do this we’ll need to combine or join the two datasets. One logical way to do this would be to take gapminder, join the world map data to it, and then plot it:
<- gapminder_clean |>
gapminder_with_map filter(year == 2007) |>
left_join(world_map, by = join_by(ISO3))
ggplot() +
geom_sf(data = gapminder_with_map, aes(fill = lifeExp))
## Error in `geom_sf()`:
## ! Problem while computing stat.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_layer()`:
## ! `stat_sf()` requires the following missing aesthetics: geometry.
oh no there’s an error! When we joined the map data, the special attributes of the geometry
column in world_map
got lost. The column is still there, but it won’t automatically plot with geom_sf()
. We can fix that by specifying that the column named “geometry” does indeed contain all the geographic data with st_set_geometry()
:
<- gapminder_clean |>
gapminder_with_map filter(year == 2007) |>
left_join(world_map, by = join_by(ISO3)) |>
# Fix the geometry column
st_set_geometry("geometry")
ggplot() +
geom_sf(data = gapminder_with_map, aes(fill = lifeExp)) +
theme_void()
We have a… map? It’s missing a bunch of countries (Russia is the most glaringly obvious hole!). That’s because those countries aren’t in gapminder, so their corresponding maps didn’t come over when using left_join()
. We can confirm by counting rows. The original map data has maps for 176 countries. Gapminder has 142 countries in 2007. The combined gapminder_with_map
dataset only has 142 rows—we’re not plotting 34 countries, since they’re not in gapminder.
nrow(world_map)
## [1] 176
nrow(gapminder_clean |> filter(year == 2007))
## [1] 142
nrow(gapminder_with_map)
## [1] 142
One quick and easy way to fix this is to use two geom_sf()
layers: one with the whole world and one with the partial gapminder-only map:
ggplot() +
geom_sf(data = world_map) +
geom_sf(data = gapminder_with_map, aes(fill = lifeExp)) +
theme_void()
The better way to fix this is to join the two datasets in a different order—start with the full map data and then add gapminder to it. This maintains the specialness of the geometry column and keeps all the original rows in world_map
. For countries that are in the map data but not in gapminder, they’ll still be in the final map_with_gapminder
data, but they’ll have NA for life expectancy:
<- world_map |>
map_with_gapminder left_join(filter(gapminder_clean, year == 2007), by = join_by(ISO3))
ggplot() +
geom_sf(data = map_with_gapminder, aes(fill = lifeExp)) +
theme_void() +
# Make the countries with missing data a different color
scale_fill_gradient(na.value = "grey90")
What if we want to facet though? This is just one year—what if we want to show panels for multiple years? This gets a little tricky. The gapminder data has rows for different country/year combinations (Afghanistan 1952, Afghanistan 1957, Albania 1952, etc.), but the world map data only has rows for countries. If we join the gapminder data to the world map data and gapminder has multiple rows for years, there’s no clear place for the gapminder rows to connect with the world map rows. R will try to make it work and repeat world_map rows for each of the repeated years, but it can be unpredictable.
The best approach I’ve found for doing this is to create what I call a “skeleton” data frame that has all the possible combinations of (1) unique countries in the map data and (2) unique years in gapminder (or the refugee data if you’re using that). The expand_grid()
function does this automatically. Like, look what happens if we tell it to make rows for every combination of A, B, C and 1, 2, 3—we get A1, A2, A3, B1, B2, and so on:
expand_grid(column_1 = c("A", "B", "C"),
column_2 = c(1, 2, 3))
## # A tibble: 9 × 2
## column_1 column_2
## <chr> <dbl>
## 1 A 1
## 2 A 2
## 3 A 3
## 4 B 1
## 5 B 2
## 6 B 3
## 7 C 1
## 8 C 2
## 9 C 3
We’ll make a similar skeleton with all the countries in the map and all the years we care about in gapminder. We’ll just show two panels—1952 and 2007—so we’ll make a little filtered dataset first. Then we’ll use expand_grid()
to make a dataset with all those combinations: Afghanistan 1952, Afghanistan 2007, Albania 1952, Albania 2007, and so on:
<- gapminder_clean |>
gapminder_smaller filter(year %in% c(1952, 2007))
<- expand_grid(ISO3 = unique(world_map$ISO3),
skeleton year = unique(gapminder_smaller$year))
skeleton## # A tibble: 352 × 2
## ISO3 year
## <chr> <int>
## 1 FJI 1952
## 2 FJI 2007
## 3 TZA 1952
## 4 TZA 2007
## 5 SAH 1952
## 6 SAH 2007
## 7 CAN 1952
## 8 CAN 2007
## 9 USA 1952
## 10 USA 2007
## # ℹ 342 more rows
Neat, that works. There’s Fiji in 1952 and 2007, Tanzania in 1952 and 2007, and so on. Those are all the possible countries in world_map
with all the possible years in gapminder_smaller
.
Next we can join in the gapminder data for each country and year, and join in the map data for each country. Notice how it has the same number of rows as skeleton
(352). If a country doesn’t have gapminder data (like Fiji here), it gets an NA for lifeExp
and pop
and gdpPercap
. But it still has map data for both 1952 and 2007, so it’ll show up in a plot.
<- skeleton |>
full_gapminder_map left_join(gapminder_smaller, by = join_by(ISO3, year)) |>
left_join(world_map, by = join_by(ISO3)) |>
# The geometry column lost its magic powers after joining, so add it back
st_set_geometry("geometry")
full_gapminder_map## Simple feature collection with 352 features and 175 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -55.61183 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 352 × 176
## ISO3 year country continent lifeExp pop gdpPercap featurecla scalerank
## <chr> <int> <fct> <fct> <dbl> <int> <dbl> <chr> <int>
## 1 FJI 1952 <NA> <NA> NA NA NA Admin-0 c… 1
## 2 FJI 2007 <NA> <NA> NA NA NA Admin-0 c… 1
## 3 TZA 1952 Tanzania Africa 41.2 8.32e6 717. Admin-0 c… 1
## 4 TZA 2007 Tanzania Africa 52.5 3.81e7 1107. Admin-0 c… 1
## 5 SAH 1952 <NA> <NA> NA NA NA Admin-0 c… 1
## 6 SAH 2007 <NA> <NA> NA NA NA Admin-0 c… 1
## 7 CAN 1952 Canada Americas 68.8 1.48e7 11367. Admin-0 c… 1
## 8 CAN 2007 Canada Americas 80.7 3.34e7 36319. Admin-0 c… 1
## 9 USA 1952 United … Americas 68.4 1.58e8 13990. Admin-0 c… 1
## 10 USA 2007 United … Americas 78.2 3.01e8 42952. Admin-0 c… 1
## # ℹ 342 more rows
## # ℹ 167 more variables: LABELRANK <int>, SOVEREIGNT <chr>, SOV_A3 <chr>,
## # ADM0_DIF <int>, LEVEL <int>, TYPE <chr>, TLC <chr>, ADMIN <chr>,
## # ADM0_A3 <chr>, GEOU_DIF <int>, GEOUNIT <chr>, GU_A3 <chr>, SU_DIF <int>,
## # SUBUNIT <chr>, SU_A3 <chr>, BRK_DIFF <int>, NAME <chr>, NAME_LONG <chr>,
## # BRK_A3 <chr>, BRK_NAME <chr>, BRK_GROUP <chr>, ABBREV <chr>, POSTAL <chr>,
## # FORMAL_EN <chr>, FORMAL_FR <chr>, NAME_CIAWF <chr>, NOTE_ADM0 <chr>, …
Now we can plot it and we’ll have consistent countries in each panel:
ggplot() +
geom_sf(data = full_gapminder_map, aes(fill = lifeExp)) +
facet_wrap(vars(year), ncol = 1) +
scale_fill_gradient(na.value = "grey90") +
theme_void()
Perfect!
Saving data that takes a long time to make
In these later sessions, I’ve had you do things with data from different places on the internet. Some of you used osrmRoute()
in exercise 12 to create a mapped route between cities. Some of you used {tidygeocoder} to geocode addresses in exercise 12. In past sessions you’ve used WDI()
to download data from the World Bank. In exercise 14 you’ll grab books from Project Gutenberg.
When you render a document, R starts with a brand new empty session without any packages or data loaded, and then it runs all your chunks to load packages, load data, and run all your other code. If you have code that grabs data from the internet, it will run every time you render your document. Remember my suggestion to render often? You’ll re-download the data, re-create routes, re-geocode addresses, and so on every time you keep re-rendering. This is excessive, slow, and—most especially—bad R etiquette. You don’t want to keep accessing those servers and recalculate things and redownload things you don’t need to update.
BUT at the same time, you should care about reproducibility. You want others—and future you—to be able to run your code and create the same plots and tables and get the same data. But you don’t want to do all that excessively and impolitely.
The solution is to be a little tricky with your R Markdown file. If you have code that needs to grab something from the internet, put it in a chunk that doesn’t run—use eval=FALSE
in its chunk options. Then, in an invisible chunk (with include=FALSE
) load the pre-downloaded data manually. I showed this in example 8 (with {WDI}) and example 13 (with {tidyquant}) and example 14 (with {gutenberger})
Here’s a quick basic example with Project Gutenberg book data. There are two chunks: get-book-fake
and load-book-real
:
```{r}
#| label: get-book-fake
#| eval: false
little_women_raw <- gutenberg_download(514, meta_fields = "title")
```
```{r}
#| label: load-book-data-real
#| include: false
little_women_file <- "data/little_women_raw.csv"
if (file.exists(little_women_file)) {
little_women_raw <- read_csv(little_women_file)
} else {
little_women_raw <- gutenberg_download(514, meta_fields = "title")
write_csv(little_women_raw, little_women_file)
}
```
- The first chunk (
get-book-fake
) contains the code for downloading data withgutenberg_download()
. It will appear in the document, but it will not run since it haseval
set to false. It will not try to grab anything from the internet. If someone were to follow along with the code in your document, they could run that code and get the book (good!), and it won’t run repeatedly on your end (also good!). - The second chunk (
load-book-data-real
) does a neat little trick. It first checks to see if a CSV file nameddata/little_women_raw.csv
exists. If it does, it’ll just load it withread_csv()
. If it doesn’t, it’ll grab data from the internet withgutenberg_download()
and then it will save that asdata/little_women_raw.csv
. This is really neat stuff. If you’re rendering your document for the first time, you won’t have the Little Women data yet, so the code will connect to Project Gutenberg and get it. The code will then save that data to your computer as a CSV file. The next time you render, R won’t need to connect to the internet again—it’ll load the CSV file instead of grabbing it from Project Gutenberg. You can render as many times as you want—you won’t need to reconnect to any remote servers again.
Again, the general pattern for this is to create two chunks: (1) a fake one that people will see in the document but won’t run, and (2) a real one that will run and load data locally if it exists, but that people won’t see.
In the example above, I saved the Little Women data from Project Gutenberg as a CSV file. This is fine—a CSV file is plain text, so it can store any kind of text-based data like numbers and text without any problems.
But sometimes you’ll work with slightly more complex types of data. For instance, with geographic data, the magical geometry
column contains a whole host of extra metadata, like projection details and multiple points (if it’s something like country boundaries). If you save a data frame with a geometry
column as a CSV file you’ll lose all that data—CSVs can’t store that extra nested metadata.
Similarly, if you have a factor or categorical variable (i.e. something like “Strongly disagree”, “Disagree”, “Agree”, “Strongly agree”), behind the scenes R stores those as numbers (i.e. 1, 2, 3, 4) with labels attached to the numbers (1 = “Strongly disagree”, 2 = “Disagree”, etc.). If you save a data frame with a categorical column like that as a CSV, by default R will only store the numbers and you’ll lose the labels. You could convert the categorical column to text before saving as CSV and then the text labels would get stored, but if the variable is ordered (i.e. “Strongly disagree” is lower than “disagree”, etc.), you’ll lose that ordering when saving as CSV.
The safest way to save intermediate files like this is to actually not use CSV, but instead use a special kind of file called .rds, which lets you take an entire object from your Environment panel and save it as a file. The .rds file will keep all the extra metadata and attributes (i.e. the projection details and nested points inside a geometry
column; factor labels and ordering for categorical variables, and so on).
So instead of saving that Little Women book as a CSV file, the better approach is to use saveRDS()
and readRDS()
to store it and load it as an .rds file, like this:
```{r}
#| label: get-book-fake
#| eval: false
little_women_raw <- gutenberg_download(514, meta_fields = "title")
```
```{r}
#| label: load-book-data-real
#| include: false
little_women_file <- "data/little_women_raw.rds"
if (file.exists(little_women_file)) {
little_women_raw <- readRDS(little_women_file)
} else {
little_women_raw <- gutenberg_download(514, meta_fields = "title")
saveRDS(little_women_raw, little_women_file)
}
```
If you want to be super cool, check out the {targets} package, which is like the professional version of this approach to caching data. {targets} lets you keep track of all your different objects and it will only re-run stuff if absolutely necessary.
For instance, imagine that in a document you load data, clean it, and plot it. Standard stuff. There’s a linear relationship between all this—the raw data leads to the clean data, which leads to a plot. If you change code in your plot, the data cleaning and loading code didn’t change, so there’s no real reason to need to re-run it. If you change your data cleaning code, your downstream plot will be affected and its code would need to be re-run.
{targets} keeps track of all these dependencies and re-runs code only when there are upstream changes. It’s great for plots and models that take a long time to run, or for grabbing data from the internet.
The best way to learn {targets} is to play with their short little walkthrough tutorial here, which has you make a simple document that loads data, builds a regression model, and makes a plot.
I use {targets} for all my projects (including this course website!) and it makes life a ton easier for any kind of project that involves more than one .qmd file or R script (see this for an example). I highly recommend checking it out.