Week 1 tips and FAQs

FAQs

Posted

Wednesday September 4, 2024 at 11:25 PM

Hi everyone!

I just finished grading Exercise 1 and you’re doing great work!

I have just a few important FAQs and tips I’ve been giving as feedback to many of you:

What’s the difference between R Markdown and Quarto?

In the videos, you’ll see examples of using R Markdown (i.e. files with an .Rmd extension), but in the instructions for the exercises and the answer keys on iCollege, I have you use Quarto (i.e. files with a .qmd extension) instead. Why?

Quarto Markdown is regular Markdown with R code and output sprinkled in. You can do everything you can with regular Markdown, but you can incorporate graphs, tables, and other code output directly in your document. You can create HTML, PDF, and Word documents, PowerPoint and HTML presentations, websites, books, and even interactive dashboards with Quarto. This whole course website is created with Quarto.

Quarto’s predecessor was R Markdown and worked exclusively with R (though there are ways to use other languages in document). Quarto is essentially “R Markdown 2.0,” but it is designed to be language agnostic. You can use R, Python, Julia, Observable JS, and even Stata code all in the same document. It is magical.

The core idea behind R Markdown—that you can intersperse text with code chunks—is the same in Quarto, and the syntax is the same too. Everything R Markdown-related in the videos should work just the same in Quarto.

See here for some of the cool things you’ll be able to do with Quarto documents.

Rendered document format for iCollege

Quarto is great and wonderful in part because you can write a document in one .qmd file and have it magically turn into an HTML file, a Word file, or a PDF (or if you want to get extra fancy later, a slideshow, a dashboard, or even a full website).

iCollege doesn’t like HTML files, though. It won’t show images that get uploaded because of weird server restrictions or something. So when you submit your exercises, make sure you render as PDF or Word.

I’d recommend rendering as HTML often as you work on the exercise. Rendering as PDF takes a few extra seconds, and rendering as Word is a hassle because Word gets mad if you have a previous version of the document open when rendering. HTML is pretty instantaneous. When I work on Quarto files, I put a browser window on one of my monitors and RStudio on the other and render and re-render often to HTML while working. Once I’m done and the document all works and the images, tables, text, etc. are all working, I’ll render as PDF or Word or whatever final format I want.

Why does it matter where we put files like `cars.csv`?

Oh man, this is one of the trickiest parts of the class, and it’s all Google’s fault.

Google, Apple, and the (over)simplification of file systems

There’s a bizarre (and well documented!) generational blip of people having a more intuitive understanding of files and folders. Before the 1980s, people learning how to use computers had to learn how files and folders worked because it was a brand new metaphor for thinking about items on computers. From the 1990s–2000s, sorting and moving and copying files and folders was a standard part of working with computers.

Then, in the 2010s, Google created Google Drive to simplify file management. All your Google documents are in one big easily searchable bucket of files, and while it’s possible to create folders on Google Drive, they actually discourage it. If you went to K–12 in the 2010s, you likely used Google Classroom, and many of those installations actually had Google Drive folders disabled so it was impossible to even create folder structures.

Your phone also completely hides the file system on purpose—Apple and Google wanted to simplify how things are stored on phones, so when you take a picture, that file goes somewhere, but you don’t need to know where.

As a result, people who learned how to use computers in the 1990–2000s tend to understand things like nested folders, absolute and relative paths, file extensions, and so on. As a gross oversimplification, Millennials and younger Gen X tend to use folders; Boomers/older Gen X and Gen Z do not.

It’s like one of those goofy internet generational tests: Do you put all your school-related files in carefully nested folders like Documents/GSU/Fall 2024/PMAP 8551/Assignment 1? You’re probably a Millennial. Do you put all your school-related files in one big Documents/GSU folder, or just in your Documents folder, or just wherever your browser decides to stick downloaded stuff? You’re probably Gen Z.

R (and every other programming language, and all computers in general) really cares about where your files are stored on your computer. Programs on your computer can’t read your mind—they only know where files are if you reference them with good consistent names.

In exercise 1, I had you create a folder on your computer somewhere and then create another folder in there named “data”, and you were supposed to put cars.csv inside that. You should have had a folder structure like this:

When you downloaded cars.csv your browser put that file in your Downloads folder (on macOS that’s at /Users/yourname/Downloads/; on Windows that’s at C:\User\yourname\Downloads\.) If you keep that file in Downloads, and you try running this:

read_csv("cars.csv")

…it won’t work because that file is actually in your Downloads folder and R is looking for it in the same folder as your .qmd file. You need to move that file out of Downloads.

(Technically you don’t have to; you could do read_csv("/Users/yourusername/Downloads/cars.csv"), but that’s BAD because if you ever move that file out of Downloads, your code will break.)

I had you create a folder named “data” to hold cars.csv, but technically you don’t have to do that. If you move cars.csv to the same folder as your .qmd file, it will be fine—your code just needs to say read_csv("cars.csv"). It’s just generally good practice to store all the data files in their own folder, for the sake of organization. You can name folders whatever you want too—like, you could put it in a folder in your project named “my-neat-data-stuff”, and then load it with read_csv("my-neat-data-stuff/cars.csv"). That’s fine too.

How does `group_by()` work? Why did we have to group by `class` and not `cty` and `class`?

A bunch of you did something like this for the last part of Exercise 1:

cars |>
  group_by(cty, class) |>
  summarize(avg_mpg = mean(cty))

While that technically gave you an answer, it’s wrong.

As you saw in the Primers, group_by() puts your data into groups behind the scenes, and summarize() collapses those groups into single values.

For instance, if you group by class, R will put all the 2seaters together, all the compacts together, all the suvs together, and so on. Then when you use summarize(), it will calculate the average for each of those groups:

mpg |> 
  group_by(class) |> 
  summarize(
    avg_mpg = mean(cty),
    num_cars = n()
  )
## # A tibble: 7 × 3
##   class      avg_mpg num_cars
##   <chr>        <dbl>    <int>
## 1 2seater       15.4        5
## 2 compact       20.1       47
## 3 midsize       18.8       41
## 4 minivan       15.8       11
## 5 pickup        13         33
## 6 subcompact    20.4       35
## 7 suv           13.5       62

There are 5 two-seater cars with an average MPG of 15.4; 47 compact cars with an average MPG of 20.13; and so on.

If you group by class and cty, R will create groups of all the unique combinations of car class and city MPG: all the compact cars with 18 MPG, all the compact cars with 20 MPG, all the midsize cars with 15 MPG, and so on. Then when you use summarize(), it will calculate the average for each of those groups:

mpg |> 
  group_by(class, cty) |> 
  summarize(
    avg_mpg = mean(cty),
    num_cars = n()
  )
## # A tibble: 61 × 4
## # Groups:   class [7]
##    class     cty avg_mpg num_cars
##    <chr>   <int>   <dbl>    <int>
##  1 2seater    15      15        3
##  2 2seater    16      16        2
##  3 compact    15      15        2
##  4 compact    16      16        3
##  5 compact    17      17        4
##  6 compact    18      18        7
##  7 compact    19      19        5
##  8 compact    20      20        5
##  9 compact    21      21       12
## 10 compact    22      22        3
## # ℹ 51 more rows

The average MPG for the 3 two-seater cars with 15 MPG is, unsurprisingly, 15 MPG, since every car in that group has a mileage of 15 MPG.

Check out these animations to help with the intuition of grouping, like these:

group_by(cat1)
group_by(cat2)

Can we use ChatGPT? Can you even tell if we do?

Yes, I can typically tell. I’d highly recommend not using it. See this for why.

Do you have any tips for remembering all the different functions we’re learning? There are so many!

Yes! In RStudio, go to Help > Cheat Sheets and you’ll be able to access a bunch of 2-page PDF cheatsheets for the main things that we’ll cover in this class. Posit has dozens of others available, too, if you click on “Browse Cheat Sheets…” at the bottom of that menu.