16  Next Steps

So now you’ve got a handle on Quarto, what are some of the other things to think about learning? Here are some of my recommendations.

16.1 Learn how to use git and github

git is a version control system. Not sure what a version control system is? No worries, let me explain. If you’ve ever named a document something like:

Final
Final 2
Really final

Relevant PhD comics link

Or even if you have something like:

  • 2018-10-10-document.qmd
  • 2018-10-11-document.qmd

These are ways of managing which version you have.

To learn git and github, I’d highly recommend Happy Git with R by Jenny Bryan, the STAT 545 TAs, and Jim Hester

16.2 Learn how to make reproducible examples

(See https://github.com/njtierney/reprex-essentials for more examples)

(The following is an excerpt from my blog post, “How to get good at R”)

When you run into a problem, or an error, if you can’t work out the answer after some tinkering about, it can be worthwhile spending some time to construct a small example of the code that breaks. This takes a bit of time, and could be its own little blog post. It takes practice. But in the process of reducing the problem down to its core components, I often can solve the problem myself. It’s kind of like that experience of when you talk to someone to try and describe a problem that you are working on, and in talking about it, you arrive at a solution.

There is a great R package that helps you create these reproducible examples, called reprex, by Jenny Bryan. I’ve written about the reprex package here

For the purposes of illustration, let’s briefly tear down a small example using the somewhat large dataset of diamonds

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Let’s say we had a few steps involved in the data summary of diamonds data:

diamonds %>%
  mutate(
    price_per_carat = price / carat
  ) %>% 
  group_by(
    cut
    ) %>% 
  summarise(
    price_mean = mean(price_per_carat),
    price_sd = sd(price_per_carat),
    mean_color = mean(color)
  )
Warning: There were 5 warnings in `summarise()`.
The first warning was:
ℹ In argument: `mean_color = mean(color)`.
ℹ In group 1: `cut = Fair`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
# A tibble: 5 × 4
  cut       price_mean price_sd mean_color
  <ord>          <dbl>    <dbl>      <dbl>
1 Fair           3767.    1540.         NA
2 Good           3860.    1830.         NA
3 Very Good      4014.    2037.         NA
4 Premium        4223.    2035.         NA
5 Ideal          3920.    2043.         NA

We get a clue that the error is in the line mean_color, so let’s just try and do that line:

diamonds %>%
  mutate(
    mean_color = mean(color)
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `mean_color = mean(color)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z mean_color
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43         NA
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31         NA
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31         NA
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63         NA
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75         NA
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48         NA
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47         NA
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53         NA
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49         NA
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39         NA
# ℹ 53,930 more rows

We still get that error, so what if we just do

mean(diamonds$color)
Warning in mean.default(diamonds$color): argument is not numeric or logical:
returning NA
[1] NA

OK same error. What is in color?

head(diamonds$color)
[1] E E E I J J
Levels: D < E < F < G < H < I < J

Does it really make sense to take the mean of some letters? Ah, of course not!