In spreadsheet, pivot tables or dynamic tables are powerful tools for
summarizing data in different ways. We can create these tables using the
group_by and summarize functions from the
tidyverse package.
Let’s take a look at our report, right after we read the data, we create a summary table with the number of cases and the average body mass for each specie in our dataset.
We can use group_by() to group our data by
species and summarize() to calculate the
number of cases and the average body_mass_g for each
grouping.
penguins %>% 
  group_by(species) %>% 
  summarise(count = n(),
            mean_body_mass = mean(body_mass_g, na.rm = TRUE))## # A tibble: 3 × 3
##   species   count mean_body_mass
##   <chr>     <int>          <dbl>
## 1 Adelie      152          3701.
## 2 Chinstrap    68          3733.
## 3 Gentoo      124          5076.
The function n() returns the number of cases and the
function mean computes the average. The parameter
na.rm = TRUE tells the function to remove non available
values (NA) before performing the calculation.
Question: What if you don’t use
group_by()beforesummarise()? Let’s try it and discuss what’s going on.
penguins %>% 
  summarise(count = n(),
            mean_body_mass = mean(body_mass_g, na.rm = TRUE))## # A tibble: 1 × 2
##   count mean_body_mass
##   <int>          <dbl>
## 1   344          4202.
So if we don’t group_by first, we will get a single summary statistic (n and mean in this case) for the whole dataset.
Now, we want the average of the body mass in kilograms and with two
decimals. We can change our summarise() call to get that
calculation.
penguins %>% 
  group_by(species) %>% 
  summarise(count = n(),
            mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))## # A tibble: 3 × 3
##   species   count mean_body_mass
##   <chr>     <int>          <dbl>
## 1 Adelie      152           3.7 
## 2 Chinstrap    68           3.73
## 3 Gentoo      124           5.08
First we divide body mass by 1000 to transform from grams to
kilograms and then we use the function round() to get only
two decimals.
Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting.
So let’s add a variable assignment to that first line:
summary <- penguins %>% 
  group_by(species) %>% 
  summarise(count = n(),
            mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))To make our table look better in a report, we have to change some of its elements. For example, the name of the variables should be informative for a person (at the moment they are useful names for a computer), and we can add a title to it and change the alignment of the numbers.
We can do that with the kable function.
summary %>% 
  knitr::kable(col.names = c("Specie", "Count", "Mean Body Mass"),
               align = "lcc",
               caption = "Summary of penguins on the data set by specie.")| Specie | Count | Mean Body Mass | 
|---|---|---|
| Adelie | 152 | 3.70 | 
| Chinstrap | 68 | 3.73 | 
| Gentoo | 124 | 5.08 | 
We can do a lot more with other functions (or “verbs”) of the tidyverse. Let’s see a couple more
There are a lot of times in which we want to operate beyond the
original data. This is often done by adding a column. We do this with
the mutate() function.
We can add a new variable (called prop) representing the
proportion of penguins in each species with by using
mutate().
penguins %>% 
  group_by(species) %>% 
  summarise(count = n(),
            mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2)) %>%
  mutate(prop = count/nrow(penguins))## # A tibble: 3 × 4
##   species   count mean_body_mass  prop
##   <chr>     <int>          <dbl> <dbl>
## 1 Adelie      152           3.7  0.442
## 2 Chinstrap    68           3.73 0.198
## 3 Gentoo      124           5.08 0.360
In our report we also make plot by species. To do this we need to
keep only the data of the species we want to analyse. We use the
filter() function to filter data frames, this function
returns a new data frame with only the rows that satisfy some
conditions.
We can use logical operators (>, <,
>=, <=, ==) to create
conditions from numeric variables These operators are also useful for
dates and text strings. Look at the code to see a condition for text
type of data.
penguins %>% 
  filter(species == "Gentoo")## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
##  1 Gentoo  Biscoe           46.1          13.2               211        4500
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           48.7          14.1               210        4450
##  4 Gentoo  Biscoe           50            15.2               218        5700
##  5 Gentoo  Biscoe           47.6          14.5               215        5400
##  6 Gentoo  Biscoe           46.5          13.5               210        4550
##  7 Gentoo  Biscoe           45.4          14.6               211        4800
##  8 Gentoo  Biscoe           46.7          15.3               219        5200
##  9 Gentoo  Biscoe           43.3          13.4               209        4400
## 10 Gentoo  Biscoe           46.8          15.4               215        5150
## # ℹ 114 more rows
## # ℹ 2 more variables: sex <chr>, year <dbl>
Activity: Write code to filter all the penguins in the Torgersen island.
Activity: Now write the code to filter all the penguins with flipper lenght greater than 200