In spreadsheet, pivot tables or dynamic tables are powerful tools for
summarizing data in different ways. We can create these tables using the
group_by
and summarize
functions from the
tidyverse package.
Let’s take a look at our report, right after we read the data, we create a summary table with the number of cases and the average body mass for each specie in our dataset.
We can use group_by()
to group our data by
species and summarize()
to calculate the
number of cases and the average body_mass_g for each
grouping.
%>%
penguins group_by(species) %>%
summarise(count = n(),
mean_body_mass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 3 × 3
## species count mean_body_mass
## <chr> <int> <dbl>
## 1 Adelie 152 3701.
## 2 Chinstrap 68 3733.
## 3 Gentoo 124 5076.
The function n()
returns the number of cases and the
function mean
computes the average. The parameter
na.rm = TRUE
tells the function to remove non available
values (NA
) before performing the calculation.
Question: What if you don’t use
group_by()
beforesummarise()
? Let’s try it and discuss what’s going on.
%>%
penguins summarise(count = n(),
mean_body_mass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 1 × 2
## count mean_body_mass
## <int> <dbl>
## 1 344 4202.
So if we don’t group_by first, we will get a single summary statistic (n and mean in this case) for the whole dataset.
Now, we want the average of the body mass in kilograms and with two
decimals. We can change our summarise()
call to get that
calculation.
%>%
penguins group_by(species) %>%
summarise(count = n(),
mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))
## # A tibble: 3 × 3
## species count mean_body_mass
## <chr> <int> <dbl>
## 1 Adelie 152 3.7
## 2 Chinstrap 68 3.73
## 3 Gentoo 124 5.08
First we divide body mass by 1000 to transform from grams to
kilograms and then we use the function round()
to get only
two decimals.
Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting.
So let’s add a variable assignment to that first line:
<- penguins %>%
summary group_by(species) %>%
summarise(count = n(),
mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))
To make our table look better in a report, we have to change some of its elements. For example, the name of the variables should be informative for a person (at the moment they are useful names for a computer), and we can add a title to it and change the alignment of the numbers.
We can do that with the kable
function.
%>%
summary ::kable(col.names = c("Specie", "Count", "Mean Body Mass"),
knitralign = "lcc",
caption = "Summary of penguins on the data set by specie.")
Specie | Count | Mean Body Mass |
---|---|---|
Adelie | 152 | 3.70 |
Chinstrap | 68 | 3.73 |
Gentoo | 124 | 5.08 |
We can do a lot more with other functions (or “verbs”) of the tidyverse. Let’s see a couple more
There are a lot of times in which we want to operate beyond the
original data. This is often done by adding a column. We do this with
the mutate()
function.
We can add a new variable (called prop
) representing the
proportion of penguins in each species with by using
mutate()
.
%>%
penguins group_by(species) %>%
summarise(count = n(),
mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2)) %>%
mutate(prop = count/nrow(penguins))
## # A tibble: 3 × 4
## species count mean_body_mass prop
## <chr> <int> <dbl> <dbl>
## 1 Adelie 152 3.7 0.442
## 2 Chinstrap 68 3.73 0.198
## 3 Gentoo 124 5.08 0.360
In our report we also make plot by species. To do this we need to
keep only the data of the species we want to analyse. We use the
filter()
function to filter data frames, this function
returns a new data frame with only the rows that satisfy some
conditions.
We can use logical operators (>
, <
,
>=
, <=
, ==
) to create
conditions from numeric variables These operators are also useful for
dates and text strings. Look at the code to see a condition for text
type of data.
%>%
penguins filter(species == "Gentoo")
## # A tibble: 124 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gentoo Biscoe 46.1 13.2 211 4500
## 2 Gentoo Biscoe 50 16.3 230 5700
## 3 Gentoo Biscoe 48.7 14.1 210 4450
## 4 Gentoo Biscoe 50 15.2 218 5700
## 5 Gentoo Biscoe 47.6 14.5 215 5400
## 6 Gentoo Biscoe 46.5 13.5 210 4550
## 7 Gentoo Biscoe 45.4 14.6 211 4800
## 8 Gentoo Biscoe 46.7 15.3 219 5200
## 9 Gentoo Biscoe 43.3 13.4 209 4400
## 10 Gentoo Biscoe 46.8 15.4 215 5150
## # ℹ 114 more rows
## # ℹ 2 more variables: sex <chr>, year <dbl>
Activity: Write code to filter all the penguins in the Torgersen island.
Activity: Now write the code to filter all the penguins with flipper lenght greater than 200