It’s time to present the data set we are using. The Palmer Penguins data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The data set includes several characteristics from Adelie, Chinstrap and Gentoo penguins. You can read more about it on the palmerpenguins Documentation.
These data are available in R by installing the palmerpenguins package, but because we want to learn how to read data into R, we are gonig to read them from csv and xls files.
We’ll start by loading the tidyverse package, which
gives us access to dozens of functions to work with. For know we’ll use
the read_csv()
function to read a csv file that is stored
in the data directory.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- read_csv("data/penguins.csv") penguins
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In Excel or Google Sheets, data are stored in the spreadsheet and
organized in cells. In R, they are stored in objects. When we read a csv
file, the data goes directly to the penguins
data frame and
it’s ready to be used. In the Environment panel we can see the
penguins
object, if we click that object the data will open
in a new tab fro us to take a look.
This view is the most similar to the one we have in a spreadsheet. We
can get to this panel by running View(penguins)
on the
console There are several other function to look at our data. Let’s use
one of them
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
This output is different and give us information about the type of data in each column (or variable).
Sometimes our data is not so friendly and we need to give more information to the function to be able to read the data properly. You can find these options by looking into the function’s documentation.
Go ahead and write
?read_csv()
on the console. What is the name of the option to change the default delimitator?
What about xls files? For that we’ll need differnet R package,
readxl that is already installed on the RStudio Cloud
project. In this case the function is called
read_excel()
library(readxl)
<- read_excel("data/penguins.xlsx") penguins_xls
And that’s it, we’ve read an xls file. Of course, we sometimes have
to work with files with multiple sheets or data that is no very
organised. This functions comes with several options or arguments to
read specific sheets (sheet = <name of the sheet>
) or
a specific range (range = "C1:E7"
) and others.
Now that we have the data read into R, it’s time to do some analysis.