7 Loading Data and Data Cleaning
Screening and cleaning your data to identify and fix any potential errors (missing data, typos, errors, etc.) is an important step before conducting any analyses. This is known as Quality Assurance/Quality Control, or QA/QC. This section includes an overview of steps that should be taken to properly screen your data and introduces some functions that can come in handy when cleaning your data. If you have a small dataset that won’t be updated often, screening and cleaning your data may be easiest in Microsoft Excel by sorting and filtering your data columns. However, we recommend performing your data cleaning using R. This has the advantage that all changes made to a raw dataset will be recorded in a script that is reproducible, which may be especially useful when working with large datasets, if you want to quickly modify any steps of your cleaning process, or if you receive additional data.
7.1 Loading Data
Throughout this section and the following fisheries analysis and plotting sections, we’ll use a common data set. First, create a new folder in your working directory called “_data“. Next, download the following file onto your computer. Right click on the link, and save it in your”_data" folder.
Right click and save-link-as to download landings data
Let’s read this dataset into R and determine the structure of the dataset. The landings_data
data frame is from a fishery-dependent landing site survey. The species included in this data set is Caesio cuning, a yellowtail fusilier. We’ll be able to use these data to create length-frequency histograms that describe the size structure of the population, as well as trends in catch and CPUE.
We can look at the raw data just by typing landings_data
.
landings_data <- read_csv("_data/sample_landings_data_raw.csv")
landings_data
## # A tibble: 7,214 x 8
## yy dat trip effort gr sp l_cm w_cm
## <int> <chr> <int> <int> <chr> <chr> <dbl> <dbl>
## 1 2003 4/30/03 1 10 Trap Caesoi cunning 36 1089.1402
## 2 2003 4/30/03 1 10 trap Caesio cuning 29 565.3879
## 3 2003 4/30/03 1 10 Trap Caesio cuning 34 915.8276
## 4 2003 4/30/03 1 10 Trap Caesio cuning 36 1089.1402
## 5 2003 4/30/03 1 10 Trap Caesio cuning 34 915.8276
## 6 2003 4/30/03 1 10 Trap Caesoi cunning 28 508.3185
## 7 2003 4/30/03 1 10 Trap Caesio cuning 30 626.6000
## 8 2003 4/30/03 1 10 Trap Caesio cuning 27 455.2443
## 9 2003 4/30/03 1 10 Trap Caesio cuning 33 836.5681
## 10 2003 4/30/03 1 10 Trap Caesio cuning 35 999.9688
## # ... with 7,204 more rows
You’ll first notice that R calls this data frame a tibble
, which is just another word for a nice clean version of a landings_data frame. This format is automatically used when you read in data using read_csv
, which we always recommend. e can see that there are [7214] individual fish catch observations (rows) in our data frame and [8] variables (columns). The columns include the year and date when the measurement was collected, the fishing trip ID, how many hours were fished for each trip, what gear was used, the species, the length of the fish, and the weight of the fish.
7.2 Data Structure
First, let’s give our columns more descriptive column headings. We can rename columns using the rename
function from the dplyr
package. Let’s also convert the Date
variable to a date format using the mdy
function from the lubridate
package. We start by taking the landings data frame we loaded into R, and working through a series of “pipes”, designated by the %>%
operation, which progressively analyzes the data from one step to the next. Essentially, the output of one line is fed into the input of the next line.
# Start with the landings_data data frame
landings_data <- landings_data %>%
# Rename the columns
rename(Year = yy,
Date = dat,
Trip_ID = trip,
Effort_Hours = effort,
Gear = gr,
Species = sp,
Length_cm = l_cm,
Weight_g = w_cm) %>%
# Turn the date column into a date format that R recognizes
mutate(Date = mdy(Date))
landings_data
## # A tibble: 7,214 x 8
## Year Date Trip_ID Effort_Hours Gear Species Length_cm
## <int> <date> <int> <int> <chr> <chr> <dbl>
## 1 2003 2003-04-30 1 10 Trap Caesoi cunning 36
## 2 2003 2003-04-30 1 10 trap Caesio cuning 29
## 3 2003 2003-04-30 1 10 Trap Caesio cuning 34
## 4 2003 2003-04-30 1 10 Trap Caesio cuning 36
## 5 2003 2003-04-30 1 10 Trap Caesio cuning 34
## 6 2003 2003-04-30 1 10 Trap Caesoi cunning 28
## 7 2003 2003-04-30 1 10 Trap Caesio cuning 30
## 8 2003 2003-04-30 1 10 Trap Caesio cuning 27
## 9 2003 2003-04-30 1 10 Trap Caesio cuning 33
## 10 2003 2003-04-30 1 10 Trap Caesio cuning 35
## # ... with 7,204 more rows, and 1 more variables: Weight_g <dbl>
7.3 Missing values
Next, let’s check our data frame to determine if there are any missing values by subsetting observations (rows) in our dataframe that have missing values using the complete_cases
function and the logical operator for negation, !
.
landings_data[!complete.cases(landings_data),]
## # A tibble: 3 x 8
## Year Date Trip_ID Effort_Hours Gear Species Length_cm
## <int> <date> <int> <int> <chr> <chr> <dbl>
## 1 2003 2003-05-01 10 10 <NA> Caesio cuning 19.000
## 2 2003 2003-05-01 10 10 Handline Caesio cuning 19.000
## 3 2004 2004-12-18 NA 9 Trap Caesio cuning 20.104
## # ... with 1 more variables: Weight_g <dbl>
There are 3 rows in our dataframe with missing values. If we want to remove observations with missing data from our dataset we can use the na.omit
function which will remove any rows with missing values from our dataset:
landings_data <- na.omit(landings_data)
landings_data
## # A tibble: 7,211 x 8
## Year Date Trip_ID Effort_Hours Gear Species Length_cm
## <int> <date> <int> <int> <chr> <chr> <dbl>
## 1 2003 2003-04-30 1 10 Trap Caesoi cunning 36
## 2 2003 2003-04-30 1 10 trap Caesio cuning 29
## 3 2003 2003-04-30 1 10 Trap Caesio cuning 34
## 4 2003 2003-04-30 1 10 Trap Caesio cuning 36
## 5 2003 2003-04-30 1 10 Trap Caesio cuning 34
## 6 2003 2003-04-30 1 10 Trap Caesoi cunning 28
## 7 2003 2003-04-30 1 10 Trap Caesio cuning 30
## 8 2003 2003-04-30 1 10 Trap Caesio cuning 27
## 9 2003 2003-04-30 1 10 Trap Caesio cuning 33
## 10 2003 2003-04-30 1 10 Trap Caesio cuning 35
## # ... with 7,201 more rows, and 1 more variables: Weight_g <dbl>
Checking the data structure again, we can see that the 3 rows containing NA
values have been removed from our dataframe. You may not always wish to remove NA
values from a dataset, if you still want to keep other values in that observation. Even if you want to keep observations with NA
values in the dataset, it is still good to identify NA
s and know where they occur to ensure they don’t create problems during analyses.
7.4 Typos
We can check for typos by using the unique
function, which will tell us all of the unique values found within a particular column. As an example, let’s look at the Gear
variable.
unique(landings_data$Gear)
## [1] "Trap" "trap" "Muroami" "Handline" "Gillnet" "Trolling"
## [7] "Speargun"
The gear variable has 7 unique values, however, we know there should only be 6 gears present in the dataset. We can see that “trap” appears twice because capitalization was inconsistent. The lower case ‘t’ causes R to read it as a unique value gear type. We can fix this by making sure all of our values in the Gear
variable are consistent and have all lowercase letters using the tolower
function. Alternatively, we could change them to all uppercase using the toupper
function.
landings_data <- landings_data %>%
mutate(Gear = tolower(Gear))
unique(landings_data$Gear)
## [1] "trap" "muroami" "handline" "gillnet" "trolling" "speargun"
Now we have the correct number (6) of unique gears in our dataset.
Now, let’s check another our Species
variable for typos:
unique(landings_data$Species)
## [1] "Caesoi cunning" "Caesio cuning"
The species
is showing 2 unique values, but we know there should only be one species in our dataset. It appears there is a spelling error on one of our species names. We can check how many times each of the 2 species spellings occurs in our dataset by using the nrow
function on a filtered subset of data for each of the two Species
values:
landings_data %>%
filter(Species == "Caesoi cunning") %>%
nrow()
## [1] 2
landings_data %>%
filter(Species == "Caesio cuning") %>%
nrow()
## [1] 7209
It looks like “Caesoi cunning” likely the typo because it only appears twice in our dataset, while “Caesio cunning” appears (7209) times. We can fix this by replacing the misspelled Species
value and replacing it with the value that is spelled correctly. We do this using mutate
and replace
.
landings_data <- landings_data %>%
mutate(Species = replace(Species,Species == "Caesoi cunning", "Caesio cuning"))
unique(landings_data$Species)
## [1] "Caesio cuning"
Now we have only one species value in our Species
variable in our dataset, which is correct. The unique values of all categorical columns (i.e., gear type, species name, etc) should be examined during the data screening and cleaning process.
7.5 Errors
Errors in numeric/integer values may be caused from typos during data entry or from an error during the data collection process (for example, maybe the scale was broken or not zeroed out before weighing). To look at the range and distribution of a numeric variable, the summary
function can be used.
summary(landings_data$Length_cm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 25.00 25.81 27.00 2400.00
Looks like we have a max Length_cm
value that is order of magnitude higher than the mean and median values. Visualizing numeric data is another great way to screen continuous data and identify data outlines that may be caused from errors in the dataset:
plot(landings_data$Length_cm)
We can clearly see there is an outlier in our data (upper left corner of the plot). We are not sure how this error occurred, but we know that this is not correct. In fact, we know that the maximum possible size of our species 100 cm. We know that a measurement or typo error must have occurred for any Length_cm
values that are over 100 cm We can remove these erroneous data by only including observations in our dataset with values over 100 cm (species maximum size) using the filter
function:
landings_data <- landings_data %>%
filter(Length_cm < 100)
plot(landings_data$Length_cm)
Now all of our data contains accurate length observations that are in the range of our species length. This process of plotting and examining should be conducted for each of our numeric variables before conducting any analyses to identify any outliers and to remove any erroneous data. In this example, we will skip this step for the Weight_g
and Effort_Hours
column, although you may wish to do this as a learning exercise on your own.
7.6 Saving clean data
Now that we have completed our data cleaning and screening, let’s examine the structure of our data frame again:
landings_data
## # A tibble: 7,208 x 8
## Year Date Trip_ID Effort_Hours Gear Species Length_cm
## <int> <date> <int> <int> <chr> <chr> <dbl>
## 1 2003 2003-04-30 1 10 trap Caesio cuning 36
## 2 2003 2003-04-30 1 10 trap Caesio cuning 29
## 3 2003 2003-04-30 1 10 trap Caesio cuning 34
## 4 2003 2003-04-30 1 10 trap Caesio cuning 36
## 5 2003 2003-04-30 1 10 trap Caesio cuning 34
## 6 2003 2003-04-30 1 10 trap Caesio cuning 28
## 7 2003 2003-04-30 1 10 trap Caesio cuning 30
## 8 2003 2003-04-30 1 10 trap Caesio cuning 27
## 9 2003 2003-04-30 1 10 trap Caesio cuning 33
## 10 2003 2003-04-30 1 10 trap Caesio cuning 35
## # ... with 7,198 more rows, and 1 more variables: Weight_g <dbl>
We now have [7208] observations, with [8] variables, and with each variable being the correct data type. We can compare this to our raw dataset and see that we removed 6 observations (3 observations had missing values and 3 had error). This script may come in handy if, for example, we realize that the maximum size of our species is actually 200 cm (not 100 cm). We will know that our dataset does not include any length observations over 100 cm because we have documented our cleaning process and can easily go back to this script and change the 100 to a 200 and rerun this script. If we receive more data, we can also simply re-run this script, and all data cleaning steps will be performed again automatically.
We can save this dataset using a new name so that we have a copy of both the raw, and clean data. Now, we are ready to summarize and analyze our clean dataset.
write_csv(landings_data,"_data/sample_landings_data_clean.csv")