Ingesting legacy data into MERMAID • mermaidr

Do you have a lot of existing data and are just starting to use MERMAID? You might be wondering on how to import those data into MERMAID. You can import (ingest) your legacy data into MERMAID by using the mermaidr package. More info on the ingestion workflow with mermaidr can be seen on the ingestion documentation.

Here we provide an example on how to import your fish data that uses the fish belt method into MERMAID using mermaidr. Please note that you need to prepare your project in the app first before starting this process. Detail on how to set a new project can be seen here. For this example, we have already prepared a project in the MERMAID app named MERMAID reef survey.

We have also prepared a video to walk you through the ingestion process. You can download the fishbelt and sites data for that example to follow along.

The steps for importing legacy fish belt data are:

Download the fish belt MERMAID template
Reformat your data to match the template
Address errors and warnings
Import (ingest) data to MERMAID

1. Download the fish belt MERMAID template

Before downloading the MERMAID template, make sure that you already have the mermaidr package installed. You can install the package from GitHub using the code below:

remotes::install_github("data-mermaid/mermaidr")

In this example, we are going to also use the tidyverse package. The first step is to load the packages and access your MERMAID projects by using mermaid_search_my_projects() with the project name.

At this point, a browser window will open for you to authenticate by logging into the MERMAID app. Once you’ve logged in, you can close the browser and come back to R. Your login credentials will last for a day and after it expires you will need to log in again.

library(mermaidr)
library(tidyverse)

reef_survey <- mermaid_search_my_projects("MERMAID reef survey")

The next step is to get the fish belt MERMAID template and options using mermaid_import_get_template_and_options() and save it to a file called fishbelt_mermaid_template.xlsx.

fish_template_and_options <- mermaid_import_get_template_and_options(
  reef_survey,
  "fishbelt",
  "fishbelt_mermaid_template.xlsx"
)

## ✔ Import template and field options written to fishbelt_mermaid_template.xlsx

Fish belt MERMAID template

The XLSX file consists of the fish belt MERMAID template and the options that are tailored based on your project. For example, the site options that are accepted are all the sites that you have added in the project through the MERMAID app.

Site options

You can also preview the template in R:

fish_template_and_options[["Template"]]

## # A tibble: 0 × 22
## # ℹ 22 variables: Site * <chr>, Management * <chr>,
## #   Sample date: Year * <chr>, Sample date: Month * <chr>,
## #   Sample date: Day * <chr>, Sample time <chr>, Depth * <chr>,
## #   Transect number * <chr>, Transect label <chr>,
## #   Transect length surveyed * <chr>, Width * <chr>, Fish size bin * <chr>,
## #   Reef slope <chr>, Visibility <chr>, Current <chr>, Relative depth <chr>,
## #   Tide <chr>, Sample unit notes <chr>, Observer emails * <chr>, …

Or if you want to investigate the column names in R, you can use the code below:

names(fish_template_and_options)

##  [1] "Template"                   "Site *"                    
##  [3] "Management *"               "Sample date: Year *"       
##  [5] "Sample date: Month *"       "Sample date: Day *"        
##  [7] "Sample time"                "Depth *"                   
##  [9] "Transect number *"          "Transect label"            
## [11] "Transect length surveyed *" "Width *"                   
## [13] "Fish size bin *"            "Reef slope"                
## [15] "Visibility"                 "Current"                   
## [17] "Relative depth"             "Tide"                      
## [19] "Sample unit notes"          "Observer emails *"         
## [21] "Fish name *"                "Size *"                    
## [23] "Count *"

All column names with asterisk shows that it is mandatory. You can leave the columns without the asterisk blank if the data is not available.

Investigate all the available options for each column before adjusting the format. For example, to look at the options for site in R:

fish_template_and_options[["Site *"]]

## $required
## [1] TRUE
## 
## $help_text
## [1] "A unique name of a site where data was collected. Every site must be defined before ingestion and set up in the project in the web app."
## 
## $choices
## # A tibble: 6 × 1
##   value    
##   <chr>    
## 1 1206     
## 2 1208     
## 3 1209     
## 4 1210     
## 5 1211     
## 6 SAV Point

2. Reformat your data to match the template

After downloading the fish belt MERMAID template, you’re ready to start reformatting your data. In this example, we are using a fish belt width of 5 m for fish size 10-34 cm, and 20 m for fish size bigger than 34 cm. We also identify the fish size up to the closest cm.

The first step is to read in your data. We have prepared our data in a CSV file and stored it in our project directory.

We read our fish belt data (stored in fishbelt.csv file) and our sampling event data (stored in sites.csv file), and save it to two different objects in R.

fishbelt_data <- read_csv("fishbelt.csv")
sites_data <- read_csv("sites.csv")

The MERMAID ingestion template requires the sample event data and the observation data to be combined. Therefore, we need to add the sites data set to fish observation data. Look at the available columns in the site and fishbelt data sets:

sites_data

## # A tibble: 2 × 8
##   SiteID Zone     Year Month   Day Reef_slope visibility current 
##    <dbl> <chr>   <dbl> <dbl> <dbl> <chr>           <dbl> <chr>   
## 1   1208 Control  2021     3    15 slope              15 low/none
## 2   1206 Control  2021     3    15 slope              15 low/none

fishbelt_data

## # A tibble: 10 × 8
##    SiteID Depth Transect_length Width Transect_number `Fish species`   Size_cm
##     <dbl> <dbl>           <dbl> <dbl>           <dbl> <chr>              <dbl>
##  1   1208     5               5     5               1 Parupeneus mult…      15
##  2   1208     5               5    20               1 Acanthurus bari…      37
##  3   1208     5               5    20               1 Platax teira          37
##  4   1208     5               5    20               1 Bolbometopon mu…     120
##  5   1208     5               5    20               1 Parupeneus indi…      36
##  6   1208     5               5     5               1 Acanthurus aura…      17
##  7   1208     5               5    20               1 Plectorhinchus …      36
##  8   1208     5               5     5               1 Zanclus cornutus      10
##  9   1208     5               5    20               1 Variola albimar…      36
## 10   1208     5               5    20               1 Lutjanus sp.          36
## # ℹ 1 more variable: Abundance <dbl>

Then combine the site data to fishbelt data, joining on the SiteID identifier.

fishbelt_data <- fishbelt_data %>%
  left_join(sites_data, by = "SiteID")

Looking at the available data, there are still two mandatory fields that are not yet added (Fish size bin and Observer emails) and one mandatory field that need to be adjusted ( Fishbelt Width). We need to look at the options allowed for these columns:

fish_template_and_options[["Width *"]]

## $required
## [1] TRUE
## 
## $help_text
## [1] "Width of fish belt transect, in meters. See relevant tab on ingestion template for choices."
## 
## $choices
## # A tibble: 9 × 1
##   value                                        
##   <chr>                                        
## 1 10m                                          
## 2 1m                                           
## 3 20m                                          
## 4 2m                                           
## 5 5m                                           
## 6 Mixed: < 10cm @ 2m, >= 10cm @ 5m             
## 7 Mixed: >=10 cm & <35 cm @ 5 m, >=35 cm @ 20 m
## 8 Mixed: < 20cm @ 2m, >= 20cm @ 4m             
## 9 Mixed: < 20cm @ 4m, >= 20cm @ 8m

fish_template_and_options[["Fish size bin *"]]

## $required
## [1] TRUE
## 
## $help_text
## [1] "Name of bin scheme used to estimate fish size for the transect. See relevant tab on ingestion template for choices. Choose 1 cm if the fish size recorded does not use bins."
## 
## $choices
## # A tibble: 5 × 1
##   value    
##   <chr>    
## 1 1        
## 2 10       
## 3 5        
## 4 AGRRA    
## 5 WCS India

fish_template_and_options[["Observer emails *"]]

## $required
## [1] TRUE
## 
## $help_text
## [1] "Comma-separated list of emails of sample unit observers (e.g. 'me@example.com,you@example.com')."
## 
## $choices
## # A tibble: 1 × 1
##   value            
##   <chr>            
## 1 email@mermaid.org

Now that we know the available options, we are going to add them manually using the code below:

fishbelt_data <- fishbelt_data %>%
  mutate(
    `Width *` = "Mixed: >=10 cm & <35 cm @ 5 m, >=35 cm @ 20 m",
    `Fish size bin *` = 1,
    `Observer emails *` = "email@mermaid.org"
  )

The visibility data are not the same as the options accepted by MERMAID:

fishbelt_data %>%
  distinct(visibility)

## # A tibble: 1 × 1
##   visibility
##        <dbl>
## 1         15

Let’s check how it is formatted in template and implement the option into our data:

fish_template_and_options[["Visibility"]][["choices"]]

## # A tibble: 4 × 1
##   value           
##   <chr>           
## 1 <1m - bad       
## 2 1-5m - poor     
## 3 5-10m - fair    
## 4 >10m - excellent

fishbelt_data <- fishbelt_data %>%
  mutate(visibility = case_when(
    visibility == 1 ~ "<1m - bad",
    visibility == 5 ~ "1-5m - poor",
    visibility > 5 & visibility <= 10 ~ "5-10m",
    visibility >= 10 ~ ">10m - excellent"
  ))

You might already noticed that the column names in our dataset are not the same with the template, and you might be wondering if it should be the same. The answer is yes, they must have exactly the same name but not necessarily the same order. Lets rename and reorder the columns. First check the names in the template and in our data:

names(fish_template_and_options[["Template"]])

##  [1] "Site *"                     "Management *"              
##  [3] "Sample date: Year *"        "Sample date: Month *"      
##  [5] "Sample date: Day *"         "Sample time"               
##  [7] "Depth *"                    "Transect number *"         
##  [9] "Transect label"             "Transect length surveyed *"
## [11] "Width *"                    "Fish size bin *"           
## [13] "Reef slope"                 "Visibility"                
## [15] "Current"                    "Relative depth"            
## [17] "Tide"                       "Sample unit notes"         
## [19] "Observer emails *"          "Fish name *"               
## [21] "Size *"                     "Count *"

names(fishbelt_data)

##  [1] "SiteID"            "Depth"             "Transect_length"  
##  [4] "Width"             "Transect_number"   "Fish species"     
##  [7] "Size_cm"           "Abundance"         "Zone"             
## [10] "Year"              "Month"             "Day"              
## [13] "Reef_slope"        "visibility"        "current"          
## [16] "Width *"           "Fish size bin *"   "Observer emails *"

Don’t worry if you don’t have any of Sample time, Transect label, Relative depth, Tide, Sample unit notes information, because none of them are required. It will not prevent you from importing (ingesting) your data into MERMAID. However, if you have those data, it is strongly recommended to add them as well.

Next step is to rename and reorder the columns in one step:

fishbelt_data <- fishbelt_data %>%
  select(
    `Site *` = SiteID,
    `Management *` = Zone,
    `Sample date: Year *` = Year,
    `Sample date: Month *` = Month,
    `Sample date: Day *` = Day,
    `Depth *` = Depth,
    `Transect number *` = Transect_number,
    `Transect length surveyed *` = Transect_length,
    `Width *`,
    `Fish size bin *`,
    `Reef slope` = Reef_slope,
    `Visibility` = visibility,
    `Current` = current,
    `Observer emails *`,
    `Fish name *` = `Fish species`,
    `Size *` = Size_cm,
    `Count *` = Abundance
  )

fishbelt_data

## # A tibble: 10 × 17
##    `Site *` `Management *` `Sample date: Year *` `Sample date: Month *`
##       <dbl> <chr>                          <dbl>                  <dbl>
##  1     1208 Control                         2021                      3
##  2     1208 Control                         2021                      3
##  3     1208 Control                         2021                      3
##  4     1208 Control                         2021                      3
##  5     1208 Control                         2021                      3
##  6     1208 Control                         2021                      3
##  7     1208 Control                         2021                      3
##  8     1208 Control                         2021                      3
##  9     1208 Control                         2021                      3
## 10     1208 Control                         2021                      3
## # ℹ 13 more variables: `Sample date: Day *` <dbl>, `Depth *` <dbl>,
## #   `Transect number *` <dbl>, `Transect length surveyed *` <dbl>,
## #   `Width *` <chr>, `Fish size bin *` <dbl>, `Reef slope` <chr>,
## #   Visibility <chr>, Current <chr>, `Observer emails *` <chr>,
## #   `Fish name *` <chr>, `Size *` <dbl>, `Count *` <dbl>

3. Address errors and warnings

After reformatting our data, we’re going to clean our data using mermaid_import_check_options() We need to check the columns one by one to ensure the data are accepted by the MERMAID template. The code below is to check our data that we’ve reformatted against the fish belt template we’ve downloaded. If the data matches, then a check mark will appear:

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Site *")

## ✔ All values of `Site *` match

## # A tibble: 1 × 3
##   data_value closest_choice match
##   <chr>      <chr>          <lgl>
## 1 1208       1208           TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Management *")

## ✔ All values of `Management *` match

## # A tibble: 1 × 3
##   data_value closest_choice match
##   <chr>      <chr>          <lgl>
## 1 Control    Control        TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Sample date: Year *")

## ✔ Any value is allowed for `Sample date: Year *` - no checking to be done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Sample date: Month *")

## ✔ Any value is allowed for `Sample date: Month *` - no checking to be done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Sample date: Day *")

## ✔ Any value is allowed for `Sample date: Day *` - no checking to be done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Depth *")

## ✔ Any value is allowed for `Depth *` - no checking to be done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Transect number *")

## ✔ Any value is allowed for `Transect number *` - no checking to be done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Transect length surveyed *")

## ✔ Any value is allowed for `Transect length surveyed *` - no checking to be
## done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Width *")

## ✔ All values of `Width *` match

## # A tibble: 1 × 3
##   data_value                                    closest_choice           match
##   <chr>                                         <chr>                    <lgl>
## 1 Mixed: >=10 cm & <35 cm @ 5 m, >=35 cm @ 20 m Mixed: >=10 cm & <35 cm… TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Fish size bin *")

## ✔ All values of `Fish size bin *` match

## # A tibble: 1 × 3
##   data_value closest_choice match
##   <chr>      <chr>          <lgl>
## 1 1          1              TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Reef slope")

## ✔ All values of `Reef slope` match

## # A tibble: 1 × 3
##   data_value closest_choice match
##   <chr>      <chr>          <lgl>
## 1 slope      slope          TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Visibility")

## ✔ All values of `Visibility` match

## # A tibble: 1 × 3
##   data_value       closest_choice   match
##   <chr>            <chr>            <lgl>
## 1 >10m - excellent >10m - excellent TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Current")

## ✔ All values of `Current` match

## # A tibble: 1 × 3
##   data_value closest_choice match
##   <chr>      <chr>          <lgl>
## 1 low/none   low/none       TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Observer emails *")

## ✔ All values of `Observer emails *` match

## # A tibble: 1 × 3
##   data_value        closest_choice    match
##   <chr>             <chr>             <lgl>
## 1 email@mermaid.org email@mermaid.org TRUE

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Fish name *")

## • Some errors in values of `Fish name *` - please check table below

## # A tibble: 10 × 3
##    data_value                    closest_choice                match
##    <chr>                         <chr>                         <lgl>
##  1 Parupeneus multifaskiatus     Parupeneus multifasciatus     FALSE
##  2 Parupeneus indikus            Parupeneus indicus            FALSE
##  3 Lutjanus sp.                  Lutjanus                      FALSE
##  4 Acanthurus bariene            Acanthurus bariene            TRUE 
##  5 Platax teira                  Platax teira                  TRUE 
##  6 Bolbometopon muricatum        Bolbometopon muricatum        TRUE 
##  7 Acanthurus auranticavus       Acanthurus auranticavus       TRUE 
##  8 Plectorhinchus chaetodonoides Plectorhinchus chaetodonoides TRUE 
##  9 Zanclus cornutus              Zanclus cornutus              TRUE 
## 10 Variola albimarginata         Variola albimarginata         TRUE

There are issues in the Fish name that need to be fixed, marked by the FALSE note under the match column. MERMAID also provides the closest choice to help us with data cleaning. We need to fix theses issues to be able to ingest our data:

fishbelt_data <- fishbelt_data %>%
  mutate(`Fish name *` = case_when(
    `Fish name *` == "Parupeneus multifaskiatus" ~ "Parupeneus multifasciatus",
    `Fish name *` == "Parupeneus indikus" ~ "Parupeneus indicus",
    `Fish name *` == "Lutjanus sp." ~ "Lutjanus",
    TRUE ~ `Fish name *`
  ))

After fixing the issues, lets check again the values to makes sure we receive a check mark to move forward with the process.

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Fish name *")

## ✔ All values of `Fish name *` match

## # A tibble: 10 × 3
##    data_value                    closest_choice                match
##    <chr>                         <chr>                         <lgl>
##  1 Parupeneus multifasciatus     Parupeneus multifasciatus     TRUE 
##  2 Acanthurus bariene            Acanthurus bariene            TRUE 
##  3 Platax teira                  Platax teira                  TRUE 
##  4 Bolbometopon muricatum        Bolbometopon muricatum        TRUE 
##  5 Parupeneus indicus            Parupeneus indicus            TRUE 
##  6 Acanthurus auranticavus       Acanthurus auranticavus       TRUE 
##  7 Plectorhinchus chaetodonoides Plectorhinchus chaetodonoides TRUE 
##  8 Zanclus cornutus              Zanclus cornutus              TRUE 
##  9 Variola albimarginata         Variola albimarginata         TRUE 
## 10 Lutjanus                      Lutjanus                      TRUE

And finish checking the columns.

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Size *")

## ✔ Any value is allowed for `Size *` - no checking to be done

mermaid_import_check_options(fishbelt_data, fish_template_and_options, "Count *")

## ✔ Any value is allowed for `Count *` - no checking to be done

Once we got the check marks, we have the cleaned version of our data that is ready to be ingested. Let’s save the cleaned fishbelt data:

write_csv(fishbelt_data, "fishbelt_clean.csv")

4. Import (ingest) data to MERMAID

Once we have our cleaned MERMAID formatted data, the next step is ingesting the data. We do one “dry run” before actually ingesting to check the data once more:

mermaid_import_project_data(
  fishbelt_data,
  reef_survey,
  method = "fishbelt",
  dryrun = TRUE
)

## Records successfully checked! To import, please run the function again with `dryrun = FALSE`.

Once we got the message Records successfully checked!, change the dryrun option to FALSE to start ingesting your data into MERMAID. Once you got the message Records successfully imported! Please review in Collect., you can head to the Collecting Page in your project in the MERMAID app and start validating and submitting each transect.

mermaid_import_project_data(
  fishbelt_data,
  reef_survey,
  method = "fishbelt",
  dryrun = FALSE
)

## Records successfully imported! Please review in Collect.