MERMAID Fishbelt Data Quality Checks

This code performs data quality checks on MERMAID projects with fish belt data, including observer comparisons, outlier detection, and shark observation analysis.

Getting example fishbelt data from MERMAID

Loads the necessary R packages and gets example MERMAID projects with fish belt data. This analysis can be used for data checks on any project. However, in this example I use two separate projects in order to show trends in observers and taxa:

Observer comparison project: A project with multiple observers to show observer effect analysis
Shark data project: A project with shark observations for the shark analysis section

Since the code relies on all three scales of MERMAID data (observations, sample units, and sample events), it is necessary to either select MERMAID projects whose permissions are set to public for fish belt or to use projects for which you are a member.

Note: This step requires authentication even if you are accessing public projects. This means you need to create a free MERMAID account if you don’t already have one.

Show the code

rm(list = ls()) #remove past stored objects
options(scipen = 999) #turn off scientific notation

#### Load packages ####
library(mermaidr)
library(tidyverse)
library(plotly)
library(DT)
library(ggplot2)
library(ggpubr)
library(knitr)
library(rfishbase)

#### Option 1: Get data from public projects ####

# Find a project with many observers for observer comparisons
observer_projects <- mermaid_get_summary_sampleevents() %>% 
  filter(data_policy_beltfish == "public" & 
           !is.na(beltfish_biomass_kgha_avg) &
           !country %in% c("Indonesia", "Philippines")) %>%
  group_by(project_id, project, tags, project_notes, country) %>% 
  dplyr::summarise(
    NumSites = length(site),
    TotalSampleUnits = sum(beltfish_sample_unit_count),
    NumObservers = n_distinct(observers)
  ) %>% 
  arrange(desc(NumObservers), desc(TotalSampleUnits))

# Select project with most observers
observer_project_id <- observer_projects$project_id[1]
observer_project_name <- observer_projects$project[1]

# Get the observer comparison data
observer_data <- mermaid_get_project_data(
  project = observer_project_id,
  method = "fishbelt",
  data = "all"
)

# Find a project with shark observations
# Get all public projects and check for shark families
shark_families <- c(
  "Carcharhinidae","Sphyrnidae","Alopiidae","Lamnidae","Rhincodontidae",
  "Galeocerdonidae","Ginglymostomatidae","Hemiscylliidae","Heterodontidae",
  "Hexanchidae","Odontaspididae","Orectolobidae","Parascylliidae",
  "Scyliorhinidae","Squalidae","Stegostomatidae","Triakidae"
)

# This is a simplified approach - you may need to manually specify a project
# that you know has shark data
shark_project_id <- observer_project_id  # Start with same project
shark_project_name <- observer_project_name

# Try to find a project with sharks (this may take time, so commented out)
# You can manually specify a project ID here if you know one with shark data
shark_project_id <- "55ac964c-0228-42da-8061-5983339ecb9f"
shark_project_name <- observer_projects$project[observer_projects$project_id == shark_project_id]

shark_data <- mermaid_get_project_data(
  project = shark_project_id,
  method = "fishbelt",
  data = "all"
)

# Check if shark data actually has sharks, if not use observer data
has_sharks <- any(shark_data$observations$fish_family %in% shark_families)

if(!has_sharks) {
  cat("Note: Selected shark project does not contain shark observations.\n")
  cat("Using observer project data for all analyses.\n\n")
  shark_data <- observer_data
  shark_project_id <- observer_project_id
  shark_project_name <- observer_project_name
}


# #### Option 2: Get data from projects for which you are a member ####
# my_projects <- mermaid_get_my_projects()
# 
# # Specify which project to use for observer comparisons
# observer_project_id <- my_projects$project_id[1]  # Change index as needed
# observer_data <- mermaid_get_project_data(
#   project = observer_project_id,
#   method = "fishbelt",
#   data = "all"
# )
# 
# # Specify which project to use for shark data (can be same or different)
# shark_project_id <- my_projects$project_id[2]  # Change index as needed
# shark_data <- mermaid_get_project_data(
#   project = shark_project_id,
#   method = "fishbelt",
#   data = "all"
# )


# Extract the three levels for observer comparisons
observations_data <- observer_data$observations
sampleunits_data <- observer_data$sampleunits
sampleevents_data <- observer_data$sampleevents

# Extract data for shark analysis (will be used later)
shark_observations_data <- shark_data$observations
shark_sampleunits_data <- shark_data$sampleunits
shark_sampleevents_data <- shark_data$sampleevents

Anonymize Observers

Show the code

# Create anonymization mapping
create_observer_mapping <- function(observer_string) {
  # Split all observer strings and get unique observers
  all_observers <- observer_string %>%
    str_split(", ") %>%
    unlist() %>%
    unique() %>%
    sort()
  
  # Create mapping with Observer A, B, C, etc.
  mapping <- tibble(
    original = all_observers,
    anonymous = paste0("Observer ", LETTERS[1:length(all_observers)])
  )
  
  return(mapping)
}

# Get all unique observers from sample events
observer_mapping <- create_observer_mapping(sampleevents_data$observers)

# Function to anonymize observer strings
anonymize_observers <- function(observer_string, mapping) {
  # Split the string
  observers <- str_split(observer_string, ", ")[[1]]
  
  # Map each observer
  anonymous_observers <- sapply(observers, function(obs) {
    mapping$anonymous[mapping$original == obs]
  })
  
  # Recombine
  paste(anonymous_observers, collapse = ", ")
}

# Apply anonymization to all data frames
observations_data <- observations_data %>%
  mutate(observers = sapply(observers, anonymize_observers, mapping = observer_mapping))

sampleunits_data <- sampleunits_data %>%
  mutate(observers = sapply(observers, anonymize_observers, mapping = observer_mapping))

sampleevents_data <- sampleevents_data %>%
  mutate(observers = sapply(observers, anonymize_observers, mapping = observer_mapping))

# Display message
cat(paste0("<span style='color: green;'>✓ Observer names anonymized: ", 
           nrow(observer_mapping), " unique observers mapped to letters A-", 
           LETTERS[nrow(observer_mapping)], "</span>\n"))

✓ Observer names anonymized: 3 unique observers mapped to letters A-C

Project information

Show the code

# Assemble project information message for observer comparison project
project_info <- paste0(
  "### Observer Comparison Project\n\n",
  "**Project:** ", observer_project_name, "\n\n",
  "**Project ID:** ", observer_project_id, "\n\n"
)

# Add data extracted for observer project
project_info <- paste0(project_info,
                       "**Data extracted:**\n\n", 
                       "- Observations: ", nrow(observations_data), " records\n",
                       "- Sample Units: ", nrow(sampleunits_data), " transects\n",
                       "- Sample Events: ", nrow(sampleevents_data), " sites\n",
                       "- Unique observers: ", length(unique(sampleevents_data$observers)), "\n",
                       "\n")

# Add shark project info if different
if(shark_project_id != observer_project_id) {
  project_info <- paste0(project_info,
                         "### Shark Data Project\n\n",
                         "**Project:** ", shark_project_name, "\n\n",
                         "**Project ID:** ", shark_project_id, "\n\n",
                         "**Data extracted:**\n\n", 
                         "- Observations: ", nrow(shark_observations_data), " records\n",
                         "- Sample Units: ", nrow(shark_sampleunits_data), " transects\n",
                         "- Sample Events: ", nrow(shark_sampleevents_data), " sites\n",
                         "\n")
} else {
  project_info <- paste0(project_info,
                         "*Note: Same project used for shark analysis*\n\n")
}

# Display the assembled message
cat(project_info)

Observer Comparison Project

Project: Northern Belize Coastal Complex

Project ID: d2225edc-0dbb-4c10-8cc7-e7d6dfaf149f

Data extracted:

Observations: 2187 records
Sample Units: 176 transects
Sample Events: 47 sites
Unique observers: 7

Shark Data Project

Project: SERF 2.0_Nick Graham_Chagos_Outer

Project ID: 55ac964c-0228-42da-8061-5983339ecb9f

Data extracted:

Observations: 8986 records
Sample Units: 149 transects
Sample Events: 40 sites

Observer Comparisons

Comparing fish biomass and abundance patterns among observers to identify potential observer effects.

Note: Many transects have multiple observers listed. The analyses below show patterns for each observer across all transects they participated in, but systematic differences could reflect either individual observer effects or the effects of observer pairings.

Biomass by Observer

Show the code

# First, aggregate observations to transect level (sum biomass within transects)
transect_biomass <- observations_data %>%
  group_by(sample_unit_id, observers, site, transect_number) %>%
  summarise(
    total_biomass = sum(biomass_kgha, na.rm = TRUE),
    total_abundance = sum(count, na.rm = TRUE),
    n_observations = n(),
    .groups = 'drop'
  )

# Then split observers and calculate statistics across transects
observer_biomass <- transect_biomass %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers) %>%
  summarise(
    n_transects = n(),
    mean_biomass = mean(total_biomass, na.rm = TRUE),
    median_biomass = median(total_biomass, na.rm = TRUE),
    q025 = quantile(total_biomass, 0.025, na.rm = TRUE),
    q975 = quantile(total_biomass, 0.975, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_biomass))

kable(observer_biomass,
      digits = 2,
      col.names = c("Observer", "N Transects", "Mean Biomass (kg/ha)", 
                    "Median Biomass", "2.5% Quantile", "97.5% Quantile"),
      caption = "Fish Biomass Summary by Observer (transect-level totals)")

Fish Biomass Summary by Observer (transect-level totals)
Observer	N Transects	Mean Biomass (kg/ha)	Median Biomass	2.5% Quantile	97.5% Quantile
Observer C	31	286.46	107.42	3.95	1755.24
Observer B	47	144.63	65.78	2.84	562.14
Observer A	141	144.21	72.80	3.08	799.10

Statistical Test: Biomass Differences Among Observers

Show the code

# Prepare data for statistical test
plot_data <- transect_biomass %>%
  separate_rows(observers, sep = ", ")

# Filter to observers with at least 5 transects for meaningful comparison
observers_to_test <- plot_data %>%
  count(observers) %>%
  filter(n >= 5) %>%
  pull(observers)

test_data <- plot_data %>%
  filter(observers %in% observers_to_test)

# Perform Kruskal-Wallis test (non-parametric ANOVA)
if(length(observers_to_test) >= 2) {
  kw_test <- kruskal.test(total_biomass ~ observers, data = test_data)
  
  cat("**Kruskal-Wallis Test for Biomass Differences Among Observers**\n\n")
  cat("Test statistic (chi-squared):", round(kw_test$statistic, 3), "\n\n")
  cat("Degrees of freedom:", kw_test$parameter, "\n\n")
  cat("P-value:", format.pval(kw_test$p.value, digits = 3), "\n\n")
  
  if(kw_test$p.value < 0.05) {
    cat("**Result:** Significant differences detected among observers (p < 0.05)\n\n")
  } else {
    cat("**Result:** No significant differences detected among observers (p ≥ 0.05)\n\n")
  }
} else {
  cat("Insufficient observers (with ≥5 transects) for statistical testing.\n\n")
}

Kruskal-Wallis Test for Biomass Differences Among Observers

Test statistic (chi-squared): 1.076

Degrees of freedom: 2

P-value: 0.584

Result: No significant differences detected among observers (p ≥ 0.05)

Interpretation note: If significant differences are found among observers, this warrants further investigation. However, these results should be interpreted with caution, as observers may survey in different locations and at different times. Therefore, observed differences could reflect spatial or temporal variation in fish communities rather than true observer effects.

Biomass Distribution by Observer

Show the code

# Create subtitle text based on statistical test results
if(exists("kw_test") && length(observers_to_test) >= 2) {
  # Determine significance
  if(kw_test$p.value < 0.05) {
    sig_text <- "Significant differences detected among observers"
  } else {
    sig_text <- "No significant differences detected among observers"
  }
  
  subtitle_text <- paste0(
    "Observers with ≥5 transects shown; total biomass per transect\n",
    sig_text, " (Kruskal-Wallis: χ² = ", round(kw_test$statistic, 2),
    ", p = ", format.pval(kw_test$p.value, digits = 3, eps = 0.001), ")"
  )
} else {
  subtitle_text <- "Observers with ≥5 transects shown; total biomass per transect"
}

ggplot(test_data, 
       aes(x = reorder(observers, total_biomass, FUN = median), 
           y = total_biomass)) +
  geom_boxplot(fill = "#69b3a2", alpha = 0.7) +
  coord_flip() +
  labs(
    title = "Fish Biomass Distribution by Observer",
    subtitle = subtitle_text,
    x = "Observer",
    y = "Total Biomass (kg/ha)"
  ) +
  theme_classic() +
  theme(
    axis.text = element_text(size = 10, colour = "black"),
    axis.title = element_text(size = 12, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, hjust = 0.5, color = "gray30")
  )

Abundance By Observer

Show the code

# Calculate abundance summary by observer using transect-level totals
observer_abundance <- transect_biomass %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers) %>%
  summarise(
    n_transects = n(),
    mean_abundance = mean(total_abundance, na.rm = TRUE),
    median_abundance = median(total_abundance, na.rm = TRUE),
    q025 = quantile(total_abundance, 0.025, na.rm = TRUE),
    q975 = quantile(total_abundance, 0.975, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_abundance))

kable(observer_abundance,
      digits = 2,
      col.names = c("Observer", "N Transects", "Mean Abundance", 
                    "Median Abundance", "2.5% Quantile", "97.5% Quantile"),
      caption = "Fish Abundance Summary by Observer (transect-level totals)")

Fish Abundance Summary by Observer (transect-level totals)
Observer	N Transects	Mean Abundance	Median Abundance	2.5% Quantile	97.5% Quantile
Observer C	31	38.97	34	11.0	99.25
Observer B	47	27.70	27	4.3	55.85
Observer A	141	26.23	24	5.5	63.50

Taxonomic Identification by Observer

Show the code

# First calculate taxa per transect for each observer
taxa_per_transect <- observations_data %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers, sample_unit_id) %>%
  summarise(
    n_taxa_transect = n_distinct(fish_taxon),
    n_genera_transect = n_distinct(fish_genus),
    n_families_transect = n_distinct(fish_family),
    .groups = 'drop'
  )

# Then calculate summary statistics across transects for each observer
observer_taxonomy <- taxa_per_transect %>%
  group_by(observers) %>%
  summarise(
    n_transects = n(),
    mean_taxa_per_transect = mean(n_taxa_transect, na.rm = TRUE),
    median_taxa_per_transect = median(n_taxa_transect, na.rm = TRUE),
    q025_taxa = quantile(n_taxa_transect, 0.025, na.rm = TRUE),
    q975_taxa = quantile(n_taxa_transect, 0.975, na.rm = TRUE),
    mean_genera_per_transect = mean(n_genera_transect, na.rm = TRUE),
    mean_families_per_transect = mean(n_families_transect, na.rm = TRUE),
    .groups = 'drop'
  )

# Also get overall unique taxa counts across all transects
overall_taxonomy <- observations_data %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers) %>%
  summarise(
    total_unique_taxa = n_distinct(fish_taxon),
    total_unique_genera = n_distinct(fish_genus),
    total_unique_families = n_distinct(fish_family),
    n_observations = n(),
    .groups = 'drop'
  )

# Combine the two summaries
observer_taxonomy <- observer_taxonomy %>%
  left_join(overall_taxonomy, by = "observers") %>%
  arrange(desc(mean_taxa_per_transect))

kable(observer_taxonomy,
      digits = 2,
      col.names = c("Observer", "N Transects", 
                    "Mean Taxa/Transect", "Median Taxa/Transect",
                    "2.5% Quantile", "97.5% Quantile",
                    "Mean Genera/Transect", "Mean Families/Transect",
                    "Total Unique Taxa", "Total Unique Genera", 
                    "Total Unique Families", "N Observations"),
      caption = "Taxonomic Identification Summary by Observer")

Taxonomic Identification Summary by Observer
Observer	N Transects	Mean Taxa/Transect	Median Taxa/Transect	2.5% Quantile	97.5% Quantile	Mean Genera/Transect	Mean Families/Transect	Total Unique Taxa	Total Unique Genera	Total Unique Families	N Observations
Observer C	31	8.84	9	2.75	14	6.77	5.26	52	30	17	429
Observer A	141	7.56	8	2.50	13	5.84	4.31	72	38	21	1730
Observer B	47	7.49	8	2.00	13	5.74	4.32	50	29	16	561

Taxonomic Resolution Comparison

Show the code

# Filter to observers with at least 5 transects
observer_taxonomy_filtered <- observer_taxonomy %>%
  filter(n_transects >= 5)

ggplot(observer_taxonomy_filtered,
       aes(x = reorder(observers, mean_taxa_per_transect),
           y = mean_taxa_per_transect)) +
  geom_col(fill = "#4287f5", alpha = 0.7, color = "black", linewidth = 0.3) +
  geom_errorbar(aes(ymin = q025_taxa, ymax = q975_taxa),
                width = 0.3, linewidth = 0.5) +
  coord_flip() +
  labs(
    title = "Taxonomic Identification by Observer",
    subtitle = "Observers with ≥5 transects shown; error bars show 95% quantiles (2.5% - 97.5%)",
    x = "Observer",
    y = "Mean Unique Taxa per Transect"
  ) +
  theme_classic() +
  theme(
    axis.text = element_text(size = 10, colour = "black"),
    axis.title = element_text(size = 12, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray30")
  )

Observer Pairing Patterns

Show the code

# Identify common observer pairings using sample events
observer_pairs <- sampleevents_data %>%
  filter(str_detect(observers, ",")) %>%  # Only multi-observer sample events
  count(observers, sort = TRUE) %>%
  head(10)

kable(observer_pairs,
      col.names = c("Observer Pair", "N Sample Events"),
      caption = "Top 10 Most Common Observer Pairings (at sample event level)")

Top 10 Most Common Observer Pairings (at sample event level)
Observer Pair	N Sample Events
Observer A, Observer B	12
Observer C, Observer A	2
Observer C, Observer A, Observer B	2
Observer C, Observer B	2

Outlier Detection

High Biomass Observations

Identifying sample events with unusually high fish biomass (top 10%).

Show the code

# Calculate 90th percentile
biomass_90th <- quantile(sampleevents_data$biomass_kgha_avg, 
                         probs = 0.90, na.rm = TRUE)

high_biomass_sites <- sampleevents_data %>%
  filter(biomass_kgha_avg >= biomass_90th) %>%
  select(site, sample_date, observers, biomass_kgha_avg, 
         sample_unit_count, depth_avg) %>%
  arrange(desc(biomass_kgha_avg))

kable(high_biomass_sites,
      digits = 1,
      col.names = c("Site", "Date", "Observer", "Biomass (kg/ha)", 
                    "N Transects", "Depth (m)"),
      caption = paste0("Sites with High Biomass (≥ ", 
                      round(biomass_90th, 1), " kg/ha - 90th percentile)"))

Sites with High Biomass (≥ 251.9 kg/ha - 90th percentile)
Site	Date	Observer	Biomass (kg/ha)	N Transects	Depth (m)
BCMR 33	2022-11-17	Observer C, Observer B	1217.1	4	2.0
BZCCCB02	2021-10-28	Observer A	681.3	4	5.5
BCMR 06	2021-10-21	Observer A	520.7	2	1.2
BCMR 46	2022-11-17	Observer C, Observer B	364.4	4	12.0
BCMR 22	2021-10-20	Observer A	271.7	4	3.5

Outlier Detection - high or low

Identifying sample events that have unusually high or low fish biomass.

Show the code

# Identify outliers using different methods for high and low values
# Low outliers: static threshold (biologically meaningful low biomass)
low_threshold <- 50  # kg/ha - unusually low biomass
# High outliers: IQR method
Q1 <- quantile(sampleevents_data$biomass_kgha_avg, 0.25, na.rm = TRUE)
Q3 <- quantile(sampleevents_data$biomass_kgha_avg, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
upper_bound <- Q3 + 1.5 * IQR_val

outliers <- sampleevents_data %>%
  filter(biomass_kgha_avg < low_threshold | biomass_kgha_avg > upper_bound) %>%
  mutate(outlier_type = ifelse(biomass_kgha_avg < low_threshold, 
                                "Low", "High")) %>%
  select(site, sample_date, biomass_kgha_avg, outlier_type, observers) %>%
  arrange(outlier_type, biomass_kgha_avg)

kable(outliers,
      digits = 1,
      col.names = c("Site", "Date", "Biomass (kg/ha)", 
                    "Outlier Type", "Observer"),
      caption = paste0("Outlier Sites (<", low_threshold, 
                      " kg/ha or >", round(upper_bound, 1), " kg/ha)"))

Outlier Sites (<50 kg/ha or >388.8 kg/ha)
Site	Date	Biomass (kg/ha)	Outlier Type	Observer
BCMR 06	2021-10-21	520.7	High	Observer A
BZCCCB02	2021-10-28	681.3	High	Observer A
BCMR 33	2022-11-17	1217.1	High	Observer C, Observer B
BCMR 29	2022-11-17	19.6	Low	Observer C, Observer A, Observer B
BCMR 39	2021-10-19	33.4	Low	Observer A
BCMR - Alleys	2022-11-18	36.7	Low	Observer C
HCMR-B-GUZ-01	2021-10-22	39.4	Low	Observer A
NHCMR-B-GUZ-22	2021-10-25	39.8	Low	Observer A, Observer B
BCMR 33	2021-10-19	41.6	Low	Observer A
NHCMR-F-GUZ-06	2021-10-26	45.0	Low	Observer A, Observer B
HCMR-F-NTZ-04	2021-10-22	47.8	Low	Observer A
BCMR - Alleys	2021-10-20	48.4	Low	Observer A

Statistical Test: Site Outlier Detection

Show the code

# Calculate z-scores for each site
site_biomass_stats <- sampleevents_data %>%
  mutate(
    z_score = (biomass_kgha_avg - mean(biomass_kgha_avg, na.rm = TRUE)) / 
               sd(biomass_kgha_avg, na.rm = TRUE),
    is_outlier = abs(z_score) > 2  # Standard threshold: |z| > 2
  )

# Count outliers
n_outliers <- sum(site_biomass_stats$is_outlier, na.rm = TRUE)
pct_outliers <- 100 * n_outliers / nrow(site_biomass_stats)

# Create the text output
output_text <- paste0(
  "**Z-Score Based Outlier Detection**\n\n",
  "Sites with |z-score| > 2: ", n_outliers, "\n\n",
  "Percentage of sites flagged as outliers: ", round(pct_outliers, 1), "%\n\n"
)

cat(output_text)

Z-Score Based Outlier Detection

Sites with |z-score| > 2: 2

Percentage of sites flagged as outliers: 4.3%

Show the code

# Show the outlier sites
if(n_outliers > 0) {
  outlier_table <- site_biomass_stats %>%
    filter(is_outlier) %>%
    select(site, sample_date, biomass_kgha_avg, z_score, observers) %>%
    arrange(desc(abs(z_score)))
  
  kable(outlier_table,
        digits = 2,
        col.names = c("Site", "Date", "Biomass (kg/ha)", "Z-Score", "Observer"),
        caption = "Sites Identified as Statistical Outliers (|z-score| > 2)")
}

Sites Identified as Statistical Outliers (|z-score| > 2)
Site	Date	Biomass (kg/ha)	Z-Score	Observer
BCMR 33	2022-11-17	1217.07	5.27	Observer C, Observer B
BZCCCB02	2021-10-28	681.31	2.58	Observer A

Interpretation note: Sites identified as statistical outliers warrant further investigation. However, these may represent true ecological variation (e.g., protected areas, unique habitat features) rather than data errors. Differences could also reflect spatial or temporal variation in environmental conditions or fish community structure. Field notes and environmental context should be consulted before concluding these are erroneous data points.

Interactive Scatter Plot - Biomass vs Abundance

Show the code

# Calculate mean abundance per sample event
se_with_abundance <- sampleunits_data %>%
  group_by(sample_event_id) %>%
  summarise(mean_abundance = mean(total_abundance, na.rm = TRUE)) %>%
  right_join(sampleevents_data, by = "sample_event_id") %>%
  # Join with outlier detection results from z-score test
  left_join(
    site_biomass_stats %>% select(site, sample_date, is_outlier, z_score),
    by = c("site", "sample_date")
  ) %>%
  mutate(
    # Only flag as outlier if z-score > 2 (high biomass only)
    is_high_outlier = !is.na(z_score) & z_score > 2,
    outlier_status = ifelse(is_high_outlier, "High Outlier", "Normal")
  )

# Calculate upper biomass threshold based on z-scores
mean_biomass <- mean(sampleevents_data$biomass_kgha_avg, na.rm = TRUE)
sd_biomass <- sd(sampleevents_data$biomass_kgha_avg, na.rm = TRUE)
upper_biomass_threshold <- mean_biomass + 2 * sd_biomass

# Get x-axis range for the line
x_range <- range(se_with_abundance$mean_abundance, na.rm = TRUE)

# Create interactive scatter plot
scatter_plot <- plot_ly(
  data = se_with_abundance,
  x = ~mean_abundance,
  y = ~biomass_kgha_avg,
  type = "scatter",
  mode = "markers",
  text = ~paste("Site:", site,
                "<br>Biomass:", round(biomass_kgha_avg, 1), "kg/ha",
                "<br>Abundance:", round(mean_abundance, 1),
                "<br>Observer:", observers,
                "<br>Status:", outlier_status),
  hoverinfo = "text",
  marker = list(
    size = 10,
    color = ~ifelse(is_high_outlier, "#d13823", "#277d1d"),  # Red for high outliers, green for normal
    line = list(color = "white", width = 1)
  ),
  name = "Sample Events",
  showlegend = TRUE
) %>%
  # Add horizontal reference line for upper biomass threshold only
  add_segments(
    x = x_range[1],
    xend = x_range[2],
    y = upper_biomass_threshold,
    yend = upper_biomass_threshold,
    inherit = FALSE,              # don't inherit marker mapping from the main plot
    mode = "lines",               # lines only (no markers)
    marker = list(size = 0, opacity = 0),  # belt-and-suspenders: hide any marker glyphs
    line = list(color = "red", dash = "dash", width = 2),
    name = "Upper threshold (z = 2)",
    showlegend = TRUE,
    hoverinfo = "text",
    text = paste("Upper threshold:", round(upper_biomass_threshold, 1), "kg/ha")
  ) %>%
  layout(
    title = "Fish Biomass vs Abundance (High Outliers Highlighted)",
    xaxis = list(title = "Mean Fish Abundance (count per transect)"),
    yaxis = list(title = "Fish Biomass (kg/ha)"),
    hovermode = "closest"
  )

scatter_plot

Note: Red points indicate sites with unusually high biomass (z-score > 2), falling above the dashed red threshold line.

Shark Observations

Shark Presence and Abundance

Show the code

# Identify shark families (Carcharhinidae, Sphyrnidae, etc.)
shark_families <- c(
  "Carcharhinidae","Sphyrnidae","Alopiidae","Lamnidae","Rhincodontidae",
  "Galeocerdonidae","Ginglymostomatidae","Hemiscylliidae","Heterodontidae",
  "Hexanchidae","Odontaspididae","Orectolobidae","Parascylliidae",
  "Scyliorhinidae","Squalidae","Stegostomatidae","Triakidae"
)

# Use shark-specific data for this analysis
shark_obs <- shark_observations_data %>%
  filter(fish_family %in% shark_families)

if(nrow(shark_obs) > 0) {
  # Summarize shark observations by site
  shark_summary <- shark_obs %>%
    group_by(site, fish_family, fish_taxon) %>%
    summarise(
      total_count = sum(count, na.rm = TRUE),
      total_biomass = sum(biomass_kgha, na.rm = TRUE),
      n_observations = n(),
      .groups = "drop"
    ) %>%
    arrange(desc(total_count))
  
  kable(shark_summary,
        digits = 2,
        col.names = c("Site", "Family", "Species", "Total Count", 
                      "Total Biomass (kg/ha)", "N Obs"),
        caption = "Shark Observations Summary")
  
  # Sites with sharks
  cat("\n\nTotal sites surveyed:", 
      length(unique(shark_observations_data$site)), "\n\n")
  cat("Sites with shark observations:", 
      length(unique(shark_obs$site)), "\n\n")
  cat("Percentage of sites with sharks:", 
      round(100 * length(unique(shark_obs$site)) / 
              length(unique(shark_observations_data$site)), 1), 
      "%\n\n")
} else {
  cat("No shark observations in the dataset.")
}

Total sites surveyed: 13

Sites with shark observations: 10

Percentage of sites with sharks: 76.9 %

Individual Shark Weight Validation

Note - If observations are flagged here it warrants further investigation but does not necessarily mean there is an error. Maximum weight data is pulled from Fishbase here to compare with the observations but there is less maximum weight data available than maximum length (the latter of which is what is used in the MERMAID Collect app to test for larger than expected observations).

Show the code

if(nrow(shark_obs) > 0) {
  
  # Get individual shark observations with calculated weights
  individual_sharks <- shark_obs %>%
    mutate(
      # Remove "m" from transect_width and convert to numeric
      transect_width_m = as.numeric(str_replace(transect_width, "m", "")),
      
      # Calculate transect area in hectares
      # transect_length and transect_width_m are in meters
      transect_area_m2 = transect_length * transect_width_m,
      transect_area_ha = transect_area_m2 / 10000,
      
      # biomass_kgha is the total biomass for this observation per hectare
      # To get total biomass for the actual transect area:
      total_biomass_kg = biomass_kgha * transect_area_ha,
      
      # Divide by count to get individual weight
      individual_weight_kg = total_biomass_kg / count
    ) %>%
    select(site, sample_date, fish_taxon, fish_family, size, count, 
           biomass_kgha, transect_length, transect_width_m, 
           transect_area_ha, individual_weight_kg)
  
  # Get unique species names for FishBase lookup
  shark_species <- unique(shark_obs$fish_taxon)
  
  # Attempt to get max weights from FishBase
  cat("Retrieving maximum weights from FishBase...\n\n")
  
  # Initialize results list
  max_weights_list <- list()
  
  for(species in shark_species) {
    # Try to get weight data from FishBase
    tryCatch({
      # Get species data
      species_data <- species(species, fields = c("Species", "Weight"))
      
      if(!is.null(species_data) && nrow(species_data) > 0) {
        if(!is.na(species_data$Weight) && species_data$Weight > 0) {
          max_weights_list[[species]] <- tibble(
            fish_taxon = species,
            # FishBase Weight is in grams, convert to kg
            max_weight_kg = species_data$Weight / 1000
          )
        }
      }
    }, error = function(e) {
      # Silently continue if species not found
      NULL
    })
  }
  
  # Combine results
  if(length(max_weights_list) > 0) {
    max_weights <- bind_rows(max_weights_list)
    
    # Join with observations
    shark_weight_check <- individual_sharks %>%
      left_join(max_weights, by = "fish_taxon") %>%
      mutate(
        exceeds_max = !is.na(max_weight_kg) & individual_weight_kg > max_weight_kg,
        weight_ratio = individual_weight_kg / max_weight_kg
      )
    
    # Check for any that exceed published max
    exceeding_sharks <- shark_weight_check %>%
      filter(exceeds_max) %>%
      arrange(desc(weight_ratio))
    
    if(nrow(exceeding_sharks) > 0) {
      cat("**Warning:** Some individual shark observations exceed published maximum weights\n\n")
      
      kable(exceeding_sharks %>%
              select(site, sample_date, fish_taxon, size, count, individual_weight_kg, 
                     max_weight_kg, weight_ratio),
            digits = 2,
            col.names = c("Site", "Date", "Species", "Size (cm)", "Count",
                         "Observed Weight (kg)", "Max Published Weight (kg)", 
                         "Ratio (Obs/Max)"),
            caption = "Shark Observations Exceeding Published Maximum Weights")
    } else {
      cat("<span style='color: green;'>✓ All individual shark weights are within published maximum ranges</span>\n\n")
    }
    
    # Summary table of all sharks with available max weights
    shark_comparison <- shark_weight_check %>%
      filter(!is.na(max_weight_kg)) %>%
      group_by(fish_taxon) %>%
      summarise(
        n_obs = n(),
        mean_observed_kg = mean(individual_weight_kg, na.rm = TRUE),
        max_observed_kg = max(individual_weight_kg, na.rm = TRUE),
        published_max_kg = first(max_weight_kg),
        n_exceeding = sum(exceeds_max, na.rm = TRUE),
        .groups = "drop"
      ) %>%
      arrange(desc(n_obs))
    
    kable(shark_comparison,
          digits = 2,
          col.names = c("Species", "N Observations", "Mean Weight (kg)", 
                       "Max Observed (kg)", "Published Max (kg)", "N Exceeding"),
          caption = "Comparison of Observed vs Published Maximum Shark Weights")
    
  } else {
    cat("No maximum weight data available from FishBase for the observed shark species.\n")
    cat("This could be due to taxonomic naming differences or missing FishBase data.\n\n")
    
    # Still show the observations
    kable(individual_sharks,
          digits = 2,
          col.names = c("Site", "Date", "Species", "Family", "Size (cm)", 
                       "Count", "Biomass (kg/ha)", "Transect Length (m)", 
                       "Transect Width (m)", "Transect Area (ha)", 
                       "Individual Weight (kg)"),
          caption = "Individual Shark Observations (FishBase validation unavailable)")
  }
  
} else {
  cat("No shark observations in the dataset.")
}

Retrieving maximum weights from FishBase…

Warning: Some individual shark observations exceed published maximum weights

Comparison of Observed vs Published Maximum Shark Weights
Species	N Observations	Mean Weight (kg)	Max Observed (kg)	Published Max (kg)	N Exceeding
Carcharhinus amblyrhynchos	17	70.45	194.09	33.70	16
Triaenodon obesus	10	18.45	30.05	18.25	5
Carcharhinus albimarginatus	1	19.66	19.66	162.20	0

Visualization of Observed & Published Maximum Shark Weights

Show the code

if(exists("shark_weight_check") && nrow(shark_weight_check) > 0) {
  
  # Get the species ordering based on max observed weight
  species_order <- shark_weight_check %>%
    group_by(fish_taxon) %>%
    summarise(max_obs = max(individual_weight_kg, na.rm = TRUE)) %>%
    arrange(max_obs) %>%
    pull(fish_taxon)
  
  # Apply this ordering to the data
  shark_weight_check <- shark_weight_check %>%
    mutate(fish_taxon = factor(fish_taxon, levels = species_order))
  
  # Create plot using geom_errorbar instead of geom_linerange
  p <- ggplot(shark_weight_check, aes(x = individual_weight_kg, y = fish_taxon)) +
    # Add points for observed weights first
    geom_point(aes(color = exceeds_max), 
               size = 3, alpha = 0.7) +
    # Add vertical lines for published max weights using geom_vline with faceting trick
    geom_point(data = shark_weight_check %>% 
                 filter(!is.na(max_weight_kg)) %>%
                 distinct(fish_taxon, max_weight_kg),
               aes(x = max_weight_kg, y = fish_taxon),
               shape = "|", size = 10, color = "red", stroke = 2) +
    scale_color_manual(values = c("TRUE" = "#d13823", "FALSE" = "#277d1d"),
                       labels = c("TRUE" = "Exceeds max", "FALSE" = "Within range"),
                       na.value = "gray50",
                       name = "Status") +
    labs(
      title = "Individual Shark Weights vs Published Maximum Weights",
      subtitle = paste0(
        "Red vertical marks show published maximum weights from FishBase\n",
        "Published weights available for ", 
        length(unique(shark_weight_check$fish_taxon[!is.na(shark_weight_check$max_weight_kg)])),
        " of ", length(unique(shark_weight_check$fish_taxon)), " species"
      ),
      x = "Individual Weight (kg)",
      y = "Species"
    ) +
    theme_classic() +
    theme(
      axis.text.y = element_text(size = 9, colour = "black", face = "italic"),
      axis.text.x = element_text(size = 10, colour = "black"),
      axis.title = element_text(size = 12, colour = "black"),
      plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
      plot.subtitle = element_text(size = 10, hjust = 0.5, color = "gray30"),
      legend.position = "bottom"
    )
  
  print(p)
  
} else {
  cat("No shark weight comparison data available to plot.")
}

Shark Biomass by Site

Show the code

if(nrow(shark_obs) > 0) {
  # Get all shark family columns from sample events data
  shark_family_cols <- grep("^biomass_kgha_fish_family_avg_", 
                            names(shark_sampleevents_data), 
                            value = TRUE)
  
  # Filter to only shark families (convert to lowercase for matching)
  shark_families_lower <- tolower(shark_families)
  shark_cols <- shark_family_cols[sapply(shark_family_cols, function(col) {
    family_name <- sub("biomass_kgha_fish_family_avg_", "", col)
    any(grepl(family_name, shark_families_lower, ignore.case = TRUE))
  })]
  
  # Calculate shark and total biomass per sample event
  shark_biomass_summary <- shark_sampleevents_data %>%
    mutate(
      # Sum all shark family columns
      shark_biomass = rowSums(select(., all_of(shark_cols)), na.rm = TRUE),
      # Use total biomass
      total_biomass = biomass_kgha_avg,
      # Calculate other fish biomass
      other_biomass = total_biomass - shark_biomass,
      # Calculate percentage
      shark_percent = 100 * shark_biomass / total_biomass,
      # Create label combining site and date
      site_date = paste0(site, " (", format(as.Date(sample_date), "%Y-%m-%d"), ")")
    ) %>%
    # Filter to only sample events with sharks
    filter(shark_biomass > 0) %>%
    select(site_date, shark_biomass, other_biomass, total_biomass, shark_percent)
  
  # Prepare data for stacked bars
  shark_biomass_stacked <- shark_biomass_summary %>%
    pivot_longer(cols = c(shark_biomass, other_biomass),
                 names_to = "biomass_type",
                 values_to = "biomass") %>%
    mutate(biomass_type = factor(biomass_type,
                                  levels = c("other_biomass", "shark_biomass"),
                                  labels = c("Other fish", "Sharks")))
  
  # Plot
  ggplot(shark_biomass_stacked, 
         aes(x = reorder(site_date, total_biomass), y = biomass, fill = biomass_type)) +
    geom_col(position = "stack", alpha = 0.8) +
    # Add percentage labels at the end of bars
    geom_text(data = shark_biomass_summary,
              aes(x = site_date, y = total_biomass, 
                  label = paste0(round(shark_percent, 1), "%"),
                  fill = NULL),
              hjust = -0.1, size = 3, color = "black") +
    scale_fill_manual(values = c("Other fish" = "#7fbc41", "Sharks" = "#2c7bb6")) +
    coord_flip() +
    labs(
      title = "Shark and Total Fish Biomass by Sample Event (Events with Sharks)",
      subtitle = "Percentage shows shark contribution to total fish biomass",
      x = "Site (Date)",
      y = "Biomass (kg/ha)",
      fill = "Category"
    ) +
    theme_classic() +
    theme(
      axis.text = element_text(size = 9, colour = "black"),
      axis.title = element_text(size = 12, colour = "black"),
      plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
      plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray30"),
      legend.position = "bottom"
    ) +
    # Expand x-axis limits to make room for percentage labels
    scale_y_continuous(expand = expansion(mult = c(0, 0.15)))
} else {
  cat("No shark observations in the dataset.")
}

Data Quality Summary

Show the code

# Count unique families
n_families <- length(unique(observations_data$fish_family))

# Assemble the data quality summary message with markdown
summary_msg <- paste0(
  "#### Sample Coverage\n\n",
  "- **Total sample events:** ", nrow(sampleevents_data), "\n",
  "- **Total transects:** ", nrow(sampleunits_data), "\n",
  "- **Total observations:** ", nrow(observations_data), "\n",
  "- **Unique sites:** ", length(unique(sampleevents_data$site)), "\n",
  "- **Unique observers:** ", nrow(observer_mapping), "\n\n",
  
  "#### Taxonomic Diversity\n\n",
  "- **Fish families:** ", n_families, "\n",
  "- **Fish genera:** ", length(unique(observations_data$fish_genus)), "\n",
  "- **Fish species:** ", length(unique(observations_data$fish_taxon)), "\n\n",
  
  "#### Biomass Statistics (kg/ha)\n\n",
  "- **Mean:** ", round(mean(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n",
  "- **Median:** ", round(median(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n",
  "- **SD:** ", round(sd(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n",
  "- **Range:** ", round(min(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), " - ",
             round(max(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n\n",
  
  "#### Data Quality Flags\n\n",
  "- **High biomass outliers (>90th percentile):** ", nrow(high_biomass_sites), "\n",
  "- **Statistical outliers (IQR method):** ", nrow(outliers), "\n",
  "- **Sites with shark observations:** ", 
    ifelse(exists("shark_obs") && nrow(shark_obs) > 0,
           length(unique(shark_obs$site)), 0), "\n"
)

# Display the assembled summary
cat(summary_msg)

Sample Coverage

Total sample events: 47
Total transects: 176
Total observations: 2187
Unique sites: 36
Unique observers: 3

Taxonomic Diversity

Fish families: 21
Fish genera: 41
Fish species: 79

Biomass Statistics (kg/ha)

Mean: 168
Median: 113.6
SD: 199.1
Range: 19.6 - 1217.1

Data Quality Flags

High biomass outliers (>90th percentile): 5
Statistical outliers (IQR method): 12
Sites with shark observations: 10

Notes and Recommendations

Based on this data quality check:

Observer Anonymization: Observer names have been anonymized to letters for sharing this document publicly.
Observer Consistency: Statistical tests have been performed to assess systematic differences between observers. Any detected differences warrant further investigation, but should be interpreted cautiously as they may reflect spatial or temporal variation rather than true observer effects.
Outliers: Sites identified as statistical outliers should be investigated to determine if they represent true ecological variation or potential data issues. Consider environmental context and field notes.
Shark Observations: Individual shark weights have been validated against published maximum weights from FishBase where available. Any observations exceeding published maxima should be carefully reviewed for potential data entry errors in size or abundance.
Data Integrity: For any flagged issues, consult field datasheets and consider re-validation of measurements before making corrections to the database.