MERMAID Fishbelt Data Quality Checks

Author

Iain R. Caldwell

Published

January 7, 2026


This code performs data quality checks on MERMAID projects with fish belt data, including observer comparisons, outlier detection, family diversity analysis, and biomass composition.


Getting example fishbelt data from MERMAID

Loads the necessary R packages and gets an example MERMAID project with fish belt data. Since the code relies on all three scales of MERMAID data (observations, sample units, and sample events), it is necessary to either select a MERMAID project whose permissions are set to public for fish belt or to use a project for which you are a member. Here I am exporting a public project so anyone can run the code - in this case a project tagged with “WCS”. However, I also provide some example code (hashed out) that can be used to get a project for which you are a member.

Note: This step requires authentication even if you are accessing a public project. This means you need to create a free MERMAID account if you don’t already have one.

Show the code
rm(list = ls()) #remove past stored objects
options(scipen = 999) #turn off scientific notation

#### Load packages ####
library(mermaidr)
library(tidyverse)
library(plotly)
library(DT)
library(ggplot2)
library(ggpubr)
library(knitr)

#### Get data from an example project with public fishbelt data ####
allPublicFisbeltProjects <- mermaid_get_summary_sampleevents() %>% 
  filter(data_policy_beltfish == "public" & 
           !is.na(beltfish_biomass_kgha_avg) &
           !country %in% c("Indonesia", "Philippines") &
           grepl(pattern = "Blue Ventures", x = tags)) %>%
  group_by(project_id, project, tags, project_notes) %>% 
  dplyr::summarise(NumSites = length(site),
                   TotalSampleUnits = sum(beltfish_sample_unit_count)) %>% 
  arrange(desc(TotalSampleUnits))

targetProject <- allPublicFisbeltProjects[1,] #choose first project as example

targetFishAllData <- 
  mermaid_get_project_data(project = allPublicFisbeltProjects$project_id[1],
                           method = "fishbelt",
                           data = "all")

# #### Get data from a project for which you are a member ####
# my_projects <- mermaid_get_my_projects()
# 
# #The next line gets the first project - change the number to get a different one
# targetProject <- my_projects[1,] 
# 
# # Get all data levels
# targetFishAllData <- targetProject %>% 
#   mermaid_get_project_data(method = "fishbelt", data = "all")

# Extract the three levels
observations_data <- targetFishAllData$observations
sampleunits_data <- targetFishAllData$sampleunits
sampleevents_data <- targetFishAllData$sampleevents

Project information

Show the code
# Assemble project information message
project_info <- paste0(
  "**Project analyzed:** ", allPublicFisbeltProjects$project[1], "\n\n",
  "**Project ID:** ", allPublicFisbeltProjects$project_id[1], "\n\n"
)

# Add tags if available
if (!is.na(allPublicFisbeltProjects$tags[1]) && allPublicFisbeltProjects$tags[1] != "") {
  project_info <- paste0(project_info, "**Tags:** ", allPublicFisbeltProjects$tags[1], "\n\n")
}

# Add project notes if available
if (!is.na(allPublicFisbeltProjects$project_notes[1]) && allPublicFisbeltProjects$project_notes[1] != "") {
  project_info <- paste0(project_info, "**Project notes:** ", allPublicFisbeltProjects$project_notes[1], "\n\n")
}

# Add data extracted
project_info <- paste0(project_info,
                       "**Data extracted:**\n\n", 
                       "- Observations: ", nrow(observations_data), " records\n",
                       "- Sample Units: ", nrow(sampleunits_data), " transects\n",
                       "- Sample Events: ", nrow(sampleevents_data), " sites\n",
                       "\n")

# Display the assembled message
cat(project_info)

Project analyzed: Northern Belize Coastal Complex

Project ID: d2225edc-0dbb-4c10-8cc7-e7d6dfaf149f

Tags: Blue Ventures (BV)

Data extracted:

  • Observations: 2187 records
  • Sample Units: 176 transects
  • Sample Events: 47 sites

Observer Comparisons

Comparing fish biomass and abundance patterns among observers to identify potential observer effects.

Note: Many transects have multiple observers listed. The analyses below show patterns for each observer across all transects they participated in, but systematic differences could reflect either individual observer effects or the effects of observer pairings.

Biomass by Observer

Show the code
# First, aggregate observations to transect level (sum biomass within transects)
transect_biomass <- observations_data %>%
  group_by(sample_unit_id, observers, site, transect_number) %>%
  summarise(
    total_biomass = sum(biomass_kgha, na.rm = TRUE),
    total_abundance = sum(count, na.rm = TRUE),
    n_observations = n(),
    .groups = 'drop'
  )

# Then split observers and calculate statistics across transects
observer_biomass <- transect_biomass %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers) %>%
  summarise(
    n_transects = n(),
    mean_biomass = mean(total_biomass, na.rm = TRUE),
    median_biomass = median(total_biomass, na.rm = TRUE),
    q025 = quantile(total_biomass, 0.025, na.rm = TRUE),
    q975 = quantile(total_biomass, 0.975, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_biomass))

kable(observer_biomass,
      digits = 2,
      col.names = c("Observer", "N Transects", "Mean Biomass (kg/ha)", 
                    "Median Biomass", "2.5% Quantile", "97.5% Quantile"),
      caption = "Fish Biomass Summary by Observer (transect-level totals)")
Fish Biomass Summary by Observer (transect-level totals)
Observer N Transects Mean Biomass (kg/ha) Median Biomass 2.5% Quantile 97.5% Quantile
Joshua Borland 31 286.46 107.42 3.95 1755.24
Darling Ortega 47 144.63 65.78 2.84 562.14
Alexander Navarro 141 144.21 72.80 3.08 799.10

Biomass Distribution by Observer

Show the code
# Prepare data for plotting - split observers
plot_data <- transect_biomass %>%
  separate_rows(observers, sep = ", ")

# Filter to observers with at least 5 transects for meaningful comparison
observers_to_plot <- plot_data %>%
  count(observers) %>%
  filter(n >= 5) %>%
  pull(observers)

plot_data_filtered <- plot_data %>%
  filter(observers %in% observers_to_plot)

ggplot(plot_data_filtered, 
       aes(x = reorder(observers, total_biomass, FUN = median), 
           y = total_biomass)) +
  geom_boxplot(fill = "#69b3a2", alpha = 0.7) +
  coord_flip() +
  labs(
    title = "Fish Biomass Distribution by Observer",
    subtitle = "Observers with ≥5 transects shown; total biomass per transect",
    x = "Observer",
    y = "Total Biomass (kg/ha)"
  ) +
  theme_classic() +
  theme(
    axis.text = element_text(size = 10, colour = "black"),
    axis.title = element_text(size = 12, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray30")
  )

Abundance By Observer

Show the code
# Calculate abundance summary by observer using transect-level totals
observer_abundance <- transect_biomass %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers) %>%
  summarise(
    n_transects = n(),
    mean_abundance = mean(total_abundance, na.rm = TRUE),
    median_abundance = median(total_abundance, na.rm = TRUE),
    q025 = quantile(total_abundance, 0.025, na.rm = TRUE),
    q975 = quantile(total_abundance, 0.975, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_abundance))

kable(observer_abundance,
      digits = 2,
      col.names = c("Observer", "N Transects", "Mean Abundance", 
                    "Median Abundance", "2.5% Quantile", "97.5% Quantile"),
      caption = "Fish Abundance Summary by Observer (transect-level totals)")
Fish Abundance Summary by Observer (transect-level totals)
Observer N Transects Mean Abundance Median Abundance 2.5% Quantile 97.5% Quantile
Joshua Borland 31 38.97 34 11.0 99.25
Darling Ortega 47 27.70 27 4.3 55.85
Alexander Navarro 141 26.23 24 5.5 63.50

Taxonomic Identification by Observer

Show the code
# First calculate taxa per transect for each observer
taxa_per_transect <- observations_data %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers, sample_unit_id) %>%
  summarise(
    n_taxa_transect = n_distinct(fish_taxon),
    n_genera_transect = n_distinct(fish_genus),
    n_families_transect = n_distinct(fish_family),
    .groups = 'drop'
  )

# Then calculate summary statistics across transects for each observer
observer_taxonomy <- taxa_per_transect %>%
  group_by(observers) %>%
  summarise(
    n_transects = n(),
    mean_taxa_per_transect = mean(n_taxa_transect, na.rm = TRUE),
    median_taxa_per_transect = median(n_taxa_transect, na.rm = TRUE),
    q025_taxa = quantile(n_taxa_transect, 0.025, na.rm = TRUE),
    q975_taxa = quantile(n_taxa_transect, 0.975, na.rm = TRUE),
    mean_genera_per_transect = mean(n_genera_transect, na.rm = TRUE),
    mean_families_per_transect = mean(n_families_transect, na.rm = TRUE),
    .groups = 'drop'
  )

# Also get overall unique taxa counts across all transects
overall_taxonomy <- observations_data %>%
  separate_rows(observers, sep = ", ") %>%
  group_by(observers) %>%
  summarise(
    total_unique_taxa = n_distinct(fish_taxon),
    total_unique_genera = n_distinct(fish_genus),
    total_unique_families = n_distinct(fish_family),
    n_observations = n(),
    .groups = 'drop'
  )

# Combine the two summaries
observer_taxonomy <- observer_taxonomy %>%
  left_join(overall_taxonomy, by = "observers") %>%
  arrange(desc(mean_taxa_per_transect))

kable(observer_taxonomy,
      digits = 2,
      col.names = c("Observer", "N Transects", 
                    "Mean Taxa/Transect", "Median Taxa/Transect",
                    "2.5% Quantile", "97.5% Quantile",
                    "Mean Genera/Transect", "Mean Families/Transect",
                    "Total Unique Taxa", "Total Unique Genera", 
                    "Total Unique Families", "N Observations"),
      caption = "Taxonomic Identification Summary by Observer")
Taxonomic Identification Summary by Observer
Observer N Transects Mean Taxa/Transect Median Taxa/Transect 2.5% Quantile 97.5% Quantile Mean Genera/Transect Mean Families/Transect Total Unique Taxa Total Unique Genera Total Unique Families N Observations
Joshua Borland 31 8.84 9 2.75 14 6.77 5.26 52 30 17 429
Alexander Navarro 141 7.56 8 2.50 13 5.84 4.31 72 38 21 1730
Darling Ortega 47 7.49 8 2.00 13 5.74 4.32 50 29 16 561

Taxonomic Resolution Comparison

Show the code
# Filter to observers with at least 5 transects
observer_taxonomy_filtered <- observer_taxonomy %>%
  filter(n_transects >= 5)

ggplot(observer_taxonomy_filtered,
       aes(x = reorder(observers, mean_taxa_per_transect),
           y = mean_taxa_per_transect)) +
  geom_col(fill = "#4287f5", alpha = 0.7, color = "black", linewidth = 0.3) +
  geom_errorbar(aes(ymin = q025_taxa, ymax = q975_taxa),
                width = 0.3, linewidth = 0.5) +
  coord_flip() +
  labs(
    title = "Taxonomic Identification Ability by Observer",
    subtitle = "Observers with ≥5 transects shown; error bars show 95% quantiles (2.5% - 97.5%)",
    x = "Observer",
    y = "Mean Unique Taxa per Transect"
  ) +
  theme_classic() +
  theme(
    axis.text = element_text(size = 10, colour = "black"),
    axis.title = element_text(size = 12, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray30")
  )

Observer Pairing Patterns

Show the code
# Identify common observer pairings using sample events
observer_pairs <- sampleevents_data %>%
  filter(str_detect(observers, ",")) %>%  # Only multi-observer sample events
  count(observers, sort = TRUE) %>%
  head(10)

kable(observer_pairs,
      col.names = c("Observer Pair", "N Sample Events"),
      caption = "Top 10 Most Common Observer Pairings (at sample event level)")
Top 10 Most Common Observer Pairings (at sample event level)
Observer Pair N Sample Events
Alexander Navarro, Darling Ortega 12
Joshua Borland, Alexander Navarro 2
Joshua Borland, Alexander Navarro, Darling Ortega 2
Joshua Borland, Darling Ortega 2

Outlier Detection

High Biomass Observations

Identifying sites with unusually high fish biomass (top 10%).

Show the code
# Calculate 90th percentile
biomass_90th <- quantile(sampleevents_data$biomass_kgha_avg, 
                         probs = 0.90, na.rm = TRUE)

high_biomass_sites <- sampleevents_data %>%
  filter(biomass_kgha_avg >= biomass_90th) %>%
  select(site, sample_date, observers, biomass_kgha_avg, 
         sample_unit_count, depth_avg) %>%
  arrange(desc(biomass_kgha_avg))

kable(high_biomass_sites,
      digits = 1,
      col.names = c("Site", "Date", "Observer", "Biomass (kg/ha)", 
                    "N Transects", "Depth (m)"),
      caption = paste0("Sites with High Biomass (≥ ", 
                      round(biomass_90th, 1), " kg/ha - 90th percentile)"))
Sites with High Biomass (≥ 251.9 kg/ha - 90th percentile)
Site Date Observer Biomass (kg/ha) N Transects Depth (m)
BCMR 33 2022-11-17 Joshua Borland, Darling Ortega 1217.1 4 2.0
BZCCCB02 2021-10-28 Alexander Navarro 681.3 4 5.5
BCMR 06 2021-10-21 Alexander Navarro 520.7 2 1.2
BCMR 46 2022-11-17 Joshua Borland, Darling Ortega 364.4 4 12.0
BCMR 22 2021-10-20 Alexander Navarro 271.7 4 3.5

Interactive Scatter Plot - Biomass vs Abundance

Show the code
# Calculate mean abundance per sample event
se_with_abundance <- sampleunits_data %>%
  group_by(sample_event_id) %>%
  summarise(mean_abundance = mean(total_abundance, na.rm = TRUE)) %>%
  right_join(sampleevents_data, by = "sample_event_id")

# Create interactive scatter plot
scatter_plot <- plot_ly(
  data = se_with_abundance,
  x = ~mean_abundance,
  y = ~biomass_kgha_avg,
  type = "scatter",
  mode = "markers",
  text = ~paste("Site:", site,
                "<br>Biomass:", round(biomass_kgha_avg, 1), "kg/ha",
                "<br>Abundance:", round(mean_abundance, 1),
                "<br>Observer:", observers),
  hoverinfo = "text",
  marker = list(
    size = 10,
    color = ~biomass_kgha_avg,
    colorscale = list(c(0, "#d13823"), c(0.5, "#f3a224"), c(1, "#277d1d")),
    colorbar = list(title = "Biomass<br>(kg/ha)")
  )
) %>%
  layout(
    title = "Fish Biomass vs Abundance",
    xaxis = list(title = "Mean Fish Abundance (count per transect)"),
    yaxis = list(title = "Fish Biomass (kg/ha)"),
    hovermode = "closest"
  )

scatter_plot

Outlier Sites Table

Show the code
# Identify outliers using different methods for high and low values
# Low outliers: static threshold (biologically meaningful low biomass)
low_threshold <- 50  # kg/ha - unusually low biomass
# High outliers: IQR method
Q1 <- quantile(sampleevents_data$biomass_kgha_avg, 0.25, na.rm = TRUE)
Q3 <- quantile(sampleevents_data$biomass_kgha_avg, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
upper_bound <- Q3 + 1.5 * IQR_val

outliers <- sampleevents_data %>%
  filter(biomass_kgha_avg < low_threshold | biomass_kgha_avg > upper_bound) %>%
  mutate(outlier_type = ifelse(biomass_kgha_avg < low_threshold, 
                                "Low", "High")) %>%
  select(site, sample_date, biomass_kgha_avg, outlier_type, observers) %>%
  arrange(outlier_type, biomass_kgha_avg)

kable(outliers,
      digits = 1,
      col.names = c("Site", "Date", "Biomass (kg/ha)", 
                    "Outlier Type", "Observer"),
      caption = paste0("Outlier Sites (<", low_threshold, 
                      " kg/ha or >", round(upper_bound, 1), " kg/ha)"))
Outlier Sites (<50 kg/ha or >388.8 kg/ha)
Site Date Biomass (kg/ha) Outlier Type Observer
BCMR 06 2021-10-21 520.7 High Alexander Navarro
BZCCCB02 2021-10-28 681.3 High Alexander Navarro
BCMR 33 2022-11-17 1217.1 High Joshua Borland, Darling Ortega
BCMR 29 2022-11-17 19.6 Low Joshua Borland, Alexander Navarro, Darling Ortega
BCMR 39 2021-10-19 33.4 Low Alexander Navarro
BCMR - Alleys 2022-11-18 36.7 Low Joshua Borland
HCMR-B-GUZ-01 2021-10-22 39.4 Low Alexander Navarro
NHCMR-B-GUZ-22 2021-10-25 39.8 Low Alexander Navarro, Darling Ortega
BCMR 33 2021-10-19 41.6 Low Alexander Navarro
NHCMR-F-GUZ-06 2021-10-26 45.0 Low Alexander Navarro, Darling Ortega
HCMR-F-NTZ-04 2021-10-22 47.8 Low Alexander Navarro
BCMR - Alleys 2021-10-20 48.4 Low Alexander Navarro

Shark Observations

Shark Presence and Abundance

Show the code
# Identify shark families (Carcharhinidae, Sphyrnidae, etc.)
shark_families <- c("Carcharhinidae", "Sphyrnidae", "Alopiidae", 
                    "Lamnidae", "Rhincodontidae")

shark_obs <- observations_data %>%
  filter(fish_family %in% shark_families)

if(nrow(shark_obs) > 0) {
  # Summarize shark observations by site
  shark_summary <- shark_obs %>%
    group_by(site, fish_family, fish_taxon) %>%
    summarise(
      total_count = sum(count, na.rm = TRUE),
      total_biomass = sum(biomass_kgha, na.rm = TRUE),
      n_observations = n(),
      .groups = "drop"
    ) %>%
    arrange(desc(total_count))
  
  kable(shark_summary,
        digits = 2,
        col.names = c("Site", "Family", "Species", "Total Count", 
                      "Total Biomass (kg/ha)", "N Obs"),
        caption = "Shark Observations Summary")
  
  # Sites with sharks
  cat("\n\nTotal sites surveyed:", 
      length(unique(observations_data$site)), "\n")
  cat("Sites with shark observations:", 
      length(unique(shark_obs$site)), "\n")
  cat("Percentage of sites with sharks:", 
      round(100 * length(unique(shark_obs$site)) / 
              length(unique(observations_data$site)), 1), "%\n")
} else {
  cat("No shark observations in the dataset.")
}
No shark observations in the dataset.

Shark Biomass by Site

Show the code
if(nrow(shark_obs) > 0) {
  # Calculate shark biomass per sample event
  shark_biomass_summary <- observations_data %>%
    mutate(is_shark = fish_family %in% shark_families) %>%
    group_by(sample_event_id, site, is_shark) %>%
    summarise(biomass = sum(biomass_kgha, na.rm = TRUE), .groups = "drop") %>%
    pivot_wider(names_from = is_shark, values_from = biomass, 
                values_fill = 0) %>%
    rename(shark_biomass = `TRUE`, other_biomass = `FALSE`) %>%
    mutate(total_biomass = shark_biomass + other_biomass,
           shark_percent = 100 * shark_biomass / total_biomass)
  
  # Plot
  ggplot(shark_biomass_summary, 
         aes(x = reorder(site, shark_biomass), y = shark_biomass)) +
    geom_col(fill = "#2c7bb6", alpha = 0.8) +
    coord_flip() +
    labs(
      title = "Shark Biomass by Site",
      x = "Site",
      y = "Shark Biomass (kg/ha)"
    ) +
    theme_classic() +
    theme(
      axis.text = element_text(size = 10, colour = "black"),
      axis.title = element_text(size = 12, colour = "black"),
      plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
    )
} else {
  cat("No shark observations in the dataset.")
}
No shark observations in the dataset.

Fish Family Diversity

Number of Fish Families

Show the code
# Count unique families
n_families <- length(unique(observations_data$fish_family))

cat("Total number of fish families observed:", n_families, "\n\n")
Total number of fish families observed: 21 
Show the code
# Families by site
families_by_site <- observations_data %>%
  group_by(site) %>%
  summarise(
    n_families = n_distinct(fish_family),
    n_genera = n_distinct(fish_genus),
    n_species = n_distinct(fish_taxon),
    total_observations = n()
  ) %>%
  arrange(desc(n_families))

kable(families_by_site,
      col.names = c("Site", "N Families", "N Genera", 
                    "N Species", "Total Obs"),
      caption = "Taxonomic Diversity by Site")
Taxonomic Diversity by Site
Site N Families N Genera N Species Total Obs
BCMR 46 14 22 33 138
BCMR - Alleys 13 20 30 108
BCMR 31 13 18 28 100
BCMR 39 13 20 31 117
BCMR 06 11 16 27 83
BCMR 22 11 15 22 74
BCMR BR 19 11 14 22 87
CCMR-B-NTZ-10 10 14 23 72
HCMR-B-GUZ-03 10 15 21 59
BCMR 33 9 14 26 117
CCMR-B-GUZ-10 9 11 16 54
CCMR-F-GUZ-03 9 12 16 43
CCMR-F-GUZ-08 9 14 17 52
CCMR-F-NTZ-08 9 13 17 43
HCMR-B-NTZ-04 9 14 22 57
HCMR-F-NTZ-03 9 11 16 51
HCMR-F-NTZ-04 9 10 12 36
BCMR 07 8 13 21 77
BCMR 24 8 12 21 74
BCMR 29 8 13 21 69
BZHCCB02 8 11 20 49
CCMR-F-NTZ-01 8 11 17 54
HCMR-F-GUZ-02 8 11 14 43
HCMR-F-GUZ-03 8 11 16 43
NHCMR-B-GUZ-22 8 12 16 43
BZCCCB02 7 8 18 60
NHCMR-F-GUZ-13 7 11 19 42
NHCMR-F-NTZ-03 7 12 14 40
HCMR-B-NTZ-02 6 8 12 26
NHCMR-B-GUZ-30 6 9 15 39
NHCMR-B-NTZ-08 6 9 17 51
NHCMR-F-GUZ-06 6 8 13 36
HCMR-B-GUZ-01 5 7 12 41
NHCMR-B-GUZ-18 5 8 13 31
NHCMR-B-NTZ-07 4 8 13 32
CCMR-B-NTZ-09 3 6 12 46

Family Richness Distribution

Show the code
ggplot(families_by_site, aes(x = n_families)) +
  geom_histogram(binwidth = 2, fill = "#4575b4", 
                 color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Fish Family Richness Across Sites",
    x = "Number of Fish Families",
    y = "Number of Sites"
  ) +
  theme_classic() +
  theme(
    axis.text = element_text(size = 11, colour = "black"),
    axis.title = element_text(size = 12, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
  )

Complete Family List

Show the code
family_list <- observations_data %>%
  group_by(fish_family) %>%
  summarise(
    n_species = n_distinct(fish_taxon),
    total_count = sum(count, na.rm = TRUE)
  ) %>%
  arrange(fish_family)

datatable(family_list,
          colnames = c("Family", "N Species", "N Observations", "Total Count"),
          caption = "Complete List of Fish Families",
          options = list(pageLength = 15, autoWidth = TRUE))

Biomass Composition by Family

Top Families by Biomass

Identifying which fish families contribute most to total biomass.

Show the code
# Extract all family-specific biomass columns from sample events
family_cols <- grep("^biomass_kgha_fish_family_avg_", 
                    names(sampleevents_data), value = TRUE)

# Calculate true mean biomass for each family (replacing NAs with 0)
family_biomass <- sampleevents_data %>%
  select(all_of(family_cols)) %>%
  pivot_longer(cols = everything(),
               names_to = "fish_family",
               values_to = "biomass",
               names_prefix = "biomass_kgha_fish_family_avg_") %>%
  # Replace NAs with 0 to get true mean (including sites where family is absent)
  mutate(fish_family = str_to_title(fish_family),
         biomass = replace_na(biomass, 0)) %>%
  group_by(fish_family) %>%
  summarise(
    mean_biomass = mean(biomass, na.rm = TRUE),
    sd_biomass = sd(biomass, na.rm = TRUE),
    n_sites_present = sum(biomass > 0),
    n_sites_total = n()
  ) %>%
  mutate(
    percent_sites = 100 * n_sites_present / n_sites_total,
    # Calculate contribution to overall mean biomass across all families
    percent_total = 100 * mean_biomass / sum(mean_biomass)
  ) %>%
  arrange(desc(mean_biomass))

# Top 15 families
top_families <- head(family_biomass, 15)

kable(top_families,
      digits = 2,
      col.names = c("Family", "Mean Biomass (kg/ha)", "SD", 
                    "Sites Present", "Total Sites", "% Sites Present", 
                    "% of Total Biomass"),
      caption = "Top 15 Fish Families by Mean Biomass (true mean including zeros)")
Top 15 Fish Families by Mean Biomass (true mean including zeros)
Family Mean Biomass (kg/ha) SD Sites Present Total Sites % Sites Present % of Total Biomass
Scaridae 46.75 82.86 45 47 95.74 27.82
Haemulidae 41.91 67.25 40 47 85.11 24.94
Acanthuridae 31.27 155.34 39 47 82.98 18.61
Pomacentridae 14.81 14.99 47 47 100.00 8.81
Labridae 9.92 10.05 47 47 100.00 5.90
Lutjanidae 8.92 15.60 23 47 48.94 5.31
Pomacanthidae 5.66 19.74 15 47 31.91 3.37
Chaetodontidae 1.93 4.62 21 47 44.68 1.15
Carangidae 1.90 7.49 7 47 14.89 1.13
Holocentridae 1.43 3.40 12 47 25.53 0.85
Diodontidae 1.20 8.23 1 47 2.13 0.71
Epinephelidae 0.84 3.80 3 47 6.38 0.50
Serranidae 0.53 1.22 13 47 27.66 0.31
Aulostomidae 0.37 2.56 1 47 2.13 0.22
Balistidae 0.29 1.39 2 47 4.26 0.17

Biomass Proportion - Top Families

Show the code
# Prepare data for pie chart - top 10 plus "Other"
top10_families <- head(family_biomass, 10)
other_biomass <- sum(family_biomass$mean_biomass[11:nrow(family_biomass)])

pie_data <- top10_families %>%
  select(fish_family, mean_biomass) %>%
  bind_rows(tibble(fish_family = "Other families", 
                   mean_biomass = other_biomass))

# Create pie chart
plot_ly(pie_data, 
        labels = ~fish_family, 
        values = ~mean_biomass,
        type = 'pie',
        textposition = 'inside',
        textinfo = 'label+percent',
        hoverinfo = 'text',
        text = ~paste(fish_family, '<br>',
                     round(mean_biomass, 1), 'kg/ha'),
        marker = list(line = list(color = '#FFFFFF', width = 2))) %>%
  layout(title = "Fish Biomass Composition by Family (Top 10 - Mean Biomass)",
         showlegend = TRUE,
         legend = list(orientation = 'v', 
                      x = 1.1, y = 0.5))

Biomass Composition by Trophic Group

Show the code
# Get trophic group biomass from sample events
trophic_cols <- grep("biomass_kgha_trophic_group_avg_", 
                     names(sampleevents_data), value = TRUE)

trophic_biomass <- sampleevents_data %>%
  select(site, all_of(trophic_cols)) %>%
  pivot_longer(cols = all_of(trophic_cols),
               names_to = "trophic_group",
               values_to = "biomass",
               names_prefix = "biomass_kgha_trophic_group_avg_") %>%
  # Replace NAs with 0 to get true mean (including sites where group is absent)
  mutate(biomass = replace_na(biomass, 0)) %>%
  group_by(trophic_group) %>%
  summarise(
    mean_biomass = mean(biomass, na.rm = TRUE),
    q025 = quantile(biomass, 0.025, na.rm = TRUE),
    q975 = quantile(biomass, 0.975, na.rm = TRUE),
    n_sites_present = sum(biomass > 0),
    n_sites_total = n()
  ) %>%
  mutate(
    trophic_group = case_when(
      trophic_group == "planktivore" ~ "Planktivore",
      trophic_group == "herbivore_macroalgae" ~ "Herbivore (macroalgae)",
      trophic_group == "herbivore_detritivore" ~ "Herbivore (detritivore)",
      trophic_group == "invertivore_sessile" ~ "Invertivore (sessile)",
      trophic_group == "invertivore_mobile" ~ "Invertivore (mobile)",
      trophic_group == "omnivore" ~ "Omnivore",
      trophic_group == "piscivore" ~ "Piscivore",
      TRUE ~ trophic_group
    ),
    percent_sites = 100 * n_sites_present / n_sites_total,
    percent_total = 100 * mean_biomass / sum(mean_biomass)
  ) %>%
  arrange(desc(mean_biomass))

kable(trophic_biomass,
      digits = 2,
      col.names = c("Trophic Group", "Mean Biomass (kg/ha)", 
                    "2.5% Quantile", "97.5% Quantile",
                    "Sites Present", "Total Sites", "% Sites Present",
                    "% of Total Biomass"),
      caption = "Fish Biomass by Trophic Group (true mean including zeros)")
Fish Biomass by Trophic Group (true mean including zeros)
Trophic Group Mean Biomass (kg/ha) 2.5% Quantile 97.5% Quantile Sites Present Total Sites % Sites Present % of Total Biomass
Herbivore (detritivore) 58.78 3.24 264.55 47 47 100.00 34.98
Invertivore (mobile) 45.54 0.58 168.86 47 47 100.00 27.10
Herbivore (macroalgae) 29.00 0.00 40.78 33 47 70.21 17.26
Piscivore 16.88 0.00 61.84 43 47 91.49 10.04
Invertivore (sessile) 7.30 0.00 46.51 27 47 57.45 4.34
Omnivore 5.35 0.00 14.27 45 47 95.74 3.19
Planktivore 5.19 0.03 18.71 45 47 95.74 3.09

Trophic Group Composition Bar Plot

Show the code
ggplot(trophic_biomass, 
       aes(x = reorder(trophic_group, mean_biomass), 
           y = mean_biomass, fill = trophic_group)) +
  geom_col(alpha = 0.8, color = "black", linewidth = 0.3) +
  geom_errorbar(aes(ymin = q025, ymax = q975),
                width = 0.3, linewidth = 0.5) +
  coord_flip() +
  labs(
    title = "Mean Fish Biomass by Trophic Group",
    subtitle = "Error bars show 95% quantiles (2.5% - 97.5%)",
    x = "Trophic Group",
    y = "Mean Biomass (kg/ha)"
  ) +
  theme_classic() +
  theme(
    legend.position = "none",
    axis.text = element_text(size = 11, colour = "black"),
    axis.title = element_text(size = 12, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray30")
  ) +
  scale_fill_brewer(palette = "Set3")


Data Quality Summary

Show the code
# Assemble the data quality summary message with markdown
summary_msg <- paste0(
  "#### Sample Coverage\n\n",
  "- **Total sample events:** ", nrow(sampleevents_data), "\n",
  "- **Total transects:** ", nrow(sampleunits_data), "\n",
  "- **Total observations:** ", nrow(observations_data), "\n",
  "- **Unique sites:** ", length(unique(sampleevents_data$site)), "\n",
  "- **Unique observers:** ", length(unique(sampleevents_data$observers)), "\n\n",
  
  "#### Taxonomic Diversity\n\n",
  "- **Fish families:** ", n_families, "\n",
  "- **Fish genera:** ", length(unique(observations_data$fish_genus)), "\n",
  "- **Fish species:** ", length(unique(observations_data$fish_taxon)), "\n\n",
  
  "#### Biomass Statistics (kg/ha)\n\n",
  "- **Mean:** ", round(mean(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n",
  "- **Median:** ", round(median(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n",
  "- **SD:** ", round(sd(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n",
  "- **Range:** ", round(min(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), " - ",
             round(max(sampleevents_data$biomass_kgha_avg, na.rm = TRUE), 1), "\n\n",
  
  "#### Data Quality Flags\n\n",
  "- **High biomass outliers (>90th percentile):** ", nrow(high_biomass_sites), "\n",
  "- **Statistical outliers (IQR method):** ", nrow(outliers), "\n",
  "- **Sites with shark observations:** ", 
    ifelse(exists("shark_obs") && nrow(shark_obs) > 0,
           length(unique(shark_obs$site)), 0), "\n"
)

# Display the assembled summary
cat(summary_msg)

Sample Coverage

  • Total sample events: 47
  • Total transects: 176
  • Total observations: 2187
  • Unique sites: 36
  • Unique observers: 7

Taxonomic Diversity

  • Fish families: 21
  • Fish genera: 41
  • Fish species: 79

Biomass Statistics (kg/ha)

  • Mean: 168
  • Median: 113.6
  • SD: 199.1
  • Range: 19.6 - 1217.1

Data Quality Flags

  • High biomass outliers (>90th percentile): 5
  • Statistical outliers (IQR method): 12
  • Sites with shark observations: 0

Notes and Recommendations

Based on this data quality check:

  1. Observer Consistency: Review any systematic differences between observers in biomass or abundance estimates.

  2. Outliers: Investigate outlier sites to determine if they represent true ecological variation or potential data entry errors.

  3. Shark Observations: Consider the ecological significance of sites with shark presence as potential indicators of reef health.

  4. Family Diversity: Compare family richness across sites in relation to management regimes or environmental conditions.

  5. Biomass Composition: The dominant families identified should align with expected community structure for the region.

 

Powered by

Logo