The following is a short tutorial showing how MERMAID images and their associated annotations can be accessed from the S3 bucket using Python code, with a final step of visualizing them together.
Accessing MERMAID annotations
To access and work with MERMAID open data (including images and annotations) you will need to open the mermaid_confirmed_annotations.parquet file with a library such as duckdb.
import duckdbcon = duckdb.connect()con.install_extension('httpfs') # only needed oncecon.load_extension('httpfs')# Configure S3 (for public buckets you only need region)con.execute("SET s3_region='us-east-1'")con.execute("SET s3_access_key_id=''")con.execute("SET s3_secret_access_key=''")con.execute("SET s3_session_token=''")s3_url ="s3://coral-reef-training/mermaid/mermaid_confirmed_annotations.parquet"df_annotations = con.execute(f"SELECT * FROM read_parquet('{s3_url}')").df()df_images = df_annotations[["image_id","region_id","region_name"]].drop_duplicates("image_id")print(f"Loaded {len(df_annotations):,} annotations across "f"{len(df_images):,} images from "f"{df_images['region_id'].nunique():,} unique geographic realms.")
Loaded 49,950 annotations across 1,998 images from 2 unique geographic realms.
Getting an image
Extracting an image from the DataFrame and its associated annotations can be done as such:
import boto3from botocore.config import Configfrom botocore import UNSIGNEDimport iofrom PIL import Image# Anonymous S3 client (public bucket; no creds needed)s3 = boto3.client("s3", region_name="us-east-1", config=Config(signature_version=UNSIGNED))def get_image_s3(image_id: str, bucket: str="coral-reef-training", thumbnail: bool=False) -> Image.Image: key =f"mermaid/{image_id}_thumbnail.png"if thumbnail elsef"mermaid/{image_id}.png" resp = s3.get_object(Bucket=bucket, Key=key)return Image.open(io.BytesIO(resp["Body"].read()))# Example (uses df_images/df_annotations from your first chunk)idx =0image_id = df_images.loc[idx, "image_id"]image = get_image_s3(image_id, thumbnail=False).convert("RGB")# Get the annotations for the associated imageannotations = df_annotations.loc[df_annotations["image_id"] == image_id]# (Quarto) display inlinefrom IPython.display import displaydisplay(image)