B200 SXM6 180GB - available for $3.99/h Deploy now

Revolutionizing marine mammal research with AI-powered photo identification

DataCrunch Content Team 5 min read
Revolutionizing marine mammal research with AI-powered photo identification

Happywhale, a pioneering research collaboration and citizen science platform, has been using DataCrunch infrastructure since 2022 to identify individual whales and dolphins globally, processing millions of photos and transforming marine conservation research. Starting with classic pattern recognition, Happywhale evolved to implement advanced machine learning algorithms from a Kaggle competition, achieving breakthrough accuracy in whale identification. The platform now hosts the majority of the world’s humpback whale photo ID data and is expanding to other marine species including dolphins and seals.

In the North Pacific alone, Happywhale has recorded nearly every living humpback whale (approx. 90%) and supports continuous tracking of long-term trends—including detectable population declines tied to climate-driven food shortages (approx. 7,000 whales lost between 2012–2021 and a 20% population decline from 2014–2016).

With the DataCrunch Cloud Platform, Happywhale is able to:

About Happywhale

Happywhale is a research collaboration and citizen science web platform that uses AI-powered image recognition to identify individual marine mammals through photo-ID - particularly humpback whale flukes (tail images). Founded in 2015 by marine biologist Ted Cheeseman and developer Ken Southerland, the platform has revolutionized how researchers track and study whale populations globally.

Since then, Happywhale and its platform has built a strong track record:

Happywhale serves two key audiences: citizen scientists who submit photos to track individual whales, and researchers conducting population studies and conservation work. The platform has become an essential service for marine research laboratories worldwide.

How It Works

Happywhale's AI-powered system identifies individual whales by analyzing the unique characteristics of their flukes (tail fins for whale matching) and dorsal fins (for dolphin matching) - essentially nature's fingerprints. The algorithm looks at the shape, pattern, and features in an individual Humpback whale's tail, and the platform builds a profile of each individual and how it travels around the world.

Humpback Whale Oscar

Each whale's fluke has distinct identifying features:

“We basically broke the bottleneck of pattern recognition. What took an hour per photo now happens virtually instantaneously. This allowed our collaboration to expand to a global scale. We now have most of the world’s humpback whale photo ID data in one place.” – Ted Cheeseman, Co-founder & Director at Happywhale

Ted Cheeseman, Co-founder & Director at Happywhale

The Challenge

Before Happywhale’s AI solution, researchers had to spend over 50 minutes manually matching each whale photo with existing data — a tedious process that limited their work to their own data sets. A crowdsourced identification platform, however, could allow researchers worldwide to collaborate and share information, greatly expanding the scale and impact of whale research. With whales migrating thousands of miles across ocean basins, tracking individuals was nearly impossible without a faster, more accurate system. The team needed infrastructure that could handle continuous global usage while maintaining sub-10-second response times for field researchers.

Starting with SIFT-based pattern recognition in 2015, Happywhale transitioned to machine learning in 2018–2019. The breakthrough came from implementing a Kaggle competition winner’s algorithm, which Ken Southerland refactored to run an order of magnitude faster while maintaining accuracy.

Photo Identification on DataCrunch

The application runs Node.js and Python servers on DataCrunch GPU instances, processing identification requests every few minutes from users worldwide. Feature sets and trained models are stored in memory using an LRU cache for optimal performance, with S3 buckets handling image storage.

“We initially used a traditional hyperscaler, which made things ridiculously complex and left me trying to figure everything out on my own. With DataCrunch, it’s literally ‘set it and forget it’. I got our application running on a GPU instance quickly and easily, and any question I had was instantly answered by an infrastructure expert.” – Ken Southerland, Co-founder & Lead Developer at Happywhale

Ken Southerland, Co-founder & Lead Developer at Happywhale

The DataCrunch Cloud Platform equips Happywhale with:

Researchers can now photograph a whale, upload via mobile, and receive identification within seconds. This enables critical decisions in the field, such as whether it’s necessary to collect genetic samples, reducing disturbance to animals and optimizing research permits. Citizen scientists and whale watching hobbyists can see instantly if the whale they are photographing has been identified and track its sightings, and even adopt and name the whale if previously unnamed.

Vision for the Future

Ted Cheeseman envisions a future where every image uploaded to Happywhale is automatically analyzed without human intervention. “My main goal would be to have every image that comes into Happywhale get a ‘Hey, what species do I see in this?’” he explains. The system would automatically detect whether there’s anything identifiable in the photo - be it a whale fluke, dorsal fin, or seal - and route it through the appropriate species-specific algorithm (inference endpoint). This automation would eliminate the current manual step where users must specify which algorithm to use, enabling further scaling and transforming Happywhale into a truly intelligent platform capable of processing images from virtually anywhere, instantly identifying not just whales but multiple marine species from a single unified interface.

The implications for Ken and the DataCrunch team are that GPU, CPU, and storage resources must be able to scale automatically to accommodate an exponential increase in requests for processing of species-specific feature sets. Model training and re-training workloads must also be seamlessly integrated into the infrastructure. The team is currently exploring containerization to increase scalability while minimizing GPU costs, and to also increase application portability.

Next Steps