Dolphin Data Analysis: A Predictive Approach

Chong Xiang (John) Ju, Lily McAboy, Luke Monsma, Professor Stacy DeRuiter, Dr. Andreas Fahlman

Our Vision


Our project aims to develop a machine learning model to diagnose respiratory illnesses in dolphins, advancing animal conservation efforts. Existing models rely on data from voluntarily beached dolphins, which limits their applicability to a broader population and fails to diagnose specific diseases like pneumonia. To address these limitations, our model will analyze in-water measurements, enabling assessments of captive dolphins, involuntarily beached dolphins, and wild dolphins captured temporarily. Using Python and the Scikit-Learn library, we hope to implement machine learning to create a transparent and efficient supervised classification model hosted on a Hugging Face app, offering real-time respiratory health evaluations. The model will focus on generalizability and robustness. Guided by Dr. Andreas Fahlman and Professor DeRuiter, our team will utilize our experience in machine learning, data analytics, and neural networks to deliver a solution that allows veterinarians to promptly assess and treat dolphins in various environments.

2024 Fall Progress


Over the semester, the primary focus was on revisiting the original logistic regression model by adapting it to new data and refining its methodology. The process began with creating exploratory graphs to visualize the data, which helped clarify the rationale behind the original feature selection. The initial logistic regression model was constructed using the glmmTMB package in R, replicating the features from the referenced study. However, the model's accuracy was lower than that of the original study (78.8% compared to 88.4%). Efforts have since concentrated on enhancing the modeling process, with a particular focus on reworking feature selection using the tidymodels package. The inclusion of "leave-one-animal-out" validation remains a work in progress, but an interim logistic regression model was used to demonstrate results until the refined approach is finalized.

In addition to logistic regression, a neural network-based binary classifier was developed using PyTorch. The model utilized fully connected layers with ReLU activation functions, trained over 100 epochs with the Adam optimizer. Validation was performed by splitting the dataset into training (80%) and testing (20%) sets, repeating the process 30 times to calculate an average accuracy of 0.4917. Challenges include a high type II error rate and suboptimal accuracy (0.67) when relying on the four variables from the logistic regression model. Other improvements included renaming dataset variables to make them more intuitive and easier to interpret. Principal Component Analysis (PCA) was also employed to reduce dimensionality while retaining key information. Although PCA-derived components have not yet been incorporated into the models, they are expected to improve future analyses and predictive accuracy.

2025 Spring Progress


The original summary statistics that we received included notable variables such as body weight, peak inspiratory and expiratory flow, inspiratory volume, and duration of breath intake. The categorical variable of ‘sex’ was also included, which was later one hot encoded into the dataset. To streamline the model architecture and reduce computational complexity, the existing PyTorch model was replaced with a Scikit-Learn model approach. The updated model used a multi-layer perceptron (MLP Classifier), which is a type of neural net that optimizes the log loss function, for binary classification. To enhance the model performance, the model utilized a warm start strategy at initialization, which enabled the classifier to retain learned parameters across training iterations rather than resetting the model at every cycle. To ensure data quality and to avoid data leakage, two validation strategies were introduced. The first, Group Shuffle Split (GSS), was implemented to partition the data in a way that would not allow any training dolphins to be seen in the testing set. The second validation strategy was a cross-validation technique called Leave One Group Out (LOGO). One dolphin in the training set was iteratively left out for every training cycle. After each round, the model was updated and then retrained on the new batch of dolphins with another dolphin being left out. This helped improve generalizations across the individual dolphins.

The MLP classifier achieved a training accuracy of 99% and a testing accuracy of 94%. Because of the steps taken to avoid data leakage, we have strong evidence to believe that our model did not overfit. The dataset was split into 94 unique dolphins in the training set and 24 unique dolphins in the testing set. All animal names that were seen in testing were never seen in training.

The Team


Chong Xiang (John) Ju

My name is John Ju. I'm majoring in Data Science with a cognate in accounting. I come from Beijing, China. In my free time, I enjoy listening to music and playing games with my friends.

Lily McAboy

My name is Lily McAboy and I am double majoring in Computer Science and Data Science with a cognate track in engineering and a minor in math. I am from Zeeland, Michigan. In my free time, I love to crochet and cook!

Luke Monsma

My name is Luke Monsma and I am majoring in Data Science with a cognate in biology. I'm from Warsaw, Indiana and like to spend time playing games or on walks.