Group project: Detecting Diabetic Retinopathy with Image Classification/CNN

9 min readSep 19, 2021

Background and problem statement

Diabetic Retinopathy

Diabetic retinopathy is a complication of diabetes that affects the eyes. Generally, it is caused by damage to the blood vessels in tissue at the back of the eye. Diabetic patients will develop diabetic retinopathy after they have had diabetes for between three to five years. DR is generally diagnosed within three years of type-1 diabetes, however it may already be present when type-2 diabetes is diagnosed. A diet with poorly controlled blood sugar may increase the risk of developing diabetes, and therefore inadvertently developing diabetic retinopathy as well. Mild cases of DR may be treated with careful diabetes management, but advanced cases may require laser treatment or surgery.

Background

Diabetic retinopathy is a leading cause of blindness among working-age adults, and afflicts millions of people annually. Diabetic retinopathy is prevalent in rural parts of India, which are often remote and technologically isolated, making medical screenings for the ailment much more difficult to conduct. Historically, medical technicians are physically sent to these remote locations, to screen DR by photographing the retina and then manually assessing each image as a basis for DR diagnosis.

These professionals look for a variety of identifying characteristics visibly present on the retina such as hemorrhages, aneurysms, abnormal growth of blood cells, among others. All of these can indicate if DR is present, and how severe the diagnosis is.

This method of relying on trained professionals to classify each image is laborious, inefficient, and costly. In an effort to save resources, automating parts of the process could be very beneficial and could drastically increase the scope of screening accessibility.

An illustration of various physical manifestations of diabetic retinopathy

Problem Statement

Our goal is to expedite DR detection by building a multi-classification model that will be trained on thousands of images collected in rural India. This will be used to automatically predict if a patient has DR and how severe it is on a scale from zero to four, with severity ascending:

0: No DR
1: Mild
2: Moderate
3: Severe
4: Proliferative DR

Premodeling

Metadata

The data used in this repository was retrieved from the APTOS 2019 Blindness Detection competition, hosted on Kaggle.com. It was compiled over several years and was released in July, 2019. It contains fours folders, two for the training images and training labels, and the other two for the testing images and testing labels. Images in the image folders are .png files, and the labels are .csv files. The training images folder contained 3,662 images, each one 640 x 480 pixels in area, with three color channels (RGB). Furthermore, the images were gathered from multiple clinics using a variety of cameras over an extended period of time, which will introduce further variation.

An example of a retinal image from the dataset

Loading

For this multi-outcome image classification problem, we will be focusing solely on implementing and tuning convolutional neural networks (CNN), as it happens to be a common use for this type of model. I am probably unqualified to describe the mechanisms of a how a CNN works, exactly. However, Simran Bansari has done an excellent job, here.

To load in our data, we used a for-loop to populate two lists, based on code provided to us by our lovely instructors at General Assembly. It utilizes plyplot imread and numpy resize to read in the images, and then scales down the image from 640 x 480 pixels to 128 x 128 pixels. We needed our image file names to match the names of images of the labels, and used a list comprehension to remove the .png suffix. In order for our data to be feed into our neural networks, we rescaled our images by dividing each one by 255, which is the range of the grey scale in 8-bit representation.

Loop used to read images into Python compatible data.

Our initial target variable value counts will provide us with our baseline scores for our five possible outcomes, as follows:

0: 49.49%
1: 10.10%
2: 27.28%
3: 5.27%
4: 8.06%

We then train-test split on our training data, with 75% going to training and 25% going to testing, and stratify our target variables due to the imbalanced nature of our labels. Finally, we used utils.to_categorical to one-hot encode our target variable, creating a dummy array that can be used in our multi-classification CNN.

Approach

As a three person team, we split up the task of testing different models with a common goal of maximizing our model’s test metric: the quadratic-weighted kappa cohen. To go through the details of each model that was tested, there are folders in our GitHub repository for each of the team members with their model explanations and corresponding code.

My CNNs

GA structure

This CNN model structure comes from the instruction at General Assembly, as a part of their introduction to convolutional neural networks. It is comprised of two convolutional and two pooling layers — with standard kernel and pool sizes — respectively, and a small number of image filters. After passing through a flattening layer, there are two hidden dense layers, using a five-node output layer for the corresponding five outcome classes. Relu activation functions are used throughout, with the exception of the output layer which uses softmax for multi-class classification models such as these.

The CNN is then compiled using categorical cross-entropy as its loss function, Adam as an optimizer, and accuracy will be displayed as an additional evaluation metric.

Early stopping is instantiated as a callback, with changes in the validation loss over a five epoch period determining whether the network will continue or cease to run.

The CNN model is then fit over the training data, with the testing data as validation, the model is set to run for 100 epochs, and our callback is set equal to the early stopping that was previously instantiated.

We create our target predictions based on the model fit on our training predictors. Both the actual target variable values and the predicted variable values are used to calculate the quadratic weighted kappa (QWK).

Our QWK score for this model was 0.6677.

Given that an ideal QWK score is 1.0, this model is currently is at a D, maybe D+ status, in other words — not good. In addition, with a training accuracy score around 0.9 and a testing accuracy score of approximately 0.7, our model is not terribly accurate, and is certainly overfit.

James Le structure

This CNN structure was provided by James Le via an article he published on Medium. He developed this CNN structure for a ten-outcome image classification model that was intended to correctly identify which item of clothing an image depicted. The visual characteristics that help identify a range of diabetic retinopathy significance versus clothing items are inherently separate; different significances are associated with elements such as shape, line quality, contrast, as well as varying manifestations of noisy data within the images themselves. Despite these collective and composite truths, as an experiment, we will test its performance in a completely different context.

The sheer volume of its code leads one to believe it is a winning recipe. There are four convolutional layers, of which, pooling layers are placed after the second and fourth with the number of filters on each convolutional layer ranging from 32 to 128. The kernel size and pool size will remain within the standard range of values for these parameters. We are doubling the convolutional layers and, separately, increasing the number of filters at exponential increments; indeed, we should already be expecting a significant difference in our outcomes one way or another.

Once we add our flattening layer, we add two hidden layers: the first with a whopping 512 nodes, and the second one with 128 nodes, which is narrowly whittled down to an output layer of five nodes for our corresponding class outcomes.

The changes we have made to the model through adding layers and filters in the convolutional and pooling layers are further compounded by another exponentially incremental increase in nodes used in the hidden layers of our CNN.

Additionally, outwardly, it’s heavy use of various regularization methods make it an attractive candidate to fix our previous model’s overfit accuracy scores. On top of the early-stop callback regularization used in the first model, this CNN employs batch-normalization at all convolutional layers as well as all hidden layers. Similarly, dropout regularization has been applied to the same layers — spare the input layer as there are no inputs to drop. Indeed, this is all promising, and perhaps, minimizing variance through regularization could simultaneously boost our currently underwhelming quadratic weighted kappa score of 0.66.

Hopefully, increasing the general complexity of the model by adding more filters, nodes, and layers can increase the predictive power and accuracy scores of our model while using a multi-faceted regularization approach to attempt mitigating high-variance and over-fitness.

Our QWK score for this model was 0.5944, with training accuracy around .72 and testing accuracy at about .60.

Alas, our second model failed us by all measurements: overall accuracy is down, variance is still high and discernible overfitness still present. Worst of all, the score for our prime evaluative metric (the quadratic weighted kappa) has dropped even lower, providing little confidence in its overall efficacy.

All Models

Luckily, I had team members who have much more powerful computers than I do, and they provided much better models. Let’s take a look!

Production Model Evaluation

Nick’s EfficientNet model scored the highest quadratic weighted kappa score (0.895), so we chose this as our production model; this type of neural network was ubiquitous among the highest scoring entities on this Kaggle competition’s leaderboard. The structure of an EfficientNet CNN, as well as its scaling method work to scale all dimensions (width, depth, resolution) uniformly, distinguishing itself from the arbitrary scalers of typical CNNs. Furthermore, by increasing the image size from the initial 128 x 128 resize, we were able to see an even better QWK.

Accuracy / Overfit

In terms of plain accuracy, our production model (Nick’s EfficientNet) did not do very well with a 0.5841 training accuracy score and 0.5328 testing accuracy score. Scores like these are concerning and raise questions about the overall efficacy of our model, considering our testing accuracy score narrowly beats our baseline accuracy score of 0.4949. Additionally, aside from poor predictive power, the discrepancy in training and testing scores suggests that our model is overfit and suffers from high-variance.

QWK

Equation to calculate the quadratic weighted Kappa

The quadratic weighted kappa is a metric used to compare the agreement between two observed “raters”. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores and gives extra penalty to predicted outcomes that are further from the actual outcomes.

Confusion matrix displays the numerical relationships between the true classifications compared with the predicted classification outcomes.

In our case, the EfficientNet model can predict class 0 (No DR) well with 96%, but struggles with predicting classes 1, 3, 4 as it has a tendency to overpredict these classes and as class 2 outcomes. Our model only ever misclassifies outcomes within a two-class range, which is better than if the its range were larger, but this proclivity towards class 2 conflation still hurts our quadratic weighted kappa score.

Conclusion

In the end, we created a model that is not perfect, but may be useful — especially depending on which metric it is being evaluated by. When working in a group, we learned clear communication is essential, especially when confirming that all team members are working from the same dataset.

We learned that hyper-tuning the parameters of a CNN is not necessarily as useful as modifying the image itself – such as augmenting the image rotationally, or resizing as large as possible.

Looking forward, we posit that we could have achieved a better quadratic weighted kappa score if we had more processing power. By enabling larger computational abilities, we could then perform large scale augmentation and/or read our images into our CNNs without scaling them down, which can harm models by diminishing nuances — ultimately blunting the graphic information. Finally, there is a small body of research on the web on how to apply KMeans clustering for image classification and we would be curious to see how well it would perform on this dataset.