Dog Breed Classification — Udacity Capstone project using Convolutional Neural Network

6 min readApr 24, 2021

Convolutional Neural Network

CNN (Convolutional Neural Network) is the common type of artificial neural network used for image classification, that is designed to process pixel data. CNN architecture composed of multiple layers of artificial neurons, which will take large data set of labeled images, process the data through hidden layers, and output the image class. This article will walkthrough an example of CNN application to identify dog breed from the input images.

Introduction

Akita is maybe my favorite breed after watching it on movies, they seems nice, loyal, and friendly. Also, easy to identify due to their thick hair coat, sharp edged ears, and body structure, but sometime I fail to differentiate to differentiate the breeds between Akita and Siberain Husky due to similar features.

How good are you in identifying a dog breed?

I think it is not as easy as we think, even for the humans, and I believe it will be a challenging task for CNN to classify the dog breed accurately considering all the minute difference in the dog’s breed features.

Dataset

To fulfill the task, the dataset provided by Udacity contains 8351 samples representing 133 dog breeds. Training, validation, and test set are distributed as below:

Overview

The strategy laid out for solving this problem, given in the notebook provided by Udacity, is as follows:

Step 0: Import Datasets
Step 1: Detect Humans
Step 2: Detect Dogs
Step 3: Create a CNN to Classify Dog Breeds (from Scratch)
Step 4: Use a CNN to Classify Dog Breeds (using Transfer Learning)
Step 5: Create a CNN to Classify Dog Breeds (using Transfer Learning)
Step 6: Write your Algorithm
Step 7: Test Your Algorithm

This project have used OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images (Step 1). OpenCV provides many pre-trained face detectors, stored as XML files on github. We have downloaded one of these detectors and stored it in the haarcascades directory.

In the step 2 to detect the dogs the project have used ResNet-50 model to detect dogs from the image dataset. The ResNet-50 model is trained on ImageNet dataset.

Data Pre-process

Keras uses TensorFlow as backend by default; this means that a CNN requires a 4D array called tensor as input; the 4 dimensions represent (nb_sample, rows, columns, channels). A preliminary step to process an image is then to transform it into a tensor.

path_to_tensor function takes image files available in the file_path and resize it to 224,224. The 4 tensor of (1, 224, 224, 3) represents the resized image size of 1 sample, 224x 224 pixels, 3 channels (RGB).

The resized images in the dataset is divided by 255 to rescale it, so that the pixel values will be between 0 and 1.

CNN Architecture

The CNN algorithm is built using the Keras library, it is quite straight forward to use. CNN in the project chose to have five convolutional layers with neurons (16, 32, 64, 96, 128). The parameters used in the Conv2D function are input shape which is 224X224X3 at at the input layer. The number of class at the output is 133, representing the number of possible dog breeds to classify.

Dropouts are added to reduce the overfitting. The designed architecture have achieved 12.6% accuracy for 30 epoch and batch_size = 20.

Transfer Learning

In transfer learning, pretrained networks are saved models that were trained on a huge image-classification task such as Imagenet. Since these datasets are generalized enough, the saved weights can be used to improve accuracy in the image classification problem.

I have used Resnet-50 network bottleneck features for tranfer learning. Udacity had provided the features that is available in Keras.

Bottleneck features is the concept of taking a pre-trained model in our case Resnet-50 and chopping off the top classifying layer, and then input this “chopped” Resnet-50 as the first layer into our model.

Now the data set images are associated with the bottleneck features and the input shape of train_resnet50 is then added to the model and further train with the dataset. The final layer is 133 nodes to match the number of output classes.

The architecture computed using bottleneck features from Resnet-50 network has achieved a test accuracy of 79.6% in 20 epoch and batch_size = 20.

Next is to write a function that takes an image path as input and returns the dog breed that is predicted by the model.

Implementing Human/ Dog detector

This section of the article is to demonstrate the step 6 and step 7 mentioned in the project strategy. First we write an algorithm that accepts a file path to an image and determines whether the image contains a human, dog, or neither.

Then,

if a dog is detected in the image, return the predicted breed.
if a human is detected in the image, return the resembling dog breed.
if neither is detected in the image, provide output that indicates an error.

To test the algorithm, we have used 7 sample images, where 2 are humans and 4 dogs and 1 neither of it. Few of those are given below:

Reflection

In the final prediction model we were able to achieve an accuracy of 79% in classifying the dog breeds. The project performed was simple architecture and it still has more room to improve, and my suggestions on those are:

i) Improve the face_detector algorithm. I would build a new neural network using transfer learning.

2.) Fine tune hyperparameters.

3.) Third is to improve the program that capture the images from a video and generate prediction. Also, this can be extended to capture images via webcam and predict on the fly.

References:

How to interpert ResNet50 Layer Types

begingroup$ In order to make the explanation clear I will use the example of 34-layers: First you have a convolutional…

datascience.stackexchange.com

Keras documentation: Conv2D layer

2D convolution layer (e.g. spatial convolution over images). This layer creates a convolution kernel that is convolved…

keras.io

Keras documentation: MaxPooling2D layer

Max pooling operation for 2D spatial data. Downsamples the input representation by taking the maximum value over the…