How To Build a Neural Network to Translate Sign Language into English

The author selected Code Org to receive a donation as part of the Write for DOnations program.

Introduction

Computer vision is a subfield of computer science that aims to extract a higher-order understanding from images and videos. This powers technologies such as fun video chat filters, your mobile device’s face authenticator, and self-driving cars.

In this tutorial, you’ll use computer vision to build an American Sign Language translator for your webcam. As you work through the tutorial, you’ll use OpenCV, a computer-vision library, PyTorch to build a deep neural network, and onnx to export your neural network. You’ll also apply the following concepts as you build a computer-vision application:

You’ll use the same three-step method as used in How To Apply Computer Vision to Build an Emotion-Based Dog Filter tutorial: preprocess a dataset, train a model, and evaluate the model.
You’ll also expand each of these steps: employ data augmentation to address rotated or non-centered hands, change learning rate schedules to improve model accuracy, and export models for faster inference speed.
Along the way, you’ll also explore related concepts in machine learning.

By the end of this tutorial, you’ll have both an American Sign Language translator and foundational deep learning know-how. You can also access the complete source code for this project.

Prerequisites

To complete this tutorial, you will need the following:

A local development environment for Python 3 with at least 1GB of RAM. You can follow How to Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.
A working webcam to do real-time image detection.
(Recommended) Build an Emotion-Based Dog Filter; this tutorial is not explicitly used but the same ideas are reinforced and built upon.

Step 1 — Creating the Project and Installing Dependencies

Let’s create a workspace for this project and install the dependencies we’ll need.

On Linux distributions, start by preparing your system package manager and install the Python3 virtualenv package. Use:

apt-get update
apt-get upgrade
apt-get install python3-venv

We’ll call our workspace SignLanguage:

mkdir ~/SignLanguage

Navigate to the SignLanguage directory:

cd ~/SignLanguage

Then create a new virtual environment for the project:

python3 -m venv signlanguage

Activate your environment:

source signlanguage/bin/activate

Then install PyTorch, a deep-learning framework for Python that we’ll use in this tutorial.

On macOS, install Pytorch with the following command:

python -m pip install torch==1.2.0 torchvision==0.4.0

On Linux and Windows, use the following commands for a CPU-only build:

pip install torch==1.2.0+cpu torchvision==0.4.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install torchvision

Now install prepackaged binaries for OpenCV, numpy, and onnx, which are libraries for computer vision, linear algebra, AI model exporting, and AI model execution, respectively. OpenCV offers utilities such as image rotations, and numpy offers linear algebra utilities such as a matrix inversion:

python -m pip install opencv-python==3.4.3.18 numpy==1.14.5 onnx==1.6.0 onnxruntime==1.0.0

On Linux distributions, you will need to install libSM.so:

apt-get install libsm6 libxext6 libxrender-dev

With the dependencies installed, let’s build the first version of our sign language translator: a sign language classifier.

Step 2 — Preparing the Sign Language Classification Dataset

In these next three sections, you’ll build a sign language classifier using a neural network. Your goal is to produce a model that accepts a picture of a hand as input and outputs a letter.

The following three steps are required to build a machine learning classification model:

Preprocess the data: Apply one-hot encoding to your labels and wrap your data in PyTorch Tensors. Train your model on augmented data to prepare it for “unusual” input, like an off-center or rotated hand.
Specify and train the model: Set up a neural network using PyTorch. Define training hyper-parameters—such as how long to train for—and run stochastic gradient descent. You’ll also vary a specific training hyper-parameter, which is learning rate schedule. These will boost model accuracy.
Run a prediction using the model: Evaluate the neural network on your validation data to understand its accuracy. Then, export the model to a format called ONNX for faster inference speeds.

In this section of the tutorial, you will accomplish step 1 of 3. You will download the data, create a Dataset object to iterate over your data, and finally apply data augmentation. At the end of this step, you will have a programmatic way of accessing images and labels in your dataset to feed to your model.

First, download the dataset to your current working directory:

Note: On macOS, wget is not available by default. To do so, install Homebrew by following this DigitalOcean tutorial. Then, run brew install wget.

wget https://assets.digitalocean.com/articles/signlanguage_data/sign-language-mnist.tar.gz

Unzip the zip file, which contains a data/ directory:

tar -xzf sign-language-mnist.tar.gz

Create a new file, named step_2_dataset.py:

nano step_2_dataset.py

As before, import the necessary utilities and create the class that will hold your data. For data processing here, you will create the train and test datasets. You’ll implement PyTorch’s Dataset interface, allowing you to load and use PyTorch’s built-in data pipeline for your sign language classification dataset:

step_2_dataset.py

from torch.utils.data import Datasetfrom torch.autograd import Variableimport torch.nn as nnimport numpy as npimport torchimport csvclass SignLanguageMNIST(Dataset): """Sign Language classification dataset. Utility for loading Sign Language dataset into PyTorch. Dataset posted on Kaggle in 2017, by an unnamed author with username `tecperson`: https://www.kaggle.com/datamunge/sign-language-mnist Each sample is 1 x 1 x 28 x 28, and each label is a scalar. """ pass

Delete the pass placeholder in the SignLanguageMNIST class. In its place, add a method to generate a label mapping:

step_2_dataset.py

 @staticmethod def get_label_mapping(): """ We map all labels to [0, 23]. This mapping from dataset labels [0, 23] to letter indices [0, 25] is returned below. """ mapping = list(range(25)) mapping.pop(9) return mapping

Labels range from 0 to 25. However, letters J (9) and Z (25) are excluded. This means there are only 24 valid label values. So that the set of all label values starting from 0 is contiguous, we map all labels to [0, 23]. This mapping from dataset labels [0, 23] to letter indices [0, 25] is provided by this get_label_mapping method.

Next, add a method to extract labels and samples from a CSV file. The following assumes that each line starts with the label and is then followed by 784 pixel values. These 784 pixel values represent a 28x28 image:

step_2_dataset.py

 @staticmethod def read_label_samples_from_csv(path: str): """ Assumes first column in CSV is the label and subsequent 28^2 values are image pixel values 0-255. """ mapping = SignLanguageMNIST.get_label_mapping() labels, samples = [], [] with open(path) as f: _ = next(f) # skip header for line in csv.reader(f): label = int(line[0]) labels.append(mapping.index(label)) samples.append(list(map(int, line[1:]))) return labels, samples

For an explanation of how these 784 values represent an image, see Build an Emotion-Based Dog Filter, Step 4.

Step 3 — Building and Training the Sign Language Classifier Using Deep Learning

With a functioning data pipeline, you will now define a model and train it on the data. In particular, you will build a neural network with six layers, define a loss, an optimizer, and finally, optimize the loss function for your neural network predictions. At the end of this step, you will have a working sign language classifier.

Create a new file called step_3_train.py:

nano step_3_train.py

Import the necessary utilities:

step_3_train.py

from torch.utils.data import Datasetfrom torch.autograd import Variableimport torch.nn as nnimport torch.nn.functional as Fimport torch.optim as optimimport torchfrom step_2_dataset import get_train_test_loaders

Define a PyTorch neural network that includes three convolutional layers, followed by three fully connected layers. Add this to the end of your existing script:

step_3_train.py

class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 6, 3) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 6, 3) self.conv3 = nn.Conv2d(6, 16, 3) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 48) self.fc3 = nn.Linear(48, 24) def forward(self, x): x = F.relu(self.conv1(x)) x = self.pool(F.relu(self.conv2(x))) x = self.pool(F.relu(self.conv3(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x

Now initialize the neural network, define a loss function, and define optimization hyperparameters by adding the following code to the end of the script:

step_3_train.py

def main(): net = Net().float() criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

Finally, you’ll train for two epochs:

step_3_train.py

def main(): net = Net().float() criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9) trainloader, _ = get_train_test_loaders() for epoch in range(2): # loop over the dataset multiple times  train(net, criterion, optimizer, trainloader, epoch) torch.save(net.state_dict(), "checkpoint.pth")

You define an epoch to be an iteration of training where every training sample has been used exactly once. At the end of the main function, the model parameters will be saved to a file called "checkpoint.pth".

Add the following code to the end of your script to extract image and label from the dataset loader and then wrap each in a PyTorch Variable:

step_3_train.py

def train(net, criterion, optimizer, trainloader, epoch): running_loss = 0.0 for i, data in enumerate(trainloader, 0): inputs = Variable(data['image'].float()) labels = Variable(data['label'].long()) optimizer.zero_grad() # forward + backward + optimize outputs = net(inputs) loss = criterion(outputs, labels[:, 0]) loss.backward() optimizer.step() # print statistics running_loss += loss.item() if i % 100 == 0: print('[%d, %5d] loss: %.6f' % (epoch, i, running_loss / (i + 1)))

This code will also run the forward pass and then backpropagate through the loss and neural network.

Step 4 — Evaluating the Sign Language Classifier

You will now evaluate your sign language classifier by computing its accuracy on the validation set, a set of images the model did not see during training. This will provide a better sense of model performance than the final loss value did. Furthermore, you will add utilities to save our trained model at the end of training and load our pre-trained model when performing inference.

Create a new file, called step_4_evaluate.py.

nano step_4_evaluate.py

Import the necessary utilities:

step_4_evaluate.py

from torch.utils.data import Datasetfrom torch.autograd import Variableimport torch.nn as nnimport torch.nn.functional as Fimport torch.optim as optimimport torchimport numpy as npimport onnximport onnxruntime as ortfrom step_2_dataset import get_train_test_loadersfrom step_3_train import Net

Next, define a utility to evaluate the neural network’s performance. The following function compares the neural network’s predicted letter to the true letter, for a single image:

step_4_evaluate.py

def evaluate(outputs: Variable, labels: Variable) -> float: """Evaluate neural network outputs against non-one-hotted labels.""" Y = labels.numpy() Yhat = np.argmax(outputs, axis=1) return float(np.sum(Yhat == Y))

outputs is a list of class probabilities for each sample. For example, outputs for a single sample may be [0.1, 0.3, 0.4, 0.2]. labels is a list of label classes. For example, the label class may be 3.

Y = ... converts the labels into a NumPy array. Next, Yhat = np.argmax(...) converts the outputs class probabilities into predicted classes. For example, the list of class probabilities [0.1, 0.3, 0.4, 0.2] would yield the predicted class 2, because the index 2 value of 0.4 is the largest value.

Since both Y and Yhat are now classes, you can compare them. Yhat == Y checks if the predicted class matches the label class, and np.sum(...) is a trick that computes the number of truth-y values. In other words, np.sum will output the number of samples that were classified correctly.

Add the second function batch_evaluate, which applies the first function evaluate to all images:

step_4_evaluate.py

def batch_evaluate( net: Net, dataloader: torch.utils.data.DataLoader) -> float: """Evaluate neural network in batches, if dataset is too large.""" score = n = 0.0 for batch in dataloader: n += len(batch['image']) outputs = net(batch['image']) if isinstance(outputs, torch.Tensor): outputs = outputs.detach().numpy() score += evaluate(outputs, batch['label'][:, 0]) return score / n

batch is a group of images stored as a single tensor. First, you increment the total number of images you’re evaluating (n) by the number of images in this batch. Next, you run inference on the neural network with this batch of images, outputs = net(...). The type check if isinstance(...) converts the outputs in a NumPy array if needed. Finally, you use evaluate to compute the number of correctly-classified samples. At the conclusion of the function, you compute the percent of samples you correctly classified, score / n.

Finally, add the following script to leverage the preceding utilities:

step_4_evaluate.py

def validate(): trainloader, testloader = get_train_test_loaders() net = Net().float() pretrained_model = torch.load("checkpoint.pth") net.load_state_dict(pretrained_model) print('=' * 10, 'PyTorch', '=' * 10) train_acc = batch_evaluate(net, trainloader) * 100. print('Training accuracy: %.1f' % train_acc) test_acc = batch_evaluate(net, testloader) * 100. print('Validation accuracy: %.1f' % test_acc)if __name__ == '__main__': validate()

This loads a pretrained neural network and evaluates its performance on the provided sign language dataset. Specifically, the script here outputs accuracy on the images you used for training and a separate set of images you put aside for testing purposes, called the validation set.

You will next export the PyTorch model to an ONNX binary. This binary file can then be used in production to run inference with your model. Most importantly, the code running this binary does not need a copy of the original network definition. At the end of the validate function, add the following:

step_4_evaluate.py

 trainloader, testloader = get_train_test_loaders(1) # export to onnx fname = "signlanguage.onnx" dummy = torch.randn(1, 1, 28, 28) torch.onnx.export(net, dummy, fname, input_names=['input']) # check exported model model = onnx.load(fname) onnx.checker.check_model(model) # check model is well-formed # create runnable session with exported model ort_session = ort.InferenceSession(fname) net = lambda inp: ort_session.run(None, {'input': inp.data.numpy()})[0] print('=' * 10, 'ONNX', '=' * 10) train_acc = batch_evaluate(net, trainloader) * 100. print('Training accuracy: %.1f' % train_acc) test_acc = batch_evaluate(net, testloader) * 100. print('Validation accuracy: %.1f' % test_acc)

This exports the ONNX model, checks the exported model, and then runs inference with the exported model. Double-check that your file matches the step 4 file in this repository:

step_4_evaluate.py

from torch.utils.data import Datasetfrom torch.autograd import Variableimport torch.nn as nnimport torch.nn.functional as Fimport torch.optim as optimimport torchimport numpy as npimport onnximport onnxruntime as ortfrom step_2_dataset import get_train_test_loadersfrom step_3_train import Netdef evaluate(outputs: Variable, labels: Variable) -> float: """Evaluate neural network outputs against non-one-hotted labels.""" Y = labels.numpy() Yhat = np.argmax(outputs, axis=1) return float(np.sum(Yhat == Y))def batch_evaluate( net: Net, dataloader: torch.utils.data.DataLoader) -> float: """Evaluate neural network in batches, if dataset is too large.""" score = n = 0.0 for batch in dataloader: n += len(batch['image']) outputs = net(batch['image']) if isinstance(outputs, torch.Tensor): outputs = outputs.detach().numpy() score += evaluate(outputs, batch['label'][:, 0]) return score / ndef validate(): trainloader, testloader = get_train_test_loaders() net = Net().float().eval() pretrained_model = torch.load("checkpoint.pth") net.load_state_dict(pretrained_model) print('=' * 10, 'PyTorch', '=' * 10) train_acc = batch_evaluate(net, trainloader) * 100. print('Training accuracy: %.1f' % train_acc) test_acc = batch_evaluate(net, testloader) * 100. print('Validation accuracy: %.1f' % test_acc) trainloader, testloader = get_train_test_loaders(1) # export to onnx fname = "signlanguage.onnx" dummy = torch.randn(1, 1, 28, 28) torch.onnx.export(net, dummy, fname, input_names=['input']) # check exported model model = onnx.load(fname) onnx.checker.check_model(model) # check model is well-formed # create runnable session with exported model ort_session = ort.InferenceSession(fname) net = lambda inp: ort_session.run(None, {'input': inp.data.numpy()})[0] print('=' * 10, 'ONNX', '=' * 10) train_acc = batch_evaluate(net, trainloader) * 100. print('Training accuracy: %.1f' % train_acc) test_acc = batch_evaluate(net, testloader) * 100. print('Validation accuracy: %.1f' % test_acc)if __name__ == '__main__': validate()

To use and evaluate the checkpoint from the last step, run the following:

python step_4_evaluate.py

This will yield output similar to the following, affirming that your exported model not only works, but also agrees with your original PyTorch model:

Output
========== PyTorch ==========Training accuracy: 99.9Validation accuracy: 97.4========== ONNX ==========Training accuracy: 99.9Validation accuracy: 97.4

Your neural network attains a train accuracy of 99.9% and a 97.4% validation accuracy. This gap between train and validation accuracy indicates your model is overfitting. This means that instead of learning generalizable patterns, your model has memorized the training data. To understand the implications and causes of overfitting, see Understanding Bias-Variance Tradeoffs.

At this point, we have completed a sign language classifier. In essence, our model can correctly disambiguate between signs correctly almost all the time. This is a reasonably good model, so we move on to the final stage of our application. We will use this sign language classifier in a real-time webcam application.

Step 5 — Linking the Camera Feed

Your next objective is to link the computer’s camera to your sign language classifier. You will collect camera input, classify the displayed sign language, and then report the classified sign back to the user.

Now create a Python script for the face detector. Create the file step_6_camera.py using nano or your favorite text editor:

nano step_5_camera.py

Add the following code into the file:

step_5_camera.py

"""Test for sign language classification"""import cv2import numpy as npimport onnxruntime as ortdef main(): passif __name__ == '__main__': main()

This code imports OpenCV, which contains your image utilities, and the ONNX runtime, which is all you need to run inference with your model. The rest of the code is typical Python program boilerplate.

Now replace pass in the main function with the following code, which initializes a sign language classifier using the parameters you trained previously. Additionally add a mapping from indices to letters and image statistics:

step_5_camera.py

def main(): # constants index_to_letter = list('ABCDEFGHIKLMNOPQRSTUVWXY') mean = 0.485 * 255. std = 0.229 * 255. # create runnable session with exported model ort_session = ort.InferenceSession("signlanguage.onnx")

You will use elements of this test script from the official OpenCV documentation. Specifically, you will update the body of the main function. Start by initializing a VideoCapture object that is set to capture live feed from your computer’s camera. Place this at the end of the main function:

step_5_camera.py

def main(): ... # create runnable session with exported model ort_session = ort.InferenceSession("signlanguage.onnx") cap = cv2.VideoCapture(0)

Then add a while loop, which reads from the camera at every timestep:

step_5_camera.py

def main(): ... cap = cv2.VideoCapture(0) while True: # Capture frame-by-frame ret, frame = cap.read()

Write a utility function that takes the center crop for the camera frame. Place this function before main:

step_5_camera.py

def center_crop(frame): h, w, _ = frame.shape start = abs(h - w) // 2 if h > w: frame = frame[start: start + w] else: frame = frame[:, start: start + h] return frame

Next, take the center crop for the camera frame, convert to grayscale, normalize, and resize to 28x28. Place this inside the while loop within the main function:

step_5_camera.py

def main(): ... while True: # Capture frame-by-frame ret, frame = cap.read() # preprocess data frame = center_crop(frame) frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY) x = cv2.resize(frame, (28, 28)) x = (frame - mean) / std

Still within the while loop, run inference with the ONNX runtime. Convert the outputs to a class index, then to a letter:

step_5_camera.py

 ... x = (frame - mean) / std x = x.reshape(1, 1, 28, 28).astype(np.float32) y = ort_session.run(None, {'input': x})[0] index = np.argmax(y, axis=1) letter = index_to_letter[int(index)]

Display the predicted letter inside the frame, and display the frame back to the user:

step_5_camera.py

 ... letter = index_to_letter[int(index)] cv2.putText(frame, letter, (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 255, 0), thickness=2) cv2.imshow("Sign Language Translator", frame)

At the end of the while loop, add this code to check if the user hits the q character and, if so, quit the application. This line halts the program for 1 millisecond. Add the following:

step_5_camera.py

 ... cv2.imshow("Sign Language Translator", frame) if cv2.waitKey(1) & 0xFF == ord('q'): break

Finally, release the capture and close all windows. Place this outside of the while loop to end the main function.

step_5_camera.py

... while True: ... if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()

Double-check your file matches the following or this repository:

step_5_camera.py

import cv2import numpy as npimport onnxruntime as ortdef center_crop(frame): h, w, _ = frame.shape start = abs(h - w) // 2 if h > w: return frame[start: start + w] return frame[:, start: start + h]def main(): # constants index_to_letter = list('ABCDEFGHIKLMNOPQRSTUVWXY') mean = 0.485 * 255. std = 0.229 * 255. # create runnable session with exported model ort_session = ort.InferenceSession("signlanguage.onnx") cap = cv2.VideoCapture(0) while True: # Capture frame-by-frame ret, frame = cap.read() # preprocess data frame = center_crop(frame) frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY) x = cv2.resize(frame, (28, 28)) x = (x - mean) / std x = x.reshape(1, 1, 28, 28).astype(np.float32) y = ort_session.run(None, {'input': x})[0] index = np.argmax(y, axis=1) letter = index_to_letter[int(index)] cv2.putText(frame, letter, (100, 100), cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 255, 0), thickness=2) cv2.imshow("Sign Language Translator", frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()if __name__ == '__main__': main()

Exit your file and run the script.

python step_5_camera.py

Once the script is run, a window will pop up with your live webcam feed. The predicted sign language letter will be shown in the top left. Hold up your hand and make your favorite sign to see your classifier in action. Here are some sample results showing the letter L and D.

While testing, note that the background needs to be fairly clear for this translator to work. This is an unfortunate consequence of the dataset’s cleanliness. Had the dataset included images of hand signs with miscellaneous backgrounds, the network would be robust to noisy backgrounds. However, the dataset features blank backgrounds and nicely centered hands. As a result, this webcam translator works best when your hand is likewise centered and placed against a blank background.

This concludes the sign language translator application.

Conclusion

In this tutorial, you built an American Sign Language translator using computer vision and a machine learning model. In particular, you saw new aspects of training a machine learning model—specifically, data augmentation for model robustness, learning rate schedules for lower loss, and exporting AI models using ONNX for production use. This then culminated in a real-time computer vision application, which translates sign language into letters using a pipeline you built. It’s worth noting that combatting the brittleness of the final classifier can be tackled with any or all of the following methods. For further exploration try the following topics to in improve your application:

Generalization: This isn’t a sub-topic within computer vision, rather, it’s a constant problem throughout all of machine learning. See Understanding Bias-Variance Tradeoffs.
Domain Adaptation: Say your model is trained in domain A (for example, sunny environments). Can you adapt the model to domain B (for example, cloudy environments) quickly?
Adversarial Examples: Say an adversary is designing images intentionally to fool your model. How can you design such images? How can you combat such images?

How To Build a Neural Network to Translate Sign Language into English | DigitalOcean (2024)

Introduction

Prerequisites

Step 1 — Creating the Project and Installing Dependencies

Step 2 — Preparing the Sign Language Classification Dataset

Step 3 — Building and Training the Sign Language Classifier Using Deep Learning

Step 4 — Evaluating the Sign Language Classifier

Step 5 — Linking the Camera Feed

Conclusion