Handwriting to Text Conversion using Time Distributed CNN and LSTM with CTC Loss Function

An approach to Optical Character Recognition (OCR) for handwritten character to text conversion using Deep Learning framework Keras.

7 min readApr 4, 2020

This post is part of the project that we did as the capstone for the Indian School of Business’s Co’ Summer 2019 and sponsored by .

Introduction:

The motivation of the project comes from the fact that a lot of people still prefer to write on paper with pen and if digitized it adds a lot of value to be stored in long term and uses it for the downstream system for analysis. One such example is in the field of medicine, in India most of the doctors still write the patient prescription in a paper, and our sponsor needed to digitize it so that it can be put into the patient health records system. Through this project, we demonstrate how to train a Handwriting Recognition system with labeled data. Specifically, we train a deep convolution recurrent neural network (CRNN) system on manually labeled text-line data from a specific doctor prescription datasets and propose an incremental training procedure that covers the rest of the data as and when available to perform at the highest level of accuracy for the model to be able to adapt to individual doctor’s handwriting.

Segments of project:

This project consists of multiple segments as mentioned below:

Labeling data using image annotation tools
Raw image into line segmented image
Splitting the images into different frames
Time Distributed Convolution Recurrent Neural Network (CRNN)
Implementation of Connectionist Temporal Categorical (CTC) loss function
Nearest word prediction using Levenshtein distance (also known as edit distance)

Section 1: Labeling of data using image annotation tools

This is the first and most time-consuming step of the project. We tried and evaluated multiple tools for this purpose, and found by Oxford best for our purpose. This tool allows us to select areas in the prescription by drawing bounding boxes for the words, and by entering an annotation corresponding to the area selected. Once the annotation is done, we will be able to download the annotation in CSV format for further processing in Python. The downloaded file presents information like the region of annotation in x and y coordinates, the file name, and the region attributes, which is the actual annotation of the word itself. This is how a sample annotation looks like:

Section 2: Raw image into line segmented image

This is performed with the help of OpenCV library in python. The raw image is read in its original resolution by the process. It is then converted into a greyscale value. Since for the input of the handwriting recognition system, it does not matter if we give the image in the greyscale values or in three channels. Once this is done, then the contours in the image are identified. The contours are helpful in identifying the boundaries for the pixels having the same color or intensity. After the bounding boxes are drawn, we then crop the image to take out the individual line segmented image.
In the process of getting the proper line segmented data, we ignore all the images which have height bounding boxes below a certain threshold limit. By experimentation, we found that using a value of 64 gives us the desired output.
Also, it is very important to preserve the order of the images obtained. For this, we ensure that we read the images from contours based on the starting pixel value of the image. Thereby the images will be read from the top left corner to the bottom right corner.

Section 3: Splitting the images into different frames

We used Keras wrapper to feed the CNN layers. What this does is it takes different frames of the input and processed it frame by frame. Then frames of 64 x 64 pixels images are obtained from the input image with a stride of 4 for each subsequent frame. The sliding window is applied in the direction of the writing which is from left to right. Keras TimeDistributed wrapper is used for passing the frames of input to the CNN layers. TimeDistributed wrapper is helpful in preserving the temporal sequence of the frames of images that we get from the input. Essentially the input and the output will look like this:

Section 4: Time Distributed Convolution Recurrent Neural Network (CRNN)

The core of this project is a deep which is inspired by VGG16 architecture. We use stacked 13 convolution layers followed by three bidirectional LSTMs consisting of 256 units in each of the directions. Max pooling is applied after some convolution layers, a total of 5 max-pooling layers are applied. To introduce non-linearity in the convolution layers, the activation function ReLU is used. The weights of the convolution layers are initialized to he-normal and also batch normalization layers are used after all of the convolution layers to normalize the weights, which will be helpful in faster convergence of the network. The dropout rate for the training is set to 0.25 in each of the LSTM layers. There is a dense layer between the output of the CNN and the input of the LSTMs, which is very effective in reducing the number of parameters that are used for our training.

We set the total number of output classes that are applicable for our problem to 91 (A-Z, a-z, 0–9, all standard special characters on an English keyboard, plus one additional class for unknown). Hence the last layer of the neural network had 91 weights for each output of the LSTM frame. The softmax layer is used in the final layer to get the output class which has the highest probability in a frame.

Section 5: Implementation of Connectionist Temporal Categorical (CTC) loss function

The objective function which is used to minimize the loss is the (CTC) loss function. While other loss function optimized single objective function, the CTC loss is specially designed to optimize both the length of the predicted sequence and the classes of the predicted sequence, as the input image varying in nature. The convolution filters and the LSTM weights are jointly learned within the back-propagation procedure. Adam optimizer is used for the training with an initial learning rate of 0.001.

The models were trained on GPU servers taken from google cloud services. The system configurations are 32GB RAM, 8 core processor, and 4 Tesla K80 GPUs.

Complete architecture:

Section 6: Nearest word prediction using Levenshtein distance

The metrics that is used to track the performance of the network is (also known as edit distance), which is a commonly used metrics to measure the string metrics by measuring the difference between the observed sequence and the predicted sequence.

For this we built a vocabulary of all the words in the input. The CTC loss function does not actually give the entire word correctly as there is an overlap between the frames fed using the TimeDistributed wrapper. Since a single alphabet could be in multiple frames, the output will contain repetition of words. So we have to use edit distance as metric to find out what is the nearest word in our entire corpus which matches closest to the word given as output.

Conclusion and further work:

Putting all the pieces together, the output of the model worked with 0.75 accuracy to predict the test dataset and about 0.80 on the training dataset. The predictions will be fed to the CRNN line by line and the output of the CTC layer will be matched with the closest word in our corpus for the prediction to be checked with actual words using the edit distance metric.
Furthermore, we wanted to experiment with RESNET architecture for the CNN layers instead of the VGG-16 architecture that was initially chosen for the project, but retraining from the scratch will be very time and resource-consuming since the capstone was ending, we did not have the opportunity to do so.
As for the practical use of the project, we wanted to integrate this with an inscription pad, on which you can place a piece of paper and write, and in the backend, our system will automatically fragment the whole page into lines, lines into words and feed to CRNN architecture and will automatically give out the words corresponding to the handwritten letters.

Link to GitHub repo:

Authored by: , , ,

TDS Archive