Natural Language Processing on Hinglish

Sagarika Raje
4 min readApr 16, 2021

The language Hinglish involves a hybrid mixing of Hindi and English within conversations, individual sentences, and even words. An example: “nahi mei nahi aa sakta”. Translation: “no, I cannot come.” It is gaining popularity as a way of speaking that demonstrates you are modern, yet locally grounded.

India is a diverse country, so naturally, one tends to witness both ends of the spectrum. Namely, a Hindi-speaking commoner who can read and understand the Devanagari script. And tourists coming in from abroad, who may or may not understand the language entirely. Since the local Indian markets which are a huge source of attraction for foreign tourists comprise a majority of purely Devanagari understanding vendors, we found the need to use Hinglish as a medium for the two extreme ends to meet.

Our program aims to detect two things:

  • Whether the word typed is either English or Hindi written in English
  • To translate the Hinglish word to both: English and Hindi

PART 1

To tackle the first we built a sequence to sequence architecture using tensorflow.

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. words in Hinglish) to sequences in another domain (e.g. the same words translated to Hindi).

Seq2Seq is a method of encoder-decoder-based machine translation and language processing that maps an input of sequence to an output of sequence with a tag and attention value. The idea is to use 2 RNNs that will work together with a special token and try to predict the next state sequence from the previous sequence.

A typical sequence to sequence model has two parts — an encoder and a decoder. Both the parts are practically two different neural network models combined into one giant network.

Broadly, the task of an encoder network is to understand the input sequence and create a smaller dimensional representation of it. This representation is then forwarded to a decoder network which generates a sequence of its own that represents the output.

We made a seq2seq to translate Hinglish (Hindi words in Latin) to Hindi and further another seq2seq model to translate Hindi to English. As a result, translating Hinglish into English.

We chose the linear approach rather than the direct approach. Even with this approach, we faced a lot of issues mainly because there was a limited corpus for Hinglish words, which posed a major setback for translating uncommon words to Hindi and English. Also, since the corpus consisted of just words and not sentences, we could not translate Hinglish sentences to proper English or Hindi. Working with tensorflow was also a challenge because of the outdated libraries as well as reduced usage of tensorflow-core, which made it difficult to debug. Lastly, Hinglish being a casually-made language has no fixed spellings of any word. This increased the scope of error during translation.

English is based on Subject-Verb-Object (SVO) structure, but Hindi is a Subject-Object-Verb (SOV) type of language. Hindi is morphologically more rich than English. In general, these divergences are the factors that make the translation process difficult and error-prone. Furthermore, Hindi also has some inherent challenges in translating to English (1) Lack of articles in Hindi makes the translation imprecise. (2) Multiple contextual meaning of English

Our output looked something like this

PART 2

The second part of the project was to classify if a given language word was English or Hindi.

To make this classification model we used LSTM. Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video).

Our LSTM Model has 3 layers with ReLU and sigmoid as activation functions and binary cross-entropy for loss parameters. After training it on 20 epochs we got an accuracy of 83.4%.

We further deployed this on flask using ngrok. You can check out our code here.

Project Owners -

Sagarika Raje (raje.sagarika@gmail.com)

Kopal Sharma (kopalsharma2000@gmail.com)

--

--

Sagarika Raje

I am enthusiastic about data, statistics and programming. As I advance, I hope to gain opportunities to expand my knowledge and experience in this field.