Language Detection using NLP

November 24, 2022

Introduction

Every machine learning enthusiast has the desire to create or work on a cool project, right? It is not sufficient to simply comprehend the theory; you must work on projects, try to implement them, and learn from them. Additionally, focusing on a narrow field like NLP opens up a world of possibilities and research questions. We want to share with you through this article a fantastic project called the Language Detection model using Natural Language Processing. This will walk you through a practical application of ML (application to say). So, let’s not delay any longer.
It is a model built on NLP. It reveals the language of a specific sentence. As an illustration, the given sentence in English might be translated as “What are you doing right now?” Because we provided the sentence in the English language, it was able to identify the language and returned the result “English”.

About the dataset

We’re utilizing the Language Detection dataset, which includes text information for a variety of languages, including English, Dutch, Spanish, Japanese, Russian, Italian, Malayalam, Hindi, Tamil, and Telugu.
We must build a model that can predict the given language using the text. For machine translation, these kinds of prediction systems are frequently used on robots as well as electronic devices like mobile phones and laptops.

Implementation

Importing libraries and dataset

So, let’s go, we will import all of the necessary libraries first. There are 22 different languages included in the languages dataset’s “Text,” which is a sentence, and “Language,” which is its target feature. After that, examine the data’s shape. The data set’s 22000 rows and 2 columns were also displayed.
Check the count of each language
Text Preprocessing
This dataset was generated by scraping Wikipedia, so it contains a lot of unwanted symbols (stopwards), numbers, and other information that will degrade the accuracy of our model. Therefore, text preprocessing techniques is used.
Bag of Words
As is common knowledge, both the input and the output features must be in numerical form. Therefore, we used Count Vectorizer to build a Bag of Words model, we converted text into numerical form.
Label Encoding
Our output variable, the name of languages, is a categorical variable. We performed label encoding on that output variable because it needs to be converted into a numerical form for training the model. LabelEncoder from sklearn is imported for this procedure. We perform the label encoding on our target feature which is a “Language”, it converts the language column into the numbers according to the order. Like 1,2,3,4,5,6, etc.
Train Test Splitting
Input and output variables were preprocessed. The next step is to develop the training set for the model’s training and the test set for the test set’s evaluation.
Model Training and Prediction
The process of creating the model is almost complete. The naive bayes algorithm is used to build our model. The model is later trained using the training set. Let’s now forecast the results for the test set.
Model Evaluation
It’s time to assess the model using the results of the actual tests and the forecasted output data. As a result, it is clear that the accuracy rating is 92% and you also see the confusion matrix.
Confusion Matrix
Predicting with some more data
Now let’s test the model prediction using text in different languages.
Conclusion
Yes, that blog was interesting. We hope you now have a better understanding of how these projects are put together. You would have received a basic NLP programme diagram from this. The data needs to be evaluated, then preprocessed as necessary. A bag of words model becomes a way of representing your text data. In order to make accurate predictions in NLP, text extraction and vectorization are crucial steps. In these text classification issues, Naive Bayes consistently proves to be a better model, leading to more accurate results.
Authors:

Vaishnavi Gawale

Priyanshi Lalit

Nishant Tekale

Shivansh Rastogi

Comments