A Self Case Study on Quora Insincere Question Classification
Note: Here, I will try to explain how I approach the problem statement starting from exploratory data analysis to model building using deep learning approach end to end. Please do enjoy and give it a clap!
Quora is an American question-and-answer website that provides a platform to ask questions and connect with people who contribute unique insight and share quality thoughts via their answers to its users. It empowers people to learn from each other.
Handling harmful content is a big set of problems for websites nowadays. Quora also face the same toxicity via insincere questions posted on its website every day based upon false premises, demeaning people and their choices, or intend to make a statement rather than look for helpful answers. Identifying these harmful questions will improve online conversations and create a productive environment for its users.
Quora Insincere Question Classification was a Kaggle based competition hosted by Quora to develop a model that will identify and flag those Insincere Questions.
- Data Overview
- Evaluation Metric
- Exploratory Data Analysis
- Data Pre-processing
- Model Building
As discussed above, here, I will develop a model that will identify the Insincere Questions from the question text corpus provided by Quora. It’s a binary class classification model with output label “1” as insincere, otherwise “0”.
Kaggle has provided four files with the descriptions — train.csv, test.csv, sample_submission.csv, and embeddings.zip file.
- In train.csv, the file has three columns,
qid - unique question identifier (unique id associated with the questions)
question_test - Quora question text (questions asked in the platform)
target - a question labeled “insincere” has a value of 1, otherwise 0
2. In test.csv, the file has only the qid’s as well as the question text
3. sample_submission.csv has the test data unique qid’s and a submission column where we are going to set the output label for submission.
4. Embedding.zip file contains four pre-trained embedding files:
You can download the full dataset here.
The competition criteria for evaluation is F1-Score.
F1-Score is the harmonic mean of precision and recall. Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances whereas recall (also known as sensitivity) is the fraction of the total amount of relevant instances that were actually retrieved.
TP = True Positive (number of positive class classified as positive)
TN = True Negative (number of positive class classified as negative)
FP = False Positive (number of negative class classified as positive)
FN = False Negative (number of negative class classified as negative)
Note: For a highly imbalanced dataset, it’s better to work with F1-Score as an evaluation metric. It seeks balance between precision and recall and doesn’t create any bias.
Exploratory data analysis:
First, we have to load the dataset for further analysis.
Here we can see that the training dataset doesn’t have any missing values. Also, all the qid’s and question text is unique.
The pie chart shows that only 6.2% of the total questions belong to the insincere category, which shows the dataset is highly imbalanced.
Word Cloud Analysis:
For Sincere Questions,
For Insincere Questions,
From the word cloud, we can see that words like Muslim, black, Chinese, American, etc. appeared in insincere category, making the target label more distinguishable.
The objective of pre-processing here is to clean and pre-process the text corpus so that it will be closer to our pre-trained embedding. The percentage of unique words and total words present in the text corpus should be more relative to the embedding words.
The pre-processing is based on glove embedding as it outperformed all others.
First check for the coverage of the text corpus in the embedding.
The word coverage of text corpus concerning embedding is shallow. Only 33% of the unique words were present in the embedding - our job here is to increase the amount as much as possible.
- Replacing math symbols with their literal meaning
The text has many questions related to mathematical problems (including symbols and formulas). To keep the questions’ essence, we replaced many symbols and variables with their respective meanings to still be qualified as a valid question.
2. Remove unwanted characters and symbols
We are removing the above characters/symbols as it is not present in the embedding and doesn’t add much value to the question text.
3. Removal of Contractions
We are keeping as many as contraction which is present in the embedding. Others are de-contracting to create a clean text corpus.
4. Keep all punctuations present in the embedding
5. Replace the vocabulary words with their correct form present in the embedding
6. Replace misspelled and unknown words with their correct spelling and meaning
After performing the above steps, the result is quite good now as we are covering almost 75% of the training words and 85% of the test words, as shown below.
The percentage can be further improved with other pre-processing techniques, such as stemming and lemmatization. We can also replace company names with their work, such as OnePlus - Mobile Phone Company, etc. to make more sense.
Model Building :
I have tried both machine learning and deep learning approaches for this classification problem. Statistically, deep learning based models completely outperformed all machine learning approaches. Here I am going to discuss the model that yield better F1-Score than any other model.
Bidirectional RNN (LSTM/GRU) with attention Layer
What is RNN?
Recurrent Neural Network (RNN) is a type of neural network where the output from previous steps are fed as input to the current step. RNN based models are widely used in NLP task as it has the capability of remembering previous words in a sentence to predict the next word in a sentence. The most important feature of a RNN is its Hidden State, which remembers some information about a sequence.
The unidirectional RNN stores information of the past because the only inputs it has seen from the past. Bidirectional RNN manages the inputs in two ways, one from past to future and vice-versa. Using the two hidden states combined, we will be able at any point in time to preserve information from both past and future.
I have used a bidirectional LSTM layer, for the model building. To know more about LSTM layer and it’s architecture, please do check out here.
What is Attention?
Attention models, or attention mechanisms, are input processing techniques for neural networks that allow the network to focus on specific aspects of a complex input, one at a time until the entire dataset is categorized. The goal is to break down complicated tasks into smaller areas of attention that are processed sequentially. Similar to how the human mind solves a new problem by dividing it into simpler tasks and solving them one by one.
It overcomes the limitations of RNN by memorizing the longer sentences, keeping the semantic meaning of the whole paragraph. RNN fails to work properly when the length of sentence is longer.
Here I have used Bahdanau Attention which is also known as additive attention as it performs a linear combination of encoder states and the decoder states.
Before going for pseudo code here are some notations.
- FC = Fully connected (dense) layer
- EO = Encoder output
- H = hidden state
- X = input to the decoder
score = FC(tanh(FC(EO) + FC(H)))attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.embedding output = The input to the decoder X is passed through an embedding layer.merged vector = concat(embedding output, context vector)Finally the merged vector is given to a dense layer to get the feature vectors. Again the dense layer is passed a softmax layer to get the prediction.
Below is the final model which I used to get the output.
1. Here the input layer is fed to the embedding layer, where the weights are obtained from the glove embedding.
2. The embedding layer output is given to a bidirectional LSTM layer of 64 parallel units. Here we are obtained LSTM output and both forward and backward hidden and cell state.
3. The attention layer takes two inputs, lstm layer and hidden state. Here we passed both of them with 10 units.
4. The context vector is now fed to the dense layer of size 64 which will obtain the feature vectors.
5. The feature vector is now fed to a dense layer with sigmoid activation to get the output label.
After the above process, the output f1 score obtained is shown below.
Private Score : 0.68211
Public Score : 0.67010
Finally I am concluding the case study with above obtained score. I performed the task on more than 25 ML and DL model and the result obtained are shown below.
The full implementation of the above process can be found here in my github repository. Please do check out and suggestions are welcome.