Sentiment Classification and Lie Detection (NLP)
Background and Goal
There are many machine learning solutions to detect if a person is lying or not. The goal of this mini project is too build sentiment classification and lie detection using the reviews of hotels in the United States. The detailed objectives are as follows.
- To select the best features to build the models using gain ratio scores.
- To understand the difference between sentiment classification and lie detection and conclude which model can make a better performance, using Multinomial Naive Bayes, SVM, Decision Tree, and Random Forest models.
Process and Conclusion
1. Data Preprocessing & Text Preprocessing
This part is for preparing and exploring the given data set (the reviews of US hotels) for further machine learning project.
- EDA: The data set has 35,437 reviews with the following three attributes.
is_positive
: whether the review is positive or negative (positive = ‘y
’, negative = ‘n
’)Reviewer_score
: the review score (10 points scale)review
: actual reviews, string values
- handling missing values: investigating missing values
Value Transformation: converting the values of
is_positive
to numeric values.- Text Preprocessing: Defining a function to implement text preprocessing using the following classifiers from NLTK library.
- Sentence Tokenizer
- Stopwords
- Regexp Tokenizer
- WordNet Lemmatizer
- Porter Stemmer
2. Vectorization and Feature Selection
- This part is vectorizing the text data using count vectorization and TF-IDF vectorization and selecting best feature by calculating gain ratio scores for further data modeling.
3. Data Modeling
- This part is building sentiment classification and lie detection model for the preprocessed data in previous parts.
- Sentiment Classification: MultinomialNB (Naive Bayes), SVM (Support Vector Machine)
- Lie Detection: Decision Tree, Random Forest
4. Evaluating the Models
- This part is comparing the results of each algorithm and think about the next step for improvements of those models.
- According to the above results, sentiment classifications using Multinomial NB and TD-IDF (C-2) and SVM with count vectorization (D-1) models have the best accuracy score, about 0.969. We would interpret that the models predict the values correctly.
- However, the roc auc score of these models shows the worst scores among the models used in the research, about 0.5. The confusion matrix of these models also showed the lowest ability of these models to divide the data into positive and negative clearly.
5. Conclusion
- Feature Selection using two vectorization techniques: count vectorization and TF-IDF
- Count vectorization showed the results focused on sentiment words. The top 5 features in the count vectorization are positive, negative, great, rude, and dirty, which are emotional words. However, the TD-IDF focused on the categories (or conditions that we use in choosing the best hotel) rather than emotions because the four fifth of the top 5 features in the TD-IDF are room, staff, hotel, and location.
- Sentiment Classification and Lie Detection Models
- When using the features selected in this research, the sentiment classification with the MultinomialNB model can detect whether a review is a lie or not better than lie detection. However, all models used in the research showed a lower score of the ability to distinguish between positive and negative values, even though they have a good accuracy score. For further improvement, we should consider how to make a balanced confusion matrix using the data set.
This post is licensed under CC BY 4.0 by the author.