Sentiment analysis
All of the code used for this sentiment analysis tool can be found in this GitHub repository.
Introduction
This sentiment analysis tool was a little project with which I wanted to explore the applications of language models, specifically the application of n-gram models. My goal was to construct this tool from the ground up to gain a better understanding of the underlying theory, without using existing state of the art tools (such as existing natural language processing or machine learning libraries) in the process.
The code I have written allows for the classification of a given sentence or text snippet as either positive or negative.
Language models
N-gram models store the frequencies of sequences in certain data and when applied to language, the units of such n-grams can be words or characters. N denotes the length of the sequence in a specific model. For example, a 2-gram model (also called bigram model) would store the frequencies of sequences of two words in a given corpus or dataset.
For this sentiment analysis task, n-gram models must be constructed, one which describes texts of positive sentiment and one which describes texts of negative sentiment. When a sentence is to be classified, the probability of this sentence is computed given the positive language model and the language model, and the model which maximizes this probability is selected as the sentiment of the sentence.
Corpora
The data from which the two language models were constructed originate from reviews on the website Rotten Tomatoes, and can be found here.
Classification
In order to classify a given sentence, I have implemented a Naive Bayes classifier, which
is a simple classification algorithm that utilizes Bayes' Theorem:
P(Ck|x) = P(x|Ck)P(Ck)⁄P(Cx)
In the formula above, Ck denotes class k (which can be positive or negative in this case)
and x denotes the sentence that must be classified. With this information, we can see
that in order to compute the posterior (the probability of class k given sentence x),
we need to multiply the likelihood (the probability of the sentence given the class) by the prior
(the probability of the class) and then we must divide this by the evidence (the probability of the sentence).
Since we are interested in the class k that maximizes the posterior, we can
omit the denominator in the right hand side of the formula, since this value is constant when
we compute the posterior for different classes, and therefore it does not impact the
selection of the class that maximizes the posterior.
P(Ck|x) ∝ P(x|Ck)P(Ck)
Preliminary results
Currently, the classifier's resulting precision is 0.701 and the recall is 0.709
Future changes
Naturally, the preliminary results are not amazing, which is why I might try out some different approaches in the future, for instance: applying Kneser-Ney smoothing to the n-gram models; trying out a different type of classifier; using Word2Vec; using different datasets for training/testing; etc.