Methodology

Extracting and preparing the data

The original training dataset used for this model was scraped from SkyTrax and we have used an updated version that can be found here. From a total of 20 columns in this csv file, we have chosen 'content' and 'recommended' columns as our reviews and labels respectively for training the model.

For testing, we have streamed the data through Twitter API for 2 weeks and gathered about 5GB of data in json format consisting about 410,000 tweets for 24 Airlines.

Building the model

Our model has been built and trained using Support Vector Machines(SVM) implementation of the Spark MLlib. Support Vector Machines are supervised machine learning algorithms that can used for linear classification of data. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. The SVM implementation of Spark MLlib allows us to train our model and achieve parallelism within the cluster resulting in better efficiency on large datasets.

Linear SVM implemenation of the Spark MLlib is a standard method for large-scale binary classification. By default, linear SVMs are trained with an L2 regularization. Below is the loss function:
L(w;x,y):= max{0,1 − yw^Tx}

Model optimization

The SVM model has been trained using Stochastic Gradient Descent as an objective function. In order to further improve our model, we have also implemented Word2Vec feature of the Spark MLlib. Upon performing repeated experiments for different feature values, the optimal values obtained were 200 iterations for SGD with an embedding size of 128 in Word2Vec resulting in validation accuracy of about 68 percent.

NOTE: Word2Vec is used to produce word embeddings and computes distributed vector representation of words.

Predicting the twitter sentiments

The data obtained from the twitter streaming is now processed in parallel on our cluster and fed into the trained SVM model for prediction. The SVM model makes predictions based on the value of w^Tx and classifies whether a particular sentence belongs to positive or negative class.

Methodology

Extracting and preparing the data

Building the model

Model optimization

Predicting the twitter sentiments

Ranking based on predictions

Storing the results

Visualization