Methodology

The original training dataset used for this model was scraped from SkyTrax and we have used an updated version that can be found here. From a total of 20 columns in this csv file, we have chosen 'content' and 'recommended' columns as our reviews and labels respectively for training the model.

For testing, we have streamed the data through Twitter API for 2 weeks and gathered about 5GB of data in json format consisting about 410,000 tweets for 24 Airlines.
Our model has been built and trained using Support Vector Machines(SVM) implementation of the Spark MLlib. Support Vector Machines are supervised machine learning algorithms that can used for linear classification of data. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. The SVM implementation of Spark MLlib allows us to train our model and achieve parallelism within the cluster resulting in better efficiency on large datasets.

Linear SVM implemenation of the Spark MLlib is a standard method for large-scale binary classification. By default, linear SVMs are trained with an L2 regularization. Below is the loss function:
L(w;x,y):= max{0,1 − ywTx}
The SVM model has been trained using Stochastic Gradient Descent as an objective function. In order to further improve our model, we have also implemented Word2Vec feature of the Spark MLlib. Upon performing repeated experiments for different feature values, the optimal values obtained were 200 iterations for SGD with an embedding size of 128 in Word2Vec resulting in validation accuracy of about 68 percent.

NOTE: Word2Vec is used to produce word embeddings and computes distributed vector representation of words.
The data obtained from the twitter streaming is now processed in parallel on our cluster and fed into the trained SVM model for prediction. The SVM model makes predictions based on the value of wTx and classifies whether a particular sentence belongs to positive or negative class.
The score for each airline is calculated by taking the average among the positive reviews obtained from our model for that airline. The airline with the highest score has the best reviews from its customers on twitter.
The results obtained after calculating the score and ranking the airlines is now stored into a Cassandra table which is connected to the Desktop version of Tableau.
The Desktop version of Tableau is used for visualization of charts and graphs. These charts are then published to the public version of Tableau and finally embedded into the Conclusions page.