Project Summary

Gathering the data: 3 points

Gathering the testing data initially involoved scraping the data from Facebook, but since the API from Facebook didn't let us gather the required data for our experiments. The solution was to gather data through the Twitter API which provided us with about 5GB of data.

ETL: 2 points

The training dataset originially obtained from SkyTrax included junk columns in its csv file. We had to clean this data and build RDDs to provide as an input to our Word2Vec model. For testing, the data was obtained in json format through twitter and the processed data was fed into our model for prediction.

Problem: 3 points

The problem involoves performing sentimental analysis on reviews provided by airline customers based on their experience. With the best optimization of our model, a validation accuracy of 70 percent was obtained on a training dataset consisting of about 41,000 reviews.

Algorithms: 3 points

Two of the main algorithms used in our model are Word2Vec and Support Vector Machine implementations of the Spark MLlib.