Hurdles

Extracting live social media data

The initial thought process was to obtain reviews from the Airline's Facebook profile page. However, gathering this data dynamically was a huge challenge even though it was a public dataset. We tried using techniques such as various python libraries, scrapy.org, regex techniques and facebook APIs but could not obtain the required dataset. Also, manual extraction of data would be a time-consuming and tedious task.

The solution was to switch to extracting the reviews from the airline's Twitter account using their handles and APIs. With this approach, we were able to gather about 5GB of data from 24 airlines in 2 weeks.

The Twitter data can also be collected using Spark Streaming, but it has APIs for only java and scala languages. We chose to do a python standalone software as the amount of data collected would be enough to be handled by a single machine.

Choosing the right model

While choosing the model to perform sentimental analysis, one of the main challenges was the lack of machine learning algorithms in Spark MLlib. On further research, we found that this a very new field and many machine learning algorithms are currently being developed for Spark. Looking to reach high accuracy rates, we tried to implement a convolution neural network using tensorflow by following instructions found in [1] but we faced a problem that even though tensorflow has a distributed version it cannot work with Spark. We also found a very useful library called tensorspark in [2] that was able to fill in for the lack of system that integrate tensorflow although it does not perform as well as distributed tensorflow. Even though the tensorspark project is very new, we were able to run it on the cluster for a few examples. The library requires many changes inside the tensorflow model which would result in the task to integrate the tensorflow model and the library highly time-consuming. As a result, we decided to perform experiments using SVM and Word2Vec models that are available on the Spark MLlib.

[1] http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

[2] https://github.com/adatao/tensorspark