Data

Project Summary

Getting the data: Acquiring/gathering/downloading - 2 points

The data was gathered from yelp dataset challenge

ETL: Extract-Transform-Load work and cleaning the data set - 3 points

Data was initially in json format, out of which only the necessary fields were stored into cassandra tables. They had information of all the businesses, which had to be converted and transformed into vectors for various machine learning algorithms.

Problem: Work on defining problem itself and motivation for the analysis - 1 point

Problem involved building a user profile by sentimental analysis of his tweets and using collaborative filtering to get results similar to his taste. Next step was to filter the results based upon his current location and day.

Algorithmic work: Work on the algorithms needed to work with the data, including integrating data mining and machine learning techniques - 4 points

Algorithms used such as TF.IDF, Baye's Theorem, ALS.

Bigness/parallelization: Efficiency of the analysis on a cluster, and scalability to larger data sets - 2 points

Parallelization techniques used such as broadcast, cache and collaborating filtering takes into account 2.7M reviews utilising the cluster efficiently for all computation.

UI: User interface to the results, possibly including web or data exploration frontends - 3 points

The UI has been implemented with a website based on HTML5, Bootstrap and CSS inorder to visualize the results and showcase information regarding our project.

Visualization: Visualization of analysis results - 3 points

The visualization has been implemented using Tableau by connecting it to Spark Cassandra dynamically and resulting charts were integrated with the HTML website.

Technologies: New technologies learned as part of doing the project - 2 points

Some of the new technologies learnt as part of this project are Spark MLlib, Tableau, HTML5 and Bootstrap CSS.