Project Summary

  1. Getting the data: Acquiring/gathering/downloading - 2 points
  2. The data was gathered from yelp dataset challenge

  3. ETL: Extract-Transform-Load work and cleaning the data set - 3 points
  4. Data was initially in json format, out of which only the necessary fields were stored into cassandra tables. They had information of all the businesses, which had to be converted and transformed into vectors for various machine learning algorithms.

  5. Problem: Work on defining problem itself and motivation for the analysis - 1 point
  6. Problem involved building a user profile by sentimental analysis of his tweets and using collaborative filtering to get results similar to his taste. Next step was to filter the results based upon his current location and day.

  7. Algorithmic work: Work on the algorithms needed to work with the data, including integrating data mining and machine learning techniques - 4 points
  8. Algorithms used such as TF.IDF, Baye's Theorem, ALS.

  9. Bigness/parallelization: Efficiency of the analysis on a cluster, and scalability to larger data sets - 2 points
  10. Parallelization techniques used such as broadcast, cache and collaborating filtering takes into account 2.7M reviews utilising the cluster efficiently for all computation.

  11. UI: User interface to the results, possibly including web or data exploration frontends - 3 points
  12. The UI has been implemented with a website based on HTML5, Bootstrap and CSS inorder to visualize the results and showcase information regarding our project.

  13. Visualization: Visualization of analysis results - 3 points
  14. The visualization has been implemented using Tableau by connecting it to Spark Cassandra dynamically and resulting charts were integrated with the HTML website.

  15. Technologies: New technologies learned as part of doing the project - 2 points
  16. Some of the new technologies learnt as part of this project are Spark MLlib, Tableau, HTML5 and Bootstrap CSS.