This project was the final project for my Data Mining class during my graduate studies. There was not much time to do it, so I wanted to be sure that I worked on a project that was doable in the small timeframe. My teammate and I had a few ideas, but we decided on analyzing trends in movies. While this was a doable model, there was also a lot of room for expansion if we had more time.
Most of the work on this project was gathering the data. I found 3 different data sources. A lot of this data was either the same or conflicting, so there was a lot of picking and choosing what data points we were going to use. A lot of the features were particularly difficult because we had to organize it in such a way as was readable by the model. We mostly did this through one-hot-encoding, ending up with about 2000 features. We took some additional approaches, particularly in regards to the release day of the movies.
The model ended up being very accurate. It was easily able to estimate how many people would rate a movie, and the average rating of the movies. The total revenue had some big outliers.
We found evidence of bias in the data. Of the most influencial features, female cast and crew were not significant when determining vote count and rating, but it was among the top influencers when determining revenue. This suggests that movies with a higher percentage of female cast and crew match the quality of other movies, but do not make as much money. This is speculation, but this leads me to believe that women are not given as many opportunities among the high-grossing films as men.