The Birth of the motion picture camera in the late 18th century gave birth to the most powerful form of entertainment available: Cinema. Movies have been able to entertain audiences from the emergence of a single second of horse racing in the 1890s to the introduction of sound in the 1920s to the birth of color in the 1930s to create 3D Movies in early 2010.
Cinema had a humble background in terms of design, direction and acting (especially due to its very short time in its early days) but since then, the film industry around the world has been blessed with creative talent that could be directors, screenwriters, actors, sound designers and cinematographers. It has also spread across a wide range of genres from romance to comedy to science fiction and horror.
As with all children born in the last century, movies are a marvel. I was addicted to it. And I have always wanted to know more about the mysterious world of cinema. In this booklet, I will try and find some information about using data. For us, we have a data set of approximately 45000 metadata movies collected from TMDB. Using this data, we will try and answer various questions that I have always had about movies.
In this notebook, I will also try and build the following:
- A Regressor that can, to a certain degree, predict the revenue of a particular movie.
- A Classifier that identifies if a movie will be a hit or will make the producers lose money.
Importing Libraries and Loading Our Data
%matplotlib inline
from IPython.display import Image, HTML
import json
import datetime
import ast
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, XGBRegressor
from wordcloud import WordCloud, STOPWORDS
import plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
warnings.filterwarnings('ignore')
plotly.tools.set_credentials_file(username='geekycode', api_key=[api_key])
sns.set_style('whitegrid')
sns.set(font_scale=1.25)
pd.set_option('display.max_colwidth', 50)
df = pd.read_csv('../input/movies_metadata.csv')
df.head().transpose()
df = pd.read_csv('../input/movies_metadata.csv')
df.head().transpose()
0 | 1 | 2 | 3 | ||
---|---|---|---|---|---|
adult | False | False | False | False | False |
belongs_to_collection | {‘id’: 10194, ‘name’: ‘Toy Story Collection’, … | NaN | {‘id’: 119050, ‘name’: ‘Grumpy Old Men Collect… | NaN | {‘id’: 96871, ‘name’: ‘Father of the Bride Col… |
budget | 30000000 | 65000000 | 0 | 16000000 | 0 |
genres | [{‘id’: 16, ‘name’: ‘Animation’}, {‘id’: 35, ‘… | [{‘id’: 12, ‘name’: ‘Adventure’}, {‘id’: 14, ‘… | [{‘id’: 10749, ‘name’: ‘Romance’}, {‘id’: 35, … | [{‘id’: 35, ‘name’: ‘Comedy’}, {‘id’: 18, ‘nam… | [{‘id’: 35, ‘name’: ‘Comedy’}] |
homepage | http://toystory.disney.com/toy-story | NaN | NaN | NaN | NaN |
id | 862 | 8844 | 15602 | 31357 | 11862 |
imdb_id | tt0114709 | tt0113497 | tt0113228 | tt0114885 | tt0113041 |
original_language | en | en | en | en | en |
original_title | Toy Story | Jumanji | Grumpier Old Men | Waiting to Exhale | Father of the Bride Part II |
overview | Led by Woody, Andy’s toys live happily in his … | When siblings Judy and Peter discover an encha… | A family wedding reignites the ancient feud be… | Cheated on, mistreated and stepped on, the wom… | Just when George Banks has recovered from his … |
popularity | 21.9469 | 17.0155 | 11.7129 | 3.85949 | 8.38752 |
poster_path | /rhIRbceoE9lR4veEXuwCC2wARtG.jpg | /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg | /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg | /16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg | /e64sOI48hQXyru7naBFyssKFxVd.jpg |
production_companies | [{‘name’: ‘Pixar Animation Studios’, ‘id’: 3}] | [{‘name’: ‘TriStar Pictures’, ‘id’: 559}, {‘na… | [{‘name’: ‘Warner Bros.’, ‘id’: 6194}, {‘name’… | [{‘name’: ‘Twentieth Century Fox Film Corporat… | [{‘name’: ‘Sandollar Productions’, ‘id’: 5842}… |
production_countries | [{‘iso_3166_1’: ‘US’, ‘name’: ‘United States o… | [{‘iso_3166_1’: ‘US’, ‘name’: ‘United States o… | [{‘iso_3166_1’: ‘US’, ‘name’: ‘United States o… | [{‘iso_3166_1’: ‘US’, ‘name’: ‘United States o… | [{‘iso_3166_1’: ‘US’, ‘name’: ‘United States o… |
release_date | 1995-10-30 | 1995-12-15 | 1995-12-22 | 1995-12-22 | 1995-02-10 |
revenue | 3.73554e+08 | 2.62797e+08 | 0 | 8.14522e+07 | 7.65789e+07 |
runtime | 81 | 104 | 101 | 127 | 106 |
spoken_languages | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}, {‘iso… | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] |
status | Released | Released | Released | Released | Released |
tagline | NaN | Roll the dice and unleash the excitement! | Still Yelling. Still Fighting. Still Ready for… | Friends are the people who let you be yourself… | Just When His World Is Back To Normal… He’s … |
title | Toy Story | Jumanji | Grumpier Old Men | Waiting to Exhale | Father of the Bride Part II |
video | False | False | False | False | False |
vote_average | 7.7 | 6.9 | 6.5 | 6.1 | 5.7 |
vote_count | 5415 | 2413 | 92 | 34 | 173 |
Understanding the Data Set
The above database was obtained through the TMDB API. Movies available on this database are compatible with movies in the latest Movie Lens Full Data collection covering 26 million ratings out of 45,000 movies from 27,000 users. Let’s take a look at the features that are available to us.
df.columns
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count'], dtype='object')
Features
- adult: Indicates if the movie is X-Rated or Adult.
- belongs_to_collection: A stringified dictionary that gives information on the movie series the particular film belongs to.
- budget: The budget of the movie in dollars.
- genres: A stringified list of dictionaries that list out all the genres associated with the movie.
- homepage: The Official Homepage of the move.
- id: The ID of the move.
- imdb_id: The IMDB ID of the movie.
- original_language: The language in which the movie was originally shot in.
- original_title: The original title of the movie.
- overview: A brief blurb of the movie.
- popularity: The Popularity Score assigned by TMDB.
- poster_path: The URL of the poster image.
- production_companies: A stringified list of production companies involved with the making of the movie.
- production_countries: A stringified list of countries where the movie was shot/produced in.
- release_date: Theatrical Release Date of the movie.
- revenue: The total revenue of the movie in dollars.
- runtime: The runtime of the movie in minutes.
- spoken_languages: A stringified list of spoken languages in the film.
- status: The status of the movie (Released, To Be Released, Announced, etc.)
- tagline: The tagline of the movie.
- title: The Official Title of the movie.
- video: Indicates if there is a video present of the movie with TMDB.
- vote_average: The average rating of the movie.
- vote_count: The number of votes by users, as counted by TMDB.
df.shape
(45466, 24)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45466 entries, 0 to 45465 Data columns (total 24 columns): adult 45466 non-null object belongs_to_collection 4494 non-null object budget 45466 non-null object genres 45466 non-null object homepage 7782 non-null object id 45466 non-null object imdb_id 45449 non-null object original_language 45455 non-null object original_title 45466 non-null object overview 44512 non-null object popularity 45461 non-null object poster_path 45080 non-null object production_companies 45463 non-null object production_countries 45463 non-null object release_date 45379 non-null object revenue 45460 non-null float64 runtime 45203 non-null float64 spoken_languages 45460 non-null object status 45379 non-null object tagline 20412 non-null object title 45460 non-null object video 45460 non-null object vote_average 45460 non-null float64 vote_count 45460 non-null float64 dtypes: float64(4), object(20) memory usage: 8.3+ MB
There are a total of 45,466 movies with 24 features. Most of the features have very few NaN values (apart from homepage and tagline). We will attempt at cleaning this dataset to a form suitable for analysis in the next section.
Data Wrangling
The data that was originally obtained was in the form of a JSON File. This was converted manually into a CSV file to arrive at an input that could be loaded into a Pandas DataFrame effortlessly. In other words, the dataset we have in our hands is already relatively clean. We will however attempt at learning more about our features and performing appropriate wrangling steps to arrive at a form that is more suitable for analysis.
Let us start by removing the features that are not useful to us.
df = df.drop(['imdb_id'], axis=1)
df[df['original_title'] != df['title']][['title', 'original_title']].head()
title | original_title |
---|---|
The City of Lost Children | La Cité des Enfants Perdus |
Shanghai Triad | 摇啊摇,摇到外婆桥 |
Wings of Courage | Guillaumet, les ailes du courage |
The Postman | Il postino |
The Confessional | Le confessional |
The original title refers to the title of the movie in the original language in which the movie was shot. Thus, I would prefer to use the translated word, Shortened in this analysis and therefore, I will discard the real titles completely. We will be able to find out if the movie is a foreign language film by looking at the language – original language so that no material information is lost in doing so.
df = df.drop('original_title', axis=1)
df[df['revenue'] == 0].shape
(38052, 22)
linkcode
We see that the majority of the movies have recorded revenue of 0. This indicates that we do not have information about the total revenue for these movies. Although this forms the majority of the movies available to us, we will still use revenue as an extremely important feature going forward from the remaining 7000 moves.
df['revenue'] = df['revenue'].replace(0, np.nan)
The budget feature has some unclean values that make Pandas assign it as a generic object. We proceed to convert this into a numeric variable and replace all the non-numeric values with NaN. Finally, as with budget, we will convert all the values of 0 with NaN to indicate the absence of information regarding the budget.
df['budget'] = pd.to_numeric(df['budget'], errors='coerce')
df['budget'] = df['budget'].replace(0, np.nan)
df[df['budget'].isnull()].shape
(36576, 22)
As we move forward trying to answer certain questions, we will have to construct several features suitable for that particular query. For now, we will construct two very important features:
- year: The year in which the movie was released.
- return: The ratio of revenue to budget.
The returning feature is more comprehensive as it will give us a more accurate picture of the film’s financial success. At the moment, our data cannot judge whether a $ 200 million budget movie that earned $ 100 million did better than a $ 50,000 budget film that took $ 200,000. This feature will be able to capture that information.
Refund value> 1 may indicate a gain while refund value <1 would indicate a loss.
df['return'] = df['revenue'] / df['budget']
df[df['return'].isnull()].shape
(40085, 23)
linkcode
We have close to 5000 movies for which we have data on revenue and budget ratio. This is close to 10% of the entire dataset. Although this may seem small, this is enough to perform very useful analysis and discover interesting insights about the world of movies.
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
df['adult'].value_counts()
False 45454 True 9 - Written by Ørnås 1 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all. 1 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche. 1 Name: adult, dtype: int64
There are close to 0 adult movies in this dataset. The adult feature, therefore, is not of much use to us and can be safely dropped.
df = df.drop('adult', axis=1)
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df['poster_path'] = "<img src='" + base_poster_url + df['poster_path'] + "' style='height:100px;'>"
Exploratory Data Analysis
Title and Overview Wordclouds
Are there certain words that figure more often in Movie Titles and Movie Blurbs? I suspect there are some words that are considered more potent and considered more worthy of a title. Let us find out!
df['title'] = df['title'].astype('str')
df['overview'] = df['overview'].astype('str')
title_corpus = ' '.join(df['title'])
overview_corpus = ' '.join(df['overview'])
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

The word Love is the most commonly used word in movie titles. Girl, Day, and Man are also among the most commonly occurring words. I think this encapsulates the idea of the ubiquitous presence of romance in movies pretty well.
overview_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(overview_corpus)
plt.figure(figsize=(16,8))
plt.imshow(overview_wordcloud)
plt.axis('off')
plt.show()

Life is the most commonly used word in Movie titles. One and Find are also popular in Movie Blurbs. Together with Love, Man, and Girl, these word clouds give us a pretty good idea of the most popular themes present in movies.
Production Countries
The Full MovieLens Dataset consists of movies that are overwhelmingly in the English language (more than 31000). However, these movies may have been shot in various locations around the world. It would be interesting to see which countries serve as the most popular destinations for shooting movies by filmmakers, especially those in the United States of America and the United Kingdom.
df['production_countries'] = df['production_countries'].fillna('[]').apply(ast.literal_eval)
df['production_countries'] = df['production_countries'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
s = df.apply(lambda x: pd.Series(x['production_countries']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'countries'
con_df = df.drop('production_countries', axis=1).join(s)
con_df = pd.DataFrame(con_df['countries'].value_counts())
con_df['country'] = con_df.index
con_df.columns = ['num_movies', 'country']
con_df = con_df.reset_index().drop('index', axis=1)
con_df.head(10)
num_movies | ||
---|---|---|
0 | 21153 | United States of America |
1 | 4094 | United Kingdom |
2 | 3940 | France |
3 | 2254 | Germany |
4 | 2169 | Italy |
5 | 1765 | Canada |
6 | 1648 | Japan |
7 | 964 | Spain |
8 | 912 | Russia |
9 | 828 | India |
con_df = con_df[con_df['country'] != 'United States of America']
data = [ dict(
type = 'choropleth',
locations = con_df['country'],
locationmode = 'country names',
z = con_df['num_movies'],
text = con_df['country'],
colorscale = [[0,'rgb(255, 255, 255)'],[1,'rgb(255, 0, 0)']],
autocolorscale = False,
reversescale = False,
marker = dict(
line = dict (
color = 'rgb(180,180,180)',
width = 0.5
) ),
colorbar = dict(
autotick = False,
tickprefix = '',
title = 'Production Countries'),
) ]
layout = dict(
title = 'Production Countries for the MovieLens Movies (Apart from US)',
geo = dict(
showframe = False,
showcoastlines = False,
projection = dict(
type = 'Mercator'
)
)
)
fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )

Unsurprisingly, the United States is a very popular movie production destination as our data set usually contains English movies. Europe is also the most famous region with the UK, France, Germany, and Italy in the top 5. Japan and India are Asian countries best known when it comes to movie production.
Franchise movies
Now let’s take a brief look at the Franchise movies. I was looking forward to finding the longest-running franchises and the most successful franchises among many other things. Let’s combine our data to find out!
df_fran = df[df['belongs_to_collection'].notnull()]
df_fran['belongs_to_collection'] = df_fran['belongs_to_collection'].apply(ast.literal_eval).apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)
df_fran = df_fran[df_fran['belongs_to_collection'].notnull()]
fran_pivot = df_fran.pivot_table(index='belongs_to_collection', values='revenue', aggfunc={'revenue': ['mean', 'sum', 'count']}).reset_index()
Highest Grossing Movie Franchises
fran_pivot.sort_values('sum', ascending=False).head(10)
belongs_to_collection | count | mean | sum | |
---|---|---|---|---|
552 | Harry Potter Collection | 8 | 9.634209e+08 | 7.707367e+09 |
1160 | Star Wars Collection | 8 | 9.293118e+08 | 7.434495e+09 |
646 | James Bond Collection | 26 | 2.733450e+08 | 7.106970e+09 |
1317 | The Fast and the Furious Collection | 8 | 6.406373e+08 | 5.125099e+09 |
968 | Pirates of the Caribbean Collection | 5 | 9.043154e+08 | 4.521577e+09 |
1550 | Transformers Collection | 5 | 8.732202e+08 | 4.366101e+09 |
325 | Despicable Me Collection | 4 | 9.227676e+08 | 3.691070e+09 |
1491 | The Twilight Collection | 5 | 6.684215e+08 | 3.342107e+09 |
610 | Ice Age Collection | 5 | 6.433417e+08 | 3.216709e+09 |
666 | Jurassic Park Collection | 4 | 7.578710e+08 | 3.031484e+09 |
The Harry Potter Franchise is the most successful movie franchise raking in more than 7.707 billion dollars from 8 movies. The Star Wars Movies come in a close second with a 7.403 billion dollars from 8 movies too. James Bond is third but the franchise has significantly more movies compared to the others in the list and therefore, a much smaller average gross.linkcode
Most Successful Movie Franchises (by Average Gross)
We will use the average gross per movie to gauge the success of a movie franchise. However, this is not a very potent metric as the revenues in this dataset have not been adjusted for inflation. Therefore, revenue statistics will tend to strongly favor franchises in the recent times.In [29]:
fran_pivot.sort_values('mean', ascending=False).head(10)
belongs_to_collection | count | mean | sum | |
---|---|---|---|---|
112 | Avatar Collection | 1 | 2.787965e+09 | 2.787965e+09 |
1245 | The Avengers Collection | 2 | 1.462481e+09 | 2.924962e+09 |
479 | Frozen Collection | 1 | 1.274219e+09 | 1.274219e+09 |
446 | Finding Nemo Collection | 2 | 9.844532e+08 | 1.968906e+09 |
1352 | The Hobbit Collection | 3 | 9.785078e+08 | 2.935523e+09 |
1388 | The Lord of the Rings Collection | 3 | 9.721816e+08 | 2.916545e+09 |
552 | Harry Potter Collection | 8 | 9.634209e+08 | 7.707367e+09 |
1160 | Star Wars Collection | 8 | 9.293118e+08 | 7.434495e+09 |
325 | Despicable Me Collection | 4 | 9.227676e+08 | 3.691070e+09 |
968 | Pirates of the Caribbean Collection | 5 | 9.043154e+08 | 4.521577e+09 |
The Avatar Collection, although just consisting of one movie at the moment, is the most successful franchise of all time with the sole movie raking in close to 3 billion dollars. The Harry Potter franchise is still the most successful franchise with at least 5 movies.
Longest Running Franchises
Finally, in this subsection, let us take a look at the franchises which have stood the test of time and have managed to deliver the largest number of movies under a single banner. This metric is potent in the way that it isn’t affected by inflation. However, this does not imply that successful movie franchises tend to have more movies. Some franchises, such as Harry Potter, have a predefined storyline and it wouldn’t make sense to produce more movies despite its enormous success.In [30]:
fran_pivot.sort_values('count', ascending=False).head(10)
belongs_to_collection | count | mean | sum | |
---|---|---|---|---|
646 | James Bond Collection | 26 | 2.733450e+08 | 7.106970e+09 |
473 | Friday the 13th Collection | 12 | 3.874155e+07 | 4.648985e+08 |
976 | Pokémon Collection | 11 | 6.348189e+07 | 6.983008e+08 |
552 | Harry Potter Collection | 8 | 9.634209e+08 | 7.707367e+09 |
540 | Halloween Collection | 8 | 3.089601e+07 | 2.471681e+08 |
29 | A Nightmare on Elm Street Collection | 8 | 4.544894e+07 | 3.635916e+08 |
1317 | The Fast and the Furious Collection | 8 | 6.406373e+08 | 5.125099e+09 |
1432 | The Pink Panther (Original) Collection | 8 | 2.055978e+07 | 1.644782e+08 |
1160 | Star Wars Collection | 8 | 9.293118e+08 | 7.434495e+09 |
977 | Police Academy Collection | 7 | 4.352046e+07 | 3.046432e+08 |
The James Bond Movies is the largest franchise ever with over 26 movies released under the banner. Friday the 13th and Pokemon come in at a distant second and third with 12 and 11 movies respectively.
Production Companies
df['production_companies'] = df['production_companies'].fillna('[]').apply(ast.literal_eval)
df['production_companies'] = df['production_companies'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
s = df.apply(lambda x: pd.Series(x['production_companies']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'companies'
com_df = df.drop('production_companies', axis=1).join(s)
com_sum = pd.DataFrame(com_df.groupby('companies')['revenue'].sum().sort_values(ascending=False))
com_sum.columns = ['Total']
com_mean = pd.DataFrame(com_df.groupby('companies')['revenue'].mean().sort_values(ascending=False))
com_mean.columns = ['Average']
com_count = pd.DataFrame(com_df.groupby('companies')['revenue'].count().sort_values(ascending=False))
com_count.columns = ['Number']
com_pivot = pd.concat((com_sum, com_mean, com_count), axis=1)
Highest Earning Production Companies
Let us find out which production companies have earned the most money from the movie-making business.
com_pivot.sort_values('Total', ascending=False).head(10)
Total | Average | Number | |
---|---|---|---|
Warner Bros. | 6.352519e+10 | 1.293792e+08 | 491 |
Universal Pictures | 5.525919e+10 | 1.193503e+08 | 463 |
Paramount Pictures | 4.880819e+10 | 1.235650e+08 | 395 |
Twentieth Century Fox Film Corporation | 4.768775e+10 | 1.398468e+08 | 341 |
Walt Disney Pictures | 4.083727e+10 | 2.778046e+08 | 147 |
Columbia Pictures | 3.227974e+10 | 1.367785e+08 | 236 |
New Line Cinema | 2.217339e+10 | 1.119868e+08 | 198 |
Amblin Entertainment | 1.734372e+10 | 2.550547e+08 | 68 |
DreamWorks SKG | 1.547575e+10 | 1.984071e+08 | 78 |
Dune Entertainment | 1.500379e+10 | 2.419966e+08 | 62 |
Warner Bros is the highest-earning production company of all time earning a staggering 63.5 billion dollars from close to 500 movies. Universal Pictures and Paramount Pictures are the second and the third highest-earning companies with 55 billion dollars and 48 billion dollars in revenue respectively.
Most Succesful Production Companies
Which production companies produce the most successful movies on average? Let us find out. We will only consider those companies that have made at least 15 movies.
com_pivot[com_pivot['Number'] >= 15].sort_values('Average', ascending=False).head(10)
Total | Average | Number | |
---|---|---|---|
Pixar Animation Studios | 1.118853e+10 | 6.215852e+08 | 18 |
Marvel Studios | 1.169964e+10 | 6.157703e+08 | 19 |
Revolution Sun Studios | 8.120339e+09 | 5.413559e+08 | 15 |
Lucasfilm | 9.898421e+09 | 4.499282e+08 | 22 |
DreamWorks Animation | 1.370752e+10 | 4.031622e+08 | 34 |
DC Entertainment | 6.212609e+09 | 3.882880e+08 | 16 |
Dentsu | 6.853205e+09 | 3.807336e+08 | 18 |
Jerry Bruckheimer Films | 8.957441e+09 | 3.732267e+08 | 24 |
Marvel Enterprises | 6.538067e+09 | 3.441088e+08 | 19 |
Legendary Pictures | 1.346866e+10 | 3.367166e+08 | 40 |
Pixar Animation Studios has produced the most successful movies, on average. This is not surprising considering the amazing array of movies that it has produced in the last few decades: Up, Finding Nemo, Inside Out, Wall-E, Ratatouille, the Toy Story Franchise, Cars Franchise, etc. Marvel Studios with an average gross of 615 million dollars comes in second with movies such as Iron Man and The Avengers under its banner.
Original Language
In this section, let us look at the languages of the movies in our dataset. From the production countries, we have already deduced that the majority of the movies in the dataset are English. Let us see what the other major languages represented are.
df['original_language'].drop_duplicates().shape[0]
93
lang_df = pd.DataFrame(df['original_language'].value_counts())
lang_df['language'] = lang_df.index
lang_df.columns = ['number', 'language']
lang_df.head()
number | language | |
---|---|---|
en | 32269 | en |
fr | 2438 | fr |
it | 1529 | it |
ja | 1350 | ja |
de | 1080 | de |
There are over 93 languages represented in our dataset. As we had expected, English language films form the overwhelming majority. French and Italian movies come at a very distant second and third respectively. Let us represent the most popular languages (apart from English) in the form of a bar plot.In [39]:
plt.figure(figsize=(12,5))
sns.barplot(x='language', y='number', data=lang_df.iloc[1:11])
plt.show()

As mentioned earlier, French and Italian are the most commonly occurring languages after English. Japanese and Hindi form the majority as far as Asian Languages are concerned.
Popularity, Vote Average and Vote Count
In this section, we will work with metrics provided to us by TMDB users. We will try to gain a deeper understanding of the popularity, vote average and vote count features and try and deduce any relationships between them as well as other numeric features such as budget and revenue.In [40]:
def clean_numeric(x):
try:
return float(x)
except:
return np.nan
In [41]:
df['popularity'] = df['popularity'].apply(clean_numeric).astype('float')
df['vote_count'] = df['vote_count'].apply(clean_numeric).astype('float')
df['vote_average'] = df['vote_average'].apply(clean_numeric).astype('float')
Let us examine the summary statistics and the distribution of each feature one by one.In [42]:
df['popularity'].describe()
Out[42]:
count 45460.000000 mean 2.921478 std 6.005414 min 0.000000 25% 0.385948 50% 1.127685 75% 3.678902 max 547.488298 Name: popularity, dtype: float64
In [43]:
sns.distplot(df['popularity'].fillna(df['popularity'].median()))
plt.show()

In [44]:
df['popularity'].plot(logy=True, kind='hist')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa02bb20908>

The Popularity score seems to be an extremely skewed quentity with a mean of only 2.9 but maximum values reaching as high as 547, which is almost 1800% greater than the mean. However, as can be seen from the distribution plot, almost all movies have a popularity score less than 10 (the 75th percentile is at 3.678902).
Most Popular Movies by Popularity Score
In [45]:
df[['title', 'popularity', 'year']].sort_values('popularity', ascending=False).head(10)
Out[45]:
title | popularity | year | |
---|---|---|---|
30700 | Minions | 547.488298 | 2015 |
33356 | Wonder Woman | 294.337037 | 2017 |
42222 | Beauty and the Beast | 287.253654 | 2017 |
43644 | Baby Driver | 228.032744 | 2017 |
24455 | Big Hero 6 | 213.849907 | 2014 |
26564 | Deadpool | 187.860492 | 2016 |
26566 | Guardians of the Galaxy Vol. 2 | 185.330992 | 2017 |
14551 | Avatar | 185.070892 | 2009 |
24351 | John Wick | 183.870374 | 2014 |
23675 | Gone Girl | 154.801009 | 2014 |
Minions are the most popular movie by the TMDB Popularity Score. Wonder Woman and Beauty and the Beast, two extremely successful woman-centric movies come in second and third respectively.In [46]:
df['vote_count'].describe()
Out[46]:
count 45460.000000 mean 109.897338 std 491.310374 min 0.000000 25% 3.000000 50% 10.000000 75% 34.000000 max 14075.000000 Name: vote_count, dtype: float64
As with popularity scores, the distribution of vote counts is extremely skewed with the median vote count standing at a paltry 10 votes. The most votes a single movie has got stands at 14,075. TMDB Votes, therefore, are not as potent and suggestive as its IMDB Counterpart. Nevertheless, let us check which the most voted on movies on the website are.
Most Voted on Movies
In [47]:
df[['title', 'vote_count', 'year']].sort_values('vote_count', ascending=False).head(10)
Out[47]:
title | vote_count | year | |
---|---|---|---|
15480 | Inception | 14075.0 | 2010 |
12481 | The Dark Knight | 12269.0 | 2008 |
14551 | Avatar | 12114.0 | 2009 |
17818 | The Avengers | 12000.0 | 2012 |
26564 | Deadpool | 11444.0 | 2016 |
22879 | Interstellar | 11187.0 | 2014 |
20051 | Django Unchained | 10297.0 | 2012 |
23753 | Guardians of the Galaxy | 10014.0 | 2014 |
2843 | Fight Club | 9678.0 | 1999 |
18244 | The Hunger Games | 9634.0 | 2012 |
Inception and The Dark Knight, two critically acclaimed and commercially successful Christopher Nolan movies figure at the top of our chart.
In [48]:
df['vote_average'] = df['vote_average'].replace(0, np.nan)
df['vote_average'].describe()
Out[48]:
count 42462.000000 mean 6.014877 std 1.256208 min 0.500000 25% 5.300000 50% 6.100000 75% 6.900000 max 10.000000 Name: vote_average, dtype: float64
In [49]:
sns.distplot(df['vote_average'].fillna(df['vote_average'].median()))
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa02b325ef0>

It appears that TMDB Users are extremely strict in their ratings. The mean rating is only a 5.6 on a scale of 10. Half the movies have a rating of less than or equal to 6. Let us check what the most critically acclaimed movies as per TMDB are. We will only consider those movies that have more than 2000 votes (similar to IMDB’s criteria of 5000 votes in selecting its top 250).
Most Critically Acclaimed Movies
In [50]:
df[df['vote_count'] > 2000][['title', 'vote_average', 'vote_count' ,'year']].sort_values('vote_average', ascending=False).head(10)
Out[50]:
title | vote_average | vote_count | year | |
---|---|---|---|---|
314 | The Shawshank Redemption | 8.5 | 8358.0 | 1994 |
834 | The Godfather | 8.5 | 6024.0 | 1972 |
2211 | Life Is Beautiful | 8.3 | 3643.0 | 1997 |
5481 | Spirited Away | 8.3 | 3968.0 | 2001 |
1152 | One Flew Over the Cuckoo’s Nest | 8.3 | 3001.0 | 1975 |
1176 | Psycho | 8.3 | 2405.0 | 1960 |
2843 | Fight Club | 8.3 | 9678.0 | 1999 |
1178 | The Godfather: Part II | 8.3 | 3418.0 | 1974 |
12481 | The Dark Knight | 8.3 | 12269.0 | 2008 |
292 | Pulp Fiction | 8.3 | 8670.0 | 1994 |
The Shawshank Redemption and The Godfather are the two most critically acclaimed movies in the TMDB Database. Interestingly, they are the top 2 movies in IMDB’s Top 250 Movies list too. They have a rating of over 9 on IMDB as compared to their 8.5 TMDB Scores.
Do popularity and vote average share a tangible relationship? In other words, is there a strong positive correlation between these two quantities? Let us visualize their relationship in the form of a scatterplot.In [51]:
sns.jointplot(x='vote_average', y='popularity', data=df)
Out[51]:
<seaborn.axisgrid.JointGrid at 0x7f9feb159358>

Surprisingly, the Pearson Coefficient of the two aforementioned quantities is a measly 0.097 which suggests that there is no tangible correlation. In other words, popularity and vote average and independent quantities. It would be interesting to discover how TMDB assigns numerical popularity scores to its movies.In [52]:
sns.jointplot(x='vote_average', y='vote_count', data=df)
Out[52]:
<seaborn.axisgrid.JointGrid at 0x7f9feb73d588>

There is a very small correlation between Vote Count and Vote Average. A large number of votes on a particular movie does not necessarily imply that the movie is good.
Movie Release Dates
Release Dates can often play a very important role in determining the success and the revenue generated by a particular movie. In this section, we will try and gain insights about release dates in terms of years, months and days of the week.
We have already constructed the year feature in our preliminary data wrangling step. Let us now extract the month and day too for each movie with a release date.
In [53]:
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
In [54]:
def get_month(x):
try:
return month_order[int(str(x).split('-')[1]) - 1]
except:
return np.nan
In [55]:
def get_day(x):
try:
year, month, day = (int(i) for i <strong>in</strong> x.split('-'))
answer = datetime.date(year, month, day).weekday()
return day_order[answer]
except:
return np.nan
In [56]:
df['day'] = df['release_date'].apply(get_day)
df['month'] = df['release_date'].apply(get_month)
With these features in hand, let us now check the most popular and most successful months and days.In [57]:
plt.figure(figsize=(12,6))
plt.title("Number of Movies released in a particular month.")
sns.countplot(x='month', data=df, order=month_order)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9feb705f28>

It appears that January is the most popular month when it comes to movie releases. In Hollywood circles, this is also known as the dump month when sub-par movies are released by the dozen.
In which months do blockbuster movies tend to release? To answer this question, we will consider all movies that have made in excess of 100 million dollars and calculate the average gross for each month.
In [58]:
month_mean = pd.DataFrame(df[df['revenue'] > 1e8].groupby('month')['revenue'].mean())
month_mean['mon'] = month_mean.index
plt.figure(figsize=(12,6))
plt.title("Average Gross by the Month for Blockbuster Movies")
sns.barplot(x='mon', y='revenue', data=month_mean, order=month_order)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ff8bb64a8>

We see that the months of April, May and June have the highest average gross among high grossing movies. This can be attributed to the fact that blockbuster movies are usually released in the summer when the kids are out of school and the parents are on vacation and therefore, the audience is more likely to spend their disposable income on entertainment.
Do some months tend to be more successful than others? Let us visualise the boxplot between the return and the months.In [59]:
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15, 8))
sns.boxplot(x='month', y='return', data=df[df['return'].notnull()], palette="muted", ax =ax, order=month_order)
ax.set_ylim([0, 12])
Out[59]:
(0, 12)

The months of June and July tend to yield the highest median returns. September is the least successful months on the aforementioned metrics. Again, the success of June and July movies can be attributed to them being summer months and times of vacation. September usually denotes the beginning of the school/college semester and hence a slight reduction in the consumption of movies.
Let us now have a look at the most popular days as we did for months.In [60]:
plt.figure(figsize=(10,5))
plt.title("Number of Movies released on a particular day.")
sns.countplot(x='day', data=df, order=day_order)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ff88f7a90>

Friday is clearly the most popular day for movie releases. This is understandable considering the fact that it usually denotes the beginning of the weekend. Sunday and Monday are the least popular days and this can be attributed to the same aforementioned reason.
Number of Movies by the year
The Dataset of 45,000 movies available to us does not represent the entire corpus of movies released since the inception of cinema. However, it is reasonable to assume that it does include almost every major film released in Hollywood as well as other major film industries across the world (such as Bollywood in India). With this assumption in mind, let us take a look at the number of movies produced by the year.In [61]:
year_count = df.groupby('year')['title'].count()
plt.figure(figsize=(18,5))
year_count.plot()
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa0285302e8>

We notice that there is a sharp rise in the number of movies starting the 1990s decade. However, we will not look too much into this as it is entirely possible that recent movies were oversampled for the purposes of this dataset.
Next, let us take a look at the earliest movies represented in the dataset.
Earliest Movies Represented
In [62]:
df[df['year'] != 'NaT'][['title', 'year']].sort_values('year').head(10)
Out[62]:
title | year | |
---|---|---|
34940 | Passage of Venus | 1874 |
34937 | Sallie Gardner at a Gallop | 1878 |
41602 | Buffalo Running | 1883 |
34933 | Man Walking Around a Corner | 1887 |
34934 | Accordion Player | 1888 |
34938 | Traffic Crossing Leeds Bridge | 1888 |
34936 | Monkeyshines, No. 2 | 1890 |
34939 | London’s Trafalgar Square | 1890 |
34935 | Monkeyshines, No. 1 | 1890 |
41194 | Mosquinha | 1890 |
The oldest movie, Passage of Venus, was a series of photographs of the transit of the planet Venus across the Sun in 1874. They were taken to Japan by the French astronomer Pierre Janssen using his ‘photographic revolver’. This is also the oldest movie on both IMDB and TMDB.
Finally, in this section, let us construct a heatmap to indicate movie releases by month and year for all movies released in this century. This will give us a good idea of the hot and cold months for movie buffs.
In [63]:
months = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
In [64]:
df_21 = df.copy()
df_21['year'] = df_21[df_21['year'] != 'NaT']['year'].astype(int)
df_21 = df_21[df_21['year'] >=2000]
hmap_21 = pd.pivot_table(data=df_21, index='month', columns='year', aggfunc='count', values='title')
hmap_21 = hmap_21.fillna(0)
In [65]:
sns.set(font_scale=1)
f, ax = plt.subplots(figsize=(16, 8))
sns.heatmap(hmap_21, annot=True, linewidths=.5, ax=ax, fmt='n', yticklabels=month_order)
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa0285a0a58>

In [66]:
sns.set(font_scale=1.25)
Movie Status
Although not entirely relevant to our analysis of movies, gathering information on the various kinds of movies based on their status of release can provide us with an interesting insight into the nature of the movies present in our dataset. My preliminary hunch was that almost every movie has the Released status. Let’s find out.
In [67]:
df['status'].value_counts()
Out[67]:
Released 45014 Rumored 230 Post Production 98 In Production 20 Planned 15 Canceled 2 Name: status, dtype: int64
Almost every movie is indeed released. However, it is interesting to see that MovieLens has user ratings for movies that are still in the planning, production and post production stage. We might take this information into account while building our collaborative filtering recommendation engine.
Spoken Languages
Does the number of spoken languages influence the success of a movie? To do this, we will convert our spoken_languages feature to a numeric feature denoting the number of languages spoken in that film.In [68]:
df['spoken_languages'] = df['spoken_languages'].fillna('[]').apply(ast.literal_eval).apply(lambda x: len(x) if isinstance(x, list) else np.nan)
In [69]:
df['spoken_languages'].value_counts()
Out[69]:
1 33736 2 5371 0 3835 3 1705 4 550 5 178 6 62 7 14 8 6 9 5 19 1 13 1 12 1 10 1 Name: spoken_languages, dtype: int64
Most movies have just one language spoken in the entire duration of the film. 19 is the higest number of languages spoken in a film. Let us take a look at all the films with more than 10 spoken languages.
In [70]:
df[df['spoken_languages'] >= 10][['title', 'year', 'spoken_languages']].sort_values('spoken_languages', ascending=False)
Out[70]:
title | year | spoken_languages | |
---|---|---|---|
22235 | Visions of Europe | 2004 | 19 |
35288 | The Testaments | 2000 | 13 |
14093 | To Each His Own Cinema | 2007 | 12 |
8789 | The Adventures of Picasso | 1978 | 10 |
The movie with the most number of languages, Visions of Europe is actually a collection of 25 short films by 25 different European directors. This explains the sheer diversity of the movie in terms of language.In [71]:
sns.jointplot(x="spoken_languages", y="return", data=df, stat_func=stats.spearmanr, color="m")
Out[71]:
<seaborn.axisgrid.JointGrid at 0x7fa028628908>

The Spearman Coefficient is 0.018 indicating no correlation between the two quantities.
Runtime
From its humble beginnings of 1 minute slient, black & white clips to epic 3 hour visual delights, movies have a come a long way in terms of runtime. In this section, let us try and gain some additional insights about the nature of movie lengths and their evolution over time.In [72]:
df['runtime'].describe()
Out[72]:
count 45203.000000 mean 94.128199 std 38.407810 min 0.000000 25% 85.000000 50% 95.000000 75% 107.000000 max 1256.000000 Name: runtime, dtype: float64
The average length of a movie is about 1 hour and 30 minutes. The longest movie on record in this dataset is a staggering 1256 minutes (or 20 hours) long.In [73]:
df['runtime'] = df['runtime'].astype('float')
We are aware that most movies are less than 5 hours (or 300 minutes) long. Let us plot a distribution of these mainstream movies.In [74]:
plt.figure(figsize=(12,6))
sns.distplot(df[(df['runtime'] < 300) & (df['runtime'] > 0)]['runtime'])
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa0289abda0>

Is there any meaningful relationship between runtime and return? Let us find out!In [75]:
df_mat = df[(df['return'].notnull()) & (df['runtime'] > 0) & (df['return'] < 10)]
sns.jointplot('return', 'runtime', data=df_mat)
plt.show()

There seems to be relationship between the two quantities. The duration of a movie is independent of its success. However, I have a feeling this might not be the case with duration and budget. A longer movie should entail a higher budget. Let us find out if this is really the case.In [76]:
df_mat = df[(df['budget'].notnull()) & (df['runtime'] > 0)]
sns.jointplot('budget', 'runtime', data=df_mat)
plt.show()

The two quantities have a much weaker correlation than I had expected. In retrospect, the genre of the movie tends to have a much greater impact on budget. A 3 hour art film will cost significantly lesser than a 90 minute Sci-Fi movie.
Next, I’d like to see the average lengths of movies through time, right from the 1890s to the 2017s. It would be interesting to see the trends in what filmmakers adjudged would be the appropriate length of a movie at that time.In [77]:
plt.figure(figsize=(18,5))
year_runtime = df[df['year'] != 'NaT'].groupby('year')['runtime'].mean()
plt.plot(year_runtime.index, year_runtime)
plt.xticks(np.arange(1874, 2024, 10.0))
plt.show()

We notice that films started hitting the 60 minute mark as early as 1914. Starting 1924, films started having the traiditonal 90 minute duration and has remained more or less constant ever since.
Finally in this section, let us see the longest and the shortest movies of all time (with respect to the movies in the dataset).
Shortest Movies
In [78]:
df[df['runtime'] > 0][['runtime', 'title', 'year']].sort_values('runtime').head(10)
Out[78]:
runtime | title | year | |
---|---|---|---|
35146 | 1.0 | La Vague | 1891 |
36577 | 1.0 | Champs de Mars | 1900 |
36576 | 1.0 | Palace of Electricity | 1900 |
44965 | 1.0 | Luis Martinetti, Contortionist | 1894 |
42357 | 1.0 | A Gathering of Cats | 2007 |
36575 | 1.0 | Eiffel Tower from Trocadero Palace | 1900 |
36574 | 1.0 | Panorama of Eiffel Tower | 1900 |
44714 | 1.0 | The Infernal Caldron | 1903 |
19244 | 1.0 | The Kiss | 1896 |
44646 | 1.0 | The Vanishing Lady | 1896 |
We see that every movie in this list except A Gathering of Cats were filmed in the late 1890s and the beginning of the 20th century. All these movies were one minute long.
Longest Movies
In [79]:
df[df['runtime'] > 0][['runtime', 'title', 'year']].sort_values('runtime', ascending=False).head(10)
Out[79]:
runtime | title | year | |
---|---|---|---|
24178 | 1256.0 | Centennial | 1978 |
40938 | 1140.0 | Baseball | 1994 |
19965 | 1140.0 | Jazz | 2001 |
13767 | 931.0 | Berlin Alexanderplatz | 1980 |
13953 | 925.0 | Heimat: A Chronicle of Germany | 1984 |
27855 | 900.0 | The Story of Film: An Odyssey | 2011 |
26682 | 877.0 | Taken | 2002 |
19158 | 874.0 | The War | 2007 |
34667 | 840.0 | The Roosevelts: An Intimate History | 2014 |
34732 | 840.0 | Seventeen Moments in Spring | 1973 |
We notice that almost all the entries in the above chart are actually miniseries and hence, do not count as feature length films. We cannot gather too much insight from this list of longest movies as there is no way of distinguishing feature length films from TV Mini Series from our dataset (except, of course, by doing it manually).
Budget
Let us now turn our attention to budget. We expect budgets to be a skewed quantity and also heavily influenced by inflation. Nevertheless, it would be interesting to gather as much insights as possible from this quantity as budget is often a critical feature in predicting movie revenue and success. As a start, let us gather the summary statistics for our budget.In [80]:
df['budget'].describe()
Out[80]:
count 8.890000e+03 mean 2.160428e+07 std 3.431063e+07 min 1.000000e+00 25% 2.000000e+06 50% 8.000000e+06 75% 2.500000e+07 max 3.800000e+08 Name: budget, dtype: float64
The mean budget of a film is 21.6 million dollars whereas the median budget is far smaller at 8 million dollars. This strongly suggests the mean being influenced by outliers.In [81]:
sns.distplot(df[df['budget'].notnull()]['budget'])
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa0284b3ef0>

In [82]:
df['budget'].plot(logy=True, kind='hist')
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ff3d65588>

The distribution of movie budgets shows an exponential decay. More than 75% of the movies have a budget smaller than 25 million dollars. Next, let us take a look at the most expensive movies of all time and the revenue & returns that they generated.
Most Expensive Movies of all Time
In [83]:
df[df['budget'].notnull()][['title', 'budget', 'revenue', 'return', 'year']].sort_values('budget', ascending=False).head(10)
Out[83]:
title | budget | revenue | return | year | |
---|---|---|---|---|---|
17124 | Pirates of the Caribbean: On Stranger Tides | 380000000.0 | 1.045714e+09 | 2.751878 | 2011 |
11827 | Pirates of the Caribbean: At World’s End | 300000000.0 | 9.610000e+08 | 3.203333 | 2007 |
26558 | Avengers: Age of Ultron | 280000000.0 | 1.405404e+09 | 5.019299 | 2015 |
11067 | Superman Returns | 270000000.0 | 3.910812e+08 | 1.448449 | 2006 |
44842 | Transformers: The Last Knight | 260000000.0 | 6.049421e+08 | 2.326701 | 2017 |
16130 | Tangled | 260000000.0 | 5.917949e+08 | 2.276134 | 2010 |
18685 | John Carter | 260000000.0 | 2.841391e+08 | 1.092843 | 2012 |
11780 | Spider-Man 3 | 258000000.0 | 8.908716e+08 | 3.452991 | 2007 |
21175 | The Lone Ranger | 255000000.0 | 8.928991e+07 | 0.350157 | 2013 |
22059 | The Hobbit: The Desolation of Smaug | 250000000.0 | 9.584000e+08 | 3.833600 | 2013 |
Two Pirates of the Carribean films occupy the top spots in this list with a staggering budget of over 300 million dollars. All the top 10 most expensive films made a profit on their investment except for The Lone Ranger which managed to recoup less than 35% of its investment, taking in a paltry 90 million dollars on a 255 million dollar budget.
How strong a correlation does the budget hold with the revenue? A stronger correlation would directly imply more accurate forecasts.In [84]:
sns.jointplot(x='budget',y='revenue',data=df[df['return'].notnull()])
Out[84]:
<seaborn.axisgrid.JointGrid at 0x7f9ff3d43eb8>

The pearson r value of 0.73 between the two quantities indicates a very strong correlation.
Revenue
The final numeric feature we will explore is the revenue. The revenue is probably the most important numeric quantity associated with a movie. We will try to predict the revenue for movies given a set of features in a later section. The treatment of revenue will be very similar to that of budget and we will once again begin by studying the summary statistics.In [85]:
df['revenue'].describe()
Out[85]:
count 7.408000e+03 mean 6.878739e+07 std 1.464203e+08 min 1.000000e+00 25% 2.400000e+06 50% 1.682272e+07 75% 6.722707e+07 max 2.787965e+09 Name: revenue, dtype: float64
The mean gross of a movie is 68.7 million dollars whereas the median gross is much lower at 16.8 million dollars, suggesting the skewed nature of revenue. The lowest revenue generated by a movie is just 1 dollar whereas the highest grossing movie of all time has raked in an astonishing *2.78 billion dollars.In [86]:
sns.distplot(df[df['revenue'].notnull()]['revenue'])
Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa02b107160>

The distribution of revenue undergoes exponential decay just like budget. We also found that the two quantities were strongly correlated. Let us now take a look at the highest and least grossing movies of all time.
Highest Grossing Films of All Time
In [87]:
gross_top = df[['poster_path', 'title', 'budget', 'revenue', 'year']].sort_values('revenue', ascending=False).head(10)
pd.set_option('display.max_colwidth', 100)
HTML(gross_top.to_html(escape=False))
Out[87]:
poster_path | title | budget | revenue | year | |
---|---|---|---|---|---|
14551 | ![]() | Avatar | 237000000.0 | 2.787965e+09 | 2009 |
26555 | ![]() | Star Wars: The Force Awakens | 245000000.0 | 2.068224e+09 | 2015 |
pd.set_option('display.max_colwidth', 50)
These figures have not been adjusted for inflation. Therefore, we see a disproportionate number of movies from very recent times in the top 10 list. To get an understanding of the revenue garnered by movies, let us plot the maximum revenue through the years.In [89]:
plt.figure(figsize=(18,5))
year_revenue = df[(df['revenue'].notnull()) & (df['year'] != 'NaT')].groupby('year')['revenue'].max()
plt.plot(year_revenue.index, year_revenue)
plt.xticks(np.arange(1874, 2024, 10.0))
plt.show()

As can be seen from the figure, the maximum gross has steadily risen over the years. The world of movies broke the 1 billion dollar mark in 1997 with the release of Titanic. It took another 12 years to break the 2 billion dollar mark with Avatar. Both these movies were directed by James Cameron.
Returns
We will not look too much into returns for the time being. Let us just check the least and the most successful movies of all time. To do this, we will only consider those movies which have a budget greater than 5 million dollars.
Most Successful Movies
In [90]:
df[(df['return'].notnull()) & (df['budget'] > 5e6)][['title', 'budget', 'revenue', 'return', 'year']].sort_values('return', ascending=False).head(10)
Out[90]:
title | budget | revenue | return | year | |
---|---|---|---|---|---|
1065 | E.T. the Extra-Terrestrial | 10500000.0 | 792965326.0 | 75.520507 | 1982 |
256 | Star Wars | 11000000.0 | 775398007.0 | 70.490728 | 1977 |
1338 | Jaws | 7000000.0 | 470654000.0 | 67.236286 | 1975 |
1888 | The Exorcist | 8000000.0 | 441306145.0 | 55.163268 | 1973 |
352 | Four Weddings and a Funeral | 6000000.0 | 254700832.0 | 42.450139 | 1994 |
834 | The Godfather | 6000000.0 | 245066411.0 | 40.844402 | 1972 |
4492 | Look Who’s Talking | 7500000.0 | 296000000.0 | 39.466667 | 1989 |
24258 | Annabelle | 6500000.0 | 255273813.0 | 39.272894 | 2014 |
1056 | Dirty Dancing | 6000000.0 | 213954274.0 | 35.659046 | 1987 |
1006 | The Sound of Music | 8200000.0 | 286214286.0 | 34.904181 | 1965 |
Worst Box Office Disasters
In [91]:
df[(df['return'].notnull()) & (df['budget'] > 5e6) & (df['revenue'] > 10000)][['title', 'budget', 'revenue', 'return', 'year']].sort_values('return').head(10)
Out[91]:
title | budget | revenue | return | year | |
---|---|---|---|---|---|
11159 | Chaos | 20000000.0 | 10289.0 | 0.000514 | 2005 |
19027 | 5 Days of War | 20000000.0 | 17479.0 | 0.000874 | 2011 |
21034 | Special Forces | 10000000.0 | 10759.0 | 0.001076 | 2011 |
25732 | Foodfight! | 65000000.0 | 73706.0 | 0.001134 | 2012 |
38388 | Term Life | 16500000.0 | 21256.0 | 0.001288 | 2016 |
19505 | Laurence Anyways | 9500000.0 | 12250.0 | 0.001289 | 2012 |
12038 | The Good Night | 15000000.0 | 20380.0 | 0.001359 | 2007 |
3966 | Cherry 2000 | 10000000.0 | 14000.0 | 0.001400 | 1987 |
22097 | Twice Born | 13000000.0 | 18295.0 | 0.001407 | 2012 |
5651 | All The Queen’s Men | 15000000.0 | 23000.0 | 0.001533 | 2001 |
With these analyses in place, we are in a good position to construct our correlation matrix.In [92]:
df['year'] = df['year'].replace('NaT', np.nan)
In [93]:
df['year'] = df['year'].apply(clean_numeric)
In [94]:
sns.set(font_scale=1)
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
plt.figure(figsize=(9,9))
ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, annot=True)

In [95]:
sns.set(font_scale=1.25)
Genres
In [96]:
df['genres'] = df['genres'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i <strong>in</strong> x] if isinstance(x, list) else [])
In [97]:
s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
In [98]:
gen_df = df.drop('genres', axis=1).join(s)
In [99]:
gen_df['genre'].value_counts().shape[0]
Out[99]:
32
TMDB defines 32 different genres for our set of 45,000 movies. Let us now have a look at the most commonly occuring genres in movies.In [100]:
pop_gen = pd.DataFrame(gen_df['genre'].value_counts()).reset_index()
pop_gen.columns = ['genre', 'movies']
pop_gen.head(10)
Out[100]:
genre | movies | |
---|---|---|
0 | Drama | 20265 |
1 | Comedy | 13182 |
2 | Thriller | 7624 |
3 | Romance | 6735 |
4 | Action | 6596 |
5 | Horror | 4673 |
6 | Crime | 4307 |
7 | Documentary | 3932 |
8 | Adventure | 3496 |
9 | Science Fiction | 3049 |
In [101]:
plt.figure(figsize=(18,8))
sns.barplot(x='genre', y='movies', data=pop_gen.head(15))
plt.show()

Drama is the most commonly occurring genre with almost half the movies identifying itself as a drama film. Comedy comes in at a distant second with 25% of the movies having adequate doses of humor. Other major genres represented in the top 10 are Action, Horror, Crime, Mystery, Science Fiction, Animation and Fantasy.
The next question I want to answer is the trends in the share of genres of movies across the world. Has the demand for Science Fiction movies increased? Do certain years have a disproportionate share of Animation Movies? Let’s find out!
We will only be looking at trends starting 2000. We will consider only those themes that appear in the top 15 most popular genres. We will exclude Documentaries, Family and Foreign Movies from our analysis.
In [102]:
genres = ['Drama', 'Comedy', 'Thriller', 'Romance', 'Action', 'Horror', 'Crime', 'Adventure', 'Science Fiction', 'Mystery', 'Fantasy', 'Mystery', 'Animation']
In [103]:
pop_gen_movies = gen_df[(gen_df['genre'].isin(genres)) & (gen_df['year'] >= 2000) & (gen_df['year'] <= 2017)]
ctab = pd.crosstab([pop_gen_movies['year']], pop_gen_movies['genre']).apply(lambda x: x/x.sum(), axis=1)
ctab[genres].plot(kind='bar', stacked=True, colormap='jet', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("Stacked Bar Chart of Movie Proportions by Genre")
plt.show()

In [104]:
ctab[genres].plot(kind='line', stacked=False, colormap='jet', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

The proportion of movies of each genre has remained fairly constant since the beginning of this century except for Drama. The proportion of drama films has fallen by over 5%. Thriller movies have enjoyed a slight increase in their share.
One question that I have always had is that if some genres are particularly more successful than others. For example, we should expect Science Fiction and Fantasy Movies to bring in more revenue than other genres but when normalized with their budget, do they prove to be as successful? We will visualize two violin plots to answer this question. One will be genres versus the revenue while the other will be versus returns.In [105]:
violin_genres = ['Drama', 'Comedy', 'Thriller', 'Romance', 'Action', 'Horror', 'Crime', 'Science Fiction', 'Fantasy', 'Animation']
violin_movies = gen_df[(gen_df['genre'].isin(violin_genres))]
In [106]:
plt.figure(figsize=(18,8))
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15, 8))
sns.boxplot(x='genre', y='revenue', data=violin_movies, palette="muted", ax =ax)
ax.set_ylim([0, 3e8])
plt.show()
<matplotlib.figure.Figure at 0x7f9fee80fa90>

Animation movies has the largest 25-75 range as well as the median revenue among all the genres plotted. Fantasy and Science Fiction have the second and third highest median revenue respectively.In [107]:
plt.figure(figsize=(18,8))
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15, 8))
sns.boxplot(x='genre', y='return', data=violin_movies, palette="muted", ax =ax)
ax.set_ylim([0, 10])
plt.show()
<matplotlib.figure.Figure at 0x7f9ff26f3ba8>

From the boxplot, it seems like Animation Movies tend to yield the highest returns on average. Horror Movies also tend to be a good bet. This is partially due to the nature of Horror movies being low budget compared to Fantasy Movies but being capable of generating very high revenues relative to its budget.
Cast and Crew
Let us now take a look at the cast and crew of our movies. We do not have these details with us in our main dataset. However, we have a separate file consisting of the full cast and crew credits of all the Movielens Movies. Let us take a look at the data of this credit.
In [108]:
credits_df = pd.read_csv('../input/credits.csv')
credits_df.head()
Out[108]:
cast | crew | id | |
---|---|---|---|
0 | [{‘cast_id’: 14, ‘character’: ‘Woody (voice)’,… | [{‘credit_id’: ’52fe4284c3a36847f8024f49′, ‘de… | 862 |
1 | [{‘cast_id’: 1, ‘character’: ‘Alan Parrish’, ‘… | [{‘credit_id’: ’52fe44bfc3a36847f80a7cd1′, ‘de… | 8844 |
2 | [{‘cast_id’: 2, ‘character’: ‘Max Goldman’, ‘c… | [{‘credit_id’: ’52fe466a9251416c75077a89′, ‘de… | 15602 |
3 | [{‘cast_id’: 1, ‘character’: “Savannah ‘Vannah… | [{‘credit_id’: ’52fe44779251416c91011acb’, ‘de… | 31357 |
4 | [{‘cast_id’: 1, ‘character’: ‘George Banks’, ‘… | [{‘credit_id’: ’52fe44959251416c75039ed7′, ‘de… | 11862 |
Credits Dataset
- cast: A stringified list of dictionaries consisting of cast names and the corresponding characters they played.
- crew: A stringified list of dictionaries consisting of crew names and the function they performed.
- id: The TMDB ID of movie.
We need to perform a left join of our original movies metadata dataframe with the credits dataframe on the TMDB Movie ID. Before we are able to perform this join, we need to make sure that the ID column of our main dataframe is clean and of type integer. To do this, let us try to perform an integer conversion of our IDs and if an exception is raised,we will replace the ID with NaN. We will then proceed to drop these rows from our dataframe.
In [109]:
def convert_int(x):
try:
return int(x)
except:
return np.nan
In [110]:
df['id'] = df['id'].apply(convert_int)
In [111]:
df[df['id'].isnull()]
Out[111]:
belongs_to_collection | budget | genres | homepage | id | original_language | overview | popularity | poster_path | production_companies | … | status | tagline | title | video | vote_average | vote_count | return | year | day | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19730 | 0.065736 | NaN | [Carousel Productions, Vision View Entertainme… | [{‘iso_3166_1’: ‘CA’, ‘name’: ‘Canada’}, {‘iso… | NaN | 104.0 | Released | NaN | <img src=’http://image.tmdb.org/t/p/w185/Midni… | [] | … | NaN | NaN | nan | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
29503 | 1.931659 | NaN | [Aniplex, GoHands, BROSTA TV, Mardock Scramble… | [{‘iso_3166_1’: ‘US’, ‘name’: ‘United States o… | NaN | 68.0 | Released | NaN | <img src=’http://image.tmdb.org/t/p/w185/Mardo… | [] | … | NaN | NaN | nan | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
35587 | 2.185485 | NaN | [Odyssey Media, Pulser Productions, Rogue Stat… | [{‘iso_3166_1’: ‘CA’, ‘name’: ‘Canada’}] | NaN | 82.0 | Released | NaN | <img src=’http://image.tmdb.org/t/p/w185/Avala… | [] | … | NaN | NaN | nan | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 25 columnsIn [112]:
df = df.drop([19730, 29503, 35587])
In [113]:
df['id'] = df['id'].astype('int')
In [114]:
df = df.merge(credits_df, on='id')
df.shape
Out[114]:
(45538, 27)
In [115]:
df['cast'] = df['cast'].apply(ast.literal_eval)
df['crew'] = df['crew'].apply(ast.literal_eval)
In [116]:
df['cast_size'] = df['cast'].apply(lambda x: len(x))
df['crew_size'] = df['crew'].apply(lambda x: len(x))
In [117]:
df['cast'] = df['cast'].apply(lambda x: [i['name'] for i <strong>in</strong> x] if isinstance(x, list) else [])
In [118]:
def get_director(x):
for i <strong>in</strong> x:
if i['job'] == 'Director':
return i['name']
return np.nan
In [119]:
df['director'] = df['crew'].apply(get_director)
In [120]:
s = df.apply(lambda x: pd.Series(x['cast']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'actor'
cast_df = df.drop('cast', axis=1).join(s)
Let us now take a look at the actors and the directors who have raked in the most amount of money with their movies.
Actors with the Highest Total Revenue
In [121]:
sns.set_style('whitegrid')
plt.title('Actors with the Highest Total Revenue')
cast_df.groupby('actor')['revenue'].sum().sort_values(ascending=False).head(10).plot(kind='bar')
plt.show()

Directors with the Highest Total Revenue
In [122]:
plt.title('Directors with the Highest Total Revenue')
df.groupby('director')['revenue'].sum().sort_values(ascending=False).head(10).plot(kind='bar', colormap='autumn')
plt.show()

For average revenues, we will consider only actors and directors who have acted and directed in at least 5 movies respectively.
In [123]:
actor_list = cast_df.groupby('actor')['revenue'].count().sort_values(ascending=False)
actor_list = list(actor_list[actor_list >= 5].index)
director_list = df.groupby('director')['revenue'].count().sort_values(ascending=False)
director_list = list(director_list[director_list >= 5].index)
Actors with Highest Average Revenue
In [124]:
plt.title("Actors with Highest Average Revenue")
cast_df[cast_df['actor'].isin(actor_list)].groupby('actor')['revenue'].mean().sort_values(ascending=False).head(10).plot(kind='bar', colormap='Greens_r')
plt.show()

Directors with Highest Average Revenue
In [125]:
plt.title("Directors with Highest Average Revenue")
df[df['director'].isin(director_list)].groupby('director')['revenue'].mean().sort_values(ascending=False).head(10).plot(kind='bar', colormap='autumn_r')
plt.show()

Which actors and directors are the safest bet? For this, we will consider the average return brought in by a particular director or actor. We will only consider those movies that have raked in at least 10 million dollars. Also, we will only consider actors and directors that have worked in at least 5 films.
Most Successful Actors
In [126]:
success_df = cast_df[(cast_df['return'].notnull()) & (cast_df['revenue'] > 1e7) & (cast_df['actor'].isin(actor_list))]
pd.DataFrame(success_df.groupby('actor')['return'].mean().sort_values(ascending=False).head(10))
Out[126]:
return | |
---|---|
actor | |
Jami Gertz | 3.099099e+06 |
Donna Mitchell | 2.479289e+06 |
Andrew McCarthy | 2.479280e+06 |
Nicholas Pryor | 1.770914e+06 |
James Spader | 1.549551e+06 |
Michael Bowen | 1.033038e+06 |
Robert Downey Jr. | 5.165189e+05 |
Martin Kove | 3.395458e+05 |
William Zabka | 3.395441e+05 |
Brad Pitt | 2.817390e+05 |
Most Successful Directors
In [127]:
success_df = df[(df['return'].notnull()) & (df['revenue'] > 1e7) & (df['director'].isin(director_list))]
pd.DataFrame(success_df.groupby('director')['return'].mean().sort_values(ascending=False).head(10))
Out[127]:
return | |
---|---|
director | |
John G. Avildsen | 169822.750468 |
George A. Romero | 60.656935 |
George Lucas | 54.328446 |
Martha Coolidge | 49.553131 |
Davis Guggenheim | 46.243000 |
George Miller | 39.076975 |
James Wan | 27.659835 |
Guy Hamilton | 25.779057 |
John Badham | 25.433532 |
John Carpenter | 25.075058 |
John G. Avildsen has an unnaturally high return. All the other directors in the list are nowhere close to the level of return that he has gained. Let us have a look at his movies.In [128]:
df[(df['director'] == 'John G. Avildsen') & (df['return'].notnull())][['title', 'budget', 'revenue', 'return', 'year']]
Out[128]:
title | budget | revenue | return | year | |
---|---|---|---|---|---|
1853 | Rocky | 1000000.0 | 117235147.0 | 1.172351e+02 | 1976.0 |
2307 | Rocky V | 42000000.0 | 119946358.0 | 2.855866e+00 | 1990.0 |
2315 | The Karate Kid | 8000000.0 | 90815558.0 | 1.135194e+01 | 1984.0 |
2316 | The Karate Kid, Part II | 113.0 | 115103979.0 | 1.018619e+06 | 1986.0 |
5226 | Joe | 106000.0 | 19319254.0 | 1.822571e+02 | 1970.0 |
5658 | Neighbors | 8500000.0 | 29916207.0 | 3.519554e+00 | 1981.0 |
The Karate Kid, Part II has a budget of only 113 dollars. This seems like an anomaly since official figures state that the movie cost 13 million dollars. So, although he has directed amazing movies, he does not belong to this list.
We will end our Exploratory Data Analysis over here. Let us use some of the insights we gained in this section and build some useful predictive models.
Regression: Predicting Movie Revenues
In this section, I will attempt at building a regression model to predict movie revenues. Since this isn’t one of the main goals of this project, I will not spend too much time feature engineering or hyperparamater tuning my model.
Predicting Movie Revenues is an extremely popular problem in Machine Learning which has created a huge amount of literature. Most of the models proposed in these papers use far more potent features than what we possess at the moment. These include Facebook Page Likes, Information on Tweets about the Movie, YouTube Trailer Reaction (Views, Likes, Dislikes, etc.), Movie Rating (MPCAA, CBIFC) among many others.
For a great project on Movie Box Office Predictions, check out Aarti Annaswamy’s excellent project here: Movies Capstone
To compensate for the lack of these features, we are going to cheat a little. We will be using TMDB’s Popularity Score and Vote Average as our features in our model to assign a nuerical value to popularity. However, it must be kept in mind that these metrics will not be available when predicting movie revenues in the real world, when the movie has not been released yet.In [129]:
rgf = df[df['return'].notnull()]
rgf.shape
Out[129]:
(5393, 30)
We have 5393 records in our training set. Let us take a look at the features we possess and remove the ones which are unnecessary.In [130]:
rgf.columns
Out[130]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'original_language', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'return', 'year', 'day', 'month', 'cast', 'crew', 'cast_size', 'crew_size', 'director'], dtype='object')
In [131]:
rgf = rgf.drop(['id', 'overview', 'poster_path', 'release_date', 'status', 'tagline', 'video', 'return', 'crew'], axis=1)
We will perform the following feature engineering tasks:
- belongs_to_collection will be turned into a Boolean variable. 1 indicates a movie is a part of collection whereas 0 indicates it is not.
- genres will be converted into number of genres.
- homepage will be converted into a Boolean variable that will indicate if a movie has a homepage or not.
- original_language will be replaced by a feature called is_foreign to denote if a particular film is in English or a Foreign Language.
- production_companies will be replaced with just the number of production companies collaborating to make the movie.
- production_countries will be replaced with the number of countries the film was shot in.
- day will be converted into a binary feature to indicate if the film was released on a Friday.
- month will be converted into a variable that indicates if the month was a holiday season.
In [132]:
s = rgf.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_rgf = rgf.drop('genres', axis=1).join(s)
genres_train = gen_rgf['genre'].drop_duplicates()
In [133]:
def feature_engineering(df):
df['belongs_to_collection'] = df['belongs_to_collection'].apply(lambda x: 0 if x == np.nan else 1)
for genre <strong>in</strong> genres_train:
df['is_' + str(genre)] = df['genres'].apply(lambda x: 1 if genre <strong>in</strong> x else 0)
df['genres'] = df['genres'].apply(lambda x: len(x))
df['homepage'] = df['homepage'].apply(lambda x: 0 if x == np.nan else 1)
df['is_english'] = df['original_language'].apply(lambda x: 1 if x=='en' else 0)
df = df.drop('original_language', axis=1)
df['production_companies'] = df['production_companies'].apply(lambda x: len(x))
df['production_countries'] = df['production_countries'].apply(lambda x: len(x))
df['is_Friday'] = df['day'].apply(lambda x: 1 if x=='Fri' else 0)
df = df.drop('day', axis=1)
df['is_Holiday'] = df['month'].apply(lambda x: 1 if x <strong>in</strong> ['Apr', 'May', 'Jun', 'Nov'] else 0)
df = df.drop('month', axis=1)
df = df.drop(['title', 'cast', 'director'], axis=1)
df = pd.get_dummies(df, prefix='is')
df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
df['vote_average'] = df['vote_average'].fillna(df['vote_average'].mean())
return df
In [134]:
X, y = rgf.drop('revenue', axis=1), rgf['revenue']
In [135]:
X = feature_engineering(X)
In [136]:
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, test_size=0.25)
In [137]:
X.shape
Out[137]:
(5393, 38)
In [138]:
reg = GradientBoostingRegressor()
reg.fit(train_X, train_y)
reg.score(test_X, test_y)
Out[138]:
0.74557803984046966
We get a Coefficient of Determination of 0.78 which is a pretty score for the basic model that we have built. Let us compare our model’s score to a Dummy Regressor.In [139]:
dummy = DummyRegressor()
dummy.fit(train_X, train_y)
dummy.score(test_X, test_y)
Out[139]:
-0.0021536902472349961
We see that our model performs far more superiorly than the Dummy Regressor. Finally, let us plot the feature importances in the form of a bar plot to deduce which features were the most significant in our making predictions.In [140]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,12))
sns.barplot(x=reg.feature_importances_, y=X.columns)
Out[140]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9f96aa4358>

We notice that vote_count, a feature we cheated with, is the most important feature to our Gradient Boosting Model. This goes on to show the improtance of popularity metrics in determining the revenue of a movie. Budget was the second most important feature followed by Popularity (Literally, a popularity metric) and Crew Size.
Classification: Predicting Movie Sucess
What factors influence whether a movie will be able to recoup its investments? I will try to answer this question by building a binary classifier that predicts if a movie will make a profit or a loss. As with our regression model, we will cheat a little and use features that may not be available to us in the real world for the lack of other useful popularity metrics.
We have already performed extensive analysis of our data and haven’t done a lot with respect to determining factors that make a movie a success. We shall attempt at doing that in this section and follow it up by building our model.In [141]:
cls = df[df['return'].notnull()]
cls.shape
Out[141]:
(5393, 30)
In [142]:
cls.columns
Out[142]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'original_language', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'return', 'year', 'day', 'month', 'cast', 'crew', 'cast_size', 'crew_size', 'director'], dtype='object')
In [143]:
cls = cls.drop(['id', 'overview', 'poster_path', 'release_date', 'status', 'tagline', 'revenue'], axis=1)
Let us convert our return feature into a binary variable that will serve as our classes: 0 indicating a flop and 1 indicating a hit.In [144]:
cls['return'] = cls['return'].apply(lambda x: 1 if x >=1 else 0)
In [145]:
cls['return'].value_counts()
Out[145]:
1 3784 0 1609 Name: return, dtype: int64
Our classes seem to be fairly balanced. We do not need to apply any additional methods to deal with the imbalance of classes. Let us now turn our attention to our features.
In [146]:
cls['belongs_to_collection'] = cls['belongs_to_collection'].fillna('').apply(lambda x: 0 if x == '' else 1)
In [147]:
sns.set(style="whitegrid")
g = sns.PairGrid(data=cls, x_vars=['belongs_to_collection'], y_vars='return', size=5)
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))
Out[147]:
<seaborn.axisgrid.PairGrid at 0x7f9f9763be10>

It seems that movies that belong to a franchise have a higher probability of being a success.In [148]:
cls['homepage'] = cls['homepage'].fillna('').apply(lambda x: 0 if x == '' else 1)
g = sns.PairGrid(data=cls, x_vars=['homepage'], y_vars='return', size=5)
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))
Out[148]:
<seaborn.axisgrid.PairGrid at 0x7f9f970a1ac8>

We see that with homepages, there is not a very huge difference in probability. To avoid the curse of dimensionality, we will eliminate this feature as it is not very useful.In [149]:
s = cls.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_cls = cls.drop('genres', axis=1).join(s)
In [150]:
ctab = pd.crosstab([gen_cls['genre']], gen_cls['return'], dropna=False).apply(lambda x: x/x.sum(), axis=1)
ctab.plot(kind='bar', stacked=True, legend=False)
Out[150]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9f95f4b470>

We find that TV Movies have a 0% Failure Rate but that is most probably because they are extremely few in numner. Foreign Films have a higher rate of failure than average. Since there isn’t anything drastic about a particular genre, we will proceed with one hot encoding all genres.In [151]:
cls.columns
Out[151]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'original_language', 'popularity', 'production_companies', 'production_countries', 'runtime', 'spoken_languages', 'title', 'video', 'vote_average', 'vote_count', 'return', 'year', 'day', 'month', 'cast', 'crew', 'cast_size', 'crew_size', 'director'], dtype='object')
In [152]:
def classification_engineering(df):
for genre <strong>in</strong> genres_train:
df['is_' + str(genre)] = df['genres'].apply(lambda x: 1 if genre <strong>in</strong> x else 0)
df['genres'] = df['genres'].apply(lambda x: len(x))
df = df.drop('homepage', axis=1)
df['is_english'] = df['original_language'].apply(lambda x: 1 if x=='en' else 0)
df = df.drop('original_language', axis=1)
df['production_companies'] = df['production_companies'].apply(lambda x: len(x))
df['production_countries'] = df['production_countries'].apply(lambda x: len(x))
df['is_Friday'] = df['day'].apply(lambda x: 1 if x=='Fri' else 0)
df = df.drop('day', axis=1)
df['is_Holiday'] = df['month'].apply(lambda x: 1 if x <strong>in</strong> ['Apr', 'May', 'Jun', 'Nov'] else 0)
df = df.drop('month', axis=1)
df = df.drop(['title', 'cast', 'director'], axis=1)
<em>#df = pd.get_dummies(df, prefix='is')</em>
df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
df['vote_average'] = df['vote_average'].fillna(df['vote_average'].mean())
df = df.drop('crew', axis=1)
return df
In [153]:
cls = classification_engineering(cls)
In [154]:
cls.columns
Out[154]:
Index(['belongs_to_collection', 'budget', 'genres', 'popularity',
'production_companies', 'production_countries', 'runtime',
'spoken_languages', 'video', 'vote_average', 'vote_count', 'return',
'year', 'cast_size', 'crew_size', 'is_Animation', 'is_Comedy',
'is_Family', 'is_Adventure', 'is_Fantasy', 'is_Drama', 'is_Romance',
'is_Action', 'is_Crime', 'is_Thriller', 'is_History',
'is_Science Fiction', 'is_Mystery', 'is_Horror', 'is_War', 'is_Foreign',
'is_Documentary', 'is_Western', 'is_Music', 'is_nan', 'is_TV Movie',
'is_english', 'is_Friday', 'is_Holiday'],
dtype='object')
In [155]:
X, y = cls.drop('return', axis=1), cls['return']
In [156]:
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, test_size=0.25, stratify=y)
In [157]:
clf = GradientBoostingClassifier()
clf.fit(train_X, train_y)
clf.score(test_X, test_y)
Out[157]:
0.79318013343217197
Our basic Gradient Boosting Classifier has an accuracy of 80%. Again, this model can be improved upon through hyperparameter tuning and more advanced feature engineering but since this is not the main objective of this project, I will skip this.In [158]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(train_X, train_y)
dummy.score(test_X, test_y)
Out[158]:
0.70200148257968864
In [159]:
plt.figure(figsize=(10,12))
sns.barplot(x=clf.feature_importances_, y=X.columns)
Out[159]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9f975120f0>

We see that Vote Count is once again the most significant feature identified by our Classifier. Other important features include Budget, Popularity and Year. With this, we will conclude our discussion on the classification model and move on to the main part of the project.
In the next notebook, I will try to build a Hybrid Recommendation System based on popularity, content and collaborative filtering. This will use the MovieLens dataset in addition to the TMDB Movies Metadata that we have used here.
Next -> Movie Recommender Systems
Happy Learning 🙂
Important Notice for college students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com
For more Programming related blogs Visit Us Geekycodes . Follow us on Instagram.