Project: Investigate a Dataset - The Movie Database (TMDb)¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

For this investigation, I chose to look into the TMDb movie data set. This data set contains movie information such as cast and genre, and also financial information regarding the movie's budget and revenue. It also includes adjusted budget and revenue values in terms of 2010 dollars, accounting for inflation over time. These columns are called 'budget_adj' and 'revenue_adj' respectively.

The questions that I plan to explore will revolve around traits that relate to the revenue of the movies, particulary the adjusted revenue in terms of 2010 dollars since movies across multiple release years will be explored.

import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

# Change floats from displaying in scientific notation
pd.options.display.float_format = '{:20,.2f}'.format

Data Wrangling¶

Here's a look at the first few rows of the data:¶

df = pd.read_csv('tmdb-movies.csv')
df.head()

Here's a look at the statistical overviews:¶

df.describe()

Here's a look at the column data types and counts:¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Data Cleaning¶

First let's check for duplicates.¶

df[df.duplicated() == True].shape

(1, 21)

We have one duplicate. The following cell will remove it.¶

df.drop_duplicates(inplace=True)

Check the duplicate was removed.¶

df[df.duplicated() == True].shape

(0, 21)

One of the columns of interest is 'genres', and there are some movies for which this data is missing.¶

df[df['genres'].isnull()].shape

(23, 21)

These rows will be removed in the following cell.¶

df = df.dropna(subset = ['genres'])

Check that these rows were removed.¶

df[df['genres'].isnull()].shape

(0, 21)

There are also rows which have a value of 0 for either the 'budget_adj' or the 'revenue_adj' columns.¶

df.query("(budget_adj == 0) | (revenue_adj == 0)").shape

(6988, 21)

The next cell will remove those rows.¶

df = df[df['budget_adj'] > 0]
df = df[df['revenue_adj'] > 0]

Check that these rows were removed.¶

df.query("(budget_adj == 0) | (revenue_adj == 0)").shape

(0, 21)

Exploratory Data Analysis¶

Research Question 1: How does the movie's budget relate to revenue?¶

The obvious thing to look at regarding revenue would be the other monetary value in the data set, the budget. Specifically, does a higher budget correlate with a higher revenue?

Here's a plot showing the relationship between 'budget_adj' and 'revenue_adj':¶

ax = df.plot(x='budget_adj', y='revenue_adj', kind='scatter', title='Adjusted Budget and Adjusted Revenue',figsize=(16,8))
ax.set_xticklabels(['{:,}'.format(int(x)) for x in ax.get_xticks().tolist()])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

There is a lot of density of values when the 'budget_adj' is less than 100,000,000. The following 2 plots will separate the visuals at this value to get a closer view.¶

Plot for 'budget_adj' <= 100,000,000:¶

ax = df[df['budget_adj'] <= 100000000].plot(x='budget_adj', y='revenue_adj', kind='scatter', title='Adjusted Budget and Adjusted Revenue (Adjusted Budget <= 100,000,000)', figsize=(16,8))
ax.set_xticklabels(['{:,}'.format(int(x)) for x in ax.get_xticks().tolist()])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Plot for 'budget_adj' > 100,000,000:¶

ax = df[df['budget_adj'] > 100000000].plot(x='budget_adj', y='revenue_adj', kind='scatter', title='Adjusted Budget and Adjusted Revenue (Adjusted Budget > 100,000,000)', figsize=(16,8))
ax.set_xticklabels(['{:,}'.format(int(x)) for x in ax.get_xticks().tolist()])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Research Question 2: How does the movie's genre relate to revenue?¶

Specifically, are there certain genres that tend to generate more revenue than others?

Since a movie can have multiple genres, let's first get all the unique genre names.¶

genres_set = set()

for genres in df['genres'].iteritems():
    for genre in genres[1].split('|'):
        genres_set.add(genre)

genres_set = sorted(genres_set)

genres_set

['Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western']

Now let's add a new column for each genre, with a value of True or False depending on if the movie is categorized as that specific genre.¶

for genre in genres_set:
    df[genre] = df['genres'].str.contains(genre)

# Adjust the column slice so we can see some of the added individual genre columns
df.iloc[:, 13:].head()

With the added columns showing individual genres for each movie, we can now examine each genre and the average adjusted revenue for each of them.¶

The next cell creates a new data frame with the average adjusted revenue for each genre, and sorts it by the amount.¶

# This dictionary will be used to create a new data frame
genre_rev_adj_means_dict = {
    'mean_revenue_adj': {}
}

# For each genre and associated mean adjusted revenue, add a key/value to the 'mean_revenue_adj' column
for genre in genres_set:
    mean_rev_adj = df[df[genre]]['revenue_adj'].mean()
    genre_rev_adj_means_dict['mean_revenue_adj'][genre] = mean_rev_adj

# Create and sort the data frame
df_mean_rev_adj_genre = pd.DataFrame.from_dict(genre_rev_adj_means_dict)
df_mean_rev_adj_genre.sort_values(by=['mean_revenue_adj'], inplace=True)

df_mean_rev_adj_genre

Here's a plot of the data:¶

ax = df_mean_rev_adj_genre.plot(title='Average Adjusted Revenue By Genre', kind='bar', figsize=(16,8), legend=False)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Research Question 3: How has movie revenue changed over time?¶

Specifically, is there a trend of revenue increasing or decreasing over time?

Let's first get all the unique release years.¶

release_year_set = sorted(set(df['release_year']))
release_year_set

[1960,
 1961,
 1962,
 1963,
 1964,
 1965,
 1966,
 1967,
 1968,
 1969,
 1970,
 1971,
 1972,
 1973,
 1974,
 1975,
 1976,
 1977,
 1978,
 1979,
 1980,
 1981,
 1982,
 1983,
 1984,
 1985,
 1986,
 1987,
 1988,
 1989,
 1990,
 1991,
 1992,
 1993,
 1994,
 1995,
 1996,
 1997,
 1998,
 1999,
 2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015]

# This dictionary will be used to create a new data frame
release_year_rev_adj_means_dict = {
    'mean_revenue_adj': {}
}

# For each release year and associated mean adjusted revenue, add a key/value to the 'mean_revenue_adj' column
for year in release_year_set:
    mean_rev_adj = df[df['release_year'] == year]['revenue_adj'].mean()
    release_year_rev_adj_means_dict['mean_revenue_adj'][year] = mean_rev_adj

# Create the data frame
df_mean_rev_adj_release_year = pd.DataFrame.from_dict(release_year_rev_adj_means_dict)

df_mean_rev_adj_release_year

Here's a plot of the data:¶

ax = df_mean_rev_adj_release_year.plot(title='Average Adjusted Revenue By Release Year', kind='bar', figsize=(16,8), legend=False)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Conclusions¶

Research Question 1: How does the movie's budget relate to revenue?¶

There does not appear to be any correlation between budget and revenue.

Research Question 2: How does the movie's genre relate to revenue?¶

There does appear to be a trend that some movie genres tend to generate a higher revenue, with the Animation and Adventure genres being the top revenue generators.
Foreign and Documentary genres show as being the lowest revenue generators.

Research Question 3: How has movie revenue changed over time?¶

Since the 1960s revenues have fluctuated, with the 1960s and 1970s being the period of having the highest revenues.
From the 1980s and later there have been some relatively small fluctuations but overall have been relatively steady.

Limitations Of The Data Set¶

The TMDb data set has been used for the analysis and worked with traits such as popularity, revenue and runtime. Analysis is limited to only the provided data. For example, the data set does not confirm that the director of every movie is listed.
There is no normalization or exchanges rate or currency conversion considered during this analysis, and is limited to the numerical values of revenue.
Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards relationships being analyzed. This could also cause increased disparities of representation for certain groups of data, in addition to any already present in the data set.
It should be noted that a movie can represent multiple genres. In the genre analysis, a movie is taken into account for each genre it represents.

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.99	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.50	2015	137,999,939.28	1,392,445,892.52
1	76341	tt1392190	28.42	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.10	2015	137,999,939.28	348,161,292.49
2	262500	tt2908446	13.11	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.30	2015	101,199,955.47	271,619,025.41
3	140607	tt2488496	11.17	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.50	2015	183,999,919.04	1,902,723,129.80
4	168259	tt2820852	9.34	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.30	2015	174,799,923.09	1,385,748,801.47

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10,866.00	10,866.00	10,866.00	10,866.00	10,866.00	10,866.00	10,866.00	10,866.00	10,866.00	10,866.00
mean	66,064.18	0.65	14,625,701.09	39,823,319.79	102.07	217.39	5.97	2,001.32	17,551,039.82	51,364,363.25
std	92,130.14	1.00	30,913,213.83	117,003,486.58	31.38	575.62	0.94	12.81	34,306,155.72	144,632,485.04
min	5.00	0.00	0.00	0.00	0.00	10.00	1.50	1,960.00	0.00	0.00
25%	10,596.25	0.21	0.00	0.00	90.00	17.00	5.40	1,995.00	0.00	0.00
50%	20,669.00	0.38	0.00	0.00	99.00	38.00	6.00	2,006.00	0.00	0.00
75%	75,610.00	0.71	15,000,000.00	24,000,000.00	111.00	145.75	6.60	2,011.00	20,853,251.08	33,697,095.72
max	417,859.00	32.99	425,000,000.00	2,781,505,847.00	900.00	9,767.00	9.20	2,015.00	425,000,000.00	2,827,123,750.41

	mean_revenue_adj
Foreign	12,733,777.80
Documentary	24,806,165.83
TV Movie	58,389,103.04
Horror	81,406,555.10
Drama	101,378,466.61
Crime	110,395,135.21
Romance	113,519,845.24
Mystery	113,621,019.76
Comedy	121,308,528.54
History	121,661,724.41
Thriller	128,170,894.62
Music	134,566,015.89
Western	135,674,767.39
War	155,898,111.71
Action	195,387,938.30
Science Fiction	202,153,142.41
Family	243,791,030.52
Fantasy	249,992,751.60
Adventure	271,407,469.11
Animation	290,957,382.26

	mean_revenue_adj
1960	190,229,933.36
1961	246,362,189.86
1962	212,474,021.01
1963	209,483,843.96
1964	330,034,445.30
1965	634,036,929.51
1966	113,852,464.34
1967	367,986,932.20
1968	154,788,904.49
1969	341,359,887.15
1970	280,804,105.16
1971	155,875,175.16
1972	313,460,405.45
1973	456,333,938.47
1974	244,983,310.81
1975	380,559,321.50
1976	191,566,135.30
1977	356,399,689.35
1978	230,306,198.66
1979	221,779,304.02
1980	152,102,897.33
1981	128,862,902.47
1982	187,346,635.33
1983	140,560,952.02
1984	120,269,366.22
1985	111,898,314.33
1986	95,095,905.22
1987	115,039,557.71
1988	101,151,226.02
1989	157,685,421.25
1990	146,470,597.12
1991	139,715,251.56
1992	149,138,582.01
1993	129,489,752.90
1994	154,779,714.66
1995	145,367,113.45
1996	124,785,188.12
1997	157,106,357.60
1998	134,149,850.75
1999	128,095,156.22
2000	130,810,063.87
2001	134,835,978.66
2002	137,552,814.44
2003	144,573,765.03
2004	129,468,349.64
2005	110,220,544.61
2006	100,105,715.62
2007	118,810,961.44
2008	114,990,278.01
2009	126,215,513.51
2010	122,496,407.17
2011	114,029,684.03
2012	145,373,726.67
2013	126,627,975.74
2014	134,367,474.95
2015	150,666,739.81