Project: Investigate a Dataset - The Movie Database (TMDb)

Table of Contents

Introduction

For this investigation, I chose to look into the TMDb movie data set. This data set contains movie information such as cast and genre, and also financial information regarding the movie's budget and revenue. It also includes adjusted budget and revenue values in terms of 2010 dollars, accounting for inflation over time. These columns are called 'budget_adj' and 'revenue_adj' respectively.

The questions that I plan to explore will revolve around traits that relate to the revenue of the movies, particulary the adjusted revenue in terms of 2010 dollars since movies across multiple release years will be explored.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

# Change floats from displaying in scientific notation
pd.options.display.float_format = '{:20,.2f}'.format

Data Wrangling

Here's a look at the first few rows of the data:

In [2]:
df = pd.read_csv('tmdb-movies.csv')
df.head()
Out[2]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.99 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.50 2015 137,999,939.28 1,392,445,892.52
1 76341 tt1392190 28.42 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.10 2015 137,999,939.28 348,161,292.49
2 262500 tt2908446 13.11 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.30 2015 101,199,955.47 271,619,025.41
3 140607 tt2488496 11.17 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.50 2015 183,999,919.04 1,902,723,129.80
4 168259 tt2820852 9.34 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.30 2015 174,799,923.09 1,385,748,801.47

5 rows × 21 columns

Here's a look at the statistical overviews:

In [3]:
df.describe()
Out[3]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10,866.00 10,866.00 10,866.00 10,866.00 10,866.00 10,866.00 10,866.00 10,866.00 10,866.00 10,866.00
mean 66,064.18 0.65 14,625,701.09 39,823,319.79 102.07 217.39 5.97 2,001.32 17,551,039.82 51,364,363.25
std 92,130.14 1.00 30,913,213.83 117,003,486.58 31.38 575.62 0.94 12.81 34,306,155.72 144,632,485.04
min 5.00 0.00 0.00 0.00 0.00 10.00 1.50 1,960.00 0.00 0.00
25% 10,596.25 0.21 0.00 0.00 90.00 17.00 5.40 1,995.00 0.00 0.00
50% 20,669.00 0.38 0.00 0.00 99.00 38.00 6.00 2,006.00 0.00 0.00
75% 75,610.00 0.71 15,000,000.00 24,000,000.00 111.00 145.75 6.60 2,011.00 20,853,251.08 33,697,095.72
max 417,859.00 32.99 425,000,000.00 2,781,505,847.00 900.00 9,767.00 9.20 2,015.00 425,000,000.00 2,827,123,750.41

Here's a look at the column data types and counts:

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Data Cleaning

First let's check for duplicates.

In [5]:
df[df.duplicated() == True].shape
Out[5]:
(1, 21)

We have one duplicate. The following cell will remove it.

In [6]:
df.drop_duplicates(inplace=True)

Check the duplicate was removed.

In [7]:
df[df.duplicated() == True].shape
Out[7]:
(0, 21)

One of the columns of interest is 'genres', and there are some movies for which this data is missing.

In [8]:
df[df['genres'].isnull()].shape
Out[8]:
(23, 21)

These rows will be removed in the following cell.

In [9]:
df = df.dropna(subset = ['genres'])

Check that these rows were removed.

In [10]:
df[df['genres'].isnull()].shape
Out[10]:
(0, 21)

There are also rows which have a value of 0 for either the 'budget_adj' or the 'revenue_adj' columns.

In [11]:
df.query("(budget_adj == 0) | (revenue_adj == 0)").shape
Out[11]:
(6988, 21)

The next cell will remove those rows.

In [12]:
df = df[df['budget_adj'] > 0]
df = df[df['revenue_adj'] > 0]

Check that these rows were removed.

In [13]:
df.query("(budget_adj == 0) | (revenue_adj == 0)").shape
Out[13]:
(0, 21)

Exploratory Data Analysis

Research Question 1: How does the movie's budget relate to revenue?

The obvious thing to look at regarding revenue would be the other monetary value in the data set, the budget. Specifically, does a higher budget correlate with a higher revenue?

Here's a plot showing the relationship between 'budget_adj' and 'revenue_adj':

In [14]:
ax = df.plot(x='budget_adj', y='revenue_adj', kind='scatter', title='Adjusted Budget and Adjusted Revenue',figsize=(16,8))
ax.set_xticklabels(['{:,}'.format(int(x)) for x in ax.get_xticks().tolist()])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

There is a lot of density of values when the 'budget_adj' is less than 100,000,000. The following 2 plots will separate the visuals at this value to get a closer view.

Plot for 'budget_adj' <= 100,000,000:

In [15]:
ax = df[df['budget_adj'] <= 100000000].plot(x='budget_adj', y='revenue_adj', kind='scatter', title='Adjusted Budget and Adjusted Revenue (Adjusted Budget <= 100,000,000)', figsize=(16,8))
ax.set_xticklabels(['{:,}'.format(int(x)) for x in ax.get_xticks().tolist()])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Plot for 'budget_adj' > 100,000,000:

In [16]:
ax = df[df['budget_adj'] > 100000000].plot(x='budget_adj', y='revenue_adj', kind='scatter', title='Adjusted Budget and Adjusted Revenue (Adjusted Budget > 100,000,000)', figsize=(16,8))
ax.set_xticklabels(['{:,}'.format(int(x)) for x in ax.get_xticks().tolist()])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Research Question 2: How does the movie's genre relate to revenue?

Specifically, are there certain genres that tend to generate more revenue than others?

Since a movie can have multiple genres, let's first get all the unique genre names.

In [17]:
genres_set = set()

for genres in df['genres'].iteritems():
    for genre in genres[1].split('|'):
        genres_set.add(genre)

genres_set = sorted(genres_set)

genres_set
Out[17]:
['Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western']

Now let's add a new column for each genre, with a value of True or False depending on if the movie is categorized as that specific genre.

In [18]:
for genre in genres_set:
    df[genre] = df['genres'].str.contains(genre)

# Adjust the column slice so we can see some of the added individual genre columns
df.iloc[:, 13:].head()
Out[18]:
genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj Action Adventure ... History Horror Music Mystery Romance Science Fiction TV Movie Thriller War Western
0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.50 2015 137,999,939.28 1,392,445,892.52 True True ... False False False False False True False True False False
1 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.10 2015 137,999,939.28 348,161,292.49 True True ... False False False False False True False True False False
2 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.30 2015 101,199,955.47 271,619,025.41 False True ... False False False False False True False True False False
3 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.50 2015 183,999,919.04 1,902,723,129.80 True True ... False False False False False True False False False False
4 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.30 2015 174,799,923.09 1,385,748,801.47 True False ... False False False False False False False True False False

5 rows × 28 columns

With the added columns showing individual genres for each movie, we can now examine each genre and the average adjusted revenue for each of them.

The next cell creates a new data frame with the average adjusted revenue for each genre, and sorts it by the amount.

In [19]:
# This dictionary will be used to create a new data frame
genre_rev_adj_means_dict = {
    'mean_revenue_adj': {}
}

# For each genre and associated mean adjusted revenue, add a key/value to the 'mean_revenue_adj' column
for genre in genres_set:
    mean_rev_adj = df[df[genre]]['revenue_adj'].mean()
    genre_rev_adj_means_dict['mean_revenue_adj'][genre] = mean_rev_adj

# Create and sort the data frame
df_mean_rev_adj_genre = pd.DataFrame.from_dict(genre_rev_adj_means_dict)
df_mean_rev_adj_genre.sort_values(by=['mean_revenue_adj'], inplace=True)

df_mean_rev_adj_genre
Out[19]:
mean_revenue_adj
Foreign 12,733,777.80
Documentary 24,806,165.83
TV Movie 58,389,103.04
Horror 81,406,555.10
Drama 101,378,466.61
Crime 110,395,135.21
Romance 113,519,845.24
Mystery 113,621,019.76
Comedy 121,308,528.54
History 121,661,724.41
Thriller 128,170,894.62
Music 134,566,015.89
Western 135,674,767.39
War 155,898,111.71
Action 195,387,938.30
Science Fiction 202,153,142.41
Family 243,791,030.52
Fantasy 249,992,751.60
Adventure 271,407,469.11
Animation 290,957,382.26

Here's a plot of the data:

In [20]:
ax = df_mean_rev_adj_genre.plot(title='Average Adjusted Revenue By Genre', kind='bar', figsize=(16,8), legend=False)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Research Question 3: How has movie revenue changed over time?

Specifically, is there a trend of revenue increasing or decreasing over time?

Let's first get all the unique release years.

In [21]:
release_year_set = sorted(set(df['release_year']))
release_year_set
Out[21]:
[1960,
 1961,
 1962,
 1963,
 1964,
 1965,
 1966,
 1967,
 1968,
 1969,
 1970,
 1971,
 1972,
 1973,
 1974,
 1975,
 1976,
 1977,
 1978,
 1979,
 1980,
 1981,
 1982,
 1983,
 1984,
 1985,
 1986,
 1987,
 1988,
 1989,
 1990,
 1991,
 1992,
 1993,
 1994,
 1995,
 1996,
 1997,
 1998,
 1999,
 2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015]
In [22]:
# This dictionary will be used to create a new data frame
release_year_rev_adj_means_dict = {
    'mean_revenue_adj': {}
}

# For each release year and associated mean adjusted revenue, add a key/value to the 'mean_revenue_adj' column
for year in release_year_set:
    mean_rev_adj = df[df['release_year'] == year]['revenue_adj'].mean()
    release_year_rev_adj_means_dict['mean_revenue_adj'][year] = mean_rev_adj

# Create the data frame
df_mean_rev_adj_release_year = pd.DataFrame.from_dict(release_year_rev_adj_means_dict)

df_mean_rev_adj_release_year
Out[22]:
mean_revenue_adj
1960 190,229,933.36
1961 246,362,189.86
1962 212,474,021.01
1963 209,483,843.96
1964 330,034,445.30
1965 634,036,929.51
1966 113,852,464.34
1967 367,986,932.20
1968 154,788,904.49
1969 341,359,887.15
1970 280,804,105.16
1971 155,875,175.16
1972 313,460,405.45
1973 456,333,938.47
1974 244,983,310.81
1975 380,559,321.50
1976 191,566,135.30
1977 356,399,689.35
1978 230,306,198.66
1979 221,779,304.02
1980 152,102,897.33
1981 128,862,902.47
1982 187,346,635.33
1983 140,560,952.02
1984 120,269,366.22
1985 111,898,314.33
1986 95,095,905.22
1987 115,039,557.71
1988 101,151,226.02
1989 157,685,421.25
1990 146,470,597.12
1991 139,715,251.56
1992 149,138,582.01
1993 129,489,752.90
1994 154,779,714.66
1995 145,367,113.45
1996 124,785,188.12
1997 157,106,357.60
1998 134,149,850.75
1999 128,095,156.22
2000 130,810,063.87
2001 134,835,978.66
2002 137,552,814.44
2003 144,573,765.03
2004 129,468,349.64
2005 110,220,544.61
2006 100,105,715.62
2007 118,810,961.44
2008 114,990,278.01
2009 126,215,513.51
2010 122,496,407.17
2011 114,029,684.03
2012 145,373,726.67
2013 126,627,975.74
2014 134,367,474.95
2015 150,666,739.81

Here's a plot of the data:

In [23]:
ax = df_mean_rev_adj_release_year.plot(title='Average Adjusted Revenue By Release Year', kind='bar', figsize=(16,8), legend=False)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()]);

Conclusions

Research Question 1: How does the movie's budget relate to revenue?

  • There does not appear to be any correlation between budget and revenue.

Research Question 2: How does the movie's genre relate to revenue?

  • There does appear to be a trend that some movie genres tend to generate a higher revenue, with the Animation and Adventure genres being the top revenue generators.
  • Foreign and Documentary genres show as being the lowest revenue generators.

Research Question 3: How has movie revenue changed over time?

  • Since the 1960s revenues have fluctuated, with the 1960s and 1970s being the period of having the highest revenues.
  • From the 1980s and later there have been some relatively small fluctuations but overall have been relatively steady.

Limitations Of The Data Set

  • The TMDb data set has been used for the analysis and worked with traits such as popularity, revenue and runtime. Analysis is limited to only the provided data. For example, the data set does not confirm that the director of every movie is listed.
  • There is no normalization or exchanges rate or currency conversion considered during this analysis, and is limited to the numerical values of revenue.
  • Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards relationships being analyzed. This could also cause increased disparities of representation for certain groups of data, in addition to any already present in the data set.
  • It should be noted that a movie can represent multiple genres. In the genre analysis, a movie is taken into account for each genre it represents.