OscaRobber

Analysis of data hosted on IMDB


Project maintained by polakowo Hosted on GitHub Pages

Table of contents

Next you will find table of contents. To jump to the section, click on the header.

Goal

Explore IMDB dataset and user reviews to look behind the scenes of cinematography.

The introduction video for our project can be seen here.

SocialGraphs2016 from polakowo on Vimeo.

Dataset

IMDB

To help serve users who want local access to their data, the IMDB offers their data set as plain text files that can be downloaded for free. The IMDb data comes as plain data files having .list extension. The files have been downloaded from http://www.imdb.com/interfaces using an FTP client, and include compressed archives created by assembling collections of plain text files of .list format. The fetched data includes 52 archives and 4 folders. The size is estimated by 12.63GB. The files downloaded have a format that is not easily readable by most computational tools learned in the course, therefore the data preparation played a major role in this project, and required a lot of effort. Since files from ftp://ftp.fu-berlin.de/pub/misc/movies/database/ contain somewhat different numbers of columns and rows, and furthermore include different headers, the files had to be translated into a SQL database using a third-party service http://imdbpy.sourceforge.net. But because SQL do not offers infrastructure required to easily access various attributes (and many other downsides, see Explainer Notebook), we have built an infrastructure to transfer the most important data to a NoSQL database. Properties of the resulting database are listed below.

User reviews

For text analysis we used the Large Movie Review Dataset v1.0 fetched from http://ai.stanford.edu/~amaas/data/sentiment/. The core dataset contains 50,000 labeled, as well as unlabeled documents for unsupervised learning. In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. We used unlabeled data, as we are not interested in already provided sentiments, but want to calculate sentiments on ourselves. User reviews are keyed by IMDB ids rather than primary keys in the SQL database, hence we need to map each IMDB id to the corresponding primary key to be able to compare sentiments with other attributes like gross, release year, and more. As there is no information on IMDB ids in the database itself, we parsed OMDB website to find those.

OMDB

The OMDb API is a free web service to obtain movie information. We have obtained 60,000 JSON documents from https://www.omdbapi.com. Each JSON document contains following fields.

Website Fields
IMDB Year, Rated, Released, Runtime, Genre, Director, Writer, Actors, Plot, Language, Country, Awards, Poster, imdbRating, imdbVotes, imdbID, Type, DVD, BoxOffice, Production, Website
Rotten Tomatoes tomatoMeter, tomatoReviews, tomatoFresh, tomatoRotten, tomatoConsensus, tomatoUserMeter, tomatoUserRating, tomatoUserReviews, tomatoURL

Summary

The final collections stored in MongoDB are listed below.

Collections Count Size (MB) Storage (MB)
names_map 2,851,082 172 80
prod_countries 951,419 44 18
release_years 933,715 46 14
titles_map 933,715 59 28
genres 886,236 46 22
directors 853,946 39 19
languages 848,085 39 13
cast 725,011 81 62
prod_companies 634,875 30 14
runtimes 613,963 30 14
votes 280,444 12 6
imdb_ratings 280,444 13 6
birth_years 278,288 12 6
keywords 267,438 138 129
birth_places 250,989 11 4
prod_companies_map 222,116 14 7
plots 218,348 148 109
heights 149,667 7 2
keywords_map 147,421 9 5
budgets 101,221 4 2
omdb_files 64,176 57 25
gross 15,568 0 0
user_reviews 7,091 64 45
birth_places_map 385 0 0
languages_map 342 0 0
prod_countries_map 241 0 0
genres_map 28 0 0

Hint: Many collections having string values are encoded with integers to save some space, so we introduce collections ending by "_map" to get the initial strings values. Don't be scared if you see something like that, it's actually pretty awesome!

Using filtering and intelligent mapping we were able to shrink the total size of IMDB data from 12GB to 1GB

You can download the complete MongoDB database here. Use command mongod --dbpath data/db to import data into MongoDB instance.


Network analysis

Now we will perform an analysis on the network of science fiction movies. We will derive the most important network properties, test assortativity towards various attributes and uncover hidden communities.

Network representation

The collaboration graph of movies and actors is called co-stardom network. Bipartite graph Bip=(M,A,E) will have two node sets M - movies, A - actors and edges in E that only connect nodes from opposite sets. M is connected to A if A has played in M. Hence there are no direct M - M or A - A links.

The following properties refer to bipartite network of science fiction movies.

Nodes Links Components Largest component Clustering coef. Probability
227,155 273,936 6,594 186,639 0.0 0.000011
Min degree Max degree Median degree Mean degree
1 448 1 2.4
0 448 0 1.2
0 48 1 1.2

The sample of this network is shown as graph below.

Sample of bipartite network

Hint: curious about graph design? Check out the code of NetworkHelper.draw_graph_attr

Movie-to-movie network

Bipartite network projection is an extensively used method for compressing information about bipartite networks (read here). For the convenience of directly showing the relation structure among a particular set of nodes, bipartite networks are usually compressed by one-mode projection. This means that the ensuing network contains nodes of only either of the two sets, and two M (or, alternatively, A) nodes are connected only if when they have at least one common neighboring A (or, alternatively, M) node.

The first projection is network of movies connected if they have an actor in common. Parameters are shown in the following table.

Nodes Links Clustering coef. Probability
13,989 127,209 0.36 0.0013
Min degree Max degree Median degree Mean degree
1 343 6 18.18

The sample of this network is shown as graph below.

Sample of movie network

Actor-to-actor network

The second projection is network of actors connected if they played in the same movie. Parameters are shown in the following table.

Nodes Links Clustering coef. Probability
172,650 5,111,268 0.91 0.00034
Min degree Max degree Median degree Mean degree
1 3052 36 59.21

The sample of this network is shown as graph below.

Sample of actor network

Statistics

Below we list the most interesting insights we've got by analyzing the co-stardom network. If you are ready enough to get into the details, look into Explainer Notebook, where every analysis step from A to Z is explained in detail. The main discoveries in this part of solution refer entirely to the science fiction genre.

172,650 actors played in 13,989 movies.

The biggest cast has movie Angry Video Game Nerd: The Movie (2014) with 448 actors. Here, IMDB lists all the normally uncredited extras in the cast.

Angry Video Game Nerd: The Movie (2014)

Actor Kōichi Yamadera (1961) played in 48 science fiction films.

Kōichi Yamadera (1961)

An actor acts in only 1.2 movies on average, and vice versa.

82% of movies and actors are connected, meaning if you get friends with an well-known actor, through a chain of actor-to-actor relationships you may become friends with 82% of all actors on the planet.

Probability that a particular actor plays in a particular movie is 0.0014%

Probability that an actor plays in two random movies is 0.13%

Probability that a movie hires two random actors is 0.034%

An actor knows 59 actors on average.

Stan Lee (1922) knows 3,052 actors in total.

Stan Lee (1922)

A movie is connected through an actor to 18 movies on average.

Two random actors (or movies) are reached in a maximum of 14 steps.

Degrees of co-stardom network follow the power law distribution, meaning the most common number of movies an actor plays is 1, while the most common number of actors a movie hires is 2

Friendship paradox is true: the most actors know fewer actors than those know on average.

Centrality

In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph

We are going to explore the most central nodes according to a few different centrality definitions.

Betweenness and closeness centrality

Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. Closeness centrality measures the mean distance from a node to other nodes.

If you want to get friends with the most actors on Earth, first become friends with the cast of Iron Man Three (2013), Transformers: Dark of the Moon (2011), or Captain America: Civil War (2016) as they are the most central hubs in the co-stardom network of science fiction movies.

Iron Man Three (2013) Transformers: Dark of the Moon (2011) Captain America: Civil War (2016)

Degree centrality

Degree centrality is the most basic method of defining centrality, basing the centrality only on the number of neighbours a node has.

Actors who played in Kaze no tani no Naushika (1984) played also in 343 different movies, which is a record.

Kaze no tani no Naushika (1984)

Eigenvector centrality

The eigenvector centrality thesis reads: A node is important if it is linked to by other important nodes.

Actors who played in Kidô senshi Gandamu (1981) played also in very connected movies like Kaze no tani no Naushika (1984)

Kidô senshi Gandamu (1981)

Attributes

Now we move on to observe different attributes (those of movies as well as those of actors). This step of analysis helps us to uncover the real structure of the network and how it has grown. We will check the assortativity of nodes in respect to attributes, but also answer whether attributes build good communities across the network.

The science fiction genre has experienced the main gain in popularity in the beginning of 2000th, when the technology for movie production evolutionized, and it goes through the peak right now.

The most common runtimes are around 10-30 min and 80-120 min.

The most movies are low-budget movies. The more budget the movie has, the more money it spends on cast and crew, and the more profitable actors it hires, which also play in many other pictures.

There more a movie invests, the more it becomes in return.

The most actors of sci-fi films were born from 1950 to 1980. Actors who were born in later years are younger and therefore less wanted as more experienced actors. We also see a trend that the science fiction becomes more and more popular and many new actors join this trend.

Height attribute, in contrast, follows the normal (Gaussian) distribution, and isn't correlated with any other attribute.

Assortativity

Assortative mixing is a preference for a network's nodes to attach to others that are similar in some way. The more is the coefficient of assortativity towards an attribute, the more the nodes prefer to attach to other nodes having a similar attribute value.

By knowing assortativity we can closely approximate the behavior of the network.

Assortativity towards movie attributes

Assortativity Attribute Coefficient
Numeric Release year 0.81
Attribute Language 0.67
Attribute Production country 0.66
Numeric Degree 0.33
Numeric Budget 0.28
Numeric Runtime 0.22
Numeric Votes 0.20
Numeric Gross 0.12
Attribute Production company 0.06
Attribute Director 0.04

Main discoveries:

Assortativity towards actor attributes

Assortativity Attribute Coefficient
Numeric Birth year 0.74
Attribute Birth place 0.40
Numeric Degree 0.33
Numeric Height 0.02

Main discoveries:

Communities

Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.

Next we will see whether attributes are good communities, and will uncover some new.

Modularity of movie attributes

Attribute Communities Modularity
Production country 70 0.34
Release year 10 (rescaled) 0.25
Language 54 0.23
Budget 10 (rescaled) 0.10
Runtime 7 (rescaled) 0.09
Production company 6,265 0.06
Director 10,409 0.04
Gross 8 (rescaled) 0.02
Votes 9 (rescaled) 0.02

Hint: Numeric attributes like release year may have a very wide scale of values, which makes performance ugly. We discovered a correlation between maximal value on the scale and the performance: if some values go above ~1000, the algorithm hangs, hence some attributes have been rescaled to [1,10]. The modularity hasn't suffered.

Production countries, release years, and languages are not good communities, that is for example, movies made within USA are interconnected only moderately, while they share many connections with movies made outside of USA. Film industry is an international industry.

Modularity of actor attributes

Attribute Communities Modularity
Birth place 205 0.24
Birth year 10 (rescaled) 0.18
Height 5 (rescaled) 0.01

None of the actor-related attributes build good communities, in particular, for attribute birth place it means that actors act with international actors very commonly, while for attribute birth year it means that teens commonly play with adults, and vice versa.

Next we will illustrate the modularity of the attribute birth place by drawing Danish actors and their co-actors.

Sample of network of danish actors

We see that only a few Danes acted with other Danes together. Hence, birth places fail to represent communities in actor-to-actor network.

Modularity of best partition

This is the partition of highest modularity, i.e. the highest partition of the dendrogram generated by the Louvain algorithm.

Network Communities Modularity
Movies projection 83 0.60
Actor projection 163 0.84

The communities detected by an algorithm are much better than those of attributes.

Let us get a feel of newly calculated modularity by drawing a random graph.

Best partition by Louvain algorithm

Each coloured node (and edge) belong to community 0, and their colour depends upon degree. All grey-coloured nodes (and edges) belong to other communities. We can discover the almost perfect partition. We recognize it by the cardinality of links within nodes of the same community (strong boundaries of coloured sections) being higher than the cardinality of links to other communities.


Text analysis

Now we are going to dig a little deeper into contents of movies to answer whether there is a correlation between keywords provided by IMDB and plots, and whether actors tend to act in movies of the same contents.

Important: As the same calculations are performed both for keywords and plots, we display the analysis of keywords only. For further information see Explainer Notebook

Keywords

First of all we will discover keywords provided by IMDB.

Most common

Below we will identify the most common terms in the film industry and visualize them by a word cloud, where the importance of each tag is shown with font size and color.

The most common keywords

The most common term is independent-film, a feature film that is produced mostly or completely outside of the major film studio system.

Films based on a novel also enjoy the popularity.

By genre

Next we group keywords by genre to discover the difference between them.

Most common keywords by genre

More general terms murder, female-nudity, sex and death describe what the most films are about: murders and sex.

Keywords provided by IMDB highly correlate with contents of entire genres. The most common words of science fiction are alien, outer-space, monster, and robot. Horror films can be mainly described by terms murder, death, and blood. Terms murder and police are on the top of crime genre. The term murder is also very frequent in dramas. War films commonly show WWII and nazis.

Top 250 movies

Further we will display top keywords in top 250 movies by rating, budget and gross, using a wordcloud.

Keywords of top 250

The most profitable movies are biased towards particular terms.

Explosion is one of the most frequent words for the most expensive movies. The more a production company has money, the more it invests in special effects, and the more explosions it shows. Across top 250 films by gross the most promising are blockbusters. Box-office-hit keyword (describing a highly-grossing picture) is also the most common in top 250 films by budget. All those keywords are less common for movies with the highest IMDB ratings though, where the independent-film and mockumentary (parody) top the list.

The most successful and top-rated films seem to differ a lot, they even mirror opposite trends: almost none of the keyword independent-film are present in the keywords set of the most expensive and profitable movies. On the other side, none of the term box-office-hit are in top-rated films. The same observation we do wenn we look at word love. Quite a lot of top-rated movies are about love, but only a bunch of most successful films are (which are mainly blockbusters with breathtaking explosions), that is, the top-rated films interact with feelings of watchers and trigger sympathy.

Similarity

Next we want to test the similarity of film contents based on keywords. We do this by vectorization.

Vectorization is the general process of turning a collection of text documents into numerical feature vectors.

Documents in our case are sets of keywords that are described by keyword occurrences while completely ignoring the relative position information of the keywords in the set. The shape of the resulting bag-of-words matrix is displayed below.

Movies Vocabulary
5,955 32,660

To get a feel how values in the similarity matrix are distributed, let us visualize their distribution on a plot.

The most movies - 73% - aren't similar regarding keywords, meaning that keywords are maybe to concrete for a similarity measure. But, having seen those keywords anywhere, you will be able to identify the movie right away.

Assortativity Attribute Coefficient
Numeric Average similarity 0.50

The actors tend to play in movies described by similar keywords (moderately though).

There is no single attribute correlated with similarity. Willing to act in movies of similar content depends on a dicision of an actor rather than on a movie. This decision isn't depending upon year, nor it is depending upon gross, budget and other measures: it is an output of complex psychology of an actor.

Keywords vs. plots

We expect both keywords and plots to be somewhat correlated, as plots describe the same contents but in a different, more readable for humans, way. We use a scatterplot to plot similarity on keywords versus similarity on plot keywords below.

Plots and keywords are weakly correlated.

User reviews

We will apply sentiment analysis to identify and extract subjective information in user reviews.

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials.

Sentiments of user reviews follow normal distribution.

Sentiment analysis produces many errors, e.g., the user review classified as the second most positive is in fact negative. This because the LabMT wordlist contains only labeled words, not compounds, so the algorithm cannot detect a negotiation or sarcasm.

We also cannot discover any correlation between sentiment and release year, runtime, budget, and gross. Good and bad films are made every year, and are not depending upon runtime, budget or gross. Reviews are always a matter of taste.

Terms

Below the most common terms across user reviews are displayed.

Most common terms in user reviews

The most common keywords are movie, film, good, and time. The 3 most frequent bigrams are (ever, seen), (special, effects), and (even, though); the most common trigrams are (movie, ever, seen), (worst, movie, ever), and (movies, ever, seen), and contain bigram (ever, seen); expressions like that indicate that the movie made a big impression on the viewer.

Adjectives

Below are shown most common adjectives in top 250 most negative and positive user reviews.

Negative and positive keywords

Not suprising: the adjectives bad and good top the lists. That's the most loved expressions used by users.

In general, sentiments are not precise. Negotiations may provide some problems when classifying data. Extending the wordlist by compounds may be a solution, whereby one very difficult to implement.


Further information

A lot of information was skipped to shrink the size of the webpage, thus you are welcome to continue the reading in the Explainer Notebook

You may also be interested in testing the things out, for this, clone the repository, download and import the MongoDB data, and enjoy.