Moviegoer — Prototype in Action: Turning Movies into Emotional Data | by Tim Lee | Nov, 2020
Film is the key to teaching emotion to AI. As I argued in a previous post, machines will learn emotions by “watching” movies. By turning cinema into structured data, we can unlock a trove of emotional data which can then be used to train AI models. Moviegoer is a tool that can do this automatically, turning movies into self-labeling data. The prototype is complete, and can automatically identify scenes, discover key dialogue, and track characters and their emotions throughout a film.
The Moviegoer prototype is freely available for anyone to replicate the results. The examples below were taken straight from the prototype, with no pre-processing cleaning required of the input movie data.
The prototype is the first step toward unlocking emotional data from movies, and you can read my previous post for a better idea of the data we’re looking to find. Here are just a few of Moviegoer’s capabilities:
Without any structure, a film is just a collection of a few thousand frames of images and a very long audio track. Conversations bleed into one another, characters appear and disappear without reason, and we teleport from one location to the next. We can begin to organize a film by dividing it into individual scenes.
We have an algorithm to identify scenes, partitioning them by identifying their first and last frames. Currently, we’re focused on a specific type of scene: the two-character dialogue scene. These scenes are the basic building blocks of cinema: two characters speaking to each other, with no distractions, purely advancing the plot.
Let’s take a look at Lost in Translation (2003), a famously quiet film, light on dialogue. This is the first scene we’ve identified, which is also the first time that our characters Bob and Charlotte have a conversation. It doesn’t occur until 31 minutes into the film — again, the film is sparse dialogue-wise.
In modern filmmaking, two-character dialogue scenes follow a very distinct pattern. Character A speaks, then Character B, then back to A, then to B, etc. The film cuts back and forth between these two shots, the Anchor shots. There’s a little more magic to the algorithm, including the identification of Cutaway shots, such as the two-shot where they’re both sitting at the bar. We’ve now discovered a handful of scenes, which we’ll be using for Plot and Character analyses.
To understand a film, a machine needs to comprehend the significance of the various happenings of a film. This can most effectively be accomplished by analyzing the dialogue and identifying key points of emotional expression.
With the scene boundaries of the above identified scene, we can isolate dialogue and analyze them as a single, cohesive conversation. Below, we’ve identified some important pieces of dialogue and events, and mapped them back to their frames.
This is visualized to convey the scene’s scope, but it’s important to note that this image is mostly just for our interpretation — Moviegoer doesn’t actually need to “see” these frames.
We conduct an NLP analysis on the dialogue to try and understand what’s happening. In particular, we isolate every Directed Question, questions that address the other person as “you”. These types of questions usually elicit a personal response from the other character. Below is a mapping of every frame of interest:
- First and Last Frames
- Icebreaker and Kicker (First, Last Three Lines of Dialogue)
- Directed Questions and Responses
This scene had 6 Directed Questions. Since this scene was the first time the characters spoke to one another, they were getting to know each other by asking personal questions.
Emotional Analysis, at the Scene Level
With individual scenes identified, we can analyze the emotional content within each scene. In this scene, Bob and Charlotte reconcile after a fight. It’s a quiet scene, even for a quiet film. It has a very slow conversation cadence of 8 sentences per minute (vs. the film’s baseline of 15 sentences per minute). The emotional impact comes not from the dialogue, but from the characters’ facial features as they look at each other in silence.
We can calculate their Primary Emotion by measuring their facial emotion in each scene they appear, and then picking the most common emotion. Charlotte, sad about their impending separation, has a Sad face in almost 40% of her frames. Bob, played by the notoriously deadpan Bill Murray, has a Neutral look for the majority of the scene.
Next, we take a look at a scene from Plus One (2019), a romantic comedy. Two-character dialogue scenes full of sharp dialogue are a staple of rom-coms. We’ve identified 18 of these — let’s take a look at Scene 17.
It has twice as much profanity as the film’s average, indicating that it might be a dramatic scene. Profanity is an example of a measure of drama, and we can compare these indicators against the film’s baseline, to find the most dramatic scenes. We could also, of course, find scenes with the most Sad faces.
Also of note are the First-Person Declarations we identified. These are sentences where a character declares something, with one’s self as the subject. (It’s easier understood by looking at the examples above.)
A film conveys its emotional responses through its characters. Since we’ll eventually want to determine what causes characters’ emotions to change, we need to track characters and their emotions throughout the film.
Finding Characters’ Scenes
Since we’ve previously identified scenes, as well as the facial identities of their participants, we can search for all scenes in which a character appears. We discovered 18 scenes in Plus One, and Alice was found as a participant in 13 of them.
We can guess Ben’s demographic information with facial recognition models. We were able to guess that he is white, male and 32 years old. The actor playing Ben, Jack Quaid, is 28 years old, so this was a pretty good guess.
We also want to plot Ben’s emotions through the film. We count up the times he appears “Sad” and “Angry”, and group those into “Upset”. We can then plot these Upset emotions across the film. This plot roughly tracks with the traditional three-act structure — lots of drama at the film’s climax, before culminating in a happy ending.
A film is more than just dialogue. There are many style features meant to influence the emotional impact of a particular scene. Below are three types of features for which we can look. Though they don’t quite have a definitive meaning, we can still infer information from each.
Every movie frame can be broken down to its RGB values, or additive color components. Each pixel in a frame has a red, green, and blue value, and they can be averaged into three values representing the entire image. These three value tend to be relatively balanced, but we can look for frames where they aren’t: images that skew toward one of the primary additive colors red, green, or blue; or images that lack one of the primary colors, skewing toward the secondary colors yellow, cyan, or magenta.
These color images may be the result of creative lighting, or just the context of the scene (e.g. underwater or containing fire). The three most prominent examples from the high school comedy Booksmart (2019), are all from dialogue-free set pieces: a dream sequence dance with a crush, a karaoke party, and an underwater chase.
Non-Conforming Aspect Ratios
Certain shots of a film might be displayed in an aspect ratio different than the rest of the film. For example, a more widescreen aspect ratio might be used to show a “film within a film”, or a more square ratio for an “old-timey” flashback. In Booksmart, all the frames with non-conforming aspect ratios are used to display footage seen on characters’ phones.
Long takes are shots that are held for a period of time — think the action sequences from Children of Men (2006). A long shot builds tension and suspense, and they’re not just for action scenes: they’re effective for dialogue as well.
Ford v Ferrari (2019), a motorsport drama, is filled with racing scenes that use short, fast shots to emphasize speed and white-knuckle action. But it also uses long takes effectively. Here are three examples: a monologue about the challenges of endurance racing, a driver’s conversation with his son about the mythical perfect lap, and a pre-credits ride (drive) into the sunset. Long takes were used to emphasize the importance of the monologue and conversation; we infer that the dialogue content is of particular importance to the characters.
This was a small example, but we’ve demonstrated we can programmatically collect multiple streams of structured data. For example, if we take a given line of dialogue, we may choose to explore it in the following dimensions:
- Conversation context (sentences before/after)
- Facial expression (of both speaker and non-speaker)
- Voice tone
All of these will have some correlation with the line of dialogue. Since movies are mimicry of humanistic interactions, there is some “truth” in how these data streams react with one another. I’ll explore this concept of “self-labeling data” in a future post.
The prototype is complete, but we’re just getting started. I invite you to look at the repo and poke around the code, replicate the results, or even make your own contributions.
Moviegoer wasn’t built from scratch. It was built upon many Python libraries and community products, and of course, a love of movies.
This project is for non-commercial, research purposes.
As of the prototype’s creation, these tools and libraries were in use:
- Python 3.7 (PyCharm 2020.2)
- Jupyter Notebook 6.1 (Anaconda 2020.02)
- TensorFlow 2.3 (Docker Community 19.03)
- Pandas 1.1
- NumPy 1.19
- Matplotlib 3.2
- Scikit-Learn 0.22
- face_recognition 1.3
- deepface 0.0.26
- OpenCV 4.2
- PyTesseract 0.3
- SpaCy 2.3
- pysrt 1.1
- pyAudioAnalysis 0.3
- Librosa 0.7
Movies pictured in screenshots for visualization purposes, under fair use:
- Lost in Translation. Directed by Sofia Coppola, performances by Scarlett Johansson and Bill Murray, Focus Features, 2003.
- Plus One. Directed by Jeff Chan and Andrew Rhymer, performances by Maya Erskine and Jack Quaid, Red Hour Productions, 2019.
- Booksmart. Directed by Olivia Wilde, performances by Kaitlyn Dever and Beanie Feldstein, Annapurna Pictures, 2019.
- Ford v Ferrari. Directed by James Mangold, performances by Christian Bale and Matt Damon, Chernin Entertainment, 2019.