It’s possible to find subtitles for almost every movie or TV series. And there’s also stock images with anything imaginable. Wouldn’t it be fun to connect this two things and make a sort of a filmstrip with a stock image for every caption from subtitles?

TLDR: the result is silly:

For the subtitles to play with I chose subtitles for Bob’s Burgers – The Deeping. At first, we need to parse it with pycaption:

from import SRTReader

lang = 'en-US'
path = ''

def read_subtitles(path, lang):
    with open(path) as f:
        data =
        return SRTReader().read(data, lang=lang)
subtitles = read_subtitles(path, lang)
captions = subtitles.get_captions(lang)
>>> captions
['00:00:04.745 --> 00:00:06.746\nShh.', '00:00:10.166 --> 00:00:20.484\n...

As a lot of subtitles contains html, it’s important to remove tags before future processing, it’s very easy to do with lxml:

import lxml.html

def to_text(raw_text):
    return lxml.html.document_fromstring(raw_text).text_content()
to_text('<i>That shark is ruining</i>')
'That shark is ruining'

For finding most significant words in the text we need to tokenize it, lemmatize (replace every different form of a word with a common form) and remove stop words. It’s easy to do with NLTK:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def tokenize_lemmatize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(token.lower())
                  for token in tokens if token.isalpha()]
    stop_words = set(stopwords.words("english"))
    return [lemma for lemma in lemmatized if lemma not in stop_words]
>>> tokenize_lemmatize('That shark is ruining')
['shark', 'ruining']

And after that we can just combine the previous two functions and find most frequently used words:

from collections import Counter

def get_most_popular(captions):
    full_text = '\n'.join(to_text(caption.get_text()) for caption in captions)
    tokens = tokenize_lemmatize(full_text)
    return Counter(tokens)
most_popular = get_most_popular(captions)
Counter({'shark': 68, 'oh': 32, 'bob': 29, 'yeah': 25, 'right': 20,...

It’s not the best way to find the most important words, but it kind of works.

After that it’s straightforward to extract keywords from a single caption:

def get_keywords(most_popular, text, n=2):
    tokens = sorted(tokenize_lemmatize(text), key=lambda x: -most_popular[x])
    return tokens[:n]
>>> captions[127].get_text()
'Teddy, what is wrong with you?'
>>> get_keywords(most_popular, to_text(captions[127].get_text()))
['teddy', 'wrong']

The next step is to find a stock image for those keywords. There’s not that many properly working and documented stocks, so I chose to use Shutterstock API. It’s limited to 250 requests per hour, but it’s enough to play.

From their API we only need to use /images/search. We will search for the most popular photo:

import requests

# Key and secret of your app
stock_key = ''
stock_secret = ''

def get_stock_image_url(query):
    response = requests.get(
            'query': query,
            'sort': 'popular',
            'view': 'minimal',
            'safe': 'false',
            'per_page': '1',
            'image_type': 'photo',
        auth=(stock_key, stock_secret),
    data = response.json()
        return data['data'][0]['assets']['preview']['url']
    except (IndexError, KeyError):
        return None
>>> get_stock_image_url('teddy wrong')

The image looks relevant:

teddy wrong

Now we can create a proper card from a caption:

def make_slide(most_popular, caption):
    text = to_text(caption.get_text())
    if not text:
        return None

    keywords = get_keywords(most_popular, text)
    query = ' '.join(keywords)
    if not query:
        return None

    stock_image = get_stock_image_url(query)
    if not stock_image:
        return None

    return text, stock_image
make_slide(most_popular, captions[132])
('He really chewed it...\nwith his shark teeth.', '')

The image is kind of relevant:

He really chewed it...with his shark teeth.

After that we can select captions that we want to put in our filmstrip and generate html like the one in the TLDR section:

output_path = 'burgers.html'
start_slide = 98
end_slide = 200

def make_html_output(slides):
    html = '<html><head><link rel="stylesheet" href="./style.css"></head><body>'
    for (text, stock_image) in slides:
        html += f'''<div class="box">
            <img src="{stock_image}" />
    html += '</body></html>'
    return html

interesting_slides = [make_slide(most_popular, caption)
                      for caption in captions[start_slide:end_slide]]
interesting_slides = [slide for slide in interesting_slides if slide]

with open(output_path, 'w') as f:
    output = make_html_output(interesting_slides)

And the result - burgers.html.

Another example, even worse and a bit NSFW, It’s Always Sunny in Philadelphia – Charlie Catches a Leprechaun.

Gist with the sources.

