Analysing the trip to South America with a bit of image recognition

Back in September, I had a three weeks trip to South America. While planning the trip I was using sort of data mining to select the most optimal flights and it worked well. To continue following the data-driven approach (more buzzwords), I’ve decided to analyze the data I’ve collected during the trip.

Unfortunately, I was traveling without local sim-card and almost without internet, I can’t use Google Location History as in the fun research about the commute. But at least I have tweets and a lot of photos.

At first, I’ve reused old code (more internal linking) and extracted information about flights from tweets:

all_tweets = pd.DataFrame(
    [(tweet.text, tweet.created_at) for tweet in get_tweets()],  # get_tweets available in the gist
    columns=['text', 'created_at'])

tweets_in_dates = all_tweets[
    (all_tweets.created_at > datetime(2018, 9, 8)) & (all_tweets.created_at < datetime(2018, 9, 30))]

flights_tweets = tweets_in_dates[tweets_in_dates.text.str.upper() == tweets_in_dates.text]

flights = flights_tweets.assign(start=lambda df: df.text.str.split('✈').str[0],
                                finish=lambda df: df.text.str.split('✈').str[-1]) \
                        .sort_values('created_at')[['start', 'finish', 'created_at']]
>>> flights
   start finish          created_at
19  AMS    LIS 2018-09-08 05:00:32
18  LIS    GIG 2018-09-08 11:34:14
17  SDU    EZE 2018-09-12 23:29:52
16  EZE    SCL 2018-09-16 17:30:01
15  SCL    LIM 2018-09-19 16:54:13
14  LIM    MEX 2018-09-22 20:43:42
13  MEX    CUN 2018-09-25 19:29:04
11  CUN    MAN 2018-09-29 20:16:11

Then I’ve found a json dump with airports, made a little hack with replacing Ezeiza with Buenos-Aires and found cities with lengths of stay from flights:

flights = flights.assign(
    start=flights.start.apply(lambda code: iata_to_city[re.sub(r'\W+', '', code)]),  # Removes leftovers of emojis, iata_to_city available in the gist
    finish=flights.finish.apply(lambda code: iata_to_city[re.sub(r'\W+', '', code)]))
cities = flights.assign(
    spent=flights.created_at - flights.created_at.shift(1),
)[["city", "spent", "arrived"]]
cities = cities.assign(left=cities.arrived + cities.spent)[cities.spent.dt.days > 0]
>>> cities
              city           spent             arrived                left
17  Rio De Janeiro 4 days 11:55:38 2018-09-08 11:34:14 2018-09-12 23:29:52
16  Buenos-Aires   3 days 18:00:09 2018-09-12 23:29:52 2018-09-16 17:30:01
15  Santiago       2 days 23:24:12 2018-09-16 17:30:01 2018-09-19 16:54:13
14  Lima           3 days 03:49:29 2018-09-19 16:54:13 2018-09-22 20:43:42
13  Mexico City    2 days 22:45:22 2018-09-22 20:43:42 2018-09-25 19:29:04
11  Cancun         4 days 00:47:07 2018-09-25 19:29:04 2018-09-29 20:16:11

>>> cities.plot(x="city", y="spent", kind="bar",
                legend=False, title='Cities') \
          .yaxis.set_major_formatter(formatter)  # Ugly hack for timedelta formatting, more in the gist


Now it’s time to work with photos. I’ve downloaded all photos from Google Photos, parsed creation dates from Exif, and “joined” them with cities by creation date:

raw_photos = pd.DataFrame(list(read_photos()), columns=['name', 'created_at'])  # read_photos available in the gist

photos_cities = raw_photos.assign(key=0).merge(cities.assign(key=0), how='outer')
photos = photos_cities[
    (photos_cities.created_at >= photos_cities.arrived)
    & (photos_cities.created_at <= photos_cities.left)
>>> photos.head()
                          name          created_at  key            city           spent             arrived                left
1   photos/20180913_183207.jpg 2018-09-13 18:32:07  0    Buenos-Aires   3 days 18:00:09 2018-09-12 23:29:52 2018-09-16 17:30:01
6   photos/20180909_141137.jpg 2018-09-09 14:11:36  0    Rio De Janeiro 4 days 11:55:38 2018-09-08 11:34:14 2018-09-12 23:29:52
14  photos/20180917_162240.jpg 2018-09-17 16:22:40  0    Santiago       2 days 23:24:12 2018-09-16 17:30:01 2018-09-19 16:54:13
22  photos/20180923_161707.jpg 2018-09-23 16:17:07  0    Mexico City    2 days 22:45:22 2018-09-22 20:43:42 2018-09-25 19:29:04
26  photos/20180917_111251.jpg 2018-09-17 11:12:51  0    Santiago       2 days 23:24:12 2018-09-16 17:30:01 2018-09-19 16:54:13

After that I’ve got the amount of photos by city:

photos_by_city = photos \
    .groupby(by='city') \
    .agg({'name': 'count'}) \
    .rename(columns={'name': 'photos'}) \
>>> photos_by_city
             city  photos
0  Buenos-Aires    193
1  Cancun          292
2  Lima            295
3  Mexico City     256
4  Rio De Janeiro  422
5  Santiago        267
>>> photos_by_city.plot(x='city', y='photos', kind="bar",
                        title='Photos by city', legend=False)


Let’s go a bit deeper and use image recognition, to not reinvent the wheel I’ve used a slightly modified version of TensorFlow imagenet tutorial example and for each photo find what’s on it:

tags =\
    .apply(lambda name: classify_image.run_inference_on_image(name, 1)[0]) \

tagged_photos = photos.copy()
tagged_photos[['tag', 'score']] = tags.apply(pd.Series)
tagged_photos['tag'] = tagged_photos.tag.apply(lambda tag: tag.split(', ')[0])
>>> tagged_photos.head()
                          name          created_at  key            city           spent             arrived                left       tag     score
1   photos/20180913_183207.jpg 2018-09-13 18:32:07  0    Buenos-Aires   3 days 18:00:09 2018-09-12 23:29:52 2018-09-16 17:30:01  cinema    0.164415
6   photos/20180909_141137.jpg 2018-09-09 14:11:36  0    Rio De Janeiro 4 days 11:55:38 2018-09-08 11:34:14 2018-09-12 23:29:52  pedestal  0.667128
14  photos/20180917_162240.jpg 2018-09-17 16:22:40  0    Santiago       2 days 23:24:12 2018-09-16 17:30:01 2018-09-19 16:54:13  cinema    0.225404
22  photos/20180923_161707.jpg 2018-09-23 16:17:07  0    Mexico City    2 days 22:45:22 2018-09-22 20:43:42 2018-09-25 19:29:04  obelisk   0.775244
26  photos/20180917_111251.jpg 2018-09-17 11:12:51  0    Santiago       2 days 23:24:12 2018-09-16 17:30:01 2018-09-19 16:54:13  seashore  0.24720

So now it’s possible to find things that I’ve taken photos of the most:

photos_by_tag = tagged_photos \
    .groupby(by='tag') \
    .agg({'name': 'count'}) \
    .rename(columns={'name': 'photos'}) \
    .reset_index() \
    .sort_values('photos', ascending=False) \
>>> photos_by_tag
            tag  photos
107  seashore    276   
76   monastery   142   
64   lakeside    116   
86   palace      115   
3    alp         86    
81   obelisk     72    
101  promontory  50    
105  sandbar     49    
17   bell cote   43    
39   cliff       42
>>> photos_by_tag.plot(x='tag', y='photos', kind='bar',
                       legend=False, title='Popular tags')

Popular tags

Then I was able to find what I was taking photos of by city:

popular_tags = photos_by_tag.head(5).tag
popular_tagged = tagged_photos[tagged_photos.tag.isin(popular_tags)]
not_popular_tagged = tagged_photos[~tagged_photos.tag.isin(popular_tags)].assign(
by_tag_city = popular_tagged \
    .append(not_popular_tagged) \
    .groupby(by=['city', 'tag']) \
    .count()['name'] \
>>> by_tag_city
tag             alp  lakeside  monastery  other  palace  seashore
Buenos-Aires    5    1         24         123    30      10      
Cancun          0    19        6          153    4       110     
Lima            0    25        42         136    38      54      
Mexico City     7    9         26         197    5       12      
Rio De Janeiro  73   45        17         212    4       71      
Santiago        1    17        27         169    34      19     
>>> by_tag_city.plot(kind='bar', stacked=True)

Tags by city

Although the most common thing on this plot is “other”, it’s still fun.

Gist with full sources.

Holden Karau, Rachel Warren: High Performance Spark

book cover white Recently I’ve started to use Spark more and more, so I’ve decided to read something about it. High Performance Spark by Holden Karau and Rachel Warren looked like an interesting book, and I already had it from some HIB bundle. The book is quite short, but it covers a lot of topics. It has a lot of technics to make Spark faster and avoid common bottlenecks with explanation and sometimes even going down to Spark internals. Although I’m mostly using PySpark and almost everything in the book is in Scala, it was still useful as API is mostly the same.

Some parts of High Performance Spark are like config key/param – sort of documentation, but most of the book is ok.

Video from subtitles or Bob's Burgers to The Simpsons with TensorFlow

Bob's Burgers to The Simpsons

Back in June I’ve played a bit with subtitles and tried to generate a filmstrip, it wasn’t that much successful, but it was fun. So I decided to try to go deeper and generate a video from subtitles. The main idea is to get phrases from a part of some video, get the most similar phrases from another video and generate something.

As the “enemy” I’ve decided to use a part from Bob’s Burgers Tina-rannosaurus Wrecks episode:

As the source, I’ve decided to use The Simpsons, as they have a lot of episodes and Simpsons Already Did It whatever. I somehow have 671 episode and managed to get perfectly matching subtitles for 452 of them.

TLDR: It was fun, but the result is meh at best:

Initially, I was planning to use Friends and Seinfeld but the result was even worse.

As the first step I’ve parsed subtitles (boring, available in the gist) and created a mapping from phrases and “captions” (subtitles parts with timing and additional data) and a list of phrases from all available subtitles:

data_text2captions = defaultdict(lambda: [])
for season in root.glob('*'):
    if season.is_dir():
        for subtitles in season.glob('*.srt'):
            for caption in read_subtitles(subtitles.as_posix()):

data_texts = [*data_text2captions]
>>> data_text2captions["That's just a dog in a spacesuit!"]
[Caption(path='The Simpsons S13E06 She of Little', start=127795000, length=2544000, text="That's just a dog in a spacesuit!")]
>>> data_texts[0]
'Give up, Mr. Simpson! We know you have the Olympic torch!'

After that I’ve found subtitles for the Bob’s Burgers episode and manually selected parts from the part of the episode that I’ve used as the “enemy” and processed them in a similar way:

play = [*read_subtitles('')][1:54]
play_text2captions = defaultdict(lambda: [])
for caption in play:

play_texts = [*play_text2captions]
>>> play_text2captions[ 'Who raised you?']
[Caption(path='', start=118605000, length=1202000, text='Who raised you?')]
>>> play_texts[0]
"Wow, still can't believe this sale."

Then I’ve generated vectors for all phrases with TensorFlow’s The Universal Sentence Encoder and used cosine similarity to get most similar phrases:

module_url = ""
embed = hub.Module(module_url)

vec_a = tf.placeholder(tf.float32, shape=None)
vec_b = tf.placeholder(tf.float32, shape=None)

normalized_a = tf.nn.l2_normalize(vec_a, axis=1)
normalized_b = tf.nn.l2_normalize(vec_b, axis=1)
sim_scores = -tf.acos(tf.reduce_sum(tf.multiply(normalized_a, normalized_b), axis=1))

def get_similarity_score(text_vec_a, text_vec_b):
    emba, embb, scores =
        [normalized_a, normalized_b, sim_scores],
            vec_a: text_vec_a,
            vec_b: text_vec_b
    return scores

def get_most_similar_text(vec_a, data_vectors):
    scores = get_similarity_score([vec_a] * len(data_texts), data_vectors)
    return data_texts[sorted(enumerate(scores), key=lambda score: -score[1])[3][0]]

with tf.Session() as session:[tf.global_variables_initializer(), tf.tables_initializer()])
    data_vecs, play_vecs =[embed(data_texts), embed(play_texts)])
    data_vecs = np.array(data_vecs).tolist()
    play_vecs = np.array(play_vecs).tolist()

    similar_texts = {play_text: get_most_similar_text(play_vecs[n], data_vecs)
                     for n, play_text in enumerate(play_texts)}
>> similar_texts['Is that legal?']
"- [Gasping] - Uh, isn't that illegal?"
>>> similar_texts['(chuckling): Okay, okay.']
'[ Laughing Continues ] All right. Okay.

Looks kind of relevant, right? Unfortunately only phrase by phrase.

After that, I’ve cut parts of The Simpsons episodes for matching phrases. This part was a bit complicated, because without a force re-encoding (with the same encoding) and setting a framerate (with kind of the same framerate with most of the videos) it was producing unplayable videos:

def generate_parts():
    for n, caption in enumerate(play):
        similar = similar_texts[caption.text]
        similar_caption = sorted(
            key=lambda maybe_similar: abs(caption.length - maybe_similar.length),

        yield Part(
            video=similar_caption.path.replace('.srt', '.mp4'),

parts = [*generate_parts()]
for part in parts:
    call(['ffmpeg', '-y', '-i',,
          '-ss', part.start, '-t', part.end,
          '-c:v', 'libx264', '-c:a', 'aac', '-strict', 'experimental',
          '-vf', 'fps=30',
          '-b:a', '128k', part.output])
>>> parts[0]
Part(video='The Simpsons S09E22 Trash of the Titans.mp4', start='0:00:31.531', end='0:00:03.003', output='part_0.mp4')

And at the end I’ve generated a special file for the FFmpeg concat and concatenated the generated parts (also with re-encoding):

concat = '\n'.join(f"file '{part.output}'" for part in parts) + '\n'
with open('concat.txt', 'w') as f:
cat concat.txt | head -n 5
file 'parts/part_0.mp4'
file 'parts/part_1.mp4'
file 'parts/part_2.mp4'
file 'parts/part_3.mp4'
file 'parts/part_4.mp4'
call(['ffmpeg', '-y', '-safe', '0', '-f', 'concat', '-i', 'concat.txt',
      '-c:v', 'libx264', '-c:a', 'aac', '-strict', 'experimental',
      '-vf', 'fps=30', 'output.mp4'])

As the result is kind of meh, but it was fun, I’m going to try to do that again with a bigger dataset, even working with FFmpeg wasn’t fun at all.

Gist with full sources.

Christina Wodtke: Introduction to OKRs

book cover For a better understanding of OKRs, I’ve decided to read Introduction to OKRs by Christina Wodtke. It’s a very short book, but it explains why and how to set OKRs, and how to keep track of them. The book isn’t deep or something, but it contains almost everything I wanted to know about OKRs.

In contrast with some longer books, it’s very nice to read something that’s not trying to repeat the same content ten times all over the book.

Camille Fournier: The Manager's Path

book cover As sometimes I need to take a role of a manager of a small team, I decided to read something about that. So I chose to read The Manager’s Path by Camille Fournier. The book covers different stages of manager’s growth with advice and related stories. And it’s interesting to read, for me it was kind of leisure reading.

Although most parts of the book were way over my level, it was fun to read about different roles and expectations from people working on those roles. I’ve probably even understood the difference between VP and CTO.

Steven Bird, Ewan Klein, and Edward Loper: Natural Language Processing with Python

book cover white Recently I’ve noticed that my NLP knowledge isn’t that good. So I decided to read Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper, and it’s a nice book. It explains core concepts of NLP and how, where and why to use NLTK. After each chapter, it has lots of exercises. I’ve tried to do most of them and spent a bit too much time on the book.

Although some parts of the book are too basic, it even has parts about using Python data-structures.

How I was planning a trip to South America with JavaScript, Python and Google Flights abuse

I was planning a trip to South America for a while. As I have flexible dates and want to visit a few places, it was very hard to find proper flights. So I decided to try to automatize everything.

I’ve already done something similar before with Clojure and Chrome, but it was only for a single flight and doesn’t work anymore.

Parsing flights information

Apparently, there’s no open API for getting information about flights. But as Google Flights can show a calendar with prices for dates for two months I decided to use it:

Calendar with prices

So I’ve generated every possible combination of interesting destinations in South America and flights to and from Amsterdam. Simulated user interaction with changing destination inputs and opening/closing calendar. By the end, I wrote results as JSON in a new tab. The whole code isn’t that interesting and available in the gist. From the high level it looks like:

const getFlightsData = async ([from, to]) => {
  await setDestination(FROM, from);
  await setDestination(TO, to);

  const prices = await getPrices();

  return[date, price]) => ({
    date, price, from, to,

const collectData = async () => {
  let result = [];
  for (let flight of getAllPossibleFlights()) {
    const flightsData = await getFlightsData(flight);
    result = result.concat(flightsData);
  return result;

const win ='');

  (data) => win.document.write(JSON.stringify(data)),
  (error) => console.error("Can't get flights", error),

In action:

I’ve run it twice to have separate data for flights with and without stops, and just saved the result to JSON files with content like:

[{"date":"2018-07-05","price":476,"from":"Rio de Janeiro","to":"Montevideo"},
{"date":"2018-07-06","price":470,"from":"Rio de Janeiro","to":"Montevideo"},
{"date":"2018-07-07","price":476,"from":"Rio de Janeiro","to":"Montevideo"},

Although, it mostly works, in some rare cases it looks like Google Flights has some sort of anti-parser and show “random” prices.

Selecting the best trips

In the previous part, I’ve parsed 10110 flights with stop and 6422 non-stop flights, it’s impossible to use brute force algorithm here (I’ve tried). As reading data from JSON isn’t interesting, I’ll skip that part.

At first, I’ve built an index of from destinationdayto destination:

from_id2day_number2to_id2flight = defaultdict(
    lambda: defaultdict(
        lambda: {}))
for flight in flights:
    from_id2day_number2to_id2flight[flight.from_id] \
        [flight.day_number][flight.to_id] = flight

Created a recursive generator that creates all possible trips:

def _generate_trips(can_visit, can_travel, can_spent, current_id,
                    current_day, trip_flights):
    # The last flight is to home city, the end of the trip
    if trip_flights[-1].to_id == home_city_id:
        yield Trip(
            price=sum(flight.price for flight in trip_flights),

    # Everything visited or no vacation days left or no money left
    if not can_visit or can_travel < MIN_STAY or can_spent == 0:

    # The minimal amount of cities visited, can start "thinking" about going home
    if len(trip_flights) >= MIN_VISITED and home_city_id not in can_visit:

    for to_id in can_visit:
        can_visit_next = can_visit.difference({to_id})
        for stay in range(MIN_STAY, min(MAX_STAY, can_travel) + 1):
            current_day_next = current_day + stay
            flight_next = from_id2day_number2to_id2flight \
                .get(current_id, {}).get(current_day_next, {}).get(to_id)
            if not flight_next:

            can_spent_next = can_spent - flight_next.price
            if can_spent_next < 0:

            yield from _generate_trips(
                can_visit_next, can_travel - stay, can_spent_next, to_id,
                                current_day + stay, trip_flights + [flight_next])

As the algorithm is easy to parallel, I’ve made it possible to run with Pool.pool.imap_unordered, and pre-sort for future sorting with merge sort:

def _generator_stage(params):
    return sorted(_generate_trips(*params), key=itemgetter(0))

Then generated initial flights and other trip flights in parallel:

def generate_trips():
    generators_params = [(
        city_ids.difference({start_id, home_city_id}),
        MAX_TRIP_PRICE - from_id2day_number2to_id2flight[home_city_id][start_day][start_id].price,
        for start_day in range((MAX_START - MIN_START).days)
        for start_id in from_id2day_number2to_id2flight[home_city_id][start_day].keys()]

    with Pool(cpu_count() * 2) as pool:
        for n, stage_result in enumerate(pool.imap_unordered(_generator_stage, generators_pa
            yield stage_result

And sorted everything with heapq.merge:

trips = [*merge(*generate_trips(), key=itemgetter(0))]

Looks like a solution to a job interview question.

Without optimizations, it was taking more than an hour and consumed almost whole RAM (apparently typing.NamedTuple isn’t memory efficient with multiprocessing at all), but current implementation takes 1 minute 22 seconds on my laptop.

As the last step I’ve saved results in csv (the code isn’t interesting and available in the gist), like:

price,days,cities,start city,start date,end city,end date,details
1373,15,4,La Paz,2018-09-15,Buenos Aires,2018-09-30,Amsterdam -> La Paz 2018-09-15 498 & La Paz -> Santiago 2018-09-18 196 & Santiago -> Montevideo 2018-09-23 99 & Montevideo -> Buenos Aires 2018-09-26 120 & Buenos Aires -> Amsterdam 2018-09-30 460
1373,15,4,La Paz,2018-09-15,Buenos Aires,2018-09-30,Amsterdam -> La Paz 2018-09-15 498 & La Paz -> Santiago 2018-09-18 196 & Santiago -> Montevideo 2018-09-23 99 & Montevideo -> Buenos Aires 2018-09-27 120 & Buenos Aires -> Amsterdam 2018-09-30 460
1373,15,4,La Paz,2018-09-15,Buenos Aires,2018-09-30,Amsterdam -> La Paz 2018-09-15 498 & La Paz -> Santiago 2018-09-20 196 & Santiago -> Montevideo 2018-09-23 99 & Montevideo -> Buenos Aires 2018-09-26 120 & Buenos Aires -> Amsterdam 2018-09-30 460
1373,15,4,La Paz,2018-09-15,Buenos Aires,2018-09-30,Amsterdam -> La Paz 2018-09-15 498 & La Paz -> Santiago 2018-09-20 196 & Santiago -> Montevideo 2018-09-23 99 & Montevideo -> Buenos Aires 2018-09-27 120 & Buenos Aires -> Amsterdam 2018-09-30 460

Gist with sources.

Filmstrip from subtitles and stock images

It’s possible to find subtitles for almost every movie or TV series. And there’s also stock images with anything imaginable. Wouldn’t it be fun to connect this two things and make a sort of a filmstrip with a stock image for every caption from subtitles?

TLDR: the result is silly:

For the subtitles to play with I chose subtitles for Bob’s Burgers – The Deeping. At first, we need to parse it with pycaption:

from import SRTReader

lang = 'en-US'
path = ''

def read_subtitles(path, lang):
    with open(path) as f:
        data =
        return SRTReader().read(data, lang=lang)
subtitles = read_subtitles(path, lang)
captions = subtitles.get_captions(lang)
>>> captions
['00:00:04.745 --> 00:00:06.746\nShh.', '00:00:10.166 --> 00:00:20.484\n...

As a lot of subtitles contains html, it’s important to remove tags before future processing, it’s very easy to do with lxml:

import lxml.html

def to_text(raw_text):
    return lxml.html.document_fromstring(raw_text).text_content()
to_text('<i>That shark is ruining</i>')
'That shark is ruining'

For finding most significant words in the text we need to tokenize it, lemmatize (replace every different form of a word with a common form) and remove stop words. It’s easy to do with NLTK:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def tokenize_lemmatize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(token.lower())
                  for token in tokens if token.isalpha()]
    stop_words = set(stopwords.words("english"))
    return [lemma for lemma in lemmatized if lemma not in stop_words]
>>> tokenize_lemmatize('That shark is ruining')
['shark', 'ruining']

And after that we can just combine the previous two functions and find most frequently used words:

from collections import Counter

def get_most_popular(captions):
    full_text = '\n'.join(to_text(caption.get_text()) for caption in captions)
    tokens = tokenize_lemmatize(full_text)
    return Counter(tokens)
most_popular = get_most_popular(captions)
Counter({'shark': 68, 'oh': 32, 'bob': 29, 'yeah': 25, 'right': 20,...

It’s not the best way to find the most important words, but it kind of works.

After that it’s straightforward to extract keywords from a single caption:

def get_keywords(most_popular, text, n=2):
    tokens = sorted(tokenize_lemmatize(text), key=lambda x: -most_popular[x])
    return tokens[:n]
>>> captions[127].get_text()
'Teddy, what is wrong with you?'
>>> get_keywords(most_popular, to_text(captions[127].get_text()))
['teddy', 'wrong']

The next step is to find a stock image for those keywords. There’s not that many properly working and documented stocks, so I chose to use Shutterstock API. It’s limited to 250 requests per hour, but it’s enough to play.

From their API we only need to use /images/search. We will search for the most popular photo:

import requests

# Key and secret of your app
stock_key = ''
stock_secret = ''

def get_stock_image_url(query):
    response = requests.get(
            'query': query,
            'sort': 'popular',
            'view': 'minimal',
            'safe': 'false',
            'per_page': '1',
            'image_type': 'photo',
        auth=(stock_key, stock_secret),
    data = response.json()
        return data['data'][0]['assets']['preview']['url']
    except (IndexError, KeyError):
        return None
>>> get_stock_image_url('teddy wrong')

The image looks relevant:

teddy wrong

Now we can create a proper card from a caption:

def make_slide(most_popular, caption):
    text = to_text(caption.get_text())
    if not text:
        return None

    keywords = get_keywords(most_popular, text)
    query = ' '.join(keywords)
    if not query:
        return None

    stock_image = get_stock_image_url(query)
    if not stock_image:
        return None

    return text, stock_image
make_slide(most_popular, captions[132])
('He really chewed it...\nwith his shark teeth.', '')

The image is kind of relevant:

He really chewed it...with his shark teeth.

After that we can select captions that we want to put in our filmstrip and generate html like the one in the TLDR section:

output_path = 'burgers.html'
start_slide = 98
end_slide = 200

def make_html_output(slides):
    html = '<html><head><link rel="stylesheet" href="./style.css"></head><body>'
    for (text, stock_image) in slides:
        html += f'''<div class="box">
            <img src="{stock_image}" />
    html += '</body></html>'
    return html

interesting_slides = [make_slide(most_popular, caption)
                      for caption in captions[start_slide:end_slide]]
interesting_slides = [slide for slide in interesting_slides if slide]

with open(output_path, 'w') as f:
    output = make_html_output(interesting_slides)

And the result - burgers.html.

Another example, even worse and a bit NSFW, It’s Always Sunny in Philadelphia – Charlie Catches a Leprechaun.

Gist with the sources.

Analyzing commute with Google Location History and Python

Map with commute

The article was updated 17.05.2018 with wind direction data.

Since I moved to Amsterdam I’m biking to work almost every morning. And as Google is always tracking the location of my phone, I thought that it might be interesting to do something with that data.

First of all, I’ve downloaded Location History data dump in json from Download your data page. The format of the dump is very simple, it’s a dict with locations key that contains a lot of entries like this in descendant order by date:

    "timestampMs" : "1525120611682",
    "latitudeE7" : 523508799,
    "longitudeE7" : 488938179,
    "accuracy" : 15,
    "altitude" : 49,
    "verticalAccuracy" : 2

It’s very easy to parse it with Python:

import json
from datetime import datetime
from collections import namedtuple

Point = namedtuple('Point', 'latitude, longitude, datetime')

def read_points():
    with open('data.json') as f:
        data = json.load(f)

    for point in data['locations']:
        yield Point(
            point['latitudeE7'] / 10 ** 7,
            point['longitudeE7'] / 10 ** 7,
            datetime.fromtimestamp(int(point['timestampMs']) / 1000)
points = read_points()
>>> [*points]
[Point(latitude=52.350879, longitude=4.893817, datetime=datetime.datetime(2018, 4, 30, 22, 36, 51, 682000)), ...]

As I moved to my current place in November, it’s safe to ignore all entries before:

from itertools import takewhile

from_date = datetime(2017, 11, 1)

after_move = takewhile(lambda point: point.datetime >= from_date, points)

And weekends:

work_days = (point for point in after_move
             if point.datetime.weekday() < 5)

Usually, I’m heading to work between 9 am and 10 am, but as the Netherlands are switching between summer and winter timezones, it will be safer to treat everything between 7 am and 9a m as possible commute time:

from_hour = 7
to_hour = 12

commute_time = (point for point in work_days
                if from_hour <= point.datetime.hour < to_hour)

Then I grouped everything by date:

from itertools import groupby

by_days = groupby(commute_time, key=lambda point:
>>> [(day, [*vals]) for day, vals in by_days]
[(, 4, 27),
 [Point(latitude=52.350879, longitude=4.893817, datetime=datetime.datetime(2018, 4, 27, 11, 58, 17, 189000)), ...]),

After that, I selected the last point at home and the first point at work for every day. A point considered home or work if it’s distance from home or work is smaller than 50 meters. The distance can be easily calculated with geopy:

from geopy.distance import geodesic

home = (52.350879, 4.893817)  # not really =)
work = (52.3657573, 4.8980648)

max_distance = 0.050

def last_at_home(points):
    result = None
    for point in points:
        if geodesic(home, point[:2]).km <= max_distance:
            result = point
    return result

def first_at_work(points, after):
    for point in points:
        if point.datetime > after.datetime and geodesic(work, point[:2]).km <= max_distance:
            return point

Commute = namedtuple('Commute', 'day, start, end, took')

def get_commute():
    for day, points in by_days:
        points = [*points][::-1]

        start = last_at_home(points)
        if start is None:

        end = first_at_work(points, start)
        if end is None:

        yield Commute(
            day, start.datetime, end.datetime, end.datetime - start.datetime,

commutes = [*get_commute()][::-1]
>>> commutes
[Commute(, 11, 2), start=datetime.datetime(2017, 11, 2, 9, 39, 13, 219000), end=datetime.datetime(2017, 11, 2, 9, 52, 53, 295000), took=datetime.timedelta(0, 820, 76000)), ...]

Now it’s easy to plot a graph of daily commute with matplotlib:

from matplotlib import pyplot

fig, ax = pyplot.subplots()
ax.plot([ for commute in commutes],
        [commute.took.total_seconds() / 60 for commute in commutes])

ax.set(xlabel='day', ylabel='commute (minutes)',
       title='Daily commute')

It’s easy to spot days when I had appointments in the morning:

Daily commute

Then I thought that it might be interesting to look for a correlation between temperature and commute time, and wind speed and commute time. I found data dump of daily weather that KNMI provides, the nearest meteorological station is in Schiphol airport, but I guess it’s close enough. The data is in easy to parse format:

# STN,YYYYMMDD,DDVEC,FHVEC,   FG,  FHX, FHXH,  FHN, FHNH,  FXX, FXXH,   TG,   TN,  TNH,   TX,  TXH, T10N,T10NH,   SQ,   SP,    Q,   DR,   RH,  RHX, RHXH,   PG,   PX,  PXH,   PN,  PNH,  VVN, VVNH,  VVX, VVXH,   NG,   UG,   UX,  UXH,   UN,  UNH, EV24

  240,19510101,  188,   77,   87,  195,   18,   41,   24,     ,     ,   12,  -13,    1,   26,   20,     ,     ,     ,     ,     ,     ,     ,     ,     , 9891, 9957,     , 9837,     ,     ,     ,     ,     ,    7,   90,   98,    6,   73,   20,     
  240,19510102,  153,   41,   41,   82,    4,   10,   21,     ,     ,   13,    7,    4,   18,   19,     ,     ,     ,     ,     ,     ,     ,     ,     , 9876, 9923,     , 9853,     ,     ,     ,     ,     ,    8,   93,   98,    9,   88,    1,     

I only used FG (Daily mean wind speed in 0.1 m/s), TG (Daily mean temperature in 0.1 degrees Celsius) and DDVEC (Vector mean wind direction in degrees):

from dateutil.parser import parse

Weather = namedtuple('Weather', 'windspeed, temperature, wind_direction')

def read_weather():
    result = {}

    with open('weather.txt') as f:
        for line in f.readlines():
            if not line.startswith(' '):

            data = [part.strip() for part in line.split(',')]
            result[parse(data[1]).date()] = Weather(
                int(data[4]) / 10,
                int(data[11]) / 10,

    return result

weather = read_weather()
>>> weather
{, 1, 1): Weather(windspeed=8.7, temperature=1.2, wind_direction=188),, 1, 2): Weather(windspeed=4.1, temperature=1.3, wind_direction=153),, 1, 3): Weather(windspeed=2.1, temperature=0.3, wind_direction=203),

Before doing this I’ve excluded spikes from days when I had appointments:

normalized = [commute for commute in commutes
              if commute.took.total_seconds() < 60 * 20]

Then I created a scatter plot of temperature and commute:

fig, ax = pyplot.subplots()
ax.scatter([commute.took.total_seconds() / 60 for commute in normalized],
           [weather[].temperature for commute in normalized])
ax.set(xlabel='Commute time', ylabel='Temperature',
       title='Commute and weather')

Correlation is slightly visible, on cold days commute is a bit faster:

Commute and temperature

With wind speed code is almost the same:

fig, ax = pyplot.subplots()
ax.scatter([commute.took.total_seconds() / 60 for commute in normalized],
           [weather[].windspeed for commute in normalized])
ax.set(xlabel='Commute time', ylabel='Wind speed',
       title='Commute and wind')

And the correlation is more visible, the commute is slower with stronger wind:

Commute and wind

Then I’ve tried to combine the previous two graphs in one 3d graph:

from mpl_toolkits.mplot3d import Axes3D

fig, ax = pyplot.subplots(subplot_kw={'projection': '3d'})
ax.scatter([weather[].temperature for commute in normalized],
           [weather[].windspeed for commute in normalized],
           [commute.took.total_seconds() / 60 for commute in normalized])
ax.set(xlabel='Temperature', ylabel='Wind speed', zlabel='Commute time',
       title='Commute and weather')

And the result didn’t give me anything:

Commute and weather in 3d

After that I thought that it would be interesting to look at a possible correlation of commute time, wind speed, and direction:

from matplotlib import cm

colors = iter(cm.Reds(np.linspace(0, 1, len(normalized))))

fig, ax = pyplot.subplots()

for commute in sorted(normalized, key=lambda commute: commute.took.total_seconds() / 60):

ax.set(xlabel='Wind speed', ylabel='Wind direction',
       title='Commute and wind')


Longer the commute, redder the dot:

Commute, wind speed and direction

Wind direction doesn’t look that well on that plot, so I’ve tried to use polar plot:

colors = iter(cm.Reds(np.linspace(0, 1, len(normalized))))

fig, ax = pyplot.subplots(subplot_kw={'projection': 'polar'})

for commute in sorted(normalized, key=lambda commute: commute.took.total_seconds() / 60):

ax.set(title='Commute and wind')


And it’s much more “readable”:

Polar commute, wind speed and direction

Gist with sources.

Alan A. A. Donovan,‎ Brian W. Kernighan: The Go Programming Language

book cover white Somehow I’ve started to use Go more and more, so I decided to read something fundamental about it. The Go Programming Language by Alan A. A. Donovan and‎ Brian W. Kernighan looked like a right choice. The book covers every aspect of the language from very basic things like statements and variables to reflection and interop with C. It also covers some best practices, and things that better not to do.

Although I’m not a big fan of Go and it’s best practices, the book at least helped me to understand why it’s done that way a bit better.