Robert C. Martin: Clean Architecture



book cover Recently I decided to read Clean Architecture by Robert C. Martin as I’m still trying to read something about system architecture and I had a good experience with Clean books in the past. And I kind of enjoyed it.

The book is about software architecture in general but mostly focused on “one piece of software” architecture. It nicely explains a few ways of proper separation of an application into different independent parts without going too deep into enterprise-ish practices and patterns. The book also shows a bunch of examples of how to abstract away implementation details, but also warns to not overdo and not over-engineer things.

Although sometimes it feels too basic and covers a lot of general knowledge, it’s a nice and useful book to read.

Mixing React Hooks with classes or making functional components more structured



Hook inside a class*

Relatively not so long ago Hooks were introduced in React. They allow us to have decoupled and reusable logic for components state and effects, and make dependency injection easier. But API wise it looks a bit like a step back from class-based components to sort of jQuery territory with tons of nested functions.

So I thought that it might be nice to try to mix both approaches.

TLDR: it’s possible to make it nice and declarative, but it requires metaprogramming and wouldn’t work in some browsers.

The hookable class adventure

Let’s assume that we have a simple counter component:

export default ({ initialCount }) => {
  const theme = useContext(Theme);

  const [current, setCurrent] = useState(initialCount);
  const [clicked, setClicked] = useState();

  const onClick = useCallback(() => {
    setCurrent(current + 1);
    setClicked(true);
  }, [current]);

  useEffect(() => {
    console.log("did mount");

    return () => {
      console.log("did unmount");
    };
  });

  return (
    <div>
      <p>
        Value: <span style=>{current}</span>
      </p>
      {clicked && <p>You already clicked!</p>}
      <p>Initial value: {initialCount}</p>
      <button onClick={onClick}>Increase</button>
    </div>
  );
};

As a first step towards classes, I made a simple decorator, that creates a function that initializes a class and calls render (the only similarity with original class-based components interfaces) method.

const asFunctional = cls => props => new cls(props).render();

As we can initialize attributes on a class body, it’s already safe to move useContext to it:

class Counter {
  theme = useContext(Theme);

  constructor({ initialCount }) {
    this.initialCount = initialCount;
  }

  render() {
    const [current, setCurrent] = useState(this.initialCount);
    const [clicked, setClicked] = useState();

    const onClick = useCallback(() => {
      setCurrent(current + 1);
      setClicked(true);
    }, [current]);

    useEffect(() => {
      console.log("did mount");

      return () => {
        console.log("did unmount");
      };
    });

    return (
      <div>
        <p>
          Value: <span style=>{current}</span>
        </p>
        {clicked && <p>You already clicked!</p>}
        <p>Initial value: {this.initialCount}</p>
        <button onClick={onClick}>Increase</button>
      </div>
    );
  }
}

export default asFunctional(Counter);

Manual assignment of props as attributes looks ugly, and useState inside render is even worse. It would be nice to be able to declare them on the class body. Unfortunately, decorators that can help with that aren’t here yet, but we can use a bit of Proxy magic by making a base class that will intercept attributes assignment and inject values for props and descriptors for the state:

const prop = () => ({ __isPropGetter: true });

const asDescriptor = ([val, setVal]) => ({
  __isDescriptor: true,
  get: () => val,
  set: newVal => setVal(newVal),
});

const Hookable = function(props) {
  return new Proxy(this, {
    set: (obj, name, val) => {
      if (val && val.__isPropGetter) {
        obj[name] = props[name];
      } else if (val && val.__isDescriptor) {
        Object.defineProperty(obj, name, val);
      } else {
        obj[name] = val;
      }
      return true;
    },
  });
};

So now we can have descriptors for the state, and when a state attribute value will be changed, set... will be called automatically. As we don’t need to have the state in a closure, it’s safe to move onClick callback to the class body:

class Counter extends Hookable {
  initialCount = prop();

  theme = useContext(Theme);

  current = asDescriptor(useState(this.initialCount));
  clicked = asDescriptor(useState());

  onClick = useCallback(() => {
    this.current += 1;
    this.clicked = true;
  }, [this.current]);

  render() {
    useEffect(() => {
      console.log("did mount");

      return () => {
        console.log("did unmount");
      };
    });

    return (
      <div>
        <p>
          Value: <span style=>{this.current}</span>
        </p>
        {this.clicked && <p>You already clicked!</p>}
        <p>Initial value: {this.initialCount}</p>
        <button onClick={this.onClick}>Increase</button>
      </div>
    );
  }
}

export default asFunctional(Counter);

The only not so fancy part left is useEffect inside render. In Python world similar problem with context managers API solved by contextmanager decorator, that transforms generators to context managers. I tried the same approach with effects:

const fromGenerator = (hook, genFn, deps) => fn => {
  const gen = genFn();
  hook(() => {
    gen.next();

    return () => {
      gen.next();
    };
  }, deps);

  return fn;
};

The magical end result

As a result, we have render with only JSX and almost no nested functions in our component:

class Counter extends Hookable {
  initialCount = prop();

  theme = useContext(Theme);

  current = asDescriptor(useState(this.initialCount));
  clicked = asDescriptor(useState());

  onClick = useCallback(() => {
    this.current += 1;
    this.clicked = true;
  }, [this.current]);

  withLogging = fromGenerator(
    useEffect,
    function*() {
      console.log("did mount");
      yield;
      console.log("did unmount");
    },
    [],
  );

  render = this.withLogging(() => (
    <div>
      <p>
        Value:{" "}
        <span style=>{this.current}</span>
      </p>
      {this.clicked && <p>You already clicked!</p>}
      <p>Initial value: {this.initialCount}</p>
      <button onClick={this.onClick}>Increase</button>
    </div>
  ));
}

export default asFunctional(Counter);

And it even works:

For my personal eyes, the end result looks better and more readable, but the magic inside isn’t free:

  • it doesn’t work in Internet Explorer
  • the machinery around Proxy might be slow
  • it’s impossible to make it properly typed with TypeScript or Flow
  • metaprogramming could make things unnecessary more complicated

So I guess something in the middle (functor-like approach?) might be useful for real applications.

Gist with source code.

* hero image contains a photo of a classroom in Alaska

Eben Hewitt: Technology Strategy Patterns



book cover white Not so long ago I wanted to read something about a large scale architecture and mistakenly decided that Technology Strategy Patterns by Eben Hewitt was about that. It’s not, but it was still interesting to read. The book provides a bunch of nice patterns for delivering information to product-people and stakeholders. The book covers a few ways of doing long-range planning and analysis with some examples from the real world. It even has some Powerpoint and Excel advice.

Although somehow the book was useful for me and I’ve learned a bunch of new stuff, I’m definitely not the target audience of it.

Finding the cheapest flights for a multi-leg trip with Amadeus API and Python



An old plane

This summer I’m planning to have a trip that will include Moscow, Irkutsk, Beijing, Shanghai, and Tokyo. As I’m flexible on dates I’ve decided to try to find the cheapest flights with the shortest duration. I’ve tried to do this before twice by parsing Google Flights, it was successful, but I don’t want to update those hackish scripts and want to try something a bit saner.

So I chose to try Amadeus API. It was a bit painful to use, some endpoints were randomly failing with 500, and they needed a signed agreement to use real data. But overall it was at least better than parsing Google Flights, and the whole adventure fit inside the free quota for requests.

TLDR: jupyter notebook with the whole adventure

Restrictions

I’m flexible but with boundaries, so I’ll be able to start between 10th and 20th of July and travel no longer than 21 days:

min_start = date(2019, 7, 10)
max_start = date(2019, 7, 20)
max_days = 21

I mostly don’t want to have multi-segment flights and know how many days I want to spend in destinations:

places_df = pd.DataFrame([('Amsterdam', 'NL', 0, (max_start - min_start).days, True),  # for enabling tentative start date
                          ('Moscow', 'RU', 3, 5, True),
                          ('Irkutsk', 'RU', 7, 10, True),
                          ('Beijing', 'CN', 3, 5, True),
                          ('Shanghai', 'CN', 3, 5, True),
                          ('Tokyo', 'JP', 3, 5, False),
                          ('Amsterdam', 'NL', 0, 0, True)],  # the final destination
                         columns=['city', 'cc', 'min_days', 'max_days', 'only_direct'])

places_df['min_day_of_dep'] = places_df.min_days.rolling(min_periods=1, window=len(places_df)).sum()
places_df['max_day_of_dep'] = places_df.max_days.rolling(min_periods=1, window=len(places_df)).sum()

places_df
city cc min_days max_days only_direct min_day_of_dep max_day_of_dep
0 Amsterdam NL 0 10 True 0.0 10.0
1 Moscow RU 3 5 True 3.0 15.0
2 Irkutsk RU 7 10 True 10.0 25.0
3 Beijing CN 3 5 True 13.0 30.0
4 Shanghai CN 3 5 True 16.0 35.0
5 Tokyo JP 3 5 False 19.0 40.0
6 Amsterdam NL 0 0 True 19.0 40.0

Airports

A lot of big cities have more than one airport, and usually, some airports are for low-costers and some for pricier flights. But the most important that the API expects me to send IATA codes to get prices for dates. So I needed to get IATA codes for airports for cities I will travel through, and it’s possible with just a request to /reference-data/locations:

def get_iata(city, cc):
    response = call_api('/reference-data/locations',  # full code in the notebook
                        keyword=city,
                        countryCode=cc,
                        subType='AIRPORT')

    return [result['iataCode'] for result in response['data']]

get_iata('Moscow', 'RU')
['DME', 'SVO', 'VKO']

With that function, I was able to get IATA codes for all destinations and get all possible routes with a bit of pandas magic:

places_df['iata'] = places_df.apply(lambda place: get_iata(place['city'], place['cc']), axis=1)

routes_df = places_df.assign(dest_iata=places_df.iloc[1:].reset_index().iata)
routes_df['routes'] = routes_df.apply(
    lambda row: [*product(row['iata'], row['dest_iata'])] if isinstance(row['dest_iata'], list) else [],
    axis=1)

routes_df = routes_df.routes \
    .apply(pd.Series) \
    .merge(routes_df, right_index=True, left_index=True) \
    .drop(['routes', 'min_days', 'max_days', 'iata', 'dest_iata'], axis=1) \
    .melt(id_vars=['city', 'cc', 'min_day_of_dep', 'max_day_of_dep', 'only_direct'], value_name="route") \
    .drop('variable', axis=1) \
    .dropna()

routes_df['origin'] = routes_df.route.apply(lambda route: route[0])
routes_df['destination'] = routes_df.route.apply(lambda route: route[1])
routes_df = routes_df \
    .drop('route', axis=1) \
    .rename(columns={'city': 'origin_city',
                     'cc': 'origin_cc'})

routes_df.head(10)
origin_city origin_cc min_day_of_dep max_day_of_dep only_direct origin destination
0 Amsterdam NL 0.0 10.0 True AMS DME
1 Moscow RU 3.0 15.0 True DME IKT
2 Irkutsk RU 10.0 25.0 True IKT PEK
3 Beijing CN 13.0 30.0 True PEK PVG
4 Shanghai CN 16.0 35.0 True PVG HND
5 Tokyo JP 19.0 40.0 False HND AMS
7 Amsterdam NL 0.0 10.0 True AMS SVO
8 Moscow RU 3.0 15.0 True SVO IKT
9 Irkutsk RU 10.0 25.0 True IKT NAY
10 Beijing CN 13.0 30.0 True PEK SHA

To understand the complexity of the problem better I draw an ugly graph of possible flights routes:

The ugly graph with airports

Prices and dates

After that I’ve calculated all possible dates for flights:

route_dates_df = routes_df.assign(
    dates=routes_df.apply(lambda row: [min_start + timedelta(days=days)
                                       for days in range(int(row.min_day_of_dep), int(row.max_day_of_dep) + 1)],
                          axis=1))

route_dates_df = route_dates_df.dates \
    .apply(pd.Series) \
    .merge(route_dates_df, right_index=True, left_index=True) \
    .drop(['dates', 'min_day_of_dep', 'max_day_of_dep'], axis=1) \
    .melt(id_vars=['origin_city', 'origin_cc', 'origin', 'destination', 'only_direct'], value_name="date") \
    .drop('variable', axis=1) \
    .dropna()

valid_routes_df = route_dates_df[route_dates_df.date <= max_start + timedelta(days=max_days)]

valid_routes_df.head(10)
origin_city origin_cc origin destination only_direct date
0 Amsterdam NL AMS DME True 2019-07-10
1 Moscow RU DME IKT True 2019-07-13
2 Irkutsk RU IKT PEK True 2019-07-20
3 Beijing CN PEK PVG True 2019-07-23
4 Shanghai CN PVG HND True 2019-07-26
5 Tokyo JP HND AMS False 2019-07-29
6 Amsterdam NL AMS SVO True 2019-07-10
7 Moscow RU SVO IKT True 2019-07-13
8 Irkutsk RU IKT NAY True 2019-07-20
9 Beijing CN PEK SHA True 2019-07-23

Eventually, I’ve got 363 possible route-date combinations, and used /shopping/flight-offers to get prices. As the endpoint has a quota of 2000 free requests, I was able to mess everything up a few times and haven’t reached it yet:

def get_prices(origin, destination, date, only_direct):
    response = call_api('/shopping/flight-offers',
                         origin=origin,
                         destination=destination,
                         nonStop='true' if only_direct else 'false',
                         departureDate=date.strftime("%Y-%m-%d"))

    if 'data' not in response:
        print(response)
        return []

    return [(origin, destination, date,
             Decimal(offer_item['price']['total']),
             parse_date(offer_item['services'][0]['segments'][0]['flightSegment']['departure']['at']),
             parse_date(offer_item['services'][0]['segments'][-1]['flightSegment']['arrival']['at']),
             len(offer_item['services'][0]['segments']))
            for flight in response['data']
            for offer_item in flight['offerItems']]

get_prices('IKT', 'PEK', date(2019, 7, 20), True)[:5]
[('IKT',
  'PEK',
  datetime.date(2019, 7, 20),
  Decimal('209.11'),
  datetime.datetime(2019, 7, 20, 1, 50, tzinfo=tzoffset(None, 28800)),
  datetime.datetime(2019, 7, 20, 4, 40, tzinfo=tzoffset(None, 28800)),
  1),
 ('IKT',
  'PEK',
  datetime.date(2019, 7, 20),
  Decimal('262.98'),
  datetime.datetime(2019, 7, 20, 15, 15, tzinfo=tzoffset(None, 28800)),
  datetime.datetime(2019, 7, 20, 18, 5, tzinfo=tzoffset(None, 28800)),
  1)]

Then I’ve fetched flights for the whole set of dates, assigned useful metadata like origin/destination cities and duration of the flights, and removed flights pricier than €800:

prices_df = pd.DataFrame([price
                          for route in valid_routes_df.to_dict('record')
                          for price in get_prices(route['origin'], route['destination'], route['date'], route['only_direct'])],
                         columns=['origin', 'destination', 'date', 'price', 'departure_at', 'arrival_at', 'segments'])

airport_to_city = dict(zip(routes_df.origin, routes_df.origin_city))

prices_with_city_df = prices_df \
    .assign(duration=prices_df.arrival_at - prices_df.departure_at,
            origin_city=prices_df.origin.apply(airport_to_city.__getitem__),
            destination_city=prices_df.destination.apply(airport_to_city.__getitem__))
prices_with_city_df['route'] = prices_with_city_df.origin_city + " ✈️ " + prices_with_city_df.destination_city

valid_prices_with_city_df = prices_with_city_df[prices_with_city_df.price <= 800]

valid_prices_with_city_df.head()
origin destination date price departure_at arrival_at segments duration origin_city destination_city route
0 DME IKT 2019-07-13 257.40 2019-07-13 21:40:00+03:00 2019-07-14 08:25:00+08:00 1 05:45:00 Moscow Irkutsk Moscow✈️Irkutsk
1 DME IKT 2019-07-13 257.40 2019-07-13 23:00:00+03:00 2019-07-14 09:45:00+08:00 1 05:45:00 Moscow Irkutsk Moscow✈️Irkutsk
2 DME IKT 2019-07-13 254.32 2019-07-13 19:55:00+03:00 2019-07-14 06:25:00+08:00 1 05:30:00 Moscow Irkutsk Moscow✈️Irkutsk
3 DME IKT 2019-07-13 227.40 2019-07-13 18:30:00+03:00 2019-07-14 05:15:00+08:00 1 05:45:00 Moscow Irkutsk Moscow✈️Irkutsk
4 IKT PEK 2019-07-20 209.11 2019-07-20 01:50:00+08:00 2019-07-20 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk✈️Beijing

To have a brief overview of prices I’ve made a scatterplot. If I was a machine learning algorithm I would exclude Tokyo from the adventure:

Scatterplot with prices and duration

Itinerary

At this stage I’ve got all the data I want, so I can begin building the itinerary. I’ve calculated all possible city/date combination of flights. Job interviews questions prepared me for that:

next_flight_origin_city = dict(zip(places_df.city.iloc[:-2], places_df.city.iloc[1:-1]))
place_min_days = dict(zip(places_df.city.iloc[:-1], places_df.min_days.iloc[:-1]))
place_max_days = dict(zip(places_df.city.iloc[:-1], places_df.max_days.iloc[:-1]))

def build_itinerary(place, date):
    if place is None:
        return

    next_place = next_flight_origin_city.get(place)

    for days in range(place_min_days[place], place_max_days[place] + 1):
        flight_date = date + timedelta(days=days)
        for rest_flights in build_itinerary(next_place, flight_date):
            yield [(place, flight_date), *rest_flights]

        if next_place is None:
            yield [(place, flight_date)]

itinerary = [*build_itinerary('Amsterdam', min_start)]
itinerary[:3]
[[('Amsterdam', datetime.date(2019, 7, 10)),
  ('Moscow', datetime.date(2019, 7, 13)),
  ('Irkutsk', datetime.date(2019, 7, 20)),
  ('Beijing', datetime.date(2019, 7, 23)),
  ('Shanghai', datetime.date(2019, 7, 26)),
  ('Tokyo', datetime.date(2019, 7, 29))],
 [('Amsterdam', datetime.date(2019, 7, 10)),
  ('Moscow', datetime.date(2019, 7, 13)),
  ('Irkutsk', datetime.date(2019, 7, 20)),
  ('Beijing', datetime.date(2019, 7, 23)),
  ('Shanghai', datetime.date(2019, 7, 26)),
  ('Tokyo', datetime.date(2019, 7, 30))],
 [('Amsterdam', datetime.date(2019, 7, 10)),
  ('Moscow', datetime.date(2019, 7, 13)),
  ('Irkutsk', datetime.date(2019, 7, 20)),
  ('Beijing', datetime.date(2019, 7, 23)),
  ('Shanghai', datetime.date(2019, 7, 26)),
  ('Tokyo', datetime.date(2019, 7, 31))]]

And then I’ve found all flights for those dates. As amount of possible flights combinations didn’t fit in my RAM, I was selecting n_cheapest flights on each stage. The code is slow and ugly, but it worked:

def find_flights(prices_with_city_df, itinerary_route, n_cheapest):
    result_df = None
    for place, date in itinerary_route:
        place_df = prices_with_city_df \
            [(prices_with_city_df.origin_city == place) & (prices_with_city_df.date == date)] \
            .sort_values('price', ascending=True) \
            .head(n_cheapest) \
            .add_prefix(f'{place}_')

        if result_df is None:
            result_df = place_df
        else:
            result_df = result_df \
                .assign(key=1) \
                .merge(place_df.assign(key=1), on="key") \
                .drop("key", axis=1)

            result_df['total_price'] = reduce(operator.add, (
                result_df[column] for column in result_df.columns
                if 'price' in column and column != 'total_price'
            ))

            result_df = result_df \
                .sort_values('total_price', ascending=True) \
                .head(n_cheapest)

    result_df['total_flights_duration'] = reduce(operator.add, (
        result_df[column] for column in result_df.columns
        if 'duration' in column
    ))

    return result_df[['total_price', 'total_flights_duration'] + [
        column for column in result_df.columns
        if 'total_' not in column]]

find_flights(prices_with_city_df, itinerary[0], 100).head(5)
total_price total_flights_duration Amsterdam_origin Amsterdam_destination Amsterdam_date Amsterdam_price Amsterdam_departure_at Amsterdam_arrival_at Amsterdam_segments Amsterdam_duration Amsterdam_origin_city Amsterdam_destination_city Amsterdam_route Moscow_origin Moscow_destination Moscow_date Moscow_price Moscow_departure_at Moscow_arrival_at Moscow_segments Moscow_duration Moscow_origin_city Moscow_destination_city Moscow_route Irkutsk_origin Irkutsk_destination Irkutsk_date Irkutsk_price Irkutsk_departure_at Irkutsk_arrival_at Irkutsk_segments Irkutsk_duration Irkutsk_origin_city Irkutsk_destination_city Irkutsk_route Beijing_origin Beijing_destination Beijing_date Beijing_price Beijing_departure_at Beijing_arrival_at Beijing_segments Beijing_duration Beijing_origin_city Beijing_destination_city Beijing_route Shanghai_origin Shanghai_destination Shanghai_date Shanghai_price Shanghai_departure_at Shanghai_arrival_at Shanghai_segments Shanghai_duration Shanghai_origin_city Shanghai_destination_city Shanghai_route Tokyo_origin Tokyo_destination Tokyo_date Tokyo_price Tokyo_departure_at Tokyo_arrival_at Tokyo_segments Tokyo_duration Tokyo_origin_city Tokyo_destination_city Tokyo_route
0 1901.41 1 days 20:45:00 AMS SVO 2019-07-10 203.07 2019-07-10 21:15:00+02:00 2019-07-11 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-13 227.40 2019-07-13 18:30:00+03:00 2019-07-14 05:15:00+08:00 1 05:45:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-20 209.11 2019-07-20 01:50:00+08:00 2019-07-20 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-23 171.64 2019-07-23 11:30:00+08:00 2019-07-23 14:00:00+08:00 1 02:30:00 Beijing Shanghai Beijing ✈️ Shanghai SHA NRT 2019-07-26 394.07 2019-07-26 14:35:00+08:00 2019-07-26 18:15:00+09:00 1 02:40:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-29 696.12 2019-07-29 17:55:00+09:00 2019-07-30 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
2800 1901.41 1 days 20:30:00 AMS SVO 2019-07-10 203.07 2019-07-10 11:50:00+02:00 2019-07-10 16:05:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-13 227.40 2019-07-13 18:30:00+03:00 2019-07-14 05:15:00+08:00 1 05:45:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-20 209.11 2019-07-20 01:50:00+08:00 2019-07-20 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-23 171.64 2019-07-23 21:30:00+08:00 2019-07-23 23:45:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-26 394.07 2019-07-26 14:35:00+08:00 2019-07-26 18:15:00+09:00 1 02:40:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-29 696.12 2019-07-29 17:55:00+09:00 2019-07-30 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
2900 1901.41 1 days 20:30:00 AMS SVO 2019-07-10 203.07 2019-07-10 21:15:00+02:00 2019-07-11 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-13 227.40 2019-07-13 18:30:00+03:00 2019-07-14 05:15:00+08:00 1 05:45:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-20 209.11 2019-07-20 01:50:00+08:00 2019-07-20 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-23 171.64 2019-07-23 10:00:00+08:00 2019-07-23 12:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai SHA NRT 2019-07-26 394.07 2019-07-26 14:35:00+08:00 2019-07-26 18:15:00+09:00 1 02:40:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-29 696.12 2019-07-29 17:55:00+09:00 2019-07-30 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
3000 1901.41 1 days 20:30:00 AMS SVO 2019-07-10 203.07 2019-07-10 21:15:00+02:00 2019-07-11 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-13 227.40 2019-07-13 18:30:00+03:00 2019-07-14 05:15:00+08:00 1 05:45:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-20 209.11 2019-07-20 01:50:00+08:00 2019-07-20 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-23 171.64 2019-07-23 10:00:00+08:00 2019-07-23 12:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-26 394.07 2019-07-26 14:35:00+08:00 2019-07-26 18:15:00+09:00 1 02:40:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-29 696.12 2019-07-29 17:55:00+09:00 2019-07-30 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
3100 1901.41 1 days 20:30:00 AMS SVO 2019-07-10 203.07 2019-07-10 21:15:00+02:00 2019-07-11 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-13 227.40 2019-07-13 18:30:00+03:00 2019-07-14 05:15:00+08:00 1 05:45:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-20 209.11 2019-07-20 01:50:00+08:00 2019-07-20 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-23 171.64 2019-07-23 17:00:00+08:00 2019-07-23 19:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai SHA NRT 2019-07-26 394.07 2019-07-26 14:35:00+08:00 2019-07-26 18:15:00+09:00 1 02:40:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-29 696.12 2019-07-29 17:55:00+09:00 2019-07-30 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam

So now it’s easy to get the cheapest flights by calling the function for all possible itineraries:

itinerary_df = reduce(pd.DataFrame.append, (find_flights(prices_with_city_df, itinerary_route, 10)
                                           for itinerary_route in itinerary))

itinerary_df \
    .sort_values(['total_price', 'total_flights_duration']) \
    .head(10)
total_price total_flights_duration Amsterdam_origin Amsterdam_destination Amsterdam_date Amsterdam_price Amsterdam_departure_at Amsterdam_arrival_at Amsterdam_segments Amsterdam_duration Amsterdam_origin_city Amsterdam_destination_city Amsterdam_route Moscow_origin Moscow_destination Moscow_date Moscow_price Moscow_departure_at Moscow_arrival_at Moscow_segments Moscow_duration Moscow_origin_city Moscow_destination_city Moscow_route Irkutsk_origin Irkutsk_destination Irkutsk_date Irkutsk_price Irkutsk_departure_at Irkutsk_arrival_at Irkutsk_segments Irkutsk_duration Irkutsk_origin_city Irkutsk_destination_city Irkutsk_route Beijing_origin Beijing_destination Beijing_date Beijing_price Beijing_departure_at Beijing_arrival_at Beijing_segments Beijing_duration Beijing_origin_city Beijing_destination_city Beijing_route Shanghai_origin Shanghai_destination Shanghai_date Shanghai_price Shanghai_departure_at Shanghai_arrival_at Shanghai_segments Shanghai_duration Shanghai_origin_city Shanghai_destination_city Shanghai_route Tokyo_origin Tokyo_destination Tokyo_date Tokyo_price Tokyo_departure_at Tokyo_arrival_at Tokyo_segments Tokyo_duration Tokyo_origin_city Tokyo_destination_city Tokyo_route
10 1817.04 1 days 19:50:00 AMS SVO 2019-07-11 203.07 2019-07-11 21:15:00+02:00 2019-07-12 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 21:50:00+08:00 2019-07-25 23:55:00+08:00 1 02:05:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
40 1817.04 1 days 19:50:00 AMS SVO 2019-07-11 203.07 2019-07-11 21:15:00+02:00 2019-07-12 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 21:50:00+08:00 2019-07-25 23:55:00+08:00 1 02:05:00 Beijing Shanghai Beijing ✈️ Shanghai SHA NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
0 1817.04 1 days 20:00:00 AMS SVO 2019-07-10 203.07 2019-07-10 21:15:00+02:00 2019-07-11 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 13:00:00+08:00 2019-07-25 15:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
70 1817.04 1 days 20:00:00 AMS SVO 2019-07-10 203.07 2019-07-10 11:50:00+02:00 2019-07-10 16:05:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 18:00:00+08:00 2019-07-25 20:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
60 1817.04 1 days 20:00:00 AMS SVO 2019-07-10 203.07 2019-07-10 11:50:00+02:00 2019-07-10 16:05:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 13:00:00+08:00 2019-07-25 15:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
0 1817.04 1 days 20:00:00 AMS SVO 2019-07-11 203.07 2019-07-11 21:15:00+02:00 2019-07-12 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 13:00:00+08:00 2019-07-25 15:15:00+08:00 1 02:15:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
10 1817.04 1 days 20:05:00 AMS SVO 2019-07-10 203.07 2019-07-10 11:50:00+02:00 2019-07-10 16:05:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 12:00:00+08:00 2019-07-25 14:20:00+08:00 1 02:20:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
40 1817.04 1 days 20:05:00 AMS SVO 2019-07-10 203.07 2019-07-10 11:50:00+02:00 2019-07-10 16:05:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 12:00:00+08:00 2019-07-25 14:20:00+08:00 1 02:20:00 Beijing Shanghai Beijing ✈️ Shanghai SHA NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
70 1817.04 1 days 20:05:00 AMS SVO 2019-07-11 203.07 2019-07-11 21:15:00+02:00 2019-07-12 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 18:30:00+08:00 2019-07-25 20:50:00+08:00 1 02:20:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam
60 1817.04 1 days 20:05:00 AMS SVO 2019-07-11 203.07 2019-07-11 21:15:00+02:00 2019-07-12 01:30:00+03:00 1 03:15:00 Amsterdam Moscow Amsterdam ✈️ Moscow DME IKT 2019-07-15 198.03 2019-07-15 18:35:00+03:00 2019-07-16 05:05:00+08:00 1 05:30:00 Moscow Irkutsk Moscow ✈️ Irkutsk IKT PEK 2019-07-22 154.11 2019-07-22 01:50:00+08:00 2019-07-22 04:40:00+08:00 1 02:50:00 Irkutsk Beijing Irkutsk ✈️ Beijing PEK SHA 2019-07-25 171.64 2019-07-25 11:00:00+08:00 2019-07-25 13:20:00+08:00 1 02:20:00 Beijing Shanghai Beijing ✈️ Shanghai PVG NRT 2019-07-28 394.07 2019-07-28 17:15:00+08:00 2019-07-28 20:40:00+09:00 1 02:25:00 Shanghai Tokyo Shanghai ✈️ Tokyo NRT AMS 2019-07-31 696.12 2019-07-31 17:55:00+09:00 2019-08-01 14:40:00+02:00 2 1 days 03:45:00 Tokyo Amsterdam Tokyo ✈️ Amsterdam

Conclusion

I was able to find the cheapest flights with the minimal duration and the resulting prices were almost the same as on Google Flights.

As a side-note I’ll probably reconsider my trip.

Links:

Gene Kim, Kevin Behr, George Spafford: The Phoenix Project



book cover white Not so long ago I got recommended to read The Phoenix Project by Gene Kim, Kevin Behr and George Spafford, and I’ve got a bit of mixed feelings about the book.

The first part of the book is entertaining, as it shows that everything is bad and becoming even worse. It sounds like something from real life and reminds of some work situations.

The second part is still interesting, as it describes how they’re solving their problems in an iterative way, without going overboard.

The last part is kind of meh, too cheerful and happy, and like devops are going to solve all possible problems, and how magically everything works.

Sarah Guido, Andreas C. Müller: Introduction to Machine Learning with Python



book cover white Recently I’ve started to play in a data scientist a bit more and found that I’m not that much know machine learning basics, so I’ve decided to read Introduction to Machine Learning with Python by Sarah Guido and Andreas C. Müller. It’s a nice book, it explains fundamental concepts without going to deep and without requiring much of a math background. The book has a lot of simple synthetic and kind of real-life examples and covers a few popular use cases.

Although, in some chapters, it felt like reading a Jupyter notebook, but I’ve enjoyed an example with “this is how we get ants”.

Extracting popular topics from subreddits



Continuing playing with Reddit data, I thought that it might be fun to extract discussed topics from subreddits. My idea was: get comments from a subreddit, extract ngrams, calculate counts of ngrams, normalize counts, and subtract them from normalized counts of ngrams from a neutral set of comments.

Small-scale

For proving the idea on a smaller scale, I’ve fetched titles, texts and the first three levels of comments from top 1000 r/all posts (full code available in a gist), as it should have a lot of texts from different subreddits:

get_subreddit_df('all').head()
id subreddit post_id kind text created score
0 7mjw12_title all 7mjw12 title My cab driver tonight was so excited to share ... 1.514459e+09 307861
1 7mjw12_selftext all 7mjw12 selftext 1.514459e+09 307861
2 7mjw12_comment_druihai all 7mjw12 comment I want to make good humored inappropriate joke... 1.514460e+09 18336
3 7mjw12_comment_drulrp0 all 7mjw12 comment Me too! It came out of nowhere- he was pretty ... 1.514464e+09 8853
4 7mjw12_comment_druluji all 7mjw12 comment Well, you got him to the top of Reddit, litera... 1.514464e+09 4749

Lemmatized texts, get all 1-3 words ngrams and counted them:

df = get_tokens_df(subreddit)  # Full code is in gist
df.head()
token amount
0 cab 84
1 driver 1165
2 tonight 360
3 excited 245
4 share 1793

Then I’ve normalized counts:

df['amount_norm'] = (df.amount - df.amount.mean()) / (df.amount.max() - df.amount.min())
df.head()
token amount amount_norm
0 automate 493 0.043316
1 boring 108 0.009353
2 stuff 1158 0.101979
3 python 11177 0.985800
4 tinder 29 0.002384

And as the last step, I’ve calculated diff and got top 5 ngrams with texts from top 1000 posts from some random subreddits. Seems to be working for r/linux:

diff_tokens(tokens_dfs['linux'], tokens_dfs['all']).head()
token amount_diff amount_norm_diff
5807 kde 3060.0 1.082134
2543 debian 1820.0 1.048817
48794 coc 1058.0 1.028343
9962 systemd 925.0 1.024769
11588 gentoo 878.0 1.023506

Also looks ok on r/personalfinance:

diff_tokens(tokens_dfs['personalfinance'], tokens_dfs['all']).head()
token amount_diff amount_norm_diff
78063 vanguard 1513.0 1.017727
18396 etf 1035.0 1.012113
119206 checking account 732.0 1.008555
60873 direct deposit 690.0 1.008061
200917 joint account 679.0 1.007932

And kind of funny with r/drunk:

diff_tokens(tokens_dfs['drunk'], tokens_dfs['all']).head()
token amount_diff amount_norm_diff
515158 honk honk honk 144.0 1.019149
41088 pbr 130.0 1.017247
49701 mo dog 129.0 1.017112
93763 cheap beer 74.0 1.009641
124756 birthday dude 61.0 1.007875

Seems to be working on this scale.

A bit larger scale

As the next iteration, I’ve decided to try the idea on three months of comments, which I was able to download as dumps from pushift.io.

Shaping the data

And it’s kind of a lot of data, even compressed:

$ du -sh raw_data/*
11G	raw_data/RC_2018-08.xz
10G	raw_data/RC_2018-09.xz
11G	raw_data/RC_2018-10.xz

Pandas basically doesn’t work on that scale, and unfortunately, I don’t have a personal Hadoop cluster. So I’ve reinvented a wheel a bit:

graph LR A[Reddit comments]-->B[Reddit comments wiht ngrams] B-->C[Ngrams partitioned by subreddit and day] C-->D[Counted partitioned ngrams]

The raw data is stored in line-delimited JSON, like:

$ xzcat raw_data/RC_2018-10.xz | head -n 2
{"archived":false,"author":"TistedLogic","author_created_utc":1312615878,"author_flair_background_color":null,"author_flair_css_class":null,"author_flair_richtext":[],"author_flair_template_id":null,"author_flair_text":null,"author_flair_text_color":null,"author_flair_type":"text","author_fullname":"t2_5mk6v","author_patreon_flair":false,"body":"Is it still r\/BoneAppleTea worthy if it's the opposite?","can_gild":true,"can_mod_post":false,"collapsed":false,"collapsed_reason":null,"controversiality":0,"created_utc":1538352000,"distinguished":null,"edited":false,"gilded":0,"gildings":{"gid_1":0,"gid_2":0,"gid_3":0},"id":"e6xucdd","is_submitter":false,"link_id":"t3_9ka1hp","no_follow":true,"parent_id":"t1_e6xu13x","permalink":"\/r\/Unexpected\/comments\/9ka1hp\/jesus_fking_woah\/e6xucdd\/","removal_reason":null,"retrieved_on":1539714091,"score":2,"send_replies":true,"stickied":false,"subreddit":"Unexpected","subreddit_id":"t5_2w67q","subreddit_name_prefixed":"r\/Unexpected","subreddit_type":"public"}
{"archived":false,"author":"misssaladfingers","author_created_utc":1536864574,"author_flair_background_color":null,"author_flair_css_class":null,"author_flair_richtext":[],"author_flair_template_id":null,"author_flair_text":null,"author_flair_text_color":null,"author_flair_type":"text","author_fullname":"t2_27d914lh","author_patreon_flair":false,"body":"I've tried and it's hit and miss. When it's good I feel more rested even though I've not slept well but sometimes it doesn't work","can_gild":true,"can_mod_post":false,"collapsed":false,"collapsed_reason":null,"controversiality":0,"created_utc":1538352000,"distinguished":null,"edited":false,"gilded":0,"gildings":{"gid_1":0,"gid_2":0,"gid_3":0},"id":"e6xucde","is_submitter":false,"link_id":"t3_9k8bp4","no_follow":true,"parent_id":"t1_e6xu9sk","permalink":"\/r\/insomnia\/comments\/9k8bp4\/melatonin\/e6xucde\/","removal_reason":null,"retrieved_on":1539714091,"score":1,"send_replies":true,"stickied":false,"subreddit":"insomnia","subreddit_id":"t5_2qh3g","subreddit_name_prefixed":"r\/insomnia","subreddit_type":"public"}

The first script add_ngrams.py reads lines of raw data from stdin, adds 1-3 lemmatized ngrams and writes lines in JSON to stdout. As the amount of data is huge, I’ve gzipped the output. It took around an hour to process month worth of comments on 12 CPU machine. Spawning more processes didn’t help as thw whole thing is quite CPU intense.

$ xzcat raw_data/RC_2018-10.xz | python3.7 add_ngrams.py | gzip > with_ngrams/2018-10.gz
$ zcat with_ngrams/2018-10.gz | head -n 2
{"archived": false, "author": "TistedLogic", "author_created_utc": 1312615878, "author_flair_background_color": null, "author_flair_css_class": null, "author_flair_richtext": [], "author_flair_template_id": null, "author_flair_text": null, "author_flair_text_color": null, "author_flair_type": "text", "author_fullname": "t2_5mk6v", "author_patreon_flair": false, "body": "Is it still r/BoneAppleTea worthy if it's the opposite?", "can_gild": true, "can_mod_post": false, "collapsed": false, "collapsed_reason": null, "controversiality": 0, "created_utc": 1538352000, "distinguished": null, "edited": false, "gilded": 0, "gildings": {"gid_1": 0, "gid_2": 0, "gid_3": 0}, "id": "e6xucdd", "is_submitter": false, "link_id": "t3_9ka1hp", "no_follow": true, "parent_id": "t1_e6xu13x", "permalink": "/r/Unexpected/comments/9ka1hp/jesus_fking_woah/e6xucdd/", "removal_reason": null, "retrieved_on": 1539714091, "score": 2, "send_replies": true, "stickied": false, "subreddit": "Unexpected", "subreddit_id": "t5_2w67q", "subreddit_name_prefixed": "r/Unexpected", "subreddit_type": "public", "ngrams": ["still", "r/boneappletea", "worthy", "'s", "opposite", "still r/boneappletea", "r/boneappletea worthy", "worthy 's", "'s opposite", "still r/boneappletea worthy", "r/boneappletea worthy 's", "worthy 's opposite"]}
{"archived": false, "author": "1-2-3RightMeow", "author_created_utc": 1515801270, "author_flair_background_color": null, "author_flair_css_class": null, "author_flair_richtext": [], "author_flair_template_id": null, "author_flair_text": null, "author_flair_text_color": null, "author_flair_type": "text", "author_fullname": "t2_rrwodxc", "author_patreon_flair": false, "body": "Nice! I\u2019m going out for dinner with him right and I\u2019ll check when I get home. I\u2019m very interested to read that", "can_gild": true, "can_mod_post": false, "collapsed": false, "collapsed_reason": null, "controversiality": 0, "created_utc": 1538352000, "distinguished": null, "edited": false, "gilded": 0, "gildings": {"gid_1": 0, "gid_2": 0, "gid_3": 0}, "id": "e6xucdp", "is_submitter": true, "link_id": "t3_9k9x6m", "no_follow": false, "parent_id": "t1_e6xsm3n", "permalink": "/r/Glitch_in_the_Matrix/comments/9k9x6m/my_boyfriend_and_i_lost_10_hours/e6xucdp/", "removal_reason": null, "retrieved_on": 1539714092, "score": 42, "send_replies": true, "stickied": false, "subreddit": "Glitch_in_the_Matrix", "subreddit_id": "t5_2tcwa", "subreddit_name_prefixed": "r/Glitch_in_the_Matrix", "subreddit_type": "public", "ngrams": ["nice", "go", "dinner", "right", "check", "get", "home", "interested", "read", "nice go", "go dinner", "dinner right", "right check", "check get", "get home", "home interested", "interested read", "nice go dinner", "go dinner right", "dinner right check", "right check get", "check get home", "get home interested", "home interested read"]}

The next script partition.py reads stdin and writes files like 2018-10-10_AskReddit with just ngrams to a folder passed as an argument.

$ zcat with_ngrams/2018-10.gz | python3.7 parition.py partitions
$ cat partitions/2018-10-10_AskReddit | head -n 5
"gt"
"money"
"go"
"administration"
"building"

For three months of comments it created a lot of files:

$ ls partitions | wc -l
2715472

After that I’ve counted ngrams in partitions with group_count.py:

$ python3.7 group_count.py partitions counted
$ cat counted/2018-10-10_AskReddit | head -n 5
["gt", 7010]
["money", 3648]
["go", 25812]
["administration", 108]
["building", 573]

As r/all isn’t a real subreddit and it’s not possible to get it from the dump, I’ve chosen r/AskReddit as a source of “neutral” ngrams, for that I’ve calculated the aggregated count of ngrams with aggreage_whole.py:

$ python3.7 aggreage_whole.py AskReddit > aggregated/askreddit_whole.json
$ cat aggregated/askreddit_whole.json | head -n 5
[["trick", 26691], ["people", 1638951], ["take", 844834], ["zammy", 10], ["wine", 17315], ["trick people", 515], ["people take", 10336], ["take zammy", 2], ["zammy wine", 2], ["trick people take", 4], ["people take zammy", 2]...

Playing with the data

First of all, I’ve read “neutral” ngrams, removed ngrams appeared less than 100 times as otherwise it wasn’t fitting in RAM and calculated normalized count:

whole_askreddit_df = pd.read_json('aggregated/askreddit_whole.json', orient='values')
whole_askreddit_df = whole_askreddit_df.rename(columns={0: 'ngram', 1: 'amount'})
whole_askreddit_df = whole_askreddit_df[whole_askreddit_df.amount > 99]
whole_askreddit_df['amount_norm'] = norm(whole_askreddit_df.amount)
ngram amount amount_norm
0 trick 26691 0.008026
1 people 1638951 0.492943
2 take 844834 0.254098
4 wine 17315 0.005206
5 trick people 515 0.000153

To be sure that the idea is still valid, I’ve randomly checked r/television for 10th October:

television_10_10_df = pd \
    .read_json('counted/2018-10-10_television', lines=True) \
    .rename(columns={0: 'ngram', 1: 'amount'})
television_10_10_df['amount_norm'] = norm(television_10_10_df.amount)
television_10_10_df = television_10_10_df.merge(whole_askreddit_df, how='left', on='ngram', suffixes=('_left', '_right'))
television_10_10_df['diff'] = television_10_10_df.amount_norm_left - television_10_10_df.amount_norm_right
television_10_10_df \
    .sort_values('diff', ascending=False) \
    .head()
ngram amount_left amount_norm_left amount_right amount_norm_right diff
13 show 1299 0.699950 319715.0 0.096158 0.603792
32 season 963 0.518525 65229.0 0.019617 0.498908
19 character 514 0.276084 101931.0 0.030656 0.245428
4 episode 408 0.218849 81729.0 0.024580 0.194269
35 watch 534 0.286883 320204.0 0.096306 0.190578

And just for fun, limiting to trigrams:

television_10_10_df\
    [television_10_10_df.ngram.str.count(' ') >= 2] \
    .sort_values('diff', ascending=False) \
    .head()
ngram amount_left amount_norm_left amount_right amount_norm_right diff
11615 better call saul 15 0.006646 1033.0 0.000309 0.006337
36287 would make sense 11 0.004486 2098.0 0.000629 0.003857
7242 ca n't wait 12 0.005026 5396.0 0.001621 0.003405
86021 innocent proven guilty 9 0.003406 1106.0 0.000331 0.003075
151 watch first episode 8 0.002866 463.0 0.000137 0.002728

Seems to be ok, as the next step I’ve decided to get top 50 discussed topics for every available day:

r_television_by_day = diff_n_by_day(  # in the gist
    50, whole_askreddit_df, 'television', '2018-08-01', '2018-10-31',
    exclude=['r/television'],
)
r_television_by_day[r_television_by_day.date == "2018-10-05"].head()
ngram amount_left amount_norm_left amount_right amount_norm_right diff date
3 show 906 0.725002 319715.0 0.096158 0.628844 2018-10-05
8 season 549 0.438485 65229.0 0.019617 0.418868 2018-10-05
249 character 334 0.265933 101931.0 0.030656 0.235277 2018-10-05
1635 episode 322 0.256302 81729.0 0.024580 0.231723 2018-10-05
418 watch 402 0.320508 320204.0 0.096306 0.224202 2018-10-05

Then I thought that it might be fun to get overall top topics from daily top topics and make a weekly heatmap with seaborn:

r_television_by_day_top_topics = r_television_by_day \
    .groupby('ngram') \
    .sum()['diff'] \
    .reset_index() \
    .sort_values('diff', ascending=False)

r_television_by_day_top_topics.head()
ngram diff
916 show 57.649622
887 season 37.241199
352 episode 22.752369
1077 watch 21.202295
207 character 15.599798
r_television_only_top_df = r_television_by_day \
    [['date', 'ngram', 'diff']] \
    [r_television_by_day.ngram.isin(r_television_by_day_top_topics.ngram.head(10))] \
    .groupby([pd.Grouper(key='date', freq='W-MON'), 'ngram']) \
    .mean() \
    .reset_index() \
    .sort_values('date')

pivot = r_television_only_top_df \
    .pivot(index='ngram', columns='date', values='diff') \
    .fillna(-1)

sns.heatmap(pivot, xticklabels=r_television_only_top_df.date.dt.week.unique())

r/television by week

And it was quite boring, I’ve decided to try weekday heatmap, but it wasn’t better as topics were the same:

weekday_heatmap(r_television_by_day, 'r/television weekday')  # in the gist

r/television by weekday

Heatmaps for r/programming are also boring:

r_programming_by_day = diff_n_by_day(  # in the gist
    50, whole_askreddit_df, 'programming', '2018-08-01', '2018-10-31',
    exclude=['gt', 'use', 'write'],  # selected manully
)
weekly_heatmap(r_programming_by_day, 'r/programming')

r/programming

Although a heatmap by a weekday is a bit different:

weekday_heatmap(r_programming_by_day, 'r/programming by weekday')

r/programming by weekday

Another popular subreddit – r/sports:

r_sports_by_day = diff_n_by_day(
    50, whole_askreddit_df, 'sports', '2018-08-01', '2018-10-31',
    exclude=['r/sports'],
)
weekly_heatmap(r_sports_by_day, 'r/sports')

r/sports

weekday_heatmap(r_sports_by_day, 'r/sports by weekday')

r/sports weekday

As the last subreddit for giggles – r/drunk:

r_drunk_by_day = diff_n_by_day(50, whole_askreddit_df, 'drunk', '2018-08-01', '2018-10-31')
weekly_heatmap(r_drunk_by_day, 'r/drunk')

r/drunk

weekday_heatmap(r_drunk_by_day, "r/drunk by weekday")

r/drunk weekday

Conclusion

The idea kind of works for generic topics of subreddits, but can’t be used for finding trends.

Gist with everything.

Larry Gonick, Woollcott Smith: The Cartoon Guide to Statistics



book cover Recently I’ve noticed that I’m lacking some basics in statistics and got recommended to read The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith. The format of the book is a bit silly, but it covers a lot of topics and explains things in easy to understand way. The book has a lot of images and some kind of related stories for explained topics.

Although I’m not a big fan of the book format, the book seems to be nice.

Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne: The Site Reliability Workbook



book cover white More than two years ago I’ve read SRE Book, and now I finally found a time to read The Site Reliability Workbook by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne. This book is more interesting, as it’s less idealistic and contains a lot of cases from real life. The book has examples of correct and wrong SLOs, explains how to properly implement alerts based on your error budget, and even a bit covers human part of SRE.

Overall SRE Workbook is one of the best books I’ve read recently, but it might be because the last few weeks I was doing related things at work.

Measuring community opinion: subreddits reactions to a link



As everyone knows a lot of subreddits are opinionated, so I thought that it might be interesting to measure the opinion of different subreddits. Not trying to start a holy war I’ve specifically decided to ignore r/worldnews and similar subreddits, and chose a semi-random topic – “Apu reportedly being written out of The Simpsons”.

For accessing Reddit API I’ve decided to use praw, because it already implements all OAuth related stuff and almost the same as REST API.

As a first step I’ve found all posts with that URL and populated pandas DataFrame:

[*posts] = reddit.subreddit('all').search(f"url:{url}", limit=1000)

posts_df = pd.DataFrame(
    [(post.id, post.subreddit.display_name, post.title, post.score,
      datetime.utcfromtimestamp(post.created_utc), post.url,
      post.num_comments, post.upvote_ratio)
     for post in posts],
    columns=['id', 'subreddit', 'title', 'score', 'created',
             'url', 'num_comments', 'upvote_ratio'])

posts_df.head()
       id         subreddit                                                                            title  score             created                                                                              url  num_comments  upvote_ratio
0  9rmz0o        television                                            Apu to be written out of The Simpsons   1455 2018-10-26 17:49:00  https://www.indiewire.com/2018/10/simpsons-drop-apu-character-adi-shankar-12...          1802          0.88
1  9rnu73        GamerGhazi                                 Apu reportedly being written out of The Simpsons     73 2018-10-26 19:30:39  https://www.indiewire.com/2018/10/simpsons-drop-apu-character-adi-shankar-12...            95          0.83
2  9roen1  worstepisodeever                                                     The Simpsons Writing Out Apu     14 2018-10-26 20:38:21  https://www.indiewire.com/2018/10/simpsons-drop-apu-character-adi-shankar-12...            22          0.94
3  9rq7ov          ABCDesis  The Simpsons Is Eliminating Apu, But Producer Adi Shankar Found the Perfec...     26 2018-10-27 00:40:28  https://www.indiewire.com/2018/10/simpsons-drop-apu-character-adi-shankar-12...            11          0.84
4  9rnd6y         doughboys                                            Apu to be written out of The Simpsons     24 2018-10-26 18:34:58  https://www.indiewire.com/2018/10/simpsons-drop-apu-character-adi-shankar-12...             9          0.87

The easiest metric for opinion is upvote ratio:

posts_df[['subreddit', 'upvote_ratio']] \
    .groupby('subreddit') \
    .mean()['upvote_ratio'] \
    .reset_index() \
    .plot(kind='barh', x='subreddit', y='upvote_ratio',
          title='Upvote ratio', legend=False) \
    .xaxis \
    .set_major_formatter(FuncFormatter(lambda x, _: '{:.1f}%'.format(x * 100)))

But it doesn’t say us anything:

Upvote ratio

The most straightforward metric to measure is score:

posts_df[['subreddit', 'score']] \
    .groupby('subreddit') \
    .sum()['score'] \
    .reset_index() \
    .plot(kind='barh', x='subreddit', y='score', title='Score', legend=False)

Score by subreddit

A second obvious metric is a number of comments:

posts_df[['subreddit', 'num_comments']] \
    .groupby('subreddit') \
    .sum()['num_comments'] \
    .reset_index() \
    .plot(kind='barh', x='subreddit', y='num_comments',
          title='Number of comments', legend=False)

Number of comments

As absolute numbers can’t say us anything about an opinion of a subbreddit, I’ve decided to calculate normalized score and number of comments with data from the last 1000 of posts from the subreddit:

def normalize(post):
    [*subreddit_posts] = reddit.subreddit(post.subreddit.display_name).new(limit=1000)
    subreddit_posts_df = pd.DataFrame([(post.id, post.score, post.num_comments)
                                       for post in subreddit_posts],
                                      columns=('id', 'score', 'num_comments'))

    norm_score = ((post.score - subreddit_posts_df.score.mean())
                  / (subreddit_posts_df.score.max() - subreddit_posts_df.score.min()))
    norm_num_comments = ((post.num_comments - subreddit_posts_df.num_comments.mean())
                         / (subreddit_posts_df.num_comments.max() - subreddit_posts_df.num_comments.min()))

    return norm_score, norm_num_comments

normalized_vals = pd \
    .DataFrame([normalize(post) for post in posts],
               columns=['norm_score', 'norm_num_comments']) \
    .fillna(0)

posts_df[['norm_score', 'norm_num_comments']] = normalized_vals

And look at the popularity of the link based on the numbers:

posts_df[['subreddit', 'norm_score', 'norm_num_comments']] \
    .groupby('subreddit') \
    .sum()[['norm_score', 'norm_num_comments']] \
    .reset_index() \
    .rename(columns={'norm_score': 'Normalized score',
                     'norm_num_comments': 'Normalized number of comments'}) \
    .plot(kind='barh', x='subreddit',title='Normalized popularity')

Normalized popularity

As in different subreddits a link can be shared with a different title with totally different sentiments, it seemed interesting to do sentiment analysis on titles:

sid = SentimentIntensityAnalyzer()

posts_sentiments = posts_df.title.apply(sid.polarity_scores).apply(pd.Series)
posts_df = posts_df.assign(title_neg=posts_sentiments.neg,
                           title_neu=posts_sentiments.neu,
                           title_pos=posts_sentiments.pos,
                           title_compound=posts_sentiments['compound'])

And notice that people are using the same title almost every time:

posts_df[['subreddit', 'title_neg', 'title_neu', 'title_pos', 'title_compound']] \
    .groupby('subreddit') \
    .sum()[['title_neg', 'title_neu', 'title_pos', 'title_compound']] \
    .reset_index() \
    .rename(columns={'title_neg': 'Title negativity',
                     'title_pos': 'Title positivity',
                     'title_neu': 'Title neutrality',
                     'title_compound': 'Title sentiment'}) \
    .plot(kind='barh', x='subreddit', title='Title sentiments', legend=True)

Title sentiments

Sentiments of a title isn’t that interesting, but it might be much more interesting for comments. I’ve decided to only handle root comments as replies to comments might be totally not related to post subject, and they’re making everything more complicated. For comments analysis I’ve bucketed them to five buckets by compound value, and calculated mean normalized score and percentage:

posts_comments_df = pd \
    .concat([handle_post_comments(post) for post in posts]) \  # handle_post_comments is huge and available in the gist
    .fillna(0)

>>> posts_comments_df.head()
      key  root_comments_key  root_comments_neg_neg_amount  root_comments_neg_neg_norm_score  root_comments_neg_neg_percent  root_comments_neg_neu_amount  root_comments_neg_neu_norm_score  root_comments_neg_neu_percent  root_comments_neu_neu_amount  root_comments_neu_neu_norm_score  root_comments_neu_neu_percent  root_comments_pos_neu_amount  root_comments_pos_neu_norm_score  root_comments_pos_neu_percent  root_comments_pos_pos_amount  root_comments_pos_pos_norm_score  root_comments_pos_pos_percent root_comments_post_id
0  9rmz0o                  0                          87.0                         -0.005139                       0.175758                          98.0                          0.019201                       0.197980                         141.0                         -0.007125                       0.284848                          90.0                         -0.010092                       0.181818                            79                          0.006054                       0.159596                9rmz0o
0  9rnu73                  0                          12.0                          0.048172                       0.134831                          15.0                         -0.061331                       0.168539                          35.0                         -0.010538                       0.393258                          13.0                         -0.015762                       0.146067                            14                          0.065402                       0.157303                9rnu73
0  9roen1                  0                           9.0                         -0.094921                       0.450000                           1.0                          0.025714                       0.050000                           5.0                          0.048571                       0.250000                           0.0                          0.000000                       0.000000                             5                          0.117143                       0.250000                9roen1
0  9rq7ov                  0                           1.0                          0.476471                       0.100000                           2.0                         -0.523529                       0.200000                           0.0                          0.000000                       0.000000                           1.0                         -0.229412                       0.100000                             6                          0.133333                       0.600000                9rq7ov
0  9rnd6y                  0                           0.0                          0.000000                       0.000000                           0.0                          0.000000                       0.000000                           0.0                          0.000000                       0.000000                           5.0                         -0.027778                       0.555556                             4                          0.034722                       0.444444                9rnd6y

So now we can get a percent of comments by sentiments buckets:

percent_columns = ['root_comments_neg_neg_percent',
                   'root_comments_neg_neu_percent', 'root_comments_neu_neu_percent',
                   'root_comments_pos_neu_percent', 'root_comments_pos_pos_percent']

posts_with_comments_df[['subreddit'] + percent_columns] \
    .groupby('subreddit') \
    .mean()[percent_columns] \
    .reset_index() \
    .rename(columns={column: column[13:-7].replace('_', ' ')
                     for column in percent_columns}) \
    .plot(kind='bar', x='subreddit', legend=True,
          title='Percent of comments by sentiments buckets') \
    .yaxis \
    .set_major_formatter(FuncFormatter(lambda y, _: '{:.1f}%'.format(y * 100)))

It’s easy to spot that on less popular subreddits comments are more opinionated:

Comments sentiments

The same can be spotted with mean normalized scores:

norm_score_columns = ['root_comments_neg_neg_norm_score',
                      'root_comments_neg_neu_norm_score',
                      'root_comments_neu_neu_norm_score',
                      'root_comments_pos_neu_norm_score',
                      'root_comments_pos_pos_norm_score']

posts_with_comments_df[['subreddit'] + norm_score_columns] \
    .groupby('subreddit') \
    .mean()[norm_score_columns] \
    .reset_index() \
    .rename(columns={column: column[13:-10].replace('_', ' ')
                     for column in norm_score_columns}) \
    .plot(kind='bar', x='subreddit', legend=True,
          title='Mean normalized score of comments by sentiments buckets')

Comments normalized score

Although those plots are fun even with that link, it’s more fun with something more controversial. I’ve picked one of the recent posts from r/worldnews, and it’s easy to notice that different subreddits present the news in a different way:

Hot title sentiment

And comments are rated differently, some subreddits are more neutral, some definitely not:

Hot title sentiment

Gist with full source code.