Systems design for Advanced Beginners

You’ve started yet another company with your good friend, Steve Steveington. It’s an online marketplace where people can buy and sell things and where no one asks too many questions. It’s basically a rip-off of Craigslist, but with Steve’s name instead of Craig’s.

You’re going to be responsible for building the entire Steveslist technical platform, including all of its websites, mobile apps, databases, and other infrastructure. You’re excited, but also very nervous. You figure that you can probably cobble together a small website, since you’ve done that a few times before as part of your previous entertaining-if-morally-questionable escapades with the Stevester. But you have no idea how to even start building out all of the other infrastructure and tools that you assume lie behind large, successful online platforms.

You are in desperate need of a detailed yet concise overview of how real companies do this. How do they store their data? How do their different applications talk to each other? How do they scale their systems to work for millions of users? How do they keep them secure? How do they make sure nothing goes wrong? What are APIs, webhooks and client libraries, when you really get down to it?


You send a quick WhatsApp to your other good friend, Kate Kateberry, to see if she can help. You’ve worked together very effectively in the past, and she has decades of experience creating these types of systems at Silicon Valley’s biggest and most controversial companies.

She instantly accepts your job offer. You had actually only been ringing for some rough guidance and a good gossip, but you nonetheless instantly accept her acceptance. No point looking a gift horse in the mouth, even when you don’t have any money to pay her. Kate proposes that her first day be 5 weeks ago in order to help her smooth over some accounting irregularities. She can come into the office sometime next week. You feel encouraged and threatened by her eagerness.


Kate bounces into your offices in the 19th Century Literature section of the San Francisco Public Library. “OK let’s do this!” she shouts quietly. “What have we got so far? How are all our systems set up? What’s the plan?” You lean back in your chair and close your laptop, which was not turned on because you have left your charger at home. You steeple your fingers in a manner that you hope can be described as “thoughtful”.

“Let me flip that question around, Kate. What do you think the plan should be?”

Kate takes a deep breath and paints an extremely detailed vision of the Steveslist platform five years into the future and the infrastructure that will power it.


Read Kate’s in-depth vision in full on my blog. It’s long and detailed and covers topics from database to webhooks, and by the end you’ll understand the infrastructure behind almost all modern online companies.

Read it now.

Rob

Programming Videos for Advanced Beginners

The onset of a global pandemic hasn't left me as much time as I would like for writing blog posts, but if you're looking for a great way to spend 2 or 3 hours of quarantine time then have a watch of Programming Videos for Advanced Beginners, a 9-part video series I published last year. We design and build the game "Battleships" in the command line and learn a lot about writing clean code, modular code along the way.

Stay safe.

PFAB #11: Separating logic and data

Feedback? Suggestions? Please let me know, I’d love to hear from you.

You can also read this post on my blog.

This week on Programming Feedback for Advanced Beginners we're going to start analyzing a brand new project. It's an interactive website called Eat/Play/Drink, created by PFAB reader Sophia Li. Eat/Play/Drink presents the user with a curated map of places to eat, play, and drink, and allows the user to filter the map by category.

As with almost every program that we analyze in PFAB, Sophia's program is already entirely functional and fit for purpose. You can even view it in action at https://sophiali.dev/eat-play-drink. The changes that we're going to suggest would only start to become particularly important if the codebase grew larger or was being worked on by multiple programmers. For a small, personal side-project the changes should therefore be considered somewhat optional, but they're still good opportunities for practice and learning.

Over the coming editions of PFAB we're going to look at how Sophie could restructure her code so as to make it more "defensive". This means anticipating the types of bugs that future programmers could accidentally introduce (Sophie-in-a-month counts as a future programmer!), and programming in a style that mitigates or prevents these bugs altogether.

But first I want to start our journey by discussing a design choice that Sophie made that I particularly like: enforcing the separation of logic and data.

Separating logic and data

Let's start with a quick tour of the code (you can also read it in full on GitHub).

Eat/Play/Drink is built using HTML and JavaScript. If you aren't familiar with these technologies then don't worry - the concepts that we're going to talk about are applicable to all other languages too. Roughly speaking, HTML is how you describe what a website looks like, and JavaScript is how you dynamically update it (for example, expanding a menu when a user clicks on a button).

HTML looks like this:

<!-- A simple website with an image and a button -->
<html>
  <body>
    <img id="main-pic" src="/images/photo-of-me.jpg" />
    <button id="btn-change-pic" value="Change the pic!" />
  </body>
</html>

JavaScript looks like this:

// When the user clicks on the button, change the image
$("#btn-change-pic").click(function() {
  $("#main-pic").attr("src", "/images/photo-of-you.jpg");
});

The HTML for Eat/Play/Drink lives in a file called index.html ([view][index_html]). We're not going to talk about this file much this week. The JavaScript lives in a file called app.js ([view][app_js]). It is responsible for creating Google Maps markers and "info boxes", and for showing and hiding these components whenever a user clicks on a filter button.

At the top of app.js is an array called markers. This array contains details about each place on the Eat/Play/Drink map, including a place's co-ordinates, category, and description:

const markers = [
  {
    coords: { lat: 37.7638, lng: -122.469 },
    content:
      "<h3>San Tung</h3><p>Chinese restaurant with delicious dry fried chicken wings.</p><a href='https://goo.gl/maps/wiqva8qzW9bBoJRJ6' target='_blank'>Get directions</a>",
    category: "eat"
  },
  // ...etc...
}

I really like this array. It encapsulates all of the website's data, keeping it completely isolated from the logic that is responsible for displaying it. This separation is valuable, for two reasons.

First, it's almost always good to keep different components of your code separate from each other, regardless of what those components are. As we've talked aboutatlengthinpreviousPFABs, code that is well-separated into discrete components will, as a natural, inevitable consequence, tend to be easier to understand and work with.

Second, separating logic and data allows us to add new places to the Eat/Play/Drink app without needing to go anywhere near the code that is responsible for displaying those new places. Instead, we can add our new places to the markers array using a simple, well-defined format. We can hit save and refresh the page, and the new markers and info windows are immediately available to be clicked on.

As well as improving our lives in the short-term, separating logic and data opens up extra possibilities for the future too.

Future benefits of separating data and logic

We can make data updates even simpler

With only a little extra work, we could make it even easier for anyone to update our app's code with new places, even if they aren't a programmer. Non-programmers could probably already add a new element to the markers array in a pinch, but it might be intimidating to require them to edit a full JavaScript file that also contains functions, loops, if-statements, and so on. To alleviate this, we could move our places data to an entirely separate file (perhaps called places.json). This file would contain nothing but data, formatted as JSON. JSON stands for JavaScript Object Notation, and is a simple format for representing data. For example:

{
  "places": [
    {
      "coords": { "lat": 37.7638, "lng": -122.469 },
      "content":
        "<h3>San Tung</h3><p>Chinese restaurant with delicious dry fried chicken wings.</p><a href='https://goo.gl/maps/wiqva8qzW9bBoJRJ6' target='_blank'>Get directions</a>",
      "category": "eat"
    }
    // ...etc...
  ]
}

JSON files only represent data, and don't contain any logic whatsoever. app.js would read and load the data from the JSON file, and then use it to perform all the presentation and other heavy lifting, just as it does today.

+-----------+    +------+    +----------+
|places.json+--->+app.js+--->+index.html|
+-----------+    +------+    +----------+

Changing the code in this way would further hide the complexity that turns raw data into maps and markers. Adding new places by appending a few lines to a simple config file should be much less daunting than editing a file of JavaScript.

Next, let's take this principle even further.

We can make E/P/D reusable in different cities

Eat/Play/Drink is a simple application that is deliberately and sensibly laser-focussed on things to do in San Francisco. But what if Sophia's cousin, who lives in Austin (assume that Sophia has a cousin and that this cousin lives in Austin), wanted to make an Austin version of Eat/Play/Drink? He could copy and paste our source code, but in order to localize it for use in Austin he would need to edit a long list of specific files and lines. This would be fiddly and error-prone. On top of this, if we made improvements to the core Eat/Play/Drink codebase, he would have to manually copy these changes over and make sure that they didn't accidentally overwrite his Austin-focussed updates to the code.

Sophia's hypothetical cousin is probably very intelligent, and could probably figure out what to do given enough time. However, the more complex the app gets, the harder this migration will become. We should therefore go one step better. It would not take us much work to refactor the Eat/Play/Drink codebase into a generic, customizable framework that could trivially be reused by anyone to create an app to display their favorite places in their own city.

To do this, we would need to move all of the San-Francisco-specific pieces of the application into a config file (similar to places.json, described above), and have the rest of the application read all of its data from this file. Adapting the app for Austin should then require only replacing the San Francisco config file with one for Austin, with no changes to the core code required. This means that if Sophia were to make improvements to the core code, her cousin would be able to pull them in to his website without stomping over any of his Austin-based changes. Since all of his location-specific data is stored in the config file, updating the core code is entirely safe.

Which parts of the app are currently San-Francisco-specific? The places on the map (which we discussed above) are, of course. Add to this the co-ordinates where the map should be centred, the initial zoom-level of the map, the title of the home page, and the description text above the map. We also have to decide whether to stipulate that our app can only be used with the existing categories of eatplay, and drink, or whether to allow someone repurposing the app to choose their own categories, such as runbike, and swim. If we made the latter choice, we would need to specify the categories in the config file too so that our app could generate the appropriate filter buttons.

The config file might look something like this:

{
    "title": "EAT PLAY DRINK",
    "description": "A list of the best places to eat, play, and drink in San Francisco, curated by a San Franciso native. Choose from the buttons below to filter your search.",
    "categories": ["eat", "play", "drink"],
    "map": {
        "center": { "lat": 37.7749, "lng": -122.4194 },
        "zoom": 12
    },
    "places": [
        {
            "coords": { "lat": 37.7638, "lng": -122.469 },
            "content":
            "<h3>San Tung</h3><p>Chinese restaurant with delicious dry fried chicken wings.</p><a href='https://goo.gl/maps/wiqva8qzW9bBoJRJ6' target='_blank'>Get directions</a>",
            "category": "eat"
        },
        // ...etc...
    ]
}

After moving all the city-specific data into a config file, we would update app.js to read all of this extra data from it. We would also need to move responsibility for some pieces of data from index.html into app.js. For example, index.html currently has the description of the Eat/Play/Drink app hardcoded into it. We would instead need to make app.js responsible for dynamically inserting the description that it reads from the config file.

Now that we've seen how to extract more data into configuration files, let's consider ways in we could enhance the markers data structure that we already have. I'm not actually certain that this next suggestion is a good idea, but it's definitely educational so let's discuss it anyway.

Extracting extra fields

You may notice that the content field of each element in the markers array always follows the same structure. For example:

<h3>Manna</h3><p>Hole-in-the-wall family-owned
Korean restaurant with the best tofu soup.</p>
<a href='https://goo.gl/maps/TkAFCrGrswH76hmT8'
target='_blank'>Get directions</a>

This and all other content values contain a heading, a description, and a link to directions. Can we exploit this consistency to improve our code? Maybe.

To answer this question, let's start by making sure we understand the possible problems we are trying to solve. Suppose that we wanted to change the Get directions link to instead say simply Directions. At the moment we'd have to go through each place's content field and manually update its value with the new link text. Or suppose that we wanted to allow someone not familiar with programming to add and update our data (as above). This person could probably copy-and-paste an example description and deduce that the words between <h3> and </h3> are probably a heading. However, this is a lot of unnecessary thinking to require of a person who only wants to make a note of a cool new coffee shop. There's also a significant risk that they might accidentally delete a closing </p> tag or make some other typo. Come to think of it there's a significant risk that I would accidentally delete a closing </p> tag too.

To solve these problems, we could extract each component of the content field into its own field, and write code to join these new fields together into a fully rendered block. So in our markers data structure, instead of this:

{
  content: "<h3>Manna</h3><p>Hole-in-the-wall family-owned Korean restaurant with the best tofu soup.</p><a href='https://goo.gl/maps/TkAFCrGrswH76hmT8' target='_blank'>Get directions</a>",
  // ...etc...
}

We would write this:

{
  name: "Manna",
  description: "Hole-in-the-wall family-owned Korean restaurant with the best tofu soup.",
  directions_url: "https://goo.gl/maps/TkAFCrGrswH76hmT8",
  // ...etc...
}We would then write code to join the above parameters and turn them into the same fully-rendered output (<h3>Manna</h3><p>...etc...) that was previously stored wholesale in the content field.

This approach has several advantages. First, it further separates data from logic and presentation. Currently the markers array is responsible for raw information, but also some elements of presentation, such as whether the title should be displayed in <h3> or <h2> tags. However, markers should ideally contain only raw information, and the components of our program responsible for displaying this data (i.e. the rest of the app) should be responsible for making cosmetic choices.

Second, as discussed above, it makes adding and editing data simpler. It means that editors don't have to know anything about HTML, and makes it harder for them to make silly typos.

However, the approach is not without its tradeoffs. By standardizing on a fixed structure made up of a set of fixed parameters, we make it easy to create info windows that follow this fixed structure. However, we also make it harder to create info windows that don't. Currently the structure of an info window is to start with a title, then one paragraph of description, and finish with a link to directions. If you want two paragraphs of description, or an image, or you didn't want a link to any directions, then you're out of luck. You'll have to update the code that turns our parameters into info window text so that it also accepts additional, optional parameters, such as description_2image_urlhide_directions_link, and so on. If there aren't many of these extra customizations then this might be a reasonable solution. But if there are a lot of them and you end up with twenty different configurable fields that sometimes interact with each other in surprising and complex ways (what should happen if hide_directions_link and directions_url are both set?) then you might find yourself wishing you had simply stuck with verbose, custom HTML instead of trying to make your life "easier".

In situations like this I usually start off with whichever approach requires writing the least amount of code. Then I add specialized features only when I feel that I understand the problem that I'm trying to solve. One way to decide when to invest in writing more code is to wait until the problems that you think it will solve have really become problems. For example, I've claimed that splitting out the content field into multiple sub-fields will reduce the number of silly mistakes and bugs in the info window content. I've also claimed that it will make it easier to update the appearence of every info window at once. But how often have you actually introduced silly mistakes and bugs as a result of messing up HTML? And how often have you had to update the appearence of every info window at once? Was it really that annoying?

Don't solve problems that you don't have.

Trying to be too clever

Suppose that we did decide to refactor our data file in this way. We split out the content field into separate fields for namedescription, and directions_url. At this point we notice that every directions_url is of the form https://goo.gl/maps/$UNIQUE_ID, and that we could potentially save a few characters by replacing the directions_url field with one called directions_google_id, which contains only the unique ID in the goo.gl/maps URL. For example:

{
  name: "Manna",
  description: "Hole-in-the-wall family-owned Korean restaurant with the best tofu soup.",
  directions_google_id: "TkAFCrGrswH76hmT8",
  // ...etc...
}

If we went down this path then our info-window-building code would be responsible for turning TkAFCrGrswH76hmT8 into https://goo.gl/maps/TkAFCrGrswH76hmT8 before inserting it into the directions link.

Making this change would certainly reduce the number of characters in our config file, and would indisputably reduce duplication of the string https://goo.gl/maps/. However, in my opinion these are very small benefits that don't nearly outweigh the problems that it introduces.

For one thing, this change would make our config file less clear to read and update. "Create a goo.gl link but then chop off everything except the random characters at the end" is a more complicated instruction than "create a goo.gl link and copy it". Worse, it would unnecessarily encode in our program the assumption that we always want to get our directions from Google Maps. Since we are still early in the lifecycle of the Eat/Play/Drink app, this could easily change, and we want to stay as flexible as possible. Suppose that we decided that for some places, Bing Maps gave better directions. We could add yet another field called directions_bing_id and write even more code to turn this into a Bing Maps URL, but now we're starting to really add a lot more complexity for a nebulous gain. Wanting to save keystrokes is a good instinct, but isn't always a good idea.

There are no hard rules about how data should be decomposed into parameters. It's even easy to imagine further constraints that would make the above change a good idea. For example, maybe we're being paid by Google to showcase Google Maps, and maybe it's better for us to work with IDs instead of full URLs because this is the form in which Google sends us click statistics. You have to consider the tradeoffs for each individual case, and if in doubt err towards doing nothing.


Separating out your data and your logic can make your code cleaner and your life easier, and we haven't even begun to fully exhaust the benefits. Suppose that the Eat/Play/Drink website keeps growing and adding more and more places. Eventually it will reach the point where storing all of our location data in JavaScript files, or even in separate config files, becomes unfeasible. Once this happens, we will want to switch to storing our data in a purpose-built database. This change will require a significant rearchitecting of our code, but having kept a clear separation between code and data will make the change substantially more straightforward. This is because we will only have to change the pieces of our code that deal with loading data from a data source. We can rip out the code that loads data from files and replace it with code that loads data from a database, and the code that is responsible for formatting and presenting the data won't have to know or care what we have done. So long as this code gets passed data that describes places, it doesn't much care where that data came from.

So next time you spot the opportunity to draw lines between your data and your logic, remember what Sophia did and give it a go.

Until next time:

PFAB 10: First-class functions and dependency injection

In the previous edition of PFAB we began looking at a program that analyzes WhatsApp message logs, written by Adarsh Rao. We discussed some of the tradeoffs between taking a batch approach - where we load and process many records at once - and a streaming approach - where we load one record at a time, fully process it, and only then load and process the next record. We focussed in particular on how to write a clean batch pipeline. This week we're going to look at how to make a modular streaming pipeline by using first-class functions and dependency injection.

This edition contains advanced, challenging material - this series isn't called "Programming Feedback for Advanced Beginners" for nothing. If after reading this post you think "OK, that's nice and all but I have no idea how to use any of it in my own code" then that's fine and normal. My hope is that the next time you come across some code written in this style then you'll think "oh I see what they're doing here." Then the time after you'll start to really see what's going on, and eventually you'll be writing this type of code yourself in your sleep.

If you haven't read the previous PFAB then make sure to read it before going any further. This project is written in Ruby, but should be broadly understandable even if you haven't come across Ruby before.

Moaning about our un-modular streaming code

At the end of last week's PFAB, we were attempting to update our streaming code to deal with the fact that WhatsApp messages can sometimes spread over multiple lines in a log file, like so:

7/28/19, 11:07 PM - Kimi: Here's an idea for the next time you return. 
Sample Message that is split over two-lines. 
7/30/19, 11:03 PM - Gaurav: More messages

We were lamenting the fact that our updated streaming code was starting to look quite gnarly and un-modular. Our code that parses log lines into messages was becoming deeply entwined with our code that processes messages once they had been parsed. Look at the call to update_stats buried deep inside our log-parsing code:

stats = {}
current_message = nil

# File.foreach reads the given file and runs
# the given "block" of code on each line in turn.
File.foreach("samplelog.txt") do |l|
  # Assume that we've written a function called
  # parse_raw_log_line that parses out the date, time,
  # message, and whether this line is the start of a new
  # message.
  parsed_line = parse_raw_log_line(l)

  # If this line is a new message, add the previous
  # message to our stats and reset current_message
  # to be this line.
  if parsed_line[:start_of_new_message]
    if !current_message.nil?
      stats = update_stats(current_message, stats)
    end
    current_message = parsed_line

  # If this line is not a new message, it must be a
  # continuation of the previous line's message. We
  # therefore add its contents to the current message.
  else
    current_message[:contents] += parsed_line[:contents]
  end
end

# We need to manually add the final message to our stats,
# since it won't be added by the above loop (urgh)
stats = update_stats(current_message, stats)

By contrast, the code for our batch version of the program remained nicely separated, with all of its logic wrapped up inside well-separated functions:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_logs(raw_log)
message_stats = calculate_stats(parsed_messages)

I promised that we would refactor our streaming code so that it looked as aesthetically pleasing as our batch code. Let's see how.

Start from the end

When refactoring code to make it cleaner and tidier, I like to start from the end. How do I want my code to look to another programmer who is using it?

For the batch case this is easy. Our code is a pipeline with clearly separated steps - load data, parse data, analyze data. We want the results of one step to be passed directly into the next, and we want the code for each step to be entirely separate and isolated in different functions. When I was sketching out my version of the batch program, the first thing that I wrote was the high-level structure of the program:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_log(raw_log)
message_stats = calculate_stats(parsed_messages)

This helped me see what functions I would need to write, the arguments that each function would need to accept, and the values that each function would need to return. With this chunked-up specification in hand, the rest of the program was just a matter of filling in the pre-designed blanks.

Designing our streaming code

However, I found that coming up with this kind of high-level layout for streaming code was rather harder. Have a quick think about how you might do it. Don't worry if you can't come up with anything.

The reason that streaming is harder is that we don't have the simple, sequential execution that we had in the batch case. We have to read a line, parse it, check if we have finished composing a new message, add it to our stats if so, then go back, read another line, add it to our stats, and so on. We can't simply assemble a line of functions that feed data into each other.

The root cause of my discontent with our current streaming code is that our update_stats function is buried deep inside our code that parses log lines. This troubles me because if we wanted to reuse our log-parsing code to parse a different log file and perform a different action on each message (for example, write it to a message database), we would have to go inside our log-parsing code and conduct deep invasive surgery to change the call to update_stats. By contrast, to make a similar update to our batch code we would just have to write a new write_to_db function and swap it in for our calculate_stats function, leaving the code that parses logs from files untouched.

Some plausible but unsatisfactory solutions to this problem include:

  • Having two almost-duplicate copies of the same function, called (for example) parse_log_file_and_calculate_stats and parse_log_file_and_copy_to_database. However, copy-pasting code is always a shame, and it's particularly unfortunate that we'll have to make a new function for every single new processing action that we may want to perform in the future too.

  • A slightly better alternative would be to pull our log-parsing code into a function called (say) parse_log_file, and to pass a string into parse_log_file that describes the processing action that we want to perform. Inside parse_log_file we would use a long if-statement to match the string to a function. For example:

def parse_log_file(filename, processing_action)
  # <lots of parsing code goes here>

  case processing_action
  when 'update_stats'
    stats = update_stats(message, stats)
  when 'write_to_db'
    write_to_db(message)
  # ...etc...
  end

  # <lots more parsing code goes here>
end

However, this approach still has the disadvantage of requiring us to update parse_log_file every time we want to add a new processing action.

The solution to this conundrum is to use first-class functions and dependency injection.

First-class functions

First-class functions are a feature of many (but not all) programming languages. If a language has first-class functions, this means that you can pass its functions around as variables. To demonstrate what this means and why it matters, let's switch for a second from Ruby to Python, where the syntax for first-class functions is simpler:

def say_hello():
  print("HELLO!")

x = say_hello
x()
# => prints "HELLO!"

In Python, the line x = say_hello means "assign the function say_hello to the variable x". It is very different to the more familiar x = say_hello() (note the () after say_hello), which means "evaluate the function say_hello, then assign the result to the variable x."

After executing x = say_hellox points to the say_hello function. This means that on the next line, x can be called like any other function by writing x(). This executes the code inside say_hello, printing "HELLO!".

Dependency injection

Variables that point to functions can be passed around like any other type of variable. This allows you to pass around instructions and logic without having to immediately execute it. Passing around logic in this way is known as dependency injection, because you're injecting logic that the receiving function depends on. Take a look at the following snippet, which demonstrates a function that takes another function as an argument, and then runs that function twice:

def say_hello():
  print("HELLO!")

def run_function_twice(my_function):
  """run_function_twice takes another function
  as an argument and runs that function twice"""
  my_function()
  my_function()

x = say_hello
run_function_twice(x)
# => prints "HELLO!" "HELLO!"

# Or simply:
run_function_twice(say_hello)
# => prints "HELLO!" "HELLO!"

Translating first-class functions into Ruby

(If you don't care about the specifics of Ruby and only care about the general principles of first-class functions, you can safely skip this section)

Strictly speaking, Ruby doesn't have first-class functions. You can't assign a method to a variable in the same way that you can in Python:

def say_hello()
  puts("HELLO!")
end

# In Python this would point `x` at the
# `say_hello` function:
x = say_hello
# But in Ruby it executes `say_hello`
# immediately and prints "HELLO!"

However, even though Ruby doesn't have true first-class functions, it does have a concept called a Proc, which is a different, slightly more verbose way of achieving the same thing (you don't need to worry too much about the details of Procs unless you're interested and want to look them up). In order to perform dependency injection and produce the same behavior as our Python example above, we need to:

  • Wrap the function we want to pass around inside Proc.new

  • Invoke the Proc using .call(), instead of simply ()

For example:

def say_hello()
  puts("HELLO!")
end

def run_proc_twice(my_proc):
  my_proc.call()
  my_proc.call()
end

x = Proc.new { say_hello() }
x.call()
# => prints "HELLO!"

run_proc_twice(x)
# => prints "HELLO!" "HELLO!"

Don't worry too much about the syntactic specifics for now - if the general idea of dependency injection and passing around logic as variables makes sense then you're doing great. If it doesn't then please do let me know.

How does any of this help us?

Recall the second band-aid solution that we proposed to our problem: pass parse_log_file a string describing the processing action that we want to perform, and then use a big if-statement inside parse_log_file to pick out and execute the logic that corresponds to this string.

This solution was OK, but how about we skip out the middleman? Instead of passing in a string that tells our function which logic to run, how about we pass in the logic itself in the form of a Proc?

At the top-level, this might look like this:

update_stats_proc = Proc.new { |m| update_stats(m) }
parse_log_file(fname, update_stats_proc)

The parse_log_file function now accepts 2 arguments - a filename, and a Proc containing the logic that parse_log_file should perform whenever it finishes parsing a message. This means that if we want to add new functionality to our system, for example to parse messages and add them to a database, we can reuse the existing parse_log_file method and pass in a different Proc. Excitingly, we won't have to update a single line inside parse_log_file.

write_to_db_proc = Proc.new { |m| write_to_db(m) }
parse_log_file(fname, write_to_db_proc)

Side-note: Ruby has a more elegant "block" syntax that we can use to write the same logic in a more stylish way:

parse_log_file(fname) do |m|
  update_stats(m)
end

(Google "ruby block" and "ruby yield" for more details)

Now that we've fleshed out how we want our program to look, all that remains is to implement the individual pieces that we've designed. We want parse_log_file to be a function that takes 2 arguments: a filename and a message-processing Proc. It should read the given file, parse WhatsApp messages from it, and invoke the message-processing Proc on every message as it is parsed.

Something like this ought to do the trick:

def parse_log_file(filename, message_processing_proc)
  current_message = nil
  File.foreach(filename) do |l|
    parsed_line = parse_log_line(l)

    if parsed_line[:new_message]
      if !current_message.nil?
        # If the line starts a new message, invoke
        # message_processing_proc on current_message
        # to process it.
        message_processing_proc.call(current_message)
      end
      current_message = parsed_line
    else
      current_message[:body] += parsed_line[:body]
    end
  end

  # Manually invoke message_processing_proc on the
  # final message.
  message_processing_proc.call(current_message)
end

message_count = 0
def update_stats(message)
  message_count += 1
end

update_stats_proc = Proc.new { |m| update_stats(m) }
parse_log_file(fname, update_stats_proc)

And there we have it - a streaming pipeline with the different steps of the pipeline as cleanly separated as they were in the batch pipeline.

Future work

We can use first-class functions and dependency injection to make our code even more powerful and more modular.

Configurable batch sizes

In the previous edition of PFAB, we mentioned that it is possible to combine streaming and batch approaches. We've so far thought of "streaming" as loading and processing one record at a time, and "batch" as loading and processing every record at once. However, we can also load and process "some" records at a time, in batches of a configurable size.

There may be interesting tensions at play in our choice of batch size. We might want to keep our batches small to reduce the amount of memory that we use (see last week's PFAB). But we might also want to keep our batches large to reduce the number of separate times we have to load data from our data source. In between these competing priorities there is likely a happy middle-ground. Where? That all depends.

Let's once again take our top-down approach, and consider how we would like our batch-size code to look to other programmers who are using it. I would like it if our users simply passed in a batch_size parameter to our existing functions, and then we take care of the rest:

parse_log_file(fname, batch_size) do |m|
  update_stats(m)
end

parse_log_file's contract then becomes "you give me a filename, a message-processing function, and a batch size. I'll load messages from your file in batches of batch_size, and run each record through the message-processing function."

Go back to the previous definition of parse_log_file and think about how you might update it to work with a batch-size.

More modularity

The parse_log_file function is currently responsible for both:

  • Loading log data from a file

  • Parsing log data into messages

For our current purposes, it's not a problem that parse_log_file has two jobs. However, suppose that we wanted to extend our system to work with additional data sources. In this expanded system, users can load their log data from not just a file, but also from a database or a website. The data in these new sources is formatted in the same way in each new source type as it is in the log file. We therefore want to be able to reuse our log-parsing code, and just want to be able to configure the source from which the logs are originally loaded.

We could write a band-aid solution to this problem, similar to those proposed for our earlier problems. We could make multiple functions called parse_log_databaseparse_log_websiteparse_log_file, although this would require us to duplicate some code. Or we could pass a sourcetype argument into parse_log_file that indicated whether the function should load its data from a database, website, or file.

However, I would prefer to create additional modular building blocks that users of our code can string together themselves, in the same way that users can already string together parsing and processing. I'd like our streaming code to look like this:

load_data_from_db(dbname) do |line|
  parse_and_transform_line(line) do |message|
    update_stats(message)
  end
end

The new component in the mix is load_data_from_db. Recall how our line-parsing function (now called parse_and_transform_line) works - it parses lines one-by-one, and passes each line into a user-provided Proc for further processing. In the same way, load_data_from_db loads lines from a database, and passes each line into a user-provided Proc for further processing.

This allows us to use the same line-parsing function, whether we're reading the original from databases or websites. At the same time, we can seamessly swap out the piece of the pipeline that deals with loading data. In fact, now every step of our pipeline can be replaced and customized, independent of all the other steps. For example, if we wanted to load our data from a website instead of a database, and to translate the messages into French instead of calculating statistics, we could swap out our first and last components of our pipeline and write:

load_data_from_website(url) do |line|
  parse_and_transform_line(line) do |message|
    translate_message(message, 'french')
  end
end

Once again, implementing these postulated functions is left as a difficult exercise for another day or an especially motivated reader. This process is nonetheless a good example of the power of top-down design.

Should you actually write code like this?

Probably not. At least, not yet.

I've been somewhat less than honest in my presentation. The top-level code that I've sketched out is all very clean and delightful, but the lunch that it gives you is not free. Despite my hand-waving, actually implementing all of those methods will get complex quickly.

If you're writing a library that will be used by many other programmers then this is just part of your job. It's worth stuffing as much complexity as you can inside your library in order to make it as easy as possible for people on the outside to use. But if you're writing a small project that is currently only worked on by you and maybe one or two other people, embellishments like this might be more trouble than they're worth.

In particular, one of the biggest risks in erecting a framework inside a project is that it might turn out to be the wrong one. Adding new functionality to a framework that isn't designed to handle it is much harder than adding functionality to a project with no framework at all.

For example, all of the above code has assumed that message records are processed one at a time, in a complete vacuum. This sounds like a reasonable assumption. But suppose that Adarsh (the original author of this program) wanted to be able to measure the time difference in between messages. This will require us to be able to peek back at past messages in order to read their timestamps, which is not a feature that our current framework has contemplated. Adding it to our now-opinionated framework will be substantially harder than adding it to a more freeform, unstructured program.

Don't undervalue flexibility. Every clever flourish that you add to your code makes some use-cases easier to handle, but probably also makes others much harder. When you know for sure(ish) the types of tasks that your system will and won't need to handle, then you can start to make assumptions and specialize. Until then, keep your options open, even if it makes your code a little more ragged around the edges for the time being.

Keep reading:

PFAB #9: Batch vs Stream processing

This post is part of my "Programming Feedback for Advanced Beginners" series, which helps you make the leap from knowing syntax to writing clean, elegant code. Subscribe now to receive PFAB in your inbox, every fortnight, entirely free.

(You can also read this post on my blog)

How many WhatsApp messages do you think you've exchanged with your best friend? Adarsh Rao, PFAB reader, wanted to find out. He downloaded his WhatsApp message logs and wrote a program to analyze his conversations. He found that he and his best friend had sent each other a respectable 110,000 words. "We've essentially written 2 novels between us if my Google search for 'average novel length' is at all accurate," he notes.

Adarsh's program already works perfectly (here's the code). But he wants to know how he can make it cleaner and tidier, and I've got a couple of suggestions for him. This week, I'd like to talk about the high-level structure of Adarsh's program. In particular I'd like to discuss a common tradeoff in data processing: batch or streaming? Today we'll learn what these words mean, how we can use them to construct data processing pipelines, and how to evaluate the tradeoffs between different ways of solving problems.

Batch or streaming?

A WhatsApp message log file looks like this:

7/28/19, 10:56 PM - Kimi: Hi. This is a business enquiry requesting a picture of your Dog.
7/28/19, 10:56 PM - Guarav: Hi, yes
7/28/19, 10:57 PM - Guarav: Thank you for getting in touch with us
7/28/19, 10:57 PM - Guarav: We are assigning a representative to fulfil your request
7/28/19, 10:57 PM - Guarav: Please hold

You can export your own WhatsApp chat history by using WhatsApp's "export" feature. In order to analyze a log line from his file, Adarsh needs to:

  1. Load a log line from the file (we'll start by assuming that each log line corresponds to a single message, although we'll see later that this is not always the case)

  2. Parse the line to pull out the name, date, time, and contents of the message

  3. Calculate some statistics about the message - how long is its body, roughly how long did it take to type, etc?

  4. Combine these individual message statistics into aggregate statistics about the total number, average length, etc. of messages in the conversation

Adarsh has a choice to make for how he processes his log lines: batch or streaming?

In a batch approach, Adarsh would perform each of these 4 steps on a large number of lines in his log file - possibly even all of them - before moving onto the next step. He would load all log lines from the log file. Then he would parse all of the messages into their name, date, time and contents. Then he would calculate statistics for every parsed message, and then finally combine them all into aggregate statistics.

+-------+       +-------+       +-------+
|       |       |       |       |       |
|       |       |       |       |       |
|       |       |       |       |       |
| Load  +------>+ Parse +------>+Analyze|
|       |       |       |       |       |
|       |       |       |       |       |
|       |       |       |       |       |
+-------+       +-------+       +-------+

In a streaming approach, Adarsh would do the opposite. He would load one line from the input file at a time. After loading a single line, he would immediately parse that line into its name, date, time, and contents. Then he would calculate statistics for just that line, and fold those statistics into his running aggregate statistics. Then he would go back to the start and start fully processing the next line.

+-------+       +-------+       +-------+
| Load  +------>+ Parse +------>+Analyze|
+-------+       +-------+       +-------+
+-------+       +-------+       +-------+
| Load  +------>+ Parse +------>+Analyze|
+-------+       +-------+       +-------+
+-------+       +-------+       +-------+
| Load  +------>+ Parse +------>+Analyze|
+-------+       +-------+       +-------+

As is so often the case, neither batch or streaming is intrinsically better than the other, and the appropriate approach depends entirely on the context. For example, you're more likely to prefer a streaming system when you're processing new data in realtime as it comes in, and are more likely to prefer a batch system for background systems that need to process big chunks of data at once. As we'll see next week, you often want to take a hybrid approach that gives you the best of both worlds.

Let's look at some of the advantages, disadvantages, and general properties of batch and streaming pipelines.

Batch

A big advantage of a batch approach is that (in my opinion) the code is usually simpler and easier to read and write. Each step in the pipeline can be pulled out into its own function, and the resulting code largely explains itself. For example, in Ruby:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_log(raw_log)
message_stats = calculate_stats(parsed_messages)

Each step in this pipeline can easily be unit-tested individually:

# (these variables would be filled in with actual data)
raw_log = "..."
expected_parsed_messages = [{...}, {...}]
actual_parsed_messages = parse_raw_log(raw_log)

if actual_parsed_messages == expected_parsed_messages
  puts("TEST PASSED!")
else
  puts("TEST FAILED!")
end

A second advantage of a batch approach is that it requires fewer trips to the original source of the data. This is not particularly relevant for our situation, since reading data from a file is a relatively fast operation that we don't need worry about optimizing until our system reaches a much large scale. However, suppose that we were instead loading our raw log data from a database. Suppose also that every query to this database takes at least 1 second to execute, since the program has to go through the niceties of establishing a new connection to the database, as well as sending each query on a roundtrip to the database and back. A full streaming approach, where we repeatedly query for a single record at a time (the equivalent of reading a single line from a file at a time) would make us pay this fixed overhead of 1 second for every single record, making our program very inefficient. By contrast, a batch process, in which we query for and process big chunks of data all at once, would allow us to amortize the fixed query cost over a large number of rows.

We don't even have to load and process every row at once in order to use a batch approach. Maybe we process our data by loading it in batches of 10, 100, 1000, however many records. In fact, when you think about it, streaming is just batch with a batch size of 1 (woah). We'll blur these lines even more next week, but for this week we'll stick with the pure dichotomy between batch and streaming.

One disadvantage of a batch approach is that it typically uses more RAM, or memory, to run than a streaming approach does. To see why, look again at the schematic batch pseudocode I laid out earlier:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_log(raw_log)
message_stats = calculate_stats(parsed_messages)

When this (or any other) program runs, it stores its internal state and data in memory, including the data that is stored in variables. In this program, we start by reading the entire log file all at once. We assign the result of this read to a variable called raw_log. If the log file is 10MB in size then it's reasonable to estimate that this variable will take up roughly 10MB of memory.

Next, we parse all of these raw log lines and assign the result to another variable, parsed_messages. Since the data in parsed_messages is essentially the same as that in raw_log but in a different form, parsed_messages probably takes up about the same amount of memory again as raw_log. We're therefore using at least 20MB of memory to process a 10MB file. This isn't the end of the world, but it is a shame.

Let's therefore see how streaming works and how it is able to use less memory.

Streaming

A pseudo-code sketch for a streaming approach might look something like this:

stats = {}
File.open("samplelog.txt").each_line do |l|
  message = parse_raw_log_line(l)
  stats = add_message_to_stats(message, stats)
end

Ruby's each_line method reads the open file one line at a time, and passes each line into the block (the code between do and end) to be processed. each_line doesn't read another line from the file until the block has finished executing for the previous one.

Inside the block we use parse_raw_log_line to parse the single line into a single message, and then use add_message_to_stats to add the message into our aggregate stats-so-far. Once the block has finished executing, the Ruby interpreter (the program that executes our Ruby code) is able to throw away (or garbage collect) the data for both the raw line and processed message, since it can see that the program won't reference them again. This means that the Ruby interpreter can reuse the piece of memory in which they were stored to store future values instead.

Our streaming program therefore requires much less memory to run than our batch one. It doesn't need to simultaneously hold in memory the entire raw contents of the file, plus the entire processed output. Instead, at any one time it only needs to store a single raw line and a single processed message.

Note that my claims of memory optimization are a high-level simplification, and the exact behavior of the Ruby interpreter is complex and hard to predict. However, we can still say with confidence that the streaming program will use less memory than the batch, even if we can't say exactly how much less without measuring.

Code simplicity

"You said that streaming code is often more complex than batch code, but that streaming code snippet you wrote just now didn't look too bad," I hear you say. No, but that was on easy mode. As I mentioned briefly at the start of this post, WhatsApp message logs aren't arranged neatly, one per line. Instead, if a message body contains a newline, the message logs will contain a newline too.

7/28/19, 11:07 PM - Guarav: I'm going to write a message split over many lines.
This is a new line but the same message.
This is another new line.
I'm out of control.
7/30/19, 11:03 PM - Kimi: Well I never.

This unfortunate fact messes up our nice simple rule of "one log line per message", and makes both our batch and streaming approaches more complex. Now when we're processing a log line, we don't immediately know whether it is the end of the message or whether the message will be continued on the next line. We only know that a message has ended when we process the next log line and find that a new message has started. Then we have to go back to the message that we were previously assembling, and add it to our stats and do whatever other processing we want to do on it, before resuming our parsing of our new message.

(It may be possible to simplify this logic somewhat by going through the file backwards. Why this might help is left as an exercise for the reader)

Nonetheless, it is entirely possible to incorporate this new logic into a streaming approach. Here's a reasonable, if still somewhat gnarly, attempt:

stats = {}
current_message = nil
File.readlines("samplelog.txt") do |l|
  # This method defined elsewhere
  parsed_line = parse_raw_log_line(l)

  # If this line is a new message, add the previous
  # message to our stats and reset current_message
  # to be this line.
  if parsed_line[:start_of_new_message]
    if !current_message.nil?
      stats = add_message_to_stats(current_message, stats)
    end
    current_message = parsed_line

  # If this line is not a new message, it must be a
  # continuation of the previous line's message. We
  # therefore add its contents to the current message.
  else
    current_message[:contents] += parsed_line[:contents]
  end
end

# We need to manually add the final message to our stats,
# since it won't be added by the above loop (urgh)
stats = add_message_to_stats(current_message, stats)

I'm starting to get nervous. Previously the sections of our code that parse messages and calculate message statistics were entirely separate. This made them easy to understand and to write unit tests for. However, in this new version of our code, they have become deeply entwined. add_message_to_stats is now buried inside a double-nested if-statement inside the code that parses messages. It's now hard to assess at a glance whether our code is doing approximately the right thing, and hard to write neat unit tests that just test message parsing.

As we will see next week, it is in fact possible to write streaming code that handles multi-line messages without any sacrifice of modularity. It just requires a bit more work and imagination.

By constast, updating our batch code to handle multi-line logs is easy. As a reminder, here's our current, high-level batch structure again:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_logs(raw_log)
message_stats = calculate_stats(parsed_messages)

In order to get this code to handle multi-line messages we will have to make roughly the same updates that we made to our streaming code. However, this time we can make all of the updates inside our parse_raw_logs function. The interface of the function won't need to change at all - it will still accept an array of raw logs, and still return an array of parsed messages. The only thing that will change is the internal logic that does the parsing. The rest of our program, in particular the calculate_stats function, won't have to know or care that anything has changed.

So what should Adarsh do?

If I were writing this program, I would start with a batch approach. The code is simpler and easier to get up and running, and even though it may be somewhat less memory-efficient, I don't believe that this actually matters. I assume that we're processing the WhatsApp message logs for a single person with, at most, a few hundred thousand messages. I also assume that we're running the program on a modern, relatively powerful computer. If both of these assumptions are correct then the fact that our code is slightly inefficient won't cause us any problems. Our computer will have plenty of memory to spare, and finishing our calculations in 28.5 seconds rather than 28.3 seconds won't materially affect our lives. In this situation I am happy to make my code somewhat less efficient if this also makes it easier to write, read, and understand.

However, suppose that this script continues to evolve. Eventually it becomes the backbone of a system inside a squillion-dollar company, responsible for processing and analyzing billions of message logs from millions of users. Suddenly resource efficiency becomes critical. Speeding up the code and cutting down its memory usage could save us hundreds of hours and thousands of dollars of computation time. Faced with this new set of priorities, we may well want to switch to a more efficient streaming approach, or (more likely) a batch-streaming hybrid. So how could we modify our streaming code so as to make it as pleasant to work with as possible?

Find out in next week's edition of Programming Feedback for Advanced Beginners.

Loading more posts…