PFAB #9: Batch vs Stream processing

This post is part of my "Programming Feedback for Advanced Beginners" series, which helps you make the leap from knowing syntax to writing clean, elegant code. Subscribe now to receive PFAB in your inbox, every fortnight, entirely free.

(You can also read this post on my blog)

How many WhatsApp messages do you think you've exchanged with your best friend? Adarsh Rao, PFAB reader, wanted to find out. He downloaded his WhatsApp message logs and wrote a program to analyze his conversations. He found that he and his best friend had sent each other a respectable 110,000 words. "We've essentially written 2 novels between us if my Google search for 'average novel length' is at all accurate," he notes.

Adarsh's program already works perfectly (here's the code). But he wants to know how he can make it cleaner and tidier, and I've got a couple of suggestions for him. This week, I'd like to talk about the high-level structure of Adarsh's program. In particular I'd like to discuss a common tradeoff in data processing: batch or streaming? Today we'll learn what these words mean, how we can use them to construct data processing pipelines, and how to evaluate the tradeoffs between different ways of solving problems.

Batch or streaming?

A WhatsApp message log file looks like this:

7/28/19, 10:56 PM - Kimi: Hi. This is a business enquiry requesting a picture of your Dog.
7/28/19, 10:56 PM - Guarav: Hi, yes
7/28/19, 10:57 PM - Guarav: Thank you for getting in touch with us
7/28/19, 10:57 PM - Guarav: We are assigning a representative to fulfil your request
7/28/19, 10:57 PM - Guarav: Please hold

You can export your own WhatsApp chat history by using WhatsApp's "export" feature. In order to analyze a log line from his file, Adarsh needs to:

  1. Load a log line from the file (we'll start by assuming that each log line corresponds to a single message, although we'll see later that this is not always the case)

  2. Parse the line to pull out the name, date, time, and contents of the message

  3. Calculate some statistics about the message - how long is its body, roughly how long did it take to type, etc?

  4. Combine these individual message statistics into aggregate statistics about the total number, average length, etc. of messages in the conversation

Adarsh has a choice to make for how he processes his log lines: batch or streaming?

In a batch approach, Adarsh would perform each of these 4 steps on a large number of lines in his log file - possibly even all of them - before moving onto the next step. He would load all log lines from the log file. Then he would parse all of the messages into their name, date, time and contents. Then he would calculate statistics for every parsed message, and then finally combine them all into aggregate statistics.

+-------+       +-------+       +-------+
|       |       |       |       |       |
|       |       |       |       |       |
|       |       |       |       |       |
| Load  +------>+ Parse +------>+Analyze|
|       |       |       |       |       |
|       |       |       |       |       |
|       |       |       |       |       |
+-------+       +-------+       +-------+

In a streaming approach, Adarsh would do the opposite. He would load one line from the input file at a time. After loading a single line, he would immediately parse that line into its name, date, time, and contents. Then he would calculate statistics for just that line, and fold those statistics into his running aggregate statistics. Then he would go back to the start and start fully processing the next line.

+-------+       +-------+       +-------+
| Load  +------>+ Parse +------>+Analyze|
+-------+       +-------+       +-------+
+-------+       +-------+       +-------+
| Load  +------>+ Parse +------>+Analyze|
+-------+       +-------+       +-------+
+-------+       +-------+       +-------+
| Load  +------>+ Parse +------>+Analyze|
+-------+       +-------+       +-------+

As is so often the case, neither batch or streaming is intrinsically better than the other, and the appropriate approach depends entirely on the context. For example, you're more likely to prefer a streaming system when you're processing new data in realtime as it comes in, and are more likely to prefer a batch system for background systems that need to process big chunks of data at once. As we'll see next week, you often want to take a hybrid approach that gives you the best of both worlds.

Let's look at some of the advantages, disadvantages, and general properties of batch and streaming pipelines.

Batch

A big advantage of a batch approach is that (in my opinion) the code is usually simpler and easier to read and write. Each step in the pipeline can be pulled out into its own function, and the resulting code largely explains itself. For example, in Ruby:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_log(raw_log)
message_stats = calculate_stats(parsed_messages)

Each step in this pipeline can easily be unit-tested individually:

# (these variables would be filled in with actual data)
raw_log = "..."
expected_parsed_messages = [{...}, {...}]
actual_parsed_messages = parse_raw_log(raw_log)

if actual_parsed_messages == expected_parsed_messages
  puts("TEST PASSED!")
else
  puts("TEST FAILED!")
end

A second advantage of a batch approach is that it requires fewer trips to the original source of the data. This is not particularly relevant for our situation, since reading data from a file is a relatively fast operation that we don't need worry about optimizing until our system reaches a much large scale. However, suppose that we were instead loading our raw log data from a database. Suppose also that every query to this database takes at least 1 second to execute, since the program has to go through the niceties of establishing a new connection to the database, as well as sending each query on a roundtrip to the database and back. A full streaming approach, where we repeatedly query for a single record at a time (the equivalent of reading a single line from a file at a time) would make us pay this fixed overhead of 1 second for every single record, making our program very inefficient. By contrast, a batch process, in which we query for and process big chunks of data all at once, would allow us to amortize the fixed query cost over a large number of rows.

We don't even have to load and process every row at once in order to use a batch approach. Maybe we process our data by loading it in batches of 10, 100, 1000, however many records. In fact, when you think about it, streaming is just batch with a batch size of 1 (woah). We'll blur these lines even more next week, but for this week we'll stick with the pure dichotomy between batch and streaming.

One disadvantage of a batch approach is that it typically uses more RAM, or memory, to run than a streaming approach does. To see why, look again at the schematic batch pseudocode I laid out earlier:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_log(raw_log)
message_stats = calculate_stats(parsed_messages)

When this (or any other) program runs, it stores its internal state and data in memory, including the data that is stored in variables. In this program, we start by reading the entire log file all at once. We assign the result of this read to a variable called raw_log. If the log file is 10MB in size then it's reasonable to estimate that this variable will take up roughly 10MB of memory.

Next, we parse all of these raw log lines and assign the result to another variable, parsed_messages. Since the data in parsed_messages is essentially the same as that in raw_log but in a different form, parsed_messages probably takes up about the same amount of memory again as raw_log. We're therefore using at least 20MB of memory to process a 10MB file. This isn't the end of the world, but it is a shame.

Let's therefore see how streaming works and how it is able to use less memory.

Streaming

A pseudo-code sketch for a streaming approach might look something like this:

stats = {}
File.open("samplelog.txt").each_line do |l|
  message = parse_raw_log_line(l)
  stats = add_message_to_stats(message, stats)
end

Ruby's each_line method reads the open file one line at a time, and passes each line into the block (the code between do and end) to be processed. each_line doesn't read another line from the file until the block has finished executing for the previous one.

Inside the block we use parse_raw_log_line to parse the single line into a single message, and then use add_message_to_stats to add the message into our aggregate stats-so-far. Once the block has finished executing, the Ruby interpreter (the program that executes our Ruby code) is able to throw away (or garbage collect) the data for both the raw line and processed message, since it can see that the program won't reference them again. This means that the Ruby interpreter can reuse the piece of memory in which they were stored to store future values instead.

Our streaming program therefore requires much less memory to run than our batch one. It doesn't need to simultaneously hold in memory the entire raw contents of the file, plus the entire processed output. Instead, at any one time it only needs to store a single raw line and a single processed message.

Note that my claims of memory optimization are a high-level simplification, and the exact behavior of the Ruby interpreter is complex and hard to predict. However, we can still say with confidence that the streaming program will use less memory than the batch, even if we can't say exactly how much less without measuring.

Code simplicity

"You said that streaming code is often more complex than batch code, but that streaming code snippet you wrote just now didn't look too bad," I hear you say. No, but that was on easy mode. As I mentioned briefly at the start of this post, WhatsApp message logs aren't arranged neatly, one per line. Instead, if a message body contains a newline, the message logs will contain a newline too.

7/28/19, 11:07 PM - Guarav: I'm going to write a message split over many lines.
This is a new line but the same message.
This is another new line.
I'm out of control.
7/30/19, 11:03 PM - Kimi: Well I never.

This unfortunate fact messes up our nice simple rule of "one log line per message", and makes both our batch and streaming approaches more complex. Now when we're processing a log line, we don't immediately know whether it is the end of the message or whether the message will be continued on the next line. We only know that a message has ended when we process the next log line and find that a new message has started. Then we have to go back to the message that we were previously assembling, and add it to our stats and do whatever other processing we want to do on it, before resuming our parsing of our new message.

(It may be possible to simplify this logic somewhat by going through the file backwards. Why this might help is left as an exercise for the reader)

Nonetheless, it is entirely possible to incorporate this new logic into a streaming approach. Here's a reasonable, if still somewhat gnarly, attempt:

stats = {}
current_message = nil
File.readlines("samplelog.txt") do |l|
  # This method defined elsewhere
  parsed_line = parse_raw_log_line(l)

  # If this line is a new message, add the previous
  # message to our stats and reset current_message
  # to be this line.
  if parsed_line[:start_of_new_message]
    if !current_message.nil?
      stats = add_message_to_stats(current_message, stats)
    end
    current_message = parsed_line

  # If this line is not a new message, it must be a
  # continuation of the previous line's message. We
  # therefore add its contents to the current message.
  else
    current_message[:contents] += parsed_line[:contents]
  end
end

# We need to manually add the final message to our stats,
# since it won't be added by the above loop (urgh)
stats = add_message_to_stats(current_message, stats)

I'm starting to get nervous. Previously the sections of our code that parse messages and calculate message statistics were entirely separate. This made them easy to understand and to write unit tests for. However, in this new version of our code, they have become deeply entwined. add_message_to_stats is now buried inside a double-nested if-statement inside the code that parses messages. It's now hard to assess at a glance whether our code is doing approximately the right thing, and hard to write neat unit tests that just test message parsing.

As we will see next week, it is in fact possible to write streaming code that handles multi-line messages without any sacrifice of modularity. It just requires a bit more work and imagination.

By constast, updating our batch code to handle multi-line logs is easy. As a reminder, here's our current, high-level batch structure again:

raw_log = File.read("samplelog.txt")
parsed_messages = parse_raw_logs(raw_log)
message_stats = calculate_stats(parsed_messages)

In order to get this code to handle multi-line messages we will have to make roughly the same updates that we made to our streaming code. However, this time we can make all of the updates inside our parse_raw_logs function. The interface of the function won't need to change at all - it will still accept an array of raw logs, and still return an array of parsed messages. The only thing that will change is the internal logic that does the parsing. The rest of our program, in particular the calculate_stats function, won't have to know or care that anything has changed.

So what should Adarsh do?

If I were writing this program, I would start with a batch approach. The code is simpler and easier to get up and running, and even though it may be somewhat less memory-efficient, I don't believe that this actually matters. I assume that we're processing the WhatsApp message logs for a single person with, at most, a few hundred thousand messages. I also assume that we're running the program on a modern, relatively powerful computer. If both of these assumptions are correct then the fact that our code is slightly inefficient won't cause us any problems. Our computer will have plenty of memory to spare, and finishing our calculations in 28.5 seconds rather than 28.3 seconds won't materially affect our lives. In this situation I am happy to make my code somewhat less efficient if this also makes it easier to write, read, and understand.

However, suppose that this script continues to evolve. Eventually it becomes the backbone of a system inside a squillion-dollar company, responsible for processing and analyzing billions of message logs from millions of users. Suddenly resource efficiency becomes critical. Speeding up the code and cutting down its memory usage could save us hundreds of hours and thousands of dollars of computation time. Faced with this new set of priorities, we may well want to switch to a more efficient streaming approach, or (more likely) a batch-streaming hybrid. So how could we modify our streaming code so as to make it as pleasant to work with as possible?

Find out in next week's edition of Programming Feedback for Advanced Beginners.

PFAB #8: Input validation - tradeoffs between convenience and surprise

You can also read this post on my blog.

In the previous installment of Programming Feedback for Advanced Beginners we began looking at Gianni Perez's breadth-first search library. We decided not to analyze its internal workings, and instead examined it from the outside and see how it looked and felt to external users. This week we're going to continue to study Gianni's library, and look at how it performs complex input validations.

(Read the previous PFAB before this one if you haven't already)

Input validation

Programs and functions validate their inputs in order to make sure that they don't do anything weird.

This might be in order to help them better serve their users. A user who inputs an email address of 3180 18th Street is not going to be able to receive any emails from your service. Your service should therefore validate that their email address looks sensible, and complain if it doesn't.

Input validations are also a critical tool for preventing security vulnerabilities. It makes mathematical (if obtuse) sense for you to ask your bank to transfer me -$1000. Logically, you transferring me -$1000 is the same as me transferring you +$1000. But your bank should still validate that requested transfer amounts are greater than zero, and reject any requests that fail this validation.

Many input validations are trivial, and simply check whether an input makes any sense at all. number_of_cakes_ordered must be greater than or equal to zero. email_address must look something like x@y.z (although the details can get gnarly). However, sometimes input validations are murkier and the appropriate way to handle invalid data is less clear. It might be possible, with a little imagination, for us to infer what the user meant, but it's not always clear that this is a good idea. In such situations, should we be sticklers or swashbucklers? As always, it depends.


Gianni's breadth-first search code faces a particularly subtle input validation conundrum. Last week we saw that his shortest_path function takes in a graph and two nodes, and returns the shortest path between these nodes. However, it expects the graph to be passed in in a very specific format - a dictionary in which the keys are the names of the nodes in the graph, and the values are the neighbors of the nodes that can be reached from it. For example:

{
    '1': ['2', '3'], // from node 1 you can step to node 2 or 3
    '2': ['1', '4'], // from node 2 you can step to node 1 or 4
    // ...etc...
}

It may or may not be sensible for Gianni to add validation that every node in the input graph has an entry in this data structure. But what should his validation do if a node is a dead-end, and so has no neighbors? In this situation he have two main options. First, we can continue to validate and require that the graph data structure explicitly records the fact that such nodes have no neighbors using an empty list. For example:

{
    '1': [],
    // ...etc...
}

Or second, he can assume that a node's absence implies that it has no neighbors, and not raise an exception if a node is missing.

This is a complex, abstract problem to understand, so let's start by looking at a similar but stripped-down example. After we've finished, we'll apply the lessons within it to Gianni's breadth-first search code.

Validating test scores

Suppose that we are writing a program to analyze a school's test scores. We have a function called analzye_test_scores that takes in a dataset of scores and uses them to calculate some statistics, like the mean, median, mode, and standard deviation:

def analyze_test_scores(test_scores):
    # Do some analysis and return some statistics

This function expects test_scores to be a nested dictionary in which the top-level key is a student's name. It expects the nested key to be the name of the test, and the nested value to be the student's score in that test. For example:

{
    'rob': {
        'algebra': 100,
        'history': 100
    },
    'sally': {
        'algebra': 12,
        'history': 8,
    },
    // etc...
}

However, it may be that not every student took every test. There are two ways for us to account for this possibility in our data structure. The most verbose and explicit way is to require that every student have a key-value pair for every subject, but if a student didn't take a test then the value should be None (or nil, or null, or however our language represents "nothing"). For example:

{
    'rob': {
        'algebra': 100,
        'geography': None,
        'history': 100
    },
    'sally': {
        'algebra': 12,
        'geography': 7,
        'history': None,
    },
    // etc...
}

If we take this approach, our analyze_test_scores function would need to validate that every student has an entry for every subject, and throw an exception if they don't. One useful property of this approach is to safeguard against typos in the subject names. Maybe jerome doesn't have a score for science but he does inadvertantly have one for sceince. In this situation our zealous validation code would spot that something was wrong and loudly throw an exception.

Alternatively we could be more relaxed. We could allow users to pass in datasets with missing subject scores, and assume that any student without a key-value for a subject didn't take the test. This would be more convenient to use, but more open to silly typos. It would probably take us much longer to realize that jerome had only taken a test for sceince but not science.

Which of these two approaches is best? It depends.

Trade offs between convenience and surprise

Whenever you're writing code that will be used by another programmer (you-in-a-month's-time counts as another programmer), put yourself in the mind of the person using your code. This person doesn't want to have to do much work, and they don't want to be confused. They want to maximize convenience and minimize surprise.

However, these goals are often in tension with each other. The most convenient code is that which requires no instruction and can figure out everything magically and perfectly. For example:

output = analyze_test_scores(
    "The scores from last week, or maybe the week before? I forget."
)
print(format_output(
    output,
    structure="The way that Frankie did it the other day, that was really cool."
))

But with interpretation and inference comes great scope for silly, silent, surprising mistakes. In our test scores example, giving users the freedom to pass in partially filled-in datasets might make their life easier 95% of the time. But the other 5% of the time it saddles them with secretive bugs or - even worse - wrecks their results in ways that they'll never notice.

Can we have the best of both worlds? Can we make our library both convenient and unsurprising? By using techniques that I'll call "configurable validation" and "loader functions" we can certainly try.

Configurable validation

We could give our users the flexibility to decide how much validation they want us to apply by adding an optional strict_validation parameter to our analyze_test_scores method. If this parameter is set to True then we throw an exception if data is missing. If it is set to False then we silently assume that a missing key means that a test wasn't taken.

Before you read any further - in languages that support optional parameters, should the default value of strict_validation be True or False if the user doesn't give us a value?

# If test_data is missing test scores, this will throw an exception
stats = analyze_test_scores(test_data, strict_validation=True)

# This will silently fill in the blanks
stats = analyze_test_scores(test_data, strict_validation=False)

# What should this do?
stats = analyze_test_scores(test_data)

Answer - I think that the default value of strict_validation should be True. This ensures that users of our code who don't read our documentation don't accidentally run code with lax validation. Only users who know what they are doing and understand the risks involved will pass in strict_validation=False.

Loader functions

Our library could also provide functions that load test result data on behalf of our users. For example, suppose that users retrieved their data by loading it from a specially-structured file. We could provide a function for loading, processing, and validating this data called `load_test_scores_from_file'. This function might look something like this:

def load_test_scores_from_file(filepath):
    raw_data = open(filepath).read()

    # ...
    # Massage the raw data into the format that the
    # rest of our library expects it, including
    # `None`s for missing test scores.
    # ...

    return formatted_data

We could have this load_test_scores_from_file function be responsible for making the same validations and decisions that analyze_test_scores currently makes. If data is missing from the source file, this function could be responsible for either raising an exception or filling in the blanks with Nones. It could even have its own strict_validation flag that it uses to decide how fussy to be.

This approach has several pleasing properties. It validates data immediately, at its source, and allows the rest of our program to assume that the data it receives is in the correct format. This simplifies communication between the different components of our program. They're now passing around a dataset that is known to be properly-formatted, rather than a dataset that may or may not be properly-formatted that they have to constantly check and decide how to handle.

In addition, validating data when it's loaded will likely make debugging easier. If a program raises an "invalid data!" exception while it's loading a dataset then it's clear that your problem is with your dataset itself. If it instead raises an exception several steps later, when the data is being processed, it's rather more work to trace back the flow of logic to where the data originally came from.

Other data sources

We could also provide loader functions for loading data from other sources. We could provide functions called load_test_scores_from_db and load_test_scores_from_api that perform the exact same operations as load_test_scores_from_file, but load their raw data from different places. These functions would be responsible for formatting and validating the data, and for returning a dataset in exactly the same form as load_test_scores_from_file does.

We can provide standard loader functions for as many standard data sources as we have time and patience for. However, if our use wants to load their own data from their own, custom source (for example, a custom database or spreadsheet), then we can't give them very much direct support. That said, we can still provide a standalone validation function, called something like preprocess_test_scores. Our user can use this function in their own data loading code, allowing them to at least be confident that their custom-loaded data is in the right form:

data = load_data_from_my_custom_data_store()
processed_data = preprocess_test_scores(data)

stats = analyze_test_scores(filled_in_data)

Users can bring their own data from their own sources, while also using our standardized validation and formatting functions to ensure that their data is in the right format.

A fully-featured library would likely provide all of these tools. It would provide standardized, full-service functions for loading data from standard sources (like files, databases, and APIs), but also provide the building blocks for users to bring their own data from their own sources. In the real world, time and energy are finite, so any library that you write should begin with the highest leverage, most useful tools. You can add additional, edge-case gadgets as and when they are needed.

How does this apply to breadth-first search?

Gianni's breadth-first search conundrum is almost exactly analogous to that of our test-scores. Remember, we're trying to decide whether we should require users to give us fully-filled out dictionaries that contain an entry for every node, and whether we should assume that a missing node means that the node has no neighbors, or that the user has made an error.

We have exactly the same options as with our test scores. We can fill in the blanks on behalf of our user when they pass their dictionary into shortest_path. We can fill in the blanks when they load their data. We can give them optional parameters that they can use to allow them to choose the blank-handling behavior they prefer. Same decisions, different contexts.

In conclusion

Think about how your library looks and feels to use. In particular, think about the tradeoffs between convenience and surprise. On their own these types of micro-decisions won't make or break your project, and neither will they get you fired or promoted. But they do all add up, and they are good practice for similar, larger decisions that do make a bigger difference.

Next time we're going to look at a brand new program that parses WhatsApp message logs. Don't miss it. Until then:

PFAB#7: How to write a library

(You can also read this post on my blog)

Welcome to week 7 of Programming Feedback for Advanced Beginners. In this series I review a program sent to me by one of my readers. I highlight the things that I like and discuss the things that I think could be better. Most of all, I suggest small and big changes that the author could make in order to bring their program up to the next level.

(To receive all future PFABs as soon as they’re published, subscribe by email or follow me on Twitter. For the chance to have your code analyzed and featured in future a PFAB, go here)

This week we're going to peruse a program sent to me by Gianni Perez, a security analyst from the US of A. He spends most of his days searching for exploits and vulnerabilities, not building and maintaining large software projects, but he'll frequently throw together quick scripts to automate parts of his work. He says that he's always looking to improve his programming skills, and he asked me to take a look at a small project he recently finished.

Gianni's program is an implementation of the breadth-first search algorithm for finding the shortest path from A to B. A variation of breadth-first search (called Djikstra's Algorithm) is used by car and train journey-planners to find you the best route from here to there (and show you adverts at the same time).

Breadth-first search is a staple of computer science undergraduate courses, although it's relatively rare that a real person has to implement it this outside of a university lab or an ill-conceived whiteboard job interview. Nonetheless, as we'll see, doing so can still be a satisfying and educational exercise.

Don't worry if the words "computer science" or "algorithm" make you nervous. We're going to be able to make intelligent observations about Gianni's code without having to know anything about the internals of breadth-first search. This is because we're not going to analyze whether Gianni has implemented the algorithm cleanly or even correctly (although as far as I can tell he has). Instead, we're going to pretend that we're considering whether to use his code as a library in order to perform breadth-first search in our own journey-planning project. We'll evaluate his code from the perspective of potential consumers, and discuss how to make library code more welcoming to new users.

Let's start with an optional paragraph or two summarizing what goes on inside a breadth-first search.

(You may find it useful to open Gianni's code to reference while you read this post)

Breadth-first search

The goal of breadth-first search is to find the shortest path through a network from a source to a destination. These networks are known as a graph, and a graph is composed of vertices - the nodes of the graph - and edges - the lines joining these nodes together. Suppose that we were trying to find the shortest route from city A to city B along the highways. Cities would be our vertices, and the highways joining them would be our edges. The overall highway network would be our graph.

In breadth-first search you start at the source vertex and walk simultaneously along every possible path away from it, until one of these paths reaches your desired destination. The "breadth" in breadth-first search comes from the fact that you are trying every possible path simultaneously, instead of following one path at a time all the way to its end.

To perform the algorithm you start at the source vertex. For example, suppose you were trying to find the shortest path from A to E in the following graph. You would start at A:

           +---+
     +-----+ A +-----+
     |     +---+     |
     |               |
     +               +
+---+B+---+          C
|         |          +
|         |          |
+         +          |
D         E          +
          +---------+F

Next you simultaneously take a step from this source vertex to each one of its neighbors, and store a new path in your program for each one. If any of your paths have landed on your destination vertex then you declare victory and return that path as the shortest path. If not then you keep going.

           +---+
     +-----+ A +-----+
     |     +---+     |
     |               |
   +-+-+           +-+-+
+--+ B +--+        | C |
|  +---+  |        +-+-+
|         |          |
+         +          |
D         E          +
          +---------+F

In the next iteration through the algorithm you extend each path by one step to each of its new neighbors, creating and keeping track of further new paths where necessary. You repeat this process in each subsequent iteration too, stepping from the vertex at the end of each of your paths to each new neighbor that hasn't yet been touched by another path. You iterate until one of your paths steps onto your destination vertex.

            +---+
       xxxxxxx A +-----+
       x     +---+     |
       x               |
     +-x-+           +-+-+
  +--+ B xxxx        | C |
  |  +---+  x        +-+-+
  |         x          |
+-+-+     +-x-+        |
| D |     | E |      +-+-+
+---+     +---+------+ F |
                     +---+

Shortest path: A => B => E

At this point you know for certain that you have found the shortest path between your source and destination vertices. It's possible and indeed likely that there are other paths available between these two vertices. However, since you are tracing out every possible path simultaneously and only extending them by one hop at a time, those other paths are guaranteed to take more hops to trace out that the one you have found.

If this doesn't fully make sense then you can either Google "breadth first search", or continue reading this post without worrying too much. As potential users of this library we don't need to know anything about how it works internally. Do you know how your favorite programming language actually works? Me neither, but we're still both able to write programs with it just fine.

How does a library look to its users?

As users of Gianni's library, we only care whether its code is correct, reasonably efficient, and easy to use. We don't care at all whether its insides are neat or messy or well-commented or opaque, because we're never going to see any of these innards ourselves. All we're going to see are the names of the methods that the library exposes to us, and the forms of input and output that they expect.

To illustrate the difference between a usable and a useless library, here's a simple example. Which of these functionally-equivalent chart-drawing libraries would you rather work with?

Number 1:

import ch

ch.do(
    data,
    None,
    None,
    None,
    'x=Date|y=Sales Volume',
    'YES',
    5)

Or number 2:

import chartlib

chartlib.draw_line_chart(
    data,
    x_axis_label="Date",
    y_axis_label="Sales Volume",
    show_gridlines=True,
    font_size=5)

Both libraries produce the same charts. Both libraries may even contain almost exactly the same code. But whereas library number 1 looks like an obfuscated nightmare to use, the second looks like a simple pleasure. The difference is in their interfaces - how the user interacts with them.

When analyzing Gianni's code through this lens, I noted a few subtle ways in which its functions' interfaces could be made more usable. In particular, I noticed some small but significant improvements that could be made to the form in which the functions return their results.

What should a function return?

One of the library's most important functions is called shortest_path. This function takes in 3 arguments: a graph, a source vertex, and a destination vertex. As you might expect, it finds and returns the shortest path through the graph from the source to the destination. This is exactly the kind of function that is well-suited to being performed by a library. When writing our journey-planner product, we want to handle all of the parts of the code that are specific to our business - rendering the output maps, finding the user's location, showing them lots of adverts, and so on. But there's no reason for us to re-implement a core, generic computer science algorithm; much better to outsource this type of work to a dedicated, battle-tested algorithmic library like Gianni's.

However, from our outside perspective, shortest_path comes with a substantial irritation. Its output is a single string of the names of the vertices on the shortest path, joined together with ASCII arrows. For example, "10->5->2-15". Suppose that we wanted to use this output to draw a route on a map, and to display this map in a webpage to our user (along with lots and lots of advertisements). In order to display each vertex on our map individually, we'd have to split the string returned by shortest_path (in the form "10->5->2-15") back into its component vertices. This is a perfectly doable task, but also an annoying and unnecessary one.

Instead, I would much prefer it if shortest_path returned a list of the vertices on the shortest path (eg. ['10','5','2','15']). This would give us maximum flexibility to decide how we want to present the results to our customers. shortest_path's job should only be to calculate data; it shouldn't make any assumptions about how the caller wants to display it.

As a rule of thumb, separate out data calculation from data formatting wherever possible. Have your calculation components return their results as raw lists, dictionaries and objects, and avoid doing anything that smells like "display logic". Pass this output into a second component that knows nothing about how to calculate anything, but knows everything about to make data look good.

This applies even for code that you never intend to be used by anyone else. Separating out responsibilities for calculation and formatting allows you to completely change how your data is displayed without having to change anything about how this data produced (and vice versa).

+---------+
|Calculate|
+----+----+
     |
     v
+----+----+
| Format  |
+----+----+
     |
     v
+----+----+
| Output  |
+---------+

I've sketched out how I would change shortest_path and the way in which it is used on GitHub.

A second example

I spotted an almost identical problem in a second function called bfs_traversal. This function's job is to find and return the shortest distance from a source vertex to every other vertex in the graph. The original version of this function returned its output in a fancy display format called a PrettyTable. I don't know anything about PrettyTable - from some quick Googling it looks like it's a library that can be used to print cleanly structured tables to the terminal. But as a user of Gianni's breadth-first search library I don't want to know anything about PrettyTable. I just want to receive the data that I asked for in a raw, sensible structure, and then I'll take care of formatting and outputting it as I see fit.

I looked further at the code for bfs_traversal, and noticed that it already stores all of its interim data in exactly the kind of raw, dictionary structure that I actually want it to give back to me. The only problem is that the function then tries to be too helpful and do my formatting for me. Fixing this simply requires the removal of the PrettyTable code.

This code:

def bfs_traversal(graph, source_vertex):
    distances = {}
    # <do a load of calculations>
    return convert_to_pretty_table(distances)

should become this code:

def bfs_traversal(graph, source_vertex):
    distances = {}
    # <do a load of calculations>
    return distances

Conclusion

We've seen how we can analyze the exterior of a piece code from a user's perspective, without having to know anything about what goes on inside. This idea is not specific to programming; we can (and do) critique the user interfaces of complex gadgets without needing to know how they work inside. We might not care to learn how to build a smartphone, but we can still give useful feedback on how one feels to use. Think about how your code might look to someone on the outside who doesn't care how it works and just wants it to be simple and intuitive to work with.

Next week we'll continue to look at Gianni's library, and suggest some more ways in which he could make it more convenient to use for a journey-planner-programmer in a hurry.

Until we meet again:

PFAB#6: Real-world debugging practice

This post is part of my "Programming Feedback for Advanced Beginners" series, which aims to help you make the leap from knowing the syntax to writing clean, elegant code. Subscribe now to receive PFAB in your inbox, every weekend, entirely free.

Only half of programming is about writing code. The other half is about figuring out why the code you've written doesn't god damn work.

This post contains 3 excerpts from real-world programs that my readers have sent me, asking for assistance figuring out why their code is broken. Your task is to find the bugs and fix them. These bugsquashes are great practice for thinking methodically, and for understanding code that you didn't originally write. You don't need to do any setup in order to tackle them, and you can even run the code directly in your browser.

Start debugging now.

PFAB#5: How to make your programs shorter

For the last two weeks we've been analyzing a program, written by an archaeologist named Michael Troyer, that measures commute times. So far we've admired the way in which its logic is split up, and have talked about how its error handling could be improved. In this final discussion of the program we'll look at how we can tidy up and significantly shorten the code that handles querying its database.

Read how on my blog.

Loading more posts…