Arguing with Algorithms: 2014

Friday, May 9, 2014

The operation queue: A content editing pattern for JavaScript

I recently gave a tech talk for the Khan Academy dev team describing the latest iteration of our infrastructure for saving edits from our content editors to the server. While this seems like a simple problem (doesn’t Backbone already handle that stuff?) there are complexities when you want to support multiple simultaneous users and make the UI fast and responsive. I think the pattern that we gravitated to is a generally useful one, and this was validated by a recent blog post describing an almost identical solution to the same problems by the Dropbox team, which you can read here (they even called it an operation queue!). If you’re writing a client-side system for making changes to global state that need to be persisted to a server, this may help you skip some of the less-optimal steps along the path to content syncing nirvana.

Viewing content on Khan Academy (as an admin)

Editing content on Khan Academy

Responsiveness

First, some definitions. What does responsiveness mean in an editing interface?

No save buttons: The user doesn’t have to remember to click a button to apply the changes. Once the user toggles the UI to “edit” mode, all the fields are edit-in-place where possible and everything autosaves immediately as soon as you edit it. (There is always some internal opposition to autosave on the grounds of “users expect a Save button”, but Google Docs has proven it can be done effectively.)

Instant feedback: If something is wrong we tell the user immediately, rather than returning an error minutes or hours later when they try to publish their changes.

No interruptions: The user can continue working after making changes, without having to wait for the data to get saved to the server. The user should only see a progress bar when they navigate to a different piece of content.

First pass: The Backbone Way

We started migrating over to Backbone as a Javascript framework in early 2012. Backbone’s models are a straightforward way to handle synchronizing with the server and listening for property changes, and we are still using them for that purpose. Here is the naive way to implement an editor using Backbone’s views and models:

This has some issues:

Every time the editor calls save(), it must wait for that save to complete before issuing the command again, since multiple AJAX requests can arrive out-of-order and changes will be lost.
If a save fails for whatever reason, the editor must handle that in every place that can trigger a save().
The model’s internal state is only updated either when save() is called or when the save succeeds and the server state is updated. In either case the UI can be temporarily out of sync with the model state. This makes it unsafe to listen for model changes and update the UI since the user may have unsaved changes.

Second pass: Autosave

To address some of these issues, I designed a new system that would be capable of watching the UI for changes and automatically queuing up those changes to be sent to the server in the background:

This new Autosave component would receive a request to update a field whenever the UI registered a change, and add it to a queue of attributes to save for the model. It would encapsulate calling save() on each model that had changes queued, and be smart about merging attribute changes that were queued to reduce the number of AJAX calls issued.

For error handling, in order to allow the user to continue working I chose to make the system optimistic and assume that changes being saved would succeed. If there was a validation error, it would be caught by the queue and shown to the user, and the user could clear the error by making further edits that were successfully saved. All this was handled in one place for all editors, and things were pretty good.

Warning: Collision imminent!

The problems begin when we have multiple users editing the same content at the same time, or one user has a tab open for a long time with stale data and then comes back and edits it. This leads to conflicts and data loss.

Here is a simple example, where an item (which in this case just contains a list of children A, B, and C) is edited by both Alice and Bob at the same time:

Alice adds B to the item list that just contained A and saves the new list [ A, B ] to the server. Since Bob’s item model is not updated to include the new child B, he sends up his own list with just A and C and now Alice’s change has been reverted without either Alice nor Bob being notified.

Now, we could require Bob’s Autosave component to automatically fetch the latest version of the item before overwriting it, but that would neither be transactionally secure (because it would require multiple AJAX calls, and Alice could have made her changes in between those) nor would it be straightforward to do the merge: Bob has children A & C and the server has A & B, and now we have to write some sort of client-side merge algorithm. And, as stated before, we cannot easily update Bob’s UI once we’ve received changes from the server, since that might erase changes he has made that have not been saved!

Barring a complete rewrite, the first step is to detect when Bob is about to overwrite Alice’s changes. It just so happens that we create a unique revision ID after each edit, which is stored with the item, and the client is aware of what revision ID it has for each piece of content. So when the save() is issued, the revision ID that is sent up is the latest revision the client got before the changes being saved. The server can easily compare this revision to the latest one it has, and return an error if they do not match:

Now when Bob sends up his change to the item, he includes in the JSON representation the ID that he thinks is current, in this case it’s revision 1. The server looks at the latest revision - Alice’s - and see’s it’s revision 2, and returns a special HTTP status code: 409/Conflict. This doesn’t make Bob especially happy, since he knows his changes cannot be saved. At this point he has to reload the page and make his changes again. But the key improvement is that someone is made aware that data is lost, and since it’s the person who just made the change it is easy enough for him to redo his work.

Ideal? No, not really. But the correct solution requires a bit bigger change.

OperationQueue

At this point I had to give up the band-aid solutions and look at a bigger change. What we want is to be able to pull down the latest version and play back Bob’s changes on top of Alice’s so that no work is lost. So rather than storing the attribute values directly, we create an “operation” - a function + data - that can be applied to a model at any point to make the desired change.

It’s a subtle change:

For most fields like “title” the operation will be (setAttribute, {title: X}), or “set ‘title’ to ‘X’” with ‘title’ and ‘X’ being stored in the operation’s data and setAttribute being a generic function. But for adding or removing items from a list, we would store the child being added/removed and the index, rather than the actual list.

This is how the conflict plays out now:

Now Alice and Bob no longer enqueue changes to the child list directly, but rather enqueue two operations: (addChild, “B”) and (addChild, “C”). Alice’s operation runs first and adds B to [ A ] to make [ A, B ], and saves. Bob adds C to [ A ] to make [ A, C ] but his save fails immediately. The client then fetches Alice’s new version [ A, B ], and replays the failed operation on top of it to make [ A, B, C ] and attempts to save again. this one succeeds, and now both Alice and Bob’s changes have been saved. No one has lost any work!

Now, since we’ve updated the model during the course of the save, Bob has to be shown the new values in case he wants to make further changes. This was a problem before since Bob could already have made more changes and queued them up to be saved, and we don’t want to revert them in the UI during a refresh. Luckily, having those changes as operations means we can now calculate what the UI should look like any given moment by taking the model state (which corresponds to the latest server state) and applying any operations in the queue that have not yet been saved. This is what the getUIAttributes() method does in the above diagram. This gives Bob a consistent view with all the changes he’s made even though the model underneath is always in sync with the server:

Even though the UI has been refreshed when the item was reloaded from the server, Bob’s changes remain.

It is important to understand that we only get as much protection from data loss as the operations are constructed to give - adding children is safe but if two users change the title of an item in different ways the first one will still be overwritten unless the operation does something smarter internally (like a text diff). Check boxes, combo boxes, and numeric fields cannot be merged and will always overwrite other changes. The best way to deal with this is to synchronize changes between all clients as soon as changes are made on the server, and at least we have a refresh path on the client that makes that fairly easy.

Conclusion

If you want to give this system a try yourself, feel free to check out the source code and use it in your own projects. It's fairly well documented and should be really easy to get up and running. Let me know how it works for you!

So far it seems like the saving infrastructure outlined here is pretty close to the ideal one for handling multiple users editing structured data asynchronously. We’ve had fewer reports of lost data and productivity among our content creators continues to increase as the tools get better. Now we can focus on making the editors themselves more intuitive and user-friendly. As we like to say at Khan Academy: Onward!

Friday, January 3, 2014

Implementation: A new content store for Khan Academy

Posted by Unknown at 3:42 PM

This article provides more implementation details for our versioned content store. To see the motivating challenges and overall design for building a versioned content store in App Engine, see the companion blog post.

The truth is, once I had a rough design for a Git-alike content store, the main challenges were all implementation details and attention to backward-compatibility (I had to add the new versioning system incrementally without any downtime in the editing tools or - God forbid - for site users). The simplicity owed a lot to the Git storage model (storing versions of entities labeled by their SHA-1 hash), which aligns neatly with the way App Engine’s High Replication Datastore likes to store and access data. There were just a few issues to work out, and I’ll list them here.

< architecture >

The simplest way to implement a Git-like store in App Engine is to create a db/ndb Model class for the object you’d like to store with the properties you’d like to store and a method for creating new revisions of that model. Unlike traditional entities which are overwritten whenever a change is made, in this case you create a completely new entity (a “revision”) on every change. This might sound wasteful compared to storing diffs, but the invariant that revisions are immutable makes the implementation easier and enables easy caching of revisions. This is one example where we rely on App Engine’s scalability to make our lives easier, and compared to the hundreds of millions of user-generated data items, the number of entities here will be relatively small. If this keeps you up at night you can always prune orphaned revisions later.

One decision we made fairly early on was to keep editing models (revisions) separate from the live models that the site uses. The primary reason for this was that we had live entities already (Video and Exercise), and finding all the places where we fetch them by key would have been an onerous and error-prone task. This choice turned out to have some other advantages as well. So the inheritance tree looks like this:

BaseVideo is a plain Python class that store the common DB properties and methods between the editing version (VideoRevision) and the run-time published version (Video). Common functionality for working with live content and revisions is in VersionedContent and BaseRevision, respectively. In our case, we could not make BaseVideo a PolyModel (as that would have changed the kind and therefore invalidate all the existing keys) so we had to introduce a metaclass to allow the subclasses to share DB properties. This enables us to add properties to VersionedContent and BaseRevision that will be inherited by subclasses. I will use Video/VideoRevision as my examples from now on, but everything stated applies equally to our other content types.

As in Git, the (key) ID of the VideoRevision is a hex-encoded SHA-1 hash of the contents of the object, which in this case is a JSON representation of the entity’s fields. There is also a string property which references the Video’s key ID, so we can track history and keep references to an object across revisions. When we create a VideoRevision for a new video, we generate a random ID, ensuring that a new Video entity will be created at publish time. Note that there may be many VideoRevision entities for a single video (tracking historical changes) but there is only ever one Video entity, preserving the current ability of published entities to reference each other by datastore key. The Video also contains the key ID of its latest VideoRevision that has been published; that is, the fields of both should be the same.

This means that to find the latest revision of an object, we need a table of content ID → revision ID. This is the “stage” (or sandbox), which represents the current editing state.

So, to make an edit to an object:

Look up the latest edit version of the object in the stage (this is a get-by-key, which is very efficient)
Apply the requested changes
Compute a new revision ID from the updated properties
Create the new revision entity with the revision ID as its key ID and put it into the datastore
Update the stage to point to the new revision ID

Once the content author is done making changes, they can create a commit, which is just a snapshot of the stage at a particular moment in time, freezing the revision IDs to specific values. The commit contains an author ID and commit message, and it references the previous “head” commit, forming a chain of changes that can be used to recover the entire history of the content. The commit becomes the new “head” commit and is automatically queued up to be published to the site.

This is what the whole setup looks like after a commit:

And here is what it looks like after a second commit, where three of the four entities have been changed:

Having the snapshot and commits be tables of revision IDs means that doing a diff is very efficient: just look for entries that differ between the two tables, fetch only those revisions, and diff their properties. This makes it easy to recompute values or invalidate caches based on just the content that has changed without having to compare old and new properties for every piece of content.

< publishing >

At its core, publishing a commit to the site involves identifying which content revisions have changed (by comparing the revision IDs in the commits’ tables), fetching those revisions, and copying their fields onto the corresponding live entities, creating any entities that don’t already exist. Then the “currently published commit” setting is updated, which instantaneously changes the version of the entities that all the user-facing handlers look at when rendering the site, and you’re done. The process works equally well for rolling back to an earlier commit.

Publishing is also a great place for denormalizing data. Since Video and VideoRevision are separate models, we can add properties to just Video (such as the canonical URL) that are calculated and saved on the entity during publish. We can also pre-warm some caches that are invalidated on each publish, so that users never see a cache miss.

Separating live entities from editing entities does add some complexity to the system, but after publish we can now reference a Video by its key (which is stable) or we can run datastore queries on its indexed fields, neither of which we could have done with just the revisions.

< sync / merge >

Because of the simplicity of the versioning system, if I want to import the latest copy of the topic tree to my local dev datastore, all I need to do is:

Download the latest commit from the live site,
Make a list of the revision entities that I don’t have,
Download the revisions in bulk and push them directly into my datastore
Set the downloaded commit as the “head” commit

From there I can do a normal publish and everything should behave identically to the way it does on live. If I make local changes, I can run the same process on the live site to pull the changes back up.

It is possible that the commit that has been synced is not a descendant of the local head commit (there have been changes both locally and remotely since the last sync). In this case we can create a “merge” commit which finds the common ancestor and then performs an automatic three-way merge between them. The algorithm is trivial when no entity has been modified in both branches, but it’s still possible do a field-by-field merge, which should cover a majority of cases. This allows us to copy all our content to a new datastore, make some changes, publish and preview them, and then merge those changes back to the live site automatically.

< tl;dr >

I wanted to go into some detail to give you an idea of the design decisions and trade-offs we made on the road to having a flexible, extensible versioning system that works efficiently on App Engine. Some of these decisions are specific to our particular situation, but a lot of these ideas could be generally useful in building a CMS on a NoSQL-type platform like Google’s.

If you find yourself using these patterns please drop me a line and let me know how it’s working for you.

Reinventing the wheel: A new content store for Khan Academy

Posted by Unknown at 3:30 PM

Over the past two years, I've been working largely behind the scenes at Khan Academy on the infrastructure the content team uses to upload and publish content (videos, exercises, articles, and interactive programs) to the site. Most of the changes I've made over the past year are not directly visible to users of the site but without them we could not produce the quality and quantity of lessons we need to provide a "world-class education for anyone, anywhere". One of our strengths as a company is knowing when to hack something together and when to invest in flexible and extensible systems, and I would like to share the solution that we've come up with in case others find it useful.

< context >

Creative solutions for cramped spaces

When I first arrived at Khan Academy's humble office (devs huddled around one long table, with the implementations team occupying the single conference room) the content situation was this: playlists of videos were downloaded directly from Sal Khan's YouTube account, and there was a single editor for the exercises on the knowledge map, changes to which would show up on the live site immediately upon saving. These entities: Video, VideoPlaylist, and Exercise, were the basis for everything on the site. There was no versioning, no organization, and no direct editing - if Sal wanted to fix a typo in a video title he had to do it in YouTube, and only he had access.

Fast forward a year. We have content from a half dozen content creators teaching math, science, and art history. We've added the concept of a topic tree to organize the content that any developer can edit, and a conceptually simple versioning system: edits are made to an "editing" copy of the tree, and when the version is published, a full copy of it is made that becomes the new editing version. All runtime code instantly switches to using the newly "live" version via a global setting. The system works fairly well, and we are able to build new functionality on top of it: topic pages, power mode for exercises in a topic, and tutorial navigation.

However, last fall it was already becoming clear that the system just wouldn't scale. As we brought on more content creators, coordination become more and more of an issue: hitting publish at the wrong time could push out someone else's in-progress changes, and there was no way to see who was currently making edits or who had edited something in the past. As a stopgap measure, a HipChat room was set up to coordinate publishes. As the number of topics in the topic tree grew, publish times ballooned to an unreasonable 45 minutes (owing partly to the need to duplicate all the topic entities and update their child keys), during which time no editing could happen. Rolling content back was a difficult, manual process. Furthermore, many errors were caught only at publish time, allowing one author’s simple oversight to block others from getting their changes out.

< solution >

Experienced developers tend to prefer incremental improvements over rewriting from scratch, but in this case after some discussion we decided to re-architect the system to one that could fulfill not only our current needs, but our aspirations. We want the best and the brightest teachers to share their knowledge on our platform, whether it's Andover teaching calculus or the Getty and MoMA inspiring a generation of art students. Having to coordinate between creators is a bottleneck, and having to wait an hour for content to appear on the site is untenable. At the same time, our growing dev team is adding features at an ever-increasing rate, and they need something stable and flexible to build on.

When looking at various CMS storage and versioning designs, I tried to keep the primary user in mind. Since the infrastructure should always be invisible to content creators, the primary users are in fact the developers who are building features on top of it. When it comes to versioning and deployment, developers are used to code versioning systems, so I opted to start with a design based entirely on Git, a popular distributed revision control system.

In the context of content management on App Engine, the Git model has distinct advantages:

Git's storage model is basically a key value store, with the SHA-1 hash of the data as the key. This maps really well to App Engine's datastore API.
Git stores references to a snapshot of each object on each commit, so we never have to apply a diff or walk the history to see how something looked at a given point in time, but the hash-based reference structure mean we don't have to duplicate objects that haven't changed between commits.
Using hashes as keys means that changes made on one machine can be easily merged with changes from another, which is critical in a distributed environment. For instance, adding a new object generates a random key that cannot collide with any other new object.
By design, it is easy to pull and merge changes between copies of the repository. This makes operations such as syncing a development copy of the site with production as easy as copying over any new keys and setting the head commit pointer.
Also by design, calculating a diff between any two commits is easy - just compare hashes for each object. This means that publish can incrementally update only objects that have changed, speeding up the process considerably. This works equally well for rolling back to an earlier version.
The Git content storage filesystem model is really simple to understand and implement.

I didn't copy Git's design wholesale, nor did I actually expose a Git-compatible API (although it would be really cool to someday be able to check out our content as a repository; it would give us access to a whole bunch of useful tools). However I did find that having a fully working design to crib from made implementation much easier and helped me explain the inner workings to other developers.

So far I've been very happy with the way this system has functioned. It is flexible, so we can implement various versioning or permissions schemes on top of it. Different content types use this system in slightly different ways, and that’s fine. Building helpful tools such as diff viewers and remote syncing on top of this infrastructure is really easy, and we could get as fancy as we want to support branching & merging, pull requests from third parties, etc. Most importantly, developers other than me can jump in and create their own versioned entities without a lot of help from me, and get their code working in a very short time, eliminating a dependency on me when implementing new features.

This new architecture also enabled me to reach the goal I had set for myself: publishes now complete in about a minute. This has had a profound impact on how it feels to author content for the site.

If you’re curious about the details of the implementation, I will go into sordid detail in a companion blog post.