Friday, January 3, 2014

Implementation: A new content store for Khan Academy

This article provides more implementation details for our versioned content store. To see the motivating challenges and overall design for building a versioned content store in App Engine, see the companion blog post.

The truth is, once I had a rough design for a Git-alike content store, the main challenges were all implementation details and attention to backward-compatibility (I had to add the new versioning system incrementally without any downtime in the editing tools or - God forbid - for site users). The simplicity owed a lot to the Git storage model (storing versions of entities labeled by their SHA-1 hash), which aligns neatly with the way App Engine’s High Replication Datastore likes to store and access data. There were just a few issues to work out, and I’ll list them here.

< architecture >


The simplest way to implement a Git-like store in App Engine is to create a db/ndb Model class for the object you’d like to store with the properties you’d like to store and a method for creating new revisions of that model. Unlike traditional entities which are overwritten whenever a change is made, in this case you create a completely new entity (a “revision”) on every change. This might sound wasteful compared to storing diffs, but the invariant that revisions are immutable makes the implementation easier and enables easy caching of revisions. This is one example where we rely on App Engine’s scalability to make our lives easier, and compared to the hundreds of millions of user-generated data items, the number of entities here will be relatively small. If this keeps you up at night you can always prune orphaned revisions later.

One decision we made fairly early on was to keep editing models (revisions) separate from the live models that the site uses. The primary reason for this was that we had live entities already (Video and Exercise), and finding all the places where we fetch them by key would have been an onerous and error-prone task. This choice turned out to have some other advantages as well. So the inheritance tree looks like this:

BaseVideo is a plain Python class that store the common DB properties and methods between the editing version (VideoRevision) and the run-time published version (Video). Common functionality for working with live content and revisions is in VersionedContent and BaseRevision, respectively. In our case, we could not make BaseVideo a PolyModel (as that would have changed the kind and therefore invalidate all the existing keys) so we had to introduce a metaclass to allow the subclasses to share DB properties. This enables us to add properties to VersionedContent and BaseRevision that will be inherited by subclasses. I will use Video/VideoRevision as my examples from now on, but everything stated applies equally to our other content types.

As in Git, the (key) ID of the VideoRevision is a hex-encoded SHA-1 hash of the contents of the object, which in this case is a JSON representation of the entity’s fields. There is also a string property which references the Video’s key ID, so we can track history and keep references to an object across revisions. When we create a VideoRevision for a new video, we generate a random ID, ensuring that a new Video entity will be created at publish time. Note that there may be many VideoRevision entities for a single video (tracking historical changes) but there is only ever one Video entity, preserving the current ability of published entities to reference each other by datastore key. The Video also contains the key ID of its latest VideoRevision that has been published; that is, the fields of both should be the same.

This means that to find the latest revision of an object, we need a table of content ID → revision ID. This is the “stage” (or sandbox), which represents the current editing state.

So, to make an edit to an object:
  1. Look up the latest edit version of the object in the stage (this is a get-by-key, which is very efficient)
  2. Apply the requested changes
  3. Compute a new revision ID from the updated properties
  4. Create the new revision entity with the revision ID as its key ID and put it into the datastore
  5. Update the stage to point to the new revision ID
Once the content author is done making changes, they can create a commit, which is just a snapshot of the stage at a particular moment in time, freezing the revision IDs to specific values. The commit contains an author ID and commit message, and it references the previous “head” commit, forming a chain of changes that can be used to recover the entire history of the content. The commit becomes the new “head” commit and is automatically queued up to be published to the site.

This is what the whole setup looks like after a commit:

And here is what it looks like after a second commit, where three of the four entities have been changed:


Having the snapshot and commits be tables of revision IDs means that doing a diff is very efficient: just look for entries that differ between the two tables, fetch only those revisions, and diff their properties. This makes it easy to recompute values or invalidate caches based on just the content that has changed without having to compare old and new properties for every piece of content.

< publishing >


At its core, publishing a commit to the site involves identifying which content revisions have changed (by comparing the revision IDs in the commits’ tables), fetching those revisions, and copying their fields onto the corresponding live entities, creating any entities that don’t already exist. Then the “currently published commit” setting is updated, which instantaneously changes the version of the entities that all the user-facing handlers look at when rendering the site, and you’re done. The process works equally well for rolling back to an earlier commit.

Publishing is also a great place for denormalizing data. Since Video and VideoRevision are separate models, we can add properties to just Video (such as the canonical URL) that are calculated and saved on the entity during publish. We can also pre-warm some caches that are invalidated on each publish, so that users never see a cache miss.

Separating live entities from editing entities does add some complexity to the system, but after publish we can now reference a Video by its key (which is stable) or we can run datastore queries on its indexed fields, neither of which we could have done with just the revisions.

< sync / merge >


Because of the simplicity of the versioning system, if I want to import the latest copy of the topic tree to my local dev datastore, all I need to do is:
  1. Download the latest commit from the live site,
  2. Make a list of the revision entities that I don’t have,
  3. Download the revisions in bulk and push them directly into my datastore
  4. Set the downloaded commit as the “head” commit
From there I can do a normal publish and everything should behave identically to the way it does on live. If I make local changes, I can run the same process on the live site to pull the changes back up.

It is possible that the commit that has been synced is not a descendant of the local head commit (there have been changes both locally and remotely since the last sync). In this case we can create a “merge” commit which finds the common ancestor and then performs an automatic three-way merge between them. The algorithm is trivial when no entity has been modified in both branches, but it’s still possible do a field-by-field merge, which should cover a majority of cases. This allows us to copy all our content to a new datastore, make some changes, publish and preview them, and then merge those changes back to the live site automatically.

< tl;dr >


I wanted to go into some detail to give you an idea of the design decisions and trade-offs we made on the road to having a flexible, extensible versioning system that works efficiently on App Engine. Some of these decisions are specific to our particular situation, but a lot of these ideas could be generally useful in building a CMS on a NoSQL-type platform like Google’s.

If you find yourself using these patterns please drop me a line and let me know how it’s working for you.

Reinventing the wheel: A new content store for Khan Academy

Over the past two years, I've been working largely behind the scenes at Khan Academy on the infrastructure the content team uses to upload and publish content (videos, exercises, articles, and interactive programs) to the site. Most of the changes I've made over the past year are not directly visible to users of the site but without them we could not produce the quality and quantity of lessons we need to provide a "world-class education for anyone, anywhere". One of our strengths as a company is knowing when to hack something together and when to invest in flexible and extensible systems, and I would like to share the solution that we've come up with in case others find it useful.

< context >


Creative solutions for cramped spaces
When I first arrived at Khan Academy's humble office (devs huddled around one long table, with the implementations team occupying the single conference room) the content situation was this: playlists of videos were downloaded directly from Sal Khan's YouTube account, and there was a single editor for the exercises on the knowledge map, changes to which would show up on the live site immediately upon saving. These entities: Video, VideoPlaylist, and Exercise, were the basis for everything on the site. There was no versioning, no organization, and no direct editing - if Sal wanted to fix a typo in a video title he had to do it in YouTube, and only he had access.

Fast forward a year. We have content from a half dozen content creators teaching math, science, and art history. We've added the concept of a topic tree to organize the content that any developer can edit, and a conceptually simple versioning system: edits are made to an "editing" copy of the tree, and when the version is published, a full copy of it is made that becomes the new editing version. All runtime code instantly switches to using the newly "live" version via a global setting. The system works fairly well, and we are able to build new functionality on top of it: topic pages, power mode for exercises in a topic, and tutorial navigation.
However, last fall it was already becoming clear that the system just wouldn't scale. As we brought on more content creators, coordination become more and more of an issue: hitting publish at the wrong time could push out someone else's in-progress changes, and there was no way to see who was currently making edits or who had edited something in the past. As a stopgap measure, a HipChat room was set up to coordinate publishes. As the number of topics in the topic tree grew, publish times ballooned to an unreasonable 45 minutes (owing partly to the need to duplicate all the topic entities and update their child keys), during which time no editing could happen. Rolling content back was a difficult, manual process. Furthermore, many errors were caught only at publish time, allowing one author’s simple oversight to block others from getting their changes out.

< solution >


Experienced developers tend to prefer incremental improvements over rewriting from scratch, but in this case after some discussion we decided to re-architect the system to one that could fulfill not only our current needs, but our aspirations. We want the best and the brightest teachers to share their knowledge on our platform, whether it's Andover teaching calculus or the Getty and MoMA inspiring a generation of art students. Having to coordinate between creators is a bottleneck, and having to wait an hour for content to appear on the site is untenable. At the same time, our growing dev team is adding features at an ever-increasing rate, and they need something stable and flexible to build on.

When looking at various CMS storage and versioning designs, I tried to keep the primary user in mind. Since the infrastructure should always be invisible to content creators, the primary users are in fact the developers who are building features on top of it. When it comes to versioning and deployment, developers are used to code versioning systems, so I opted to start with a design based entirely on Git, a popular distributed revision control system.

In the context of content management on App Engine, the Git model has distinct advantages:

  1. Git's storage model is basically a key value store, with the SHA-1 hash of the data as the key. This maps really well to App Engine's datastore API.
  2. Git stores references to a snapshot of each object on each commit, so we never have to apply a diff or walk the history to see how something looked at a given point in time, but the hash-based reference structure mean we don't have to duplicate objects that haven't changed between commits.
  3. Using hashes as keys means that changes made on one machine can be easily merged with changes from another, which is critical in a distributed environment. For instance, adding a new object generates a random key that cannot collide with any other new object.
  4. By design, it is easy to pull and merge changes between copies of the repository. This makes operations such as syncing a development copy of the site with production as easy as copying over any new keys and setting the head commit pointer.
  5. Also by design, calculating a diff between any two commits is easy - just compare hashes for each object. This means that publish can incrementally update only objects that have changed, speeding up the process considerably. This works equally well for rolling back to an earlier version.
  6. The Git content storage filesystem model is really simple to understand and implement.

I didn't copy Git's design wholesale, nor did I actually expose a Git-compatible API (although it would be really cool to someday be able to check out our content as a repository; it would give us access to a whole bunch of useful tools). However I did find that having a fully working design to crib from made implementation much easier and helped me explain the inner workings to other developers.

So far I've been very happy with the way this system has functioned. It is flexible, so we can implement various versioning or permissions schemes on top of it. Different content types use this system in slightly different ways, and that’s fine. Building helpful tools such as diff viewers and remote syncing on top of this infrastructure is really easy, and we could get as fancy as we want to support branching & merging, pull requests from third parties, etc. Most importantly, developers other than me can jump in and create their own versioned entities without a lot of help from me, and get their code working in a very short time, eliminating a dependency on me when implementing new features.

This new architecture also enabled me to reach the goal I had set for myself: publishes now complete in about a minute. This has had a profound impact on how it feels to author content for the site.


If you’re curious about the details of the implementation, I will go into sordid detail in a companion blog post.