thunk juice

all quiet on us-west-1

marshmallow-select: alpha junkware to crash your servers


This is to announce the publication of my package marshmallow-select, formerly part of the Distribute corporate repo, but since moved to marshmallow-code.

As the sqlalchemy docs cogently point out, the biggest problem people usually have with ORM frameworks is the accidental introduction of N+1 query bugs. These are

…a common side effect of the lazy load pattern, whereby an application wishes to iterate through a related attribute or collection on each member of a result set of objects, where that attribute or collection is set to be loaded via the lazy load pattern. The net result is that a SELECT statement is emitted to load the initial result set of parent objects; then, as the application iterates through each member, an additional SELECT statement is emitted for each member in order to load the related attribute or collection for that member. The end result is that for a result set of N parent objects, there will be N + 1 SELECT statements emitted.

The solution offered by sqlalchemy is to declare relationships as 'raiseload'. This means that while the relationship in question is not loaded by default, if it is loaded at all, it must be loaded eagerly: that is, via a join. Failing to eagerly load the relationship raises an error. Within the confines of sqlalchemy itself, I agree that there really isn’t any better solution. So without wishing to cast aspersions on the illustrious authors of sqlalchemy, I still say it’s an extremely unsatisfying solution for two reasons.

First, N+1 query problems are only really problematic for large N. When you fetch the list of all User records for display on your webapp frontend, each one firing off an extra query (especially sychronously at serialization time) is a big problem. When you fetch a single User object to display a more detailed version, such as a profile view, the fact that you might execute 3 or 4 queries instead of 1 is not the end of the world. But your user detail view suddenly crashing for everyone in production as a side effect of your attempts to optimize the user list view might be somewhat more serious. This can be avoid with good unit and integration testing, if you do them. We all have limited time budgets we can’t afford to fritter away on extra unit tests (or open-source software libraries, or blog posts). And if I had a nickel for every mediocre unit test I’ve seen which didn’t actually catch the issue it was supposed to catch, I’d have, I dunno, busfare maybe? Still, SF MUNI’s actually not cheap.

Second, it not only requires extra risk, but extra work fiddling with query parameters. You have to go through and manually set all the relationship options to joinload every time you want to fetch them. This sucks. Can we do better?

I think so. If you’re like me (except, statistically speaking, less handsome and charming), you’re probably fetching these objects as part of a web application with the ultimate goal of serializing them and sending them so some kind of frontend single-page app. You probably also realize that actually serializing the objects by hand is a terrible idea,[1] and a much better idea is to use a library like marshmallow (or possibly marshmallow-sqlalchemy) to do the work for you. Such a library is used to define a schema for (among other things) controlling the serialization of your objects.

You probably also realize that since various views want various different subsets of the total fields of an object, it makes sense to define pared-down schemas for a particular view (possibly via marshmallow’s fields or exclude metaclass options). These pared-down subschemas define a subset of the possible fields to be fetched for an object, and represent exactly what you want to send back. So what if you told sqlalchemy to fetch exactly what was needed to populate a given schema?

This is the idea of marshmallow-select, so called because it uses the marshmallow schema to generate (ultimately, via sqlalchemy) the body of the SQL SELECT clause. Rather than manually modifying the query options to exclude certain unneeded fields or eagerly load certain related objects, marshmallow-select will use your schema to walk the query tree, pulling in everything you intend to use and defer-ing or noload-ing the rest.

But don’t take my word for it. Get yourself a copy and try it out. NOTE: The management is not responsible for any lost data, production downtime, accidental death, etc. resulting from the use of our software, although we do take a kind of perverse pride in it.


  1. The guy who built the backend api server I’ve taken over did not realize this, but never mind. ↩︎

Posted May 25, 2017