Why i don’t love schemas

Once, long ago, i thought that schemas were great and that they could provide much value. Schemas hold the promise of declaratively and unambiguously defining message formats. Many improvements would be “easy” once such a definition was available. Automated low-code validations, automagically created high quality, verifiably correct documentation, clearer understanding of inputs, etc. I tried fiercely to create good schemas, using ever better schema languages (RelaxNG compact syntax, anyone) and ever more “sophisticated” tools to capture the potential benefits schemas.

At every turn i found the returns on investment disappointing. Schema languages are hard for humans to write and even harder to interpret. They are usually unable to express many real world constraints (you can only have that radio with this trim package, you can only have that engine if you not in California, etc). High quality documentation is super hard even with a schema. Generation from schema solves the easy part of documentation, explaining why the reader should care is the hard part. Optimizing a non-dominate factor in a process doesn’t actually move the needle all that much. Automated low-code validations often turned out hamper evolution far more than they caught real errors.

I wasn’t alone in my disappointment. The whole industry noticed that XML, with its schemas and general complexity was holding us back. Schemas ossified API clients and servers to the point that evolving systems was challenging, bordering on impossible. Understanding the implications of all the code generated from schemas was unrealistic for mere mortals. Schema became so complex that it was impossible to generate human interpretable documentation from them. Instead people just passed around megs of XSD files and call it “documentation”.

JSON emerged primarily as a reaction to the complexity of XML. However, XML didn’t start out complex, it accreted the complexity over time. Unsurprising the cycle is repeating. JSON schema is a thing and seems to be gaining popularity. It is probably a fools errand to try steming the tide but i’m going to try anyway.

doomed to watch everyone else repeat history

What would it take for schemas to be net positive? Only a couple of really hard things.

Humans first

Schema languages should be human readable first, and computer readable second. Modern parser generators make purpose built languages easy to implement. A human centered language turns schemas from something that is only understandable by an anointed few into a powerful, empowering tool for the entire industry. RelaxNG concise syntax (linked above) is a good place to look for inspiration. The XML community never adopted a human readable syntax and it contributed to the ultimate failure of XML as a technology. Hopefully we in the JSON community can do better.

Avoid the One True Schema cul de sac

This one is more of a cultural and education issue than technical one. This principle hinges on two realizations

  1. A message is only valuable if at least one consumer can achieve some goal with that message.
  2. Consumers want to maximize the number goals achieved

However, consumers don’t necessarily have the same goals as each other. In fact, any two consumers are likely to have highly divergent goals. Therefore, they are likely to have highly divergent needs in messages. A message with which one consumer can achieve a goal may be useless to another. Therefore, designing a schema to be used by more than consumer is an infinite-variable optimization problem. Your task is to minimize the difference between the set of valid messages and the set of actually processable messages for every possible consumer! (See appendix A for more details) A losing proposition if there ever was one.

To mitigate this schema languages should provide first class support for “personalizing” existing schemas. A consumer should able to a) declare that it only cares about some subset of the properties defined in a producer’s schema and b) that it will accept and use properties not declared in a producer’s schema. This would allow consumers to finely tune their individual schemas to their specific needs. This would improve evolvability by reducing incidental coupling, increase the clarity of documentation by hiding all irrelevant parts of the producer’s schema, and improve automated validation by ignoring irrelevant parts of messages.

We as a community should also educate people in the dangers of The One True Schema pattern.

Conclusion

Designing schemas for humans and avoiding the One True Schema are both really hard. And, unfortunately, i doubt our ability to reach consensus and execute on them. Given that i think most message producers and basically all message consumers are better avoiding schemas for anything other than documents.


Appendix A: Where in i get sorta mathy about why consumers shouldn’t share schemas unless they have the same goals

I don’t know if this will help other people but it did help me clarify my thinking on this issue.

M = set of all messages

mC = {m | m ∈ M, m contains info needed by consumer C}

A particular consumer, C, needs messages to contain certain information to achieve its goal.

mV = {m | m ∈ M, m is valid against schema S}

For any particular schema there is some subset of all messages that are valid.

mC = lim mV as S->perfectSchemaFor(C)

As the schema approaches perfection for consumer C the set of valid messages approaches the set of messages actually processable by the consumer.

mC ≠ mV
mC ⊄ mV
mC ⊅ mV

In practice, however, there is always some mis-match between the set of valid messages and the set of messages actually processable by the consumer. Some technically invalid messages contain enough information for the consumer to achieve its goal. Some technically valid messages will contain insufficient information. The mis-match may be due to bugs, a lack of expressiveness in the schema language or just poor design. The job of a schema designer is to minimize the mis-match.

Now consider a second consumer of these messages.

mD = {m | m ∈ M, m contains info needed by consumer D}

A particular consumer, D, needs messages to contain certain information to achieve its goal.

mC ≠ mD (in general)

The information needed by consumer D will, in general, be different from the information needed by consumer C. Therefore, the set of messages processable by C will, in general, not equal the set of messages processable by D.

perfectSchemaFor(C) ≠ perfectSchemaFor(D)

This is the kicker. The perfect schema for consumer C is, in general, different from the perfect schema any other consumer. Minimizing the difference between mV and mC will tend to increase the difference between mV and mD.

Would someone please think of the client developers?!?

It seems that most APIs — particularly internal ones — are not designed for ease of use but rather to be easy to implement. No one would expect a human facing product designed that way to be successful. We should not expect APIs to be any different.

Web APIs are products in their own right. That means all those rules for building great products, like understanding your users and their use cases, apply. APIs are not just high latency, bandwidth hogging database connections. Rather an API should expose an application and the business value it provides. This means understanding what clients want to accomplish and then affording those uses in easy, intuitive ways.

Communication with the users is the key to designing a great API. As with other types of products, it is often necessary to build the first version of an API before there are any developers using it. We are on shaky ground until our design is validated actual clients. As soon as there are actual, or even potential, client developers listening to, and integrating their feed back should be priority number one.

Listening doesn’t mean reflexively implementing every whim of users — users are not always right about the details — but by understanding what they are trying to accomplish we as API designers can build systems that afford those goals with a minimum of effort on the part of client developers. Facilitating that value creation should be our main goal as API designers.

Embedding

Designing the messages (or representations, i’ll use the terms interchangeably) is the most important part of API design. If you get the messages right everything else will flow naturally from them. Of course, there are trade offs that must be made when designing messages. One of those trade offs is how much data to put in each message. If they are too small clients must make too many calls to be performant. If they are too big generating, transferring and parsing the messages will be excessively slow.

Any entity worth discussing should have a URI of it’s very own. That is, it should be a resource. This means that we often (read: almost always) end up with a lot of resources that don’t really have much data directly. The usual pattern is that they have a small number of properties and then link to a bunch of other resources. For example consider an invoice: a few properties like purchase date, etc and then links to the customer, billing address, shipping address, and line items. The line items would, in turn, link to a product.

We often bulk up the representations of these lightweight resources by directly embedding representations of the other resources to which they link. This tends to reduce the number of requests needed because those embedded representations don’t need to be requested explicitly. This approach has substantial downsides, at least if implemented naively. Consider the following representation of an invoice with embedded representations.

 
{"purchase_date" : "2012-10-29T4:00Z",
 "customer"      :     
   {"uri" : "http://example.com/custs/42",
    "name": "Peter Williams",
    //...
   },
 "billing_address" :     
   {"uri"   : "http://example.com/addrs/24",
    "line1" : "123 Main St",
    //...
   },
 // etc, etc
 "line_items" :
   [{"uri"     : "http://example.com/li/84",
     "quantity": 3,
     "product" :         
       {"uri" : "...",
        "name": "Blue widget",
        "desc": "..."
       },
    },
    // other line items here
   ] 
} 

This approach is very appealing. All the data needed to display or operate on a invoice is right there at our fingertips which nicely manages the number of requests that need to be made. The data is also arranged in a logical way that makes sense to our human brains.

For all of its upsides, the downsides to this approach are substantial. The biggest issue, to my mind, is that it limits our ability to evolve this message over time. By directly embedding the line item and product data, for example, we are signalling that they are fundamentally part of this representation. Clients will implement code assuming those embedded resources are always there. That means we can never remove them without breaking clients.

There are many reasons we might want to remove those embedded representations. We might start seeing invoices with a lot of line items thereby resulting in excessively large messages. We might add a lot of properties to products and make the messages too large that way. We might move products to a different database and find that looking the all up takes too long. These are just a few of the innumerable reasons that we might want change our minds about embedding.

How small is too small?

Given that removing a property from a representation is a breaking change are there ways to design representations that reduce the possibility that we will need to remove properties in the future? The only real way is to make representations as small as possible. We will never need to remove a property that was never added in the first place. We already discussed how messages that are too small can result in excessive numbers of requests but is that really true?

Applying the yagni principle is in order when thinking about embedding. Embedding is easy to do and very super extremely hard to undo. It should be avoided until it is absolutely necessary. We will know it is absolutely necessary when, and only when, we have empirical evidence showing that now is the time. This will happen quite rarely in practice. Even when we have empirical evidence that our request volume is too high, solutions other than embedding are usually a better choice. Caching, in particular, can ameliorate most of the load problems we are likely to encounter. The fastest way to get a representation is not to embed it into another message that is passed over the wire but to fetch it out of a local cache and avoid the network altogether.

Embedding one representation inside another is an optimization. Be sure it is not premature before proceeding.

sometimes – not often, but sometimes – i like the idea of embedding

Annoyingly, sometimes optimizations really are required. In those situations where we have clear empirical evidence that the current approach produces too many requests, we have already implemented caching and we cannot think of another way to solve the problem embedding can be useful. Even in these situations embed should not done hierarchically as in the example above. Rather we should sequester the embedded representations off to the side so that it is clear to clients that they are an optimization. If we can signal that clients should not assume they well always be embedded all the better.

The following is an example of how this might be accomplished using our previous example.

 
{"purchase_date"       : "2012-10-29T4:00Z",
 "customer_uri"        : "http://example.com/custs/42",
 "billing_address_uri" : "http://example.com/addrs/24",
 "shipping_address_uri": "http://example.com/addrs/24",
 "line_item_uris"      :
   ["http://example.com/li/84",
    "http://example.com/li/85"],
 "embedded":
   [{"uri" : "http://example.com/custs/42",
     "name": "Peter Williams",
     //...
    },
    {"uri"   : "http://example.com/addrs/24",
     "line1" : "123 Main St",
     //...
    },
    {"uri"     : "http://example.com/li/84",
     "quantity": 3,
     "product_uri" : "http://example.com/prods/12"
    },
    {"uri" : "http://example.com/prods/12",
     "name": "Blue widget",
     "desc": "..."
    },
    // and so on and so forth
   ] 
} 

The _uri and _uris properties are links. A client looks for the relationship it needs and then first looks for a representation in the embedded section with the required uri. If it finds one then a network communication has been avoided, if not it can make a request to get the needed data. This approach clearly identifies representations that are embedded as an optimization and makes it easy for clients to avoid relying on that optimization to behave correctly.

This flat embedding is the approach taken by both HAL and Collection+JSON (albeit with some slightly different nuances). I suspect that the developers of both of those formats have experienced first hand the pains of having representations getting too big but not being able to easily reduce their size without breaking clients. If one of those formats work you use them, they have already solved a lot of these problems.

Other considerations

Avoiding hierarchical embedding also makes documenting your representations easier. With the sidecar style you can keep each representation to a bare mimimum size and only have to document one “profile” of representation for each flavor of resource you have. With this approach there is no difference between the representation of a customer when it is embedded vs when it is the root representation.

HTML is domain specific

The partisans of generic media types sometimes hold up HTML as an example of how much can be accomplished without domain specific media types. HTML doesn’t have application/business specific semantics and the whole human facing web uses it, so machine clients should be able to use a generic media type too. There is just one flaw with this logic. HTML is domain specific in the extreme. HTML provides strong semantics for defining document oriented user interfaces. There is nothing generic about HTML.

In the HTML ecosystem, the generic format is SGML. Nobody uses SGML out of the box because it is too generic. Instead, various SGML applications, such as HTML, are created with the appropriate domain semantics to be useful. HTML would not have been very successful if it had just defined links via the a element (which is all you need to have hypermedia semantics) and left it up to individual web sites to define what various other elements meant.

The programs we use on the WWW almost exclusively use the strongly domain specific semantics of HTML. Browsers, for example, render HTML based to the screen based on the specified semantics. We have web readers which adapt HTML — which is fundamentally visually oriented — for use by the visually impaired. We have search engines which analyze link patterns and human readable text to provide good indexing. We have super smart browsers which can often fill in forms for us. They can do these things because of the clear, domain specific semantics of HTML.

Programs don’t, generally, try to drive the human facing web to accomplish specific application/business goals because the business semantics are hidden in the prose, lists and labels. Anyone who has tried is familiar with the fragility of web scraping. These semantics, and therefore any capabilities based on them, are unavailable to machine clients of the HTML based web because the media type does not specify those semantics. Media types which target machine clients should bear this in mind.

Push or pull?

The question of how to communicate events often comes up when designing APIs and multi-component systems. The questions basically boils down to this: should events be pushed to interested parties as they occur, or should interested parties poll for new data?

The short answer: interested parties should poll for new data.

The longer answer is, of course, it depends.

Polling is the only approach that scales to internet levels. For the smaller scales of internal multi-component systems the answer is much less clear cut. It is clear that a push approach can be implemented in such environments using either web hooks or XMPP. Such approaches often appear to be simpler more efficient than the pull equivalents, and they are definitely lower latency.

The appearance of simplicity is an illusion, unfortunately. Event propagation using a push is only easy if are willing to give up a lot of reliability and predictability. It is easy to say “when an event occurs i will just POST it to registered URI(s)”. That would be easy, but the world is rarely that simple. What is the receiving server is down or unreachable? Are you going to retry, if so how many times? If not, is that level of message loss acceptable to all the interested parties. If the receiving system is very slow, will that cause a back-log in the sending system? If a lot of events happen in a very short period of time, can the receiving system handle the load?

The efficiency benefits of a push approach are real, but not nearly as significant as they first appear. HTTP’s conditional request mechanism provides, when used effectively, a way to reduce the cost of polling to quite low levels.

Pull is cool

APIs should be built around pulling data unless there is a particular functional concern that makes pull not work (e.g. low message latency being very important). Any push approach with have all the complexities if pull approach (to handle reliability issues) combined with a lot less predictable behavior because it’s performance will be dependent on one or more other systems ability to handle the event notification work load.