Really unique UUIDs

Recently we encountered a problem with duplicate time UUIDs while loading a lot of data into Cassandra. Duplicates are not normally a problem with UUIDs but occasionally you need to generate time UUIDS from a low resolution clock and/or load a lot of data really fast. In these situations you can overwhelm the ability of “correct” (to the spec) implementations of time UUID generation to create truly unique id.

The problem is that version 1 UUID use a lot of their bits to store the MAC address, leaving only enough space for a 100 nanosecond resolution timestamp and a small sequence number. That works fine is great if a bunch of different machine are generating UUIDs fairly slowly but if you have a small number of machine generating them as fast as possible it is just not good enough.

The solution we came up with was to replace the MAC address and clock sequence in the UUID with a large random number. Doing so adds a lot of entropy to the ID while maintaining its time UUID structure. Building UUIDs requires some bit twiddling but is not difficult.

The exact structure of a time UUID is described in RFC 4122. Creating a UUID in Scala (and Java) requires two longs, the most significant bits and least significant bits. The most significant bits contain the timestamp and a version identifier. The least significant bits usually contains the MAC address and a clock sequence but we want something a bit more unique.

new UUID(timeAndVersionFields(time), randomClockSeqAndNodeFields)

timeAndVersionFields constructs the time and version fields of the UUID from the milliseconds since Unix epoch, we just have to rearrange the bits of the time a little:

def timeAndVersionFields(time: Long) = {
  var msb: Long = 0L
  msb |= (0x00000000ffffffffL & time) << 32
  msb |= (0x0000ffff00000000L & time) >>> 16
  msb |= (0x0fff000000000000L & time) >>> 48
  msb |= 0x0000000000001000L // identify as a version 1 uuid
  msb
}

That the same as every other time UUID generator. For the least significant bits we need to do things a little differently.

def rand = new Random()

def randomClockSeqAndNodeFields = {
  var lsb: Long = 0
  lsb |= 0x8000000000000000L // variant (2 bits)
  lsb |= ( rand.synchronized { rand.nextLong } & 0x3FFFFFFFFFFFFFFFL)
  lsb
}

randomClockSeqAndNodeFields generates a big random number and uses that for the fields that normally contains the MAC address and clock sequence. This provides a great deal of protection against duplicate UUIDs even when generating many time UUIDs very quickly on a small number of machines.

The data modeling training at #CassandraSummit validated most of our choices. Not sure if that makes me happy or sad.

Fun Scala fact #173

Tail recursion optimization combined with implicit functions makes non-obvious infinite loops both possible and actually infinite. Who needs the crutch of stack overflow exceptions.

Good advice for the semantic web community

Ruben Verborgh has an piece on the how to move the semantic web forward.

[…] if you hosted a Web application, would you offer (even read-only) direct SQL access to your database? Of course you wouldn’t; this would pose a serious threat to the stability of your server. And, it’s not needed: you design your HTTP interface such that all data can be easily accessed—but you decide how!

We’ve been thinking of such HTTP interfaces that are handy to query Linked Data datasets. So the server still decides how clients access data—just like on the Web for humans or applications—but this time in RDF. We designed one such interface that consists of basic Linked Data Fragments, which offer triple-pattern-based access to a dataset. Servers can easily generate such fragments, and clients can use them to solve more complex queries themselves. So simple servers, smart clients.

Preach it, brother!

Announcing HalClient (for ruby)

HalClient is yet another ruby client library for HAL based web APIs. The goal is to provide an easy to use set of abstractions on top of HAL without completely hiding the HAL based API underneath. The areas of complication that HalClient seeks to simplify are

  • CURIE links
  • regular vs embedded links
  • templated links
  • working RFC6573 collections

Unlike many other ruby HAL libraries HalClient does not attempt to abstract HAL away in favor of domain objects. Domain objects are great but HalClient leaves that to the application code.

CURIEs

CURIEd links are often misunderstood by new users of HAL. Dealing with them is not hard but it requires care to do correctly. Failure to implement CURIE support correctly will result in future breakage as services make minor syntactic changes to how they encode links. HalClient’s approach is to treat CURIEs as a purely over-the-wire encoding choice. Looking up links in HalClient is always done using the full link relation. This insulates clients from future changes by the server to the namespaces in the HAL representations.

From the client perspective there is very little difference between embedded resources and remote links. The only difference is that dereferencing a remote link will take a lot longer. Servers are allowed to move links from the _links section to the _embedded section with impunity. Servers are also allow to put half of the targets of a particular rel in the _links section and the other half in the _embedded section. These choices are all semantically equivalent and therefor should not effect clients ability to function.

HalClient facilitates this by providing a single way to navigate links. The #related(rel) method provides a set of representations for all the resources linked via the specified relationship regardless of which section the links are specified. Clients don’t have to worry about the details or what idiosyncratic choices the server may be making today.

Templated links are a powerful feature of HAL but they can be a little challenging to work with in a uniform way. HalClient’s philosophy is that the template itself is rarely of interest. Therefore the #related method takes, as a second argument, as set of option with which to expand the template. The resulting full URL is used instantiate a new representation. This removes the burden of template management from the client and allows clients to treat templated links very similarly to normal links.

RFC6573 collections

Collections are a part of almost every application. HalClient provides built in support for collections implemented using the standard item, next, prev link relationships. The result is a Ruby Enumerable that can used just like your favor collections. The collection is lazily evaluated so it can be used even for very large collections.

Conclusion

If you are using HAL based web APIs I strongly encourage you to use a client library of some sort. The amount of resilience you will gain, and the amount of minutiae you will save yourself from will be well worth it. The Ruby community has a nice suite of HAL libraries whose level of abstraction ranges from ActiveRecord style ORM to very thin veneers over JSON parsing. HalClient tries to be somewhere in the middle. Exposing the API as designed while providing helpful commonly needed functionality to make the client application simpler and easier to implement and understand.

HalClient is under active development so expect to see even more functionality over time. Feedback and pull requests are, of course, greatly desired. We’d love to have your help and insight.

Rails vs Node.js

That title got your attention, didn’t it? While trying to make this decision myself recently I really wished some agile manifesto style value statements existed for these two platforms. Now that I have my first production deploy of a Node.is app I’m going give it a stab.

The Ruby on Rails community prefers:
  • Speed of development over runtime performance
  • Clarity of intent over clarity of implementation
  • Ease of getting started over ease of personal library choice
  • Freedom to customize over ease of debugging
The Express/Node.js community prefers:
  • Runtime scalability over speed of development
  • Less code over covering every use case
  • Freedom of library choice over ease of getting started

I am not suggesting that these communities don’t care about the things on the right, just that they care more about the things on the left. When faced with a tradeoff between the two values they will most often optimize for the value on the left. Of course not everyone in these communities will agree with these values and this is a great thing. A loyal opposition is invaluable because it keeps community from going off the deep end. It is also possible that i am wrong and there is not even general consensus about some of these. There is a particular risk of that with the Express/Node.js principles as I am quite new to that community.

Some of these values spring, i think, from the respective languages used by the platforms. For example, clarity of intent and speed of development are strong values of the Ruby language and that mentality has made its way into Rails also. On the Javascript side freedom and simple base constructs are strong values of the language. Is it that the language we are writing in influences how we think or that people who prefer certain values choose a language that reflects their values? One supposes that once the linguist settle that whole linguistic relativity thing we might have an answer to this question.

For what it is worth we chose to use Node.js. We had a problem domain almost perfectly suited for Node.js (IO bound and limited business logic on the server side) and we wanted to try out something new. I think the latter argument was actually the more powerful for our team.

Would someone please think of the client developers?!?

It seems that most APIs — particularly internal ones — are not designed for ease of use but rather to be easy to implement. No one would expect a human facing product designed that way to be successful. We should not expect APIs to be any different.

Web APIs are products in their own right. That means all those rules for building great products, like understanding your users and their use cases, apply. APIs are not just high latency, bandwidth hogging database connections. Rather an API should expose an application and the business value it provides. This means understanding what clients want to accomplish and then affording those uses in easy, intuitive ways.

Communication with the users is the key to designing a great API. As with other types of products, it is often necessary to build the first version of an API before there are any developers using it. We are on shaky ground until our design is validated actual clients. As soon as there are actual, or even potential, client developers listening to, and integrating their feed back should be priority number one.

Listening doesn’t mean reflexively implementing every whim of users — users are not always right about the details — but by understanding what they are trying to accomplish we as API designers can build systems that afford those goals with a minimum of effort on the part of client developers. Facilitating that value creation should be our main goal as API designers.

Zero dot versions

Dear library developers, please knock that shit off immediately.

We all seem to accept the wisdom of semantic versioning these days (thank goodness). Somehow, though, it has not occurred to many library developers that locking the first slot of the version to 0 means you give up all those benefits. Incrementing the first slot is how clients are informed of incompatible changes. If you never change the first slot you necessarily stop communicating this information.

If you have released you library to the rest of the world it should not have a ‘0.’ version. Period. If you think most people probably should not be using your library, add a pre-release tag to the version. If you want to tell the world that the API is likely to change use the docs/readme, that is why it exists. Or you could skip telling us altogether because everybody already knows the API is likely to change. That is why we came up with semantic versioning in the first place.

Embedding

Designing the messages (or representations, i’ll use the terms interchangeably) is the most important part of API design. If you get the messages right everything else will flow naturally from them. Of course, there are trade offs that must be made when designing messages. One of those trade offs is how much data to put in each message. If they are too small clients must make too many calls to be performant. If they are too big generating, transferring and parsing the messages will be excessively slow.

Any entity worth discussing should have a URI of it’s very own. That is, it should be a resource. This means that we often (read: almost always) end up with a lot of resources that don’t really have much data directly. The usual pattern is that they have a small number of properties and then link to a bunch of other resources. For example consider an invoice: a few properties like purchase date, etc and then links to the customer, billing address, shipping address, and line items. The line items would, in turn, link to a product.

We often bulk up the representations of these lightweight resources by directly embedding representations of the other resources to which they link. This tends to reduce the number of requests needed because those embedded representations don’t need to be requested explicitly. This approach has substantial downsides, at least if implemented naively. Consider the following representation of an invoice with embedded representations.

 
{"purchase_date" : "2012-10-29T4:00Z",
 "customer"      :     
   {"uri" : "http://example.com/custs/42",
    "name": "Peter Williams",
    //...
   },
 "billing_address" :     
   {"uri"   : "http://example.com/addrs/24",
    "line1" : "123 Main St",
    //...
   },
 // etc, etc
 "line_items" :
   [{"uri"     : "http://example.com/li/84",
     "quantity": 3,
     "product" :         
       {"uri" : "...",
        "name": "Blue widget",
        "desc": "..."
       },
    },
    // other line items here
   ] 
} 

This approach is very appealing. All the data needed to display or operate on a invoice is right there at our fingertips which nicely manages the number of requests that need to be made. The data is also arranged in a logical way that makes sense to our human brains.

For all of its upsides, the downsides to this approach are substantial. The biggest issue, to my mind, is that it limits our ability to evolve this message over time. By directly embedding the line item and product data, for example, we are signalling that they are fundamentally part of this representation. Clients will implement code assuming those embedded resources are always there. That means we can never remove them without breaking clients.

There are many reasons we might want to remove those embedded representations. We might start seeing invoices with a lot of line items thereby resulting in excessively large messages. We might add a lot of properties to products and make the messages too large that way. We might move products to a different database and find that looking the all up takes too long. These are just a few of the innumerable reasons that we might want change our minds about embedding.

How small is too small?

Given that removing a property from a representation is a breaking change are there ways to design representations that reduce the possibility that we will need to remove properties in the future? The only real way is to make representations as small as possible. We will never need to remove a property that was never added in the first place. We already discussed how messages that are too small can result in excessive numbers of requests but is that really true?

Applying the yagni principle is in order when thinking about embedding. Embedding is easy to do and very super extremely hard to undo. It should be avoided until it is absolutely necessary. We will know it is absolutely necessary when, and only when, we have empirical evidence showing that now is the time. This will happen quite rarely in practice. Even when we have empirical evidence that our request volume is too high, solutions other than embedding are usually a better choice. Caching, in particular, can ameliorate most of the load problems we are likely to encounter. The fastest way to get a representation is not to embed it into another message that is passed over the wire but to fetch it out of a local cache and avoid the network altogether.

Embedding one representation inside another is an optimization. Be sure it is not premature before proceeding.

sometimes – not often, but sometimes – i like the idea of embedding

Annoyingly, sometimes optimizations really are required. In those situations where we have clear empirical evidence that the current approach produces too many requests, we have already implemented caching and we cannot think of another way to solve the problem embedding can be useful. Even in these situations embed should not done hierarchically as in the example above. Rather we should sequester the embedded representations off to the side so that it is clear to clients that they are an optimization. If we can signal that clients should not assume they well always be embedded all the better.

The following is an example of how this might be accomplished using our previous example.

 
{"purchase_date"       : "2012-10-29T4:00Z",
 "customer_uri"        : "http://example.com/custs/42",
 "billing_address_uri" : "http://example.com/addrs/24",
 "shipping_address_uri": "http://example.com/addrs/24",
 "line_item_uris"      :
   ["http://example.com/li/84",
    "http://example.com/li/85"],
 "embedded":
   [{"uri" : "http://example.com/custs/42",
     "name": "Peter Williams",
     //...
    },
    {"uri"   : "http://example.com/addrs/24",
     "line1" : "123 Main St",
     //...
    },
    {"uri"     : "http://example.com/li/84",
     "quantity": 3,
     "product_uri" : "http://example.com/prods/12"
    },
    {"uri" : "http://example.com/prods/12",
     "name": "Blue widget",
     "desc": "..."
    },
    // and so on and so forth
   ] 
} 

The _uri and _uris properties are links. A client looks for the relationship it needs and then first looks for a representation in the embedded section with the required uri. If it finds one then a network communication has been avoided, if not it can make a request to get the needed data. This approach clearly identifies representations that are embedded as an optimization and makes it easy for clients to avoid relying on that optimization to behave correctly.

This flat embedding is the approach taken by both HAL and Collection+JSON (albeit with some slightly different nuances). I suspect that the developers of both of those formats have experienced first hand the pains of having representations getting too big but not being able to easily reduce their size without breaking clients. If one of those formats work you use them, they have already solved a lot of these problems.

Other considerations

Avoiding hierarchical embedding also makes documenting your representations easier. With the sidecar style you can keep each representation to a bare mimimum size and only have to document one “profile” of representation for each flavor of resource you have. With this approach there is no difference between the representation of a customer when it is embedded vs when it is the root representation.