This makes me sad:


iex(10)> Δz = 1
** (SyntaxError) iex:10: 
  unexpected token: "Δ" (column 1, codepoint U+0394)

Really unique UUIDs

Recently we encountered a problem with duplicate time UUIDs while loading a lot of data into Cassandra. Duplicates are not normally a problem with UUIDs but occasionally you need to generate time UUIDS from a low resolution clock and/or load a lot of data really fast. In these situations you can overwhelm the ability of “correct” (to the spec) implementations of time UUID generation to create truly unique id.

The problem is that version 1 UUID use a lot of their bits to store the MAC address, leaving only enough space for a 100 nanosecond resolution timestamp and a small sequence number. That works fine is great if a bunch of different machine are generating UUIDs fairly slowly but if you have a small number of machine generating them as fast as possible it is just not good enough.

The solution we came up with was to replace the MAC address and clock sequence in the UUID with a large random number. Doing so adds a lot of entropy to the ID while maintaining its time UUID structure. Building UUIDs requires some bit twiddling but is not difficult.

The exact structure of a time UUID is described in RFC 4122. Creating a UUID in Scala (and Java) requires two longs, the most significant bits and least significant bits. The most significant bits contain the timestamp and a version identifier. The least significant bits usually contains the MAC address and a clock sequence but we want something a bit more unique.

new UUID(timeAndVersionFields(time), randomClockSeqAndNodeFields)

timeAndVersionFields constructs the time and version fields of the UUID from the milliseconds since Unix epoch, we just have to rearrange the bits of the time a little:

def timeAndVersionFields(time: Long) = {
  var msb: Long = 0L
  msb |= (0x00000000ffffffffL & time) << 32
  msb |= (0x0000ffff00000000L & time) >>> 16
  msb |= (0x0fff000000000000L & time) >>> 48
  msb |= 0x0000000000001000L // identify as a version 1 uuid
  msb
}

That the same as every other time UUID generator. For the least significant bits we need to do things a little differently.

def rand = new Random()

def randomClockSeqAndNodeFields = {
  var lsb: Long = 0
  lsb |= 0x8000000000000000L // variant (2 bits)
  lsb |= ( rand.synchronized { rand.nextLong } & 0x3FFFFFFFFFFFFFFFL)
  lsb
}

randomClockSeqAndNodeFields generates a big random number and uses that for the fields that normally contains the MAC address and clock sequence. This provides a great deal of protection against duplicate UUIDs even when generating many time UUIDs very quickly on a small number of machines.

The data modeling training at #CassandraSummit validated most of our choices. Not sure if that makes me happy or sad.

Fun Scala fact #173

Tail recursion optimization combined with implicit functions makes non-obvious infinite loops both possible and actually infinite. Who needs the crutch of stack overflow exceptions.

Good advice for the semantic web community

Ruben Verborgh has an piece on the how to move the semantic web forward.

[…] if you hosted a Web application, would you offer (even read-only) direct SQL access to your database? Of course you wouldn’t; this would pose a serious threat to the stability of your server. And, it’s not needed: you design your HTTP interface such that all data can be easily accessed—but you decide how!

We’ve been thinking of such HTTP interfaces that are handy to query Linked Data datasets. So the server still decides how clients access data—just like on the Web for humans or applications—but this time in RDF. We designed one such interface that consists of basic Linked Data Fragments, which offer triple-pattern-based access to a dataset. Servers can easily generate such fragments, and clients can use them to solve more complex queries themselves. So simple servers, smart clients.

Preach it, brother!

Announcing HalClient (for ruby)

HalClient is yet another ruby client library for HAL based web APIs. The goal is to provide an easy to use set of abstractions on top of HAL without completely hiding the HAL based API underneath. The areas of complication that HalClient seeks to simplify are

  • CURIE links
  • regular vs embedded links
  • templated links
  • working RFC6573 collections

Unlike many other ruby HAL libraries HalClient does not attempt to abstract HAL away in favor of domain objects. Domain objects are great but HalClient leaves that to the application code.

CURIEs

CURIEd links are often misunderstood by new users of HAL. Dealing with them is not hard but it requires care to do correctly. Failure to implement CURIE support correctly will result in future breakage as services make minor syntactic changes to how they encode links. HalClient’s approach is to treat CURIEs as a purely over-the-wire encoding choice. Looking up links in HalClient is always done using the full link relation. This insulates clients from future changes by the server to the namespaces in the HAL representations.

From the client perspective there is very little difference between embedded resources and remote links. The only difference is that dereferencing a remote link will take a lot longer. Servers are allowed to move links from the _links section to the _embedded section with impunity. Servers are also allow to put half of the targets of a particular rel in the _links section and the other half in the _embedded section. These choices are all semantically equivalent and therefor should not effect clients ability to function.

HalClient facilitates this by providing a single way to navigate links. The #related(rel) method provides a set of representations for all the resources linked via the specified relationship regardless of which section the links are specified. Clients don’t have to worry about the details or what idiosyncratic choices the server may be making today.

Templated links are a powerful feature of HAL but they can be a little challenging to work with in a uniform way. HalClient’s philosophy is that the template itself is rarely of interest. Therefore the #related method takes, as a second argument, as set of option with which to expand the template. The resulting full URL is used instantiate a new representation. This removes the burden of template management from the client and allows clients to treat templated links very similarly to normal links.

RFC6573 collections

Collections are a part of almost every application. HalClient provides built in support for collections implemented using the standard item, next, prev link relationships. The result is a Ruby Enumerable that can used just like your favor collections. The collection is lazily evaluated so it can be used even for very large collections.

Conclusion

If you are using HAL based web APIs I strongly encourage you to use a client library of some sort. The amount of resilience you will gain, and the amount of minutiae you will save yourself from will be well worth it. The Ruby community has a nice suite of HAL libraries whose level of abstraction ranges from ActiveRecord style ORM to very thin veneers over JSON parsing. HalClient tries to be somewhere in the middle. Exposing the API as designed while providing helpful commonly needed functionality to make the client application simpler and easier to implement and understand.

HalClient is under active development so expect to see even more functionality over time. Feedback and pull requests are, of course, greatly desired. We’d love to have your help and insight.

Rails vs Node.js

That title got your attention, didn’t it? While trying to make this decision myself recently I really wished some agile manifesto style value statements existed for these two platforms. Now that I have my first production deploy of a Node.is app I’m going give it a stab.

The Ruby on Rails community prefers:
  • Speed of development over runtime performance
  • Clarity of intent over clarity of implementation
  • Ease of getting started over ease of personal library choice
  • Freedom to customize over ease of debugging
The Express/Node.js community prefers:
  • Runtime scalability over speed of development
  • Less code over covering every use case
  • Freedom of library choice over ease of getting started

I am not suggesting that these communities don’t care about the things on the right, just that they care more about the things on the left. When faced with a tradeoff between the two values they will most often optimize for the value on the left. Of course not everyone in these communities will agree with these values and this is a great thing. A loyal opposition is invaluable because it keeps community from going off the deep end. It is also possible that i am wrong and there is not even general consensus about some of these. There is a particular risk of that with the Express/Node.js principles as I am quite new to that community.

Some of these values spring, i think, from the respective languages used by the platforms. For example, clarity of intent and speed of development are strong values of the Ruby language and that mentality has made its way into Rails also. On the Javascript side freedom and simple base constructs are strong values of the language. Is it that the language we are writing in influences how we think or that people who prefer certain values choose a language that reflects their values? One supposes that once the linguist settle that whole linguistic relativity thing we might have an answer to this question.

For what it is worth we chose to use Node.js. We had a problem domain almost perfectly suited for Node.js (IO bound and limited business logic on the server side) and we wanted to try out something new. I think the latter argument was actually the more powerful for our team.

Would someone please think of the client developers?!?

It seems that most APIs — particularly internal ones — are not designed for ease of use but rather to be easy to implement. No one would expect a human facing product designed that way to be successful. We should not expect APIs to be any different.

Web APIs are products in their own right. That means all those rules for building great products, like understanding your users and their use cases, apply. APIs are not just high latency, bandwidth hogging database connections. Rather an API should expose an application and the business value it provides. This means understanding what clients want to accomplish and then affording those uses in easy, intuitive ways.

Communication with the users is the key to designing a great API. As with other types of products, it is often necessary to build the first version of an API before there are any developers using it. We are on shaky ground until our design is validated actual clients. As soon as there are actual, or even potential, client developers listening to, and integrating their feed back should be priority number one.

Listening doesn’t mean reflexively implementing every whim of users — users are not always right about the details — but by understanding what they are trying to accomplish we as API designers can build systems that afford those goals with a minimum of effort on the part of client developers. Facilitating that value creation should be our main goal as API designers.

Zero dot versions

Dear library developers, please knock that shit off immediately.

We all seem to accept the wisdom of semantic versioning these days (thank goodness). Somehow, though, it has not occurred to many library developers that locking the first slot of the version to 0 means you give up all those benefits. Incrementing the first slot is how clients are informed of incompatible changes. If you never change the first slot you necessarily stop communicating this information.

If you have released you library to the rest of the world it should not have a ‘0.’ version. Period. If you think most people probably should not be using your library, add a pre-release tag to the version. If you want to tell the world that the API is likely to change use the docs/readme, that is why it exists. Or you could skip telling us altogether because everybody already knows the API is likely to change. That is why we came up with semantic versioning in the first place.