Announcing HalClient (for ruby)

HalClient is yet another ruby client library for HAL based web APIs. The goal is to provide an easy to use set of abstractions on top of HAL without completely hiding the HAL based API underneath. The areas of complication that HalClient seeks to simplify are

  • CURIE links
  • regular vs embedded links
  • templated links
  • working RFC6573 collections

Unlike many other ruby HAL libraries HalClient does not attempt to abstract HAL away in favor of domain objects. Domain objects are great but HalClient leaves that to the application code.

CURIEs

CURIEd links are often misunderstood by new users of HAL. Dealing with them is not hard but it requires care to do correctly. Failure to implement CURIE support correctly will result in future breakage as services make minor syntactic changes to how they encode links. HalClient’s approach is to treat CURIEs as a purely over-the-wire encoding choice. Looking up links in HalClient is always done using the full link relation. This insulates clients from future changes by the server to the namespaces in the HAL representations.

From the client perspective there is very little difference between embedded resources and remote links. The only difference is that dereferencing a remote link will take a lot longer. Servers are allowed to move links from the _links section to the _embedded section with impunity. Servers are also allow to put half of the targets of a particular rel in the _links section and the other half in the _embedded section. These choices are all semantically equivalent and therefor should not effect clients ability to function.

HalClient facilitates this by providing a single way to navigate links. The #related(rel) method provides a set of representations for all the resources linked via the specified relationship regardless of which section the links are specified. Clients don’t have to worry about the details or what idiosyncratic choices the server may be making today.

Templated links are a powerful feature of HAL but they can be a little challenging to work with in a uniform way. HalClient’s philosophy is that the template itself is rarely of interest. Therefore the #related method takes, as a second argument, as set of option with which to expand the template. The resulting full URL is used instantiate a new representation. This removes the burden of template management from the client and allows clients to treat templated links very similarly to normal links.

RFC6573 collections

Collections are a part of almost every application. HalClient provides built in support for collections implemented using the standard item, next, prev link relationships. The result is a Ruby Enumerable that can used just like your favor collections. The collection is lazily evaluated so it can be used even for very large collections.

Conclusion

If you are using HAL based web APIs I strongly encourage you to use a client library of some sort. The amount of resilience you will gain, and the amount of minutiae you will save yourself from will be well worth it. The Ruby community has a nice suite of HAL libraries whose level of abstraction ranges from ActiveRecord style ORM to very thin veneers over JSON parsing. HalClient tries to be somewhere in the middle. Exposing the API as designed while providing helpful commonly needed functionality to make the client application simpler and easier to implement and understand.

HalClient is under active development so expect to see even more functionality over time. Feedback and pull requests are, of course, greatly desired. We’d love to have your help and insight.

Resque-multi-step

I’ve been developing using asynchronous jobs quite a bit lately.1 There is only one reason to do work asynchronous. It takes too long to do it synchronously.

Fortunately, it turns out that many of these very large work loads are embarrassingly parallel problems. And look, you have several (dozen) workers just waiting to do your bidding. It just makes sense to break up these large blocks of work into many smaller chunks so that it can be processed in parallel.

Breaking a task up into many small parts comes with some issues. Any task that takes long enough to run asynchronously is probably going to take long enough that you need to track its progress.

Also the problem is probably not completely parallelizable. Most problems seem to have a large portion of easily parallelized work followed by a bit of work that can only happen after all the parallel work has been complete.

These patterns show up often enough that i have gotten tired of repeating myself. Hence was born resque-multi-step. Resque-multi-step is a Resque plugin that provides compound job support complete with progress tracking, error handling, and a completely serial finalization sequence.

Example

Say you want to reindex all the posts in a blog. However, committing solr for each post would be excessively slow. (Trust me, it really is.)

Resque::Plugins::MultiStepTask.create("reindex-#{blog.name}") do |task|
  blog.posts.each do |post|
    task.add_job ReindexWithoutCommit, post
  end
  
  task.add_finalization_job CommitSolr
end

This reindexs all the posts in parallel. Any available workers will pick up a job to reindex a specific blog post. Once all those reindex jobs have completed, the finalization job will be executed.

If you have more that one finalization job, they are executed serially in the order they were added to the task.

Administrivia

If these issues sound familiar give resque-multi-step a try. It is available as a gem so installing is just

gem install resque-multi-step

If you want to contribute head on over to the github project and hack away. If you come up with something useful i’ll integrate it post haste.


  1. resque-fairly was one of the first public out comes of such work. The fair scheduling it provides the basis for this effort.

resque-fairly

I have been using Resque quite a bit recently. It is a really nice asynchronous job system based on Redis.

Resque checks the queues for jobs to process in a fixed order. (In alphabetic order, to be precise.) This turns out to be a problem is you want predictable handling time for jobs. For example, consider a system which has queues aaa and zzz. If you add 100 jobs to aaa and 1 job to zzz, the job on zzz will wait a long time before being processed.

This problem is easily solved by just checking the queues in random order. Over time, any particular queue will be checked early so a few deep queues will not starve the other queues in the system.

resque-fairly is a Resque plugin which provides that behavior. Just install the gem, add require 'resque-fairly' and Resque will handle queues with approximate fairness.

Is ruby immature?

A friend of mine recently described why he feels ruby is immature. I, of course, disagree with him. There is much in ruby that could be improved, but the issues he raised are a) intentional design choices or b) weaknesses in specific applications built in ruby. Neither of those scenarios can be fairly described as immaturity in the language, or the community using the language.

Set

Mr. Jones’ main example is one regarding the Set class in ruby. In practice Set is a rarely used class in ruby. I suspect it exists primarily for historical and completeness reasons. It is rather rare to see idiomatic ruby that utilizes Set.1

This is possible because Array provides a rather complete implementation of basic set operations. Rubyist are very accustom to using arrays. So is more common to just use the set operator on arrays rather than converting an array into a sets.

The set operations on Array do not have the same performance characteristics mr. Jones found with Set. For example,

$ time ruby -rpp -e 'pp (1..10_000_000).to_a & (1..10).to_a'
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

real	0m10.152s
user	0m6.592s
sys	0m3.515s

$ time ruby -rpp -e 'pp (1..10).to_a & (1..10_000_000).to_a'
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

real	0m12.410s
user	0m8.397s
sys	0m3.860s

Order still matters, but very much less. (That is on 1.8.6, the only version i have handy at the moment. I am sure that 1.9, or even 1.8.7, would be quite a bit faster.)

Libraries that are low traffic areas don’t get the effort that high use libraries do in any language. Even though Set is part of the standard library, it is definitely counts as a low traffic area. Hence, it has never been optimized for large numbers of items. This is appropriate because as we learned from Ron Pike “n is usually small”. The benefits of handling large sets performantly is not worth the addition complexity for a low traffic library.

nil

In his other example mr. Jones implies that the fact that nil is a real object is disadvantageous. On this count he is simply incorrect. Having nil be an object allows significant reductions in the number of special cases that must exist. This reduction in special cases often results in less code, but is always results in less cognitive load.

Consider the #try in ruby. While not my favorite implementation of this concept, it is still a powerful idiom for removing clutter from the code.

#try executes the specified method on the receive, unless the receiver is nil. When the receive is nil it does nothing. This allows code to use a best effort approach to performing non-critical operations. For example2,

def remove_email(email)                                                                                         
  emails.find_by_email(email).try(:destroy)                                                                     
end  

This is implemented as follows:

module Kernel
  def try(method, *args, &block)
    send(method, *args, &block)
  end
end

class NilClass
  def try(*args)
    # do nothing
  end
end

You could implement something like #try in a system that has non-object “no value” mechanism. It would be less elegant and less clear, though. (It would probably be less performant too because method calls tend to be optimized rather aggressively.) Have nil be an object like everything else is one less the primitive concept that the code and the programmer must keep in mind.

Mr. Jones does bring up the issue of nil.id returning 4 and that value being used as a foreign key in the database. This is not a problem i see very often, but i can happen.

This is definitely not a problem with ruby. Rather results from an unfortunate choice of naming convention in rails. Rails uses id as the name of the primary key column for database tables. This results in an #id method being created, which overrides the #id provided by ruby itself for all objects. If rails had chosen to call the primary key column something that did not conflict with an existing ruby core method – say pk – we would not be having this discussion.

In general

Mr. Jones asserts that “ruby is rife with happy path coding”. I disagree with his characterization. The ruby community has a strong bias towards producing working, if incomplete code, and iterating on that code to improve it. This “simplest thing that could work” approach does result in the occasional misstep and suboptimal implementations. In return you get to use a lot of new stuff more quickly and when there are problems they are easier to fix because the code is simpler.

The ruby community has strongly embraced the small pieces, loosely joined approach. This is only accelerating the innovation in ruby. Gems have lowered the fiction of distributing and installing components to previously unimaginable levels. This has allowed many libraries that would have been to small to be worth releasing in the past to come into existence.

Rack, with it’s middleware concept, is an example of the ruby community taking much of the Unix philosophy and turning it to 11. While rails has much historic baggage, even it is moving to a much more modular architecture with the up coming 3.0 release.

Following these principles does result in some rough edges occasionally, but the benefits are worth the trade. The 80% solution is how Unix succeed. An 80% solution today is better than a 100% solution 3 months from now. (As long as you can improve it when needed.) We always have releases to get to, after all.


  1. I, on the other hand, do use set rather more than the average rubyist. Set is a rather performant way producing collections without duplicate entries.

  2. Shamelessly copied from Chris Wanstrath.

Managing oss contributions with Git and Ruby Gems

Once you start using opensource at your day job you are going to want to improve it. Many improvements are going to be generally useful and should be contributed back to the community. A few of these changes may be quite specific and of no value to the community at large.

Changes that are generally useful should be contributed back to the community. This will help the community and help you. Every change to an opensource project you maintain raises the cost of keeping your modified variant current. Once a change you need is in the mainline project your company no longer has to maintain it alone.

Regardless of the generality of the changes you are going to want to put them in production quickly. Waiting for a change you’ve made to be integrated and released by the opensource project before putting it in production is probably not going to be an option. What ever problem caused you to make the change needs solving and soon. It could take weeks or months before even a good change is merged in to the mainline of an opensource project.

Ruby, gems and git

A distributed version control system is key to the approach i use. Git is my preferred tool, but any DVCS would work. GitHub really makes this a lot easier than it would be otherwise.

I work mostly in Ruby so i am going to describe that workflow. Gems, particularly with the introduction of the new rubygems.org, really lower the bar for releasing Ruby software. The low effort required to do so can really make managing your corporate opensource contributions easier. A similar approach could be made to work for other release mechanisms.

What you will need

Making changes

Before getting started on the actual change there are some setup steps you need to perform. Namely, creating a version of this project specific to your company and a repo that allows you to publish the changes back to the community. These steps only need to be done once per opensource project.

Create a public repo for your changes

  1. Fork the canonical repo of the opensource project into your Github account.

  2. Clone your fork onto your machine

    $ git clone {private URI of your repo}

Create a company specific version of the project

  1. Fork the canonical repo of the opensource project into your companies GitHub account.

  2. Add yourself as a contributor to your companies repo.

  3. Add an ‘foocorp’ remote to you local repo pointing to your companies fork of the opensource project.

    $ git remote add foocorp {private URI of your companies repo on github}
  4. Create a ‘foocorp-stable’ branch.

    $ git checkout -b 'foocorp-stable'
  5. On the ‘foocorp-stable’ branch, change the name of the gem to ‘foocorp-projname’.

    $ (edit gemspec or gemspec generator)
    $ git commit -m "Company specific Gem name"
    $ git push foocorp foocorp-stable

You have created a version of this project whose gem has your companies name prepended. This will be useful later as a way to release the changes you need before they have been integrated into the opensource project. However, this change is only your companies stable branch. This branch will never be integrated into the opensource project.

Making the change

  1. Create a feature branch for your change in you local repo.

    $ git checkout -b 'super-feature'
  2. Implement the wicked new feature/fix the bug.

    $ (do work)
    $ git commit -m "my feature is super"
  3. Push the feature branch to your GitHub repo.

    $ git push origin super-feature
  4. Push the feature branch to your companies GitHub repo.

    $ git push foocorp super-feature
  5. Merge your feature branch into the ‘foocorp-stable’ branch.

    $ git checkout foocorp-stable
    $ git merge super-feature
  6. Push ‘foocorp-stable’ branch to you companies GitHub repo.

    $ git push
  7. Bump the version number as appropriate.

  8. Build the gem from the ‘foocorp-stable’ branch.1

    $ rake build
  9. Push the gem to rubygems.org.

    $ gem push pkg/{gem file}
  10. Change your application to require the ‘foocorp-projname’ gem instead of ‘projname’.

  11. Send pull request to the opensource project for you feature branch.

The end result is that you have a published gem with the changes you need to support you application. This gem can be installed using the normal gem command. Your new gem boasts a name that will keep it from being confused with the original. The changes you implemented are available to the opensource project for the benefit of the community at large.

Once your changes have been integrated into the opensource project and released you can revert your application to depend on the canonical variant rather than your custom version.

Q & A

Why use two separate repos on GitHub (your’s and your company’s) to manage changes?

Your companies GitHub account will probably have it’s email address setup to point to a distribution list. Getting a change integrated into an opensource project can take some back and forth. By default, responses to a pull request go to the email of the account that sent to pull request. This means that your whole team will be getting these emails. As the original author of the change it is your responsibility to shepherd it through the integration. Preferably without barraging the rest of your team with emails they don’t care about.

Why create a feature branch?

Because it is a lot easier for maintainers to merge a feature branch containing a limited cohesive set of changes. It is a little bit more of a pain for you but your changes will get integrated faster and more reliably. The opensource maintainers will thank you.

What about non-generic changes?

Follow the same process above except don’t send the pull request. If you ever want changes from the mainline opensource project you will need to merge those into your companies stable branch explicitly. However, this is pretty easy to do with Git.

What if the opensource project adds a change i want before my changes are integrated?

Just merge the opensource project’s release tag (or any commit-ish for that matter) containing the change you want into your companies stable branch. You can do this as many times as needed.

Why not modify the version of the gem rather than the name?

You could distinguish your custom gem by appending suffix to the version. For example, ‘1.2.3.foocorp’. However, doing so would prevent you from pushing your gems to rubygems.org because someone else already owns that gem. It also prevents rational versioning for your gem. The versioning issue is important as you might want to make multiple independent changes to your gem.

Conclusion

Using the technique described above you can very effectively manage changes to opensource projects that are required by your applications. Contributing your changes back does require maintaining and merging multiple code streams. This can be somewhat convoluted at time but DVCSs allows a much more efficient approach than has ever been possible before.


  1. This example assumes you are using jeweler. If the opensource project is not using jeweler build the gem in whatever way the project supports.

Cucumber

Cucumber Logo

I have been working pretty extensively with Cucumber for the last couple of weeks. In short, it is killer. You should be using it.

Having just RSpec/unit tests results in a lot of ugly trade offs between verifying the design and implementation of the parts (or units) vs the system as a whole. Using Cucumber completely absolves RSpec specs and unit tests of any responsibility for proving the system works. That allows you to use RSpec/unit tests as tools to improve the design, and reliability, of individual parts of the system without losing confidence in the overall systems ability to function acceptably.

If you are using Emacs i highly recommend cucumber.el. It has excellent support for editing Gherkin files and key bindings to execute scenarios, etc and view the output without ever having to leave the comfort of Emacs.

Configuration files

If you are using a dynamic interpreted language please do not used use YAML1, or any other simple data serialization language, for configuration files.

Strictly speaking configuration is just data, of course, so you can use a data serialization language to represent your applications, or libraries, configuration. In some environments, like static compiled languages say, using a data serialization language for configuration makes a lot of sense. Creating your own custom configuration language from scratch is probably going to be more trouble that it is worth, unless you have a really complex configurations to express.

On the other hand, if you have an easily available, highly readable, Turing strength language why wouldn’t you use it? If the language that you application is written in supports eval you should probably use the app language to express the configuration. You will end up with a simpler and more power configuration system.

Lets look at a common configuration as an example. In Rails the database connection configuration looks like this.

development:
  adapter: mysql
  database: my_db_development
  username: my_app
  password:
  
test:
  adapter: mysql
  database: my_db_test
  username: my_app
  password:

production:
  adapter: mysql
  database: my_db_production
  username: my_app
  password:
  host: db1.my_org.invalid

That is not horrible, but it does include a fair bit of repetition. This approach starts getting ugly when you need to have more dynamic configurations. For example, RPM type distros use /tmp/mysql.sock as the domain socket for the MySQL database. Debian type distros use /var/run/mysqld/mysqld.sock. So you end up with a config file that looks like this

development:
  adapter: mysql
  database: my_db_development
  username: my_app
  password:
  socket: <%= File.exist?('var/run/mysqld/mysqld.sock') ? 
                'var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock' %>

⋮

You always end up needing non-declarative bits in your configuration. Everyone realizes this at some point. So much so that the most common pattern for YAML configuration files in Ruby is for them to support ERB as a way to embed Ruby.

Rather than implement multi-pass configuration file loading you could just have configuration files be pure Ruby but produce a hash structure like the YAML+ERB version does.

databases = {
  :development => {
    :adapter  => :mysql,
    :database => 'my_app_development',
    :username => 'my_app',
    :password => '',
    :socket   => File.exist?('var/run/mysqld/mysqld.sock') ? 
                   'var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock'
  },
  ⋮
}

That is nothing to write home about but it is dead simple to implement. Simpler even than the YAML+ERB approach. And it is superior to the YAML+ERB version because it is more powerful and extensible.

Rather than trying to map configurations onto a set of name value pairs, i prefer to create a small DSL for the configuration. The results of adapting the language to the configuration that needs to be expressed is significantly more pleasant and DRY than the hash oriented approaches. Consider the following

database {
  adapter   = mysql
  database  = 'my_db_#{RAILS_ENV}'
  user_name = 'my_app'
  password  = ''
  socket    = File.exist?('/var/run/mysqld/mysqld.sock') ? 
                '/var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock'

  env('production') {
    host = 'db1.my_org.invalid'
  }
}

That is a lot clearer, simpler and less repetitive. In addition, users can do anything they need to because they have access to the full power of the host language. This DSL would be a little more difficult to implement that the hash oriented approach, but not much. For that additional bit of effort you get a huge improvement in usability and power. Your users are worth that effort.


  1. For those who are unfamiliar with YAML it is a nice, easy to read data serialization format. It is quite similar to JSON. More limited than XML but a lot simpler to use if you are just doing data serialization.

Nucleic Teams

A nucleic team is one with small core group of permanent employees, usually just 1 to 3 people, that is supplemented as needed by contractors. The core in a nucleic team is too small to do the anticipated work, even during slow periods of development. The core teams job is two fold, first it implements stories that are particularly complicated, risky or architecturally important. The second role of the core team is to manage a group of contractors by creating statements of work, doing code reviews, etc.

The nucleic structure should provide a lot of advantages from a business stand point. You get many of the benefits of having an in-house development team. Advantages like developers that have the time and incentives to become domain experts. A consistent group of people with which all the stakeholders can build a rapport. A group of people that work together long enough to build the shared vision it takes to create systems with conceptual integrity.1

Those advantages are combined with the advantages of pure contracting team, at least in principle. The primary advantages of a pure contracting are that you can scale the development organization, both up and down, rapidly and cost effectively. Many organizations with in-house development teams end up having to maintain a sub-optimally sized development team. Work loads and cash flow tend to vary a bit over time. It takes a long time to find and hire skilled developers. Once you do, it really sucks to have to lay people off, either because of the lack of work or lack of money. Resizing development teams is so costly and disruptive that most organizations tend to pick a team size that is larger than optimal for the slow/lean times but less than optimal for the plentiful times.

Risks

This structure is not without it risks, though. Finding talent contractors is not easy. Contractors, by their very nature, cannot be relied on when planning beyond their current contract. Most importantly, though, contracting usually has an incentive structure that favors short term productivity. All of these can threaten the long term success of project if not managed correctly.

To counteract the risks inherent in contract workers the core team must be committed to the business, highly talented and fully empowered by the executive team to aggressively manage the contractors. The core team members must be highly skilled software developers, of course, but this role requires expertise in areas that are significantly different from traditional software development. The ability to read and understand other peoples code rapidly it of huge importance. As is the ability to communicate with both the business and the contractors what functionality is needed. The core team also needs to be able to communicate much more subtle, squishy, things like the architectural vision and development standards.

The core team will not be as productive at cutting code as they might be use to. The core team role is not primarily one of coding. A significant risk is that the members of the core team might find that they do not like the facilitation and maintainership role nearly as much as cutting code. It is necessary to set the expectations of candidates for the core team appropriately. One other risk is that the core team will get so bogged down in facilitation and maintainership tasks that they actually stop cutting code. The “non-coding architect” is a recipe for disaster, and should be avoided at all costs.

While this team structure has much going for it, it will be challenging to make work in practice.

Origins

I think this team structure is developing in the Rails community out of necessity, rather than preference. Rails is a highly productive environment. That can make it a competitive advantage for organizations that use it. However, the talent pool for Ruby and Rails is rather small. Additionally, many of the people who are highly skilled at Rails prefer to work as contractors. The percentage of the Rails talent pool that prefers to be independent seems quite high by comparison to any other community i know of.

This raises a problem for organizations that would like to create an in-house development team using Rails. Most of the talent would rather not work for you, or anyone for that matter. However, if you can build a small core team to manage the development and hold the institutional knowledge for the project you can utilize the huge talent that exists in the Rails contractor community to drive the project to completion.

I am not sure if this structure and the reason behind it are good, or bad, for the Rails community as a whole. The nucleic team model might turn out to be a competitive advantage in itself because it embodies the benefits of both internal and external development teams. On the other hand, it is bound to be a bit off putting for organizations that are not use to it.


  1. See Mythical Man Month by Fred Brooks for more details on the importance of conceptual integrity.

Will code for food

I am on the job market again.

If you needs some help exposing your applications functionality via REST web APIs, we should talk. If you need some assistance developing, deploying and managing a Ruby application, I have the skills you need. If you need someone that can learn your code base and your business processes, I can do that. If you need someone to drive improvements to you software development process, I won’t disappoint you.

My resume has more details on my skills and accomplishments. Feel free to pass it around to anyone you think might be interested.

Announcing Resourceful

Resourceful has its initial (0.2) release today.

Resourceful is a sophisticated HTTP client library for Ruby. It will (when it is complete, at least) provide an simple API for fully utilizing the amazing goodness that is HTTP.

It is already tasty, though. The 0.2 release provides

  • fully compliant HTTP caching
  • a framework for implementing cache managers (memory based cache manager provided)
  • fully compliant transparent redirection handling (with hooks for overriding the default behavior)
  • plugable HTTP authentication handling (Basic provided)

Introduction

The API is strongly influenced by our successful experiences with REST. Each URI is represented by a Resource object. The Resource objects act as a proxy for the conceptual resource. Resources expose the basic set of HTTP verbs: get, put, post, delete. For example to get a representation of some resource you do this

require 'resourceful'
http = Resourceful::HttpAccessor.new
resp = http.resource('http://rubyforge.org').get
puts resp.body

If you want to post a form you do this

require 'resourceful'
http = Resourceful::HttpAccessor.new
resp = http.resource('http://rubyforge.org').post("name=Peter&hobbies=programming,diy", :content_type => "application/x-www-form-urlencoded")
puts resp.code

All non-2xx responses are either handled transparently (e.g. by following redirects) or the method will raise a UnsuccessfulHttpRequestError.

Conclusion

If you need a decent HTTP library in Ruby come on over and check it out. If you see something you want, or want fixed, feel free to branch the git repo and do it. Paul and I would love some more contributors and will welcome you with open arms.