An explanation of blame and guilt

I just read Blame Turned Upside Down1 by Robert Wright. I don’t think i have ever felt quite so uncomfortable while reading as i did during that essay. Its thesis is easy to apply to daily life and it quite handily explains the reactions of myself, and others, to many situations i have experienced. It is profoundly discomforting that people’s behaviors and feelings are dominated by a set of ancient heuristics. Particularly when that person is me.

Push or pull?

The question of how to communicate events often comes up when designing APIs and multi-component systems. The questions basically boils down to this: should events be pushed to interested parties as they occur, or should interested parties poll for new data?

The short answer: interested parties should poll for new data.

The longer answer is, of course, it depends.

Polling is the only approach that scales to internet levels. For the smaller scales of internal multi-component systems the answer is much less clear cut. It is clear that a push approach can be implemented in such environments using either web hooks or XMPP. Such approaches often appear to be simpler more efficient than the pull equivalents, and they are definitely lower latency.

The appearance of simplicity is an illusion, unfortunately. Event propagation using a push is only easy if are willing to give up a lot of reliability and predictability. It is easy to say “when an event occurs i will just POST it to registered URI(s)”. That would be easy, but the world is rarely that simple. What is the receiving server is down or unreachable? Are you going to retry, if so how many times? If not, is that level of message loss acceptable to all the interested parties. If the receiving system is very slow, will that cause a back-log in the sending system? If a lot of events happen in a very short period of time, can the receiving system handle the load?

The efficiency benefits of a push approach are real, but not nearly as significant as they first appear. HTTP’s conditional request mechanism provides, when used effectively, a way to reduce the cost of polling to quite low levels.

Pull is cool

APIs should be built around pulling data unless there is a particular functional concern that makes pull not work (e.g. low message latency being very important). Any push approach with have all the complexities if pull approach (to handle reliability issues) combined with a lot less predictable behavior because it’s performance will be dependent on one or more other systems ability to handle the event notification work load.

Instant Legacy Status

MS Great Plain API diagram

That is what you gain when you allow more than one module of software to access any single database. Any integration database confers legacy status on all modules that access it.

For the sake of this discussion we will define a module as unit of software that is highly cohesive, logically independent and implemented by a single team. A module is usually a single code base. Practically speaking, module boundaries usually depend on Conway’s Law. However, disciplined teams can gain some advantage from sub-dividing their problem space into multiple modules.

Pardon me while i digress…

I recently attended Mountain West RubyConf. While there i had the pleasure of hearing a brilliant talk by Jim Weirch about connascence in software. (Here are some other write ups of the same talk, but from different conferences.) A unit of software, A, is connascence with another, B, if there exists some change to B that would necessitate a change in A to preserve the overall correctness of the system.

Connascence has various flavors. Some of these forms cause more maintenance problems than other. Connascence of Name is linking code by name. For example, calling some method foo. Calling methods by name is a form of connascence but one which we regularly accept. Connascence of Algorithm, in contrast, is the linking of code by requiring both pieces use the same algorithm. When we see this in practice we generally run from the room screaming ”DRY”.

Many of the practices we like in software – DRY, Law of Demeter, etc – reduce connascence. Likewise, many of the things we regard as code smells result in increased connascence. Lower levels, and degree, of connascence are desirable because it reduces the effort required to change software. Mr Weirch posited that connascence might be a basis for a unified theory of software quality. It is definitely the most comprehensive framework i am aware of for thinking, and communicating, about the quality of so software.

Back on point

It is clear that a database makes all the modules that access it connascent with one another. Many changes to an application will require changes to the database; any changes to the database will require changes to the other modules to maintain overall correctness. All changes to the database require that the correctness of all accessing modules be verified and such verification is often non-trivial.

The integration database is particularly problematic because it forces a high degree of a variety of forms of connascence. It causes Connascence of Name. All modules involved need to agree on the names of the columns, tables, database, schema and host. Right off the bat you should start to wonder about this approach. The modules are weakly connascent at many points.

If the data model is normalized it may well avoid introducing significant Connascence of Meaning. If, on the other hand, there are enumerable values without look-up tables; logically boolean fields stored as numeric types; or a wide variety of other common practices you will have introduced some Connascence of Meaning. Connascence of Meaning requires that the units agree on how to infer meaning from a datum (e.g., in this context 1 means true). It is a stronger form of connascence than Connascence of Name.

While we are on this point, remember that if your mad data modeling skills did manage to avoid significant Connascence of Meaning you did it by adding significant amounts of Connascence of Name. That is what a lookup-table is.

So far not so good, but that was the good part. The really disastrous part is that by integrating at the database level you are required to introduce staggering levels of Connascence of Algorithm. Connascence of Algorithm is very strong form a connascence. Allowing one module to interact with another’s data storage means that both modules have to understand (read: share the same algorithm) the business rules of the accessed module. If the business rules change in a module, the possibility exists that any other module that accesses the data might now operate incorrectly.

Only application databases need apply

I fall squarely on the side of not using databases as integration vectors. The forms and degree of connascence that such an approach necessitate make it recipe for disaster. Such high level of connascence will raise the costs of change (read: improvment) in all the modules involved. Application systems tend to get more integrated over time so the cost of improvement rises rapidly in systems that use integration databases.

After a while the cost of change will become so high that only improvements that provide huge value to the business will be worth the cost. For weak, undisciplined teams this will happen very rapidly. For strong, smart and highly disciplined teams it will take a bit longer, but it will happen. Once you allow more than one module to access your app’s database forever will it dominate your destiny.

Encfs

Security is a thing at my new job. We follow PCI security standards. We take great care never to have our customers sensitive information on unsecured machines. We make efforts to stay aware of the security risks in our environments.

With that in mind, i decide that storing all my work related files on my laptops in clear text was suboptimal. I have thought this same thing at pretty much every job, but i have never done anything about it before.

After a bit of research i settled on EncFS as the best mechanism to encrypt my work related data. It is an encrypted file system. Unlike most of the other encrypted file systems for Linux, EncFS does not require reserving large amounts of disk space up front. EncFS effectively lets you make a directory on an existing file system in which all the files will be encrypted. It is very easy to setup and use.

In addition to normal sorts of files, all my test and development databases need to be encrypted, also. These databases don’t contain any customer data they do contain some information that my employer would prefer not be public knowledge. When the database server starts it cannot access the database files until i provide the encryption password. Fortunately, PostgreSQL is totally bad ass. It will happily start up even if it is unable to access some of the configured table spaces.1 As soon as the encrypted file system is mounted, the databases that reside in the encrypted directory instantly become available. The encryption layer does not even effect performance noticeably.2

One thing i did think is a little weak is that encrypted file systems don’t get unmounted when the computer is put to sleep. No worries. A tiny script in /etc/pm/sleep.d to unmount the file system is all it takes to rectify that situation.

Now if someone steals my laptop the only thing they will be able to access are my family photos. That is a pretty nice feeling. Even better, it turned out to be very easy.


  1. To allow the postgres user to access the encrypted file system you do need to mount it with the --public option.

  2. This is light duty, single user, performance we are talking about. I wouldn’t suggest this setup for a heavy load production environment, but in development it no sweat.

The demise of law in the USA

When there’s no law, there’s no bread – Benjamin Franklin

I am not sure which way the causality goes in that statement. I fear that it is a self re-enforcing where a slip on either side causes a slip on the other. I fear that because we in the USA are already well on our way.

First day at ID Watchdog

Today is my first on the job at ID Watchdog.

After the normal pleasantries of getting a desk, etc, i had the pleasure of over hearing a conversation of some ops people in the kitchen. Apparently, a not so nice person provided the police with the identity of one of our clients rather than his own. A fact which the ID Watchdog system detected. Now the ops team is working with law enforcement and the Judaical system to cleared up the confusion. A standard part of the service we provide, apparently.

I am psyched to be working at a place that helps people in such concrete ways.

Configuration files

If you are using a dynamic interpreted language please do not used use YAML1, or any other simple data serialization language, for configuration files.

Strictly speaking configuration is just data, of course, so you can use a data serialization language to represent your applications, or libraries, configuration. In some environments, like static compiled languages say, using a data serialization language for configuration makes a lot of sense. Creating your own custom configuration language from scratch is probably going to be more trouble that it is worth, unless you have a really complex configurations to express.

On the other hand, if you have an easily available, highly readable, Turing strength language why wouldn’t you use it? If the language that you application is written in supports eval you should probably use the app language to express the configuration. You will end up with a simpler and more power configuration system.

Lets look at a common configuration as an example. In Rails the database connection configuration looks like this.

development:
  adapter: mysql
  database: my_db_development
  username: my_app
  password:
  
test:
  adapter: mysql
  database: my_db_test
  username: my_app
  password:

production:
  adapter: mysql
  database: my_db_production
  username: my_app
  password:
  host: db1.my_org.invalid

That is not horrible, but it does include a fair bit of repetition. This approach starts getting ugly when you need to have more dynamic configurations. For example, RPM type distros use /tmp/mysql.sock as the domain socket for the MySQL database. Debian type distros use /var/run/mysqld/mysqld.sock. So you end up with a config file that looks like this

development:
  adapter: mysql
  database: my_db_development
  username: my_app
  password:
  socket: <%= File.exist?('var/run/mysqld/mysqld.sock') ? 
                'var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock' %>

⋮

You always end up needing non-declarative bits in your configuration. Everyone realizes this at some point. So much so that the most common pattern for YAML configuration files in Ruby is for them to support ERB as a way to embed Ruby.

Rather than implement multi-pass configuration file loading you could just have configuration files be pure Ruby but produce a hash structure like the YAML+ERB version does.

databases = {
  :development => {
    :adapter  => :mysql,
    :database => 'my_app_development',
    :username => 'my_app',
    :password => '',
    :socket   => File.exist?('var/run/mysqld/mysqld.sock') ? 
                   'var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock'
  },
  ⋮
}

That is nothing to write home about but it is dead simple to implement. Simpler even than the YAML+ERB approach. And it is superior to the YAML+ERB version because it is more powerful and extensible.

Rather than trying to map configurations onto a set of name value pairs, i prefer to create a small DSL for the configuration. The results of adapting the language to the configuration that needs to be expressed is significantly more pleasant and DRY than the hash oriented approaches. Consider the following

database {
  adapter   = mysql
  database  = 'my_db_#{RAILS_ENV}'
  user_name = 'my_app'
  password  = ''
  socket    = File.exist?('/var/run/mysqld/mysqld.sock') ? 
                '/var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock'

  env('production') {
    host = 'db1.my_org.invalid'
  }
}

That is a lot clearer, simpler and less repetitive. In addition, users can do anything they need to because they have access to the full power of the host language. This DSL would be a little more difficult to implement that the hash oriented approach, but not much. For that additional bit of effort you get a huge improvement in usability and power. Your users are worth that effort.


  1. For those who are unfamiliar with YAML it is a nice, easy to read data serialization format. It is quite similar to JSON. More limited than XML but a lot simpler to use if you are just doing data serialization.

Unemployment Insurance Fees

I just received my unemployment insurance debit card, the default payment mechanism in Colorado. After dutifully reading the fee schedule i am appalled. If you are out of work it is quite likely that you need to make every penny count. But the debit card carries an obscene level of fees and usage charges.

Here are just a few of the most egregious ones.

  • 10 cent surcharge on every purchase1
  • 75 cents charge each time a transaction is denied due to insufficient funds2
  • 50 cents for a ATM balance inquiry

I am ashamed that the Colorado Department of Labor willing subject unemployed workers to this sort of treatment. It is particularly shocking because the debit card is strongly preferred as a payment method for unemployment insurance. The only other option is direct deposit, but it really talked down in the sign-up process.

The debit card issuer obviously needs to be paid, but surely we can do better than allowing Chase Bank to nickel and dime claimants every time they use the money entitled to them. Since a simple check is not an option the least the Department of Labor could do to pay the card issuer directly for their services, rather transferring that cost to unemployed worker in the form of obscene usage fees.


  1. You do get 2 free POS signature transactions each month. That barely even counts as a fig leaf.

  2. The old “don’t kick a man while he is down” adage is not given much weight in the banking industry, apparently.

Mountain West RubyConf 2009

I'm attending MountainWest RubyConf 2009!

I going to Mountain West RubyConf this weekend. I am very excited. Last year this was a great conference and the schedule looks great this year too. If you are going to be there too let me know. One of the great things about these conferences is all the great people you get to meet, so hopefully i’ll see you there.

Nucleic Teams

A nucleic team is one with small core group of permanent employees, usually just 1 to 3 people, that is supplemented as needed by contractors. The core in a nucleic team is too small to do the anticipated work, even during slow periods of development. The core teams job is two fold, first it implements stories that are particularly complicated, risky or architecturally important. The second role of the core team is to manage a group of contractors by creating statements of work, doing code reviews, etc.

The nucleic structure should provide a lot of advantages from a business stand point. You get many of the benefits of having an in-house development team. Advantages like developers that have the time and incentives to become domain experts. A consistent group of people with which all the stakeholders can build a rapport. A group of people that work together long enough to build the shared vision it takes to create systems with conceptual integrity.1

Those advantages are combined with the advantages of pure contracting team, at least in principle. The primary advantages of a pure contracting are that you can scale the development organization, both up and down, rapidly and cost effectively. Many organizations with in-house development teams end up having to maintain a sub-optimally sized development team. Work loads and cash flow tend to vary a bit over time. It takes a long time to find and hire skilled developers. Once you do, it really sucks to have to lay people off, either because of the lack of work or lack of money. Resizing development teams is so costly and disruptive that most organizations tend to pick a team size that is larger than optimal for the slow/lean times but less than optimal for the plentiful times.

Risks

This structure is not without it risks, though. Finding talent contractors is not easy. Contractors, by their very nature, cannot be relied on when planning beyond their current contract. Most importantly, though, contracting usually has an incentive structure that favors short term productivity. All of these can threaten the long term success of project if not managed correctly.

To counteract the risks inherent in contract workers the core team must be committed to the business, highly talented and fully empowered by the executive team to aggressively manage the contractors. The core team members must be highly skilled software developers, of course, but this role requires expertise in areas that are significantly different from traditional software development. The ability to read and understand other peoples code rapidly it of huge importance. As is the ability to communicate with both the business and the contractors what functionality is needed. The core team also needs to be able to communicate much more subtle, squishy, things like the architectural vision and development standards.

The core team will not be as productive at cutting code as they might be use to. The core team role is not primarily one of coding. A significant risk is that the members of the core team might find that they do not like the facilitation and maintainership role nearly as much as cutting code. It is necessary to set the expectations of candidates for the core team appropriately. One other risk is that the core team will get so bogged down in facilitation and maintainership tasks that they actually stop cutting code. The “non-coding architect” is a recipe for disaster, and should be avoided at all costs.

While this team structure has much going for it, it will be challenging to make work in practice.

Origins

I think this team structure is developing in the Rails community out of necessity, rather than preference. Rails is a highly productive environment. That can make it a competitive advantage for organizations that use it. However, the talent pool for Ruby and Rails is rather small. Additionally, many of the people who are highly skilled at Rails prefer to work as contractors. The percentage of the Rails talent pool that prefers to be independent seems quite high by comparison to any other community i know of.

This raises a problem for organizations that would like to create an in-house development team using Rails. Most of the talent would rather not work for you, or anyone for that matter. However, if you can build a small core team to manage the development and hold the institutional knowledge for the project you can utilize the huge talent that exists in the Rails contractor community to drive the project to completion.

I am not sure if this structure and the reason behind it are good, or bad, for the Rails community as a whole. The nucleic team model might turn out to be a competitive advantage in itself because it embodies the benefits of both internal and external development teams. On the other hand, it is bound to be a bit off putting for organizations that are not use to it.


  1. See Mythical Man Month by Fred Brooks for more details on the importance of conceptual integrity.