The Health Care Debate

Cunning realist expresses my thought about the health care debate rather concisely.

… in the United States we are getting near an era of civil unrest with hussein Obama pushing Socialism/Nazism on the people …

During the past few weeks I’ve alternated between alarm and uninterested contempt. I’m still not sure which is more appropriate.

I, too, am not sure which i should feel. The level of the public debate has certainly been depressing. Alonzo Fyfe has recently done quite a lot of excellent work detailing just how depressing.

Cucumber

Cucumber Logo

I have been working pretty extensively with Cucumber for the last couple of weeks. In short, it is killer. You should be using it.

Having just RSpec/unit tests results in a lot of ugly trade offs between verifying the design and implementation of the parts (or units) vs the system as a whole. Using Cucumber completely absolves RSpec specs and unit tests of any responsibility for proving the system works. That allows you to use RSpec/unit tests as tools to improve the design, and reliability, of individual parts of the system without losing confidence in the overall systems ability to function acceptably.

If you are using Emacs i highly recommend cucumber.el. It has excellent support for editing Gherkin files and key bindings to execute scenarios, etc and view the output without ever having to leave the comfort of Emacs.

Reverting changes in git

Note to self: If you need to revert commits that have already been pushed, or otherwised merged with any other repo or branch, use git revert.

git reset is exclusively for undoing commits in your local working tree that have not seen the light of day. Attempting to use git reset to undo changes that exist in multiple trees (changes that have been pushed or merged in to another branch or repo) will result in pain and suffering.

An explanation of blame and guilt

I just read Blame Turned Upside Down1 by Robert Wright. I don’t think i have ever felt quite so uncomfortable while reading as i did during that essay. Its thesis is easy to apply to daily life and it quite handily explains the reactions of myself, and others, to many situations i have experienced. It is profoundly discomforting that people’s behaviors and feelings are dominated by a set of ancient heuristics. Particularly when that person is me.

Push or pull?

The question of how to communicate events often comes up when designing APIs and multi-component systems. The questions basically boils down to this: should events be pushed to interested parties as they occur, or should interested parties poll for new data?

The short answer: interested parties should poll for new data.

The longer answer is, of course, it depends.

Polling is the only approach that scales to internet levels. For the smaller scales of internal multi-component systems the answer is much less clear cut. It is clear that a push approach can be implemented in such environments using either web hooks or XMPP. Such approaches often appear to be simpler more efficient than the pull equivalents, and they are definitely lower latency.

The appearance of simplicity is an illusion, unfortunately. Event propagation using a push is only easy if are willing to give up a lot of reliability and predictability. It is easy to say “when an event occurs i will just POST it to registered URI(s)”. That would be easy, but the world is rarely that simple. What is the receiving server is down or unreachable? Are you going to retry, if so how many times? If not, is that level of message loss acceptable to all the interested parties. If the receiving system is very slow, will that cause a back-log in the sending system? If a lot of events happen in a very short period of time, can the receiving system handle the load?

The efficiency benefits of a push approach are real, but not nearly as significant as they first appear. HTTP’s conditional request mechanism provides, when used effectively, a way to reduce the cost of polling to quite low levels.

Pull is cool

APIs should be built around pulling data unless there is a particular functional concern that makes pull not work (e.g. low message latency being very important). Any push approach with have all the complexities if pull approach (to handle reliability issues) combined with a lot less predictable behavior because it’s performance will be dependent on one or more other systems ability to handle the event notification work load.

Instant Legacy Status

MS Great Plain API diagram

That is what you gain when you allow more than one module of software to access any single database. Any integration database confers legacy status on all modules that access it.

For the sake of this discussion we will define a module as unit of software that is highly cohesive, logically independent and implemented by a single team. A module is usually a single code base. Practically speaking, module boundaries usually depend on Conway’s Law. However, disciplined teams can gain some advantage from sub-dividing their problem space into multiple modules.

Pardon me while i digress…

I recently attended Mountain West RubyConf. While there i had the pleasure of hearing a brilliant talk by Jim Weirch about connascence in software. (Here are some other write ups of the same talk, but from different conferences.) A unit of software, A, is connascence with another, B, if there exists some change to B that would necessitate a change in A to preserve the overall correctness of the system.

Connascence has various flavors. Some of these forms cause more maintenance problems than other. Connascence of Name is linking code by name. For example, calling some method foo. Calling methods by name is a form of connascence but one which we regularly accept. Connascence of Algorithm, in contrast, is the linking of code by requiring both pieces use the same algorithm. When we see this in practice we generally run from the room screaming ”DRY”.

Many of the practices we like in software – DRY, Law of Demeter, etc – reduce connascence. Likewise, many of the things we regard as code smells result in increased connascence. Lower levels, and degree, of connascence are desirable because it reduces the effort required to change software. Mr Weirch posited that connascence might be a basis for a unified theory of software quality. It is definitely the most comprehensive framework i am aware of for thinking, and communicating, about the quality of so software.

Back on point

It is clear that a database makes all the modules that access it connascent with one another. Many changes to an application will require changes to the database; any changes to the database will require changes to the other modules to maintain overall correctness. All changes to the database require that the correctness of all accessing modules be verified and such verification is often non-trivial.

The integration database is particularly problematic because it forces a high degree of a variety of forms of connascence. It causes Connascence of Name. All modules involved need to agree on the names of the columns, tables, database, schema and host. Right off the bat you should start to wonder about this approach. The modules are weakly connascent at many points.

If the data model is normalized it may well avoid introducing significant Connascence of Meaning. If, on the other hand, there are enumerable values without look-up tables; logically boolean fields stored as numeric types; or a wide variety of other common practices you will have introduced some Connascence of Meaning. Connascence of Meaning requires that the units agree on how to infer meaning from a datum (e.g., in this context 1 means true). It is a stronger form of connascence than Connascence of Name.

While we are on this point, remember that if your mad data modeling skills did manage to avoid significant Connascence of Meaning you did it by adding significant amounts of Connascence of Name. That is what a lookup-table is.

So far not so good, but that was the good part. The really disastrous part is that by integrating at the database level you are required to introduce staggering levels of Connascence of Algorithm. Connascence of Algorithm is very strong form a connascence. Allowing one module to interact with another’s data storage means that both modules have to understand (read: share the same algorithm) the business rules of the accessed module. If the business rules change in a module, the possibility exists that any other module that accesses the data might now operate incorrectly.

Only application databases need apply

I fall squarely on the side of not using databases as integration vectors. The forms and degree of connascence that such an approach necessitate make it recipe for disaster. Such high level of connascence will raise the costs of change (read: improvment) in all the modules involved. Application systems tend to get more integrated over time so the cost of improvement rises rapidly in systems that use integration databases.

After a while the cost of change will become so high that only improvements that provide huge value to the business will be worth the cost. For weak, undisciplined teams this will happen very rapidly. For strong, smart and highly disciplined teams it will take a bit longer, but it will happen. Once you allow more than one module to access your app’s database forever will it dominate your destiny.

Encfs

Security is a thing at my new job. We follow PCI security standards. We take great care never to have our customers sensitive information on unsecured machines. We make efforts to stay aware of the security risks in our environments.

With that in mind, i decide that storing all my work related files on my laptops in clear text was suboptimal. I have thought this same thing at pretty much every job, but i have never done anything about it before.

After a bit of research i settled on EncFS as the best mechanism to encrypt my work related data. It is an encrypted file system. Unlike most of the other encrypted file systems for Linux, EncFS does not require reserving large amounts of disk space up front. EncFS effectively lets you make a directory on an existing file system in which all the files will be encrypted. It is very easy to setup and use.

In addition to normal sorts of files, all my test and development databases need to be encrypted, also. These databases don’t contain any customer data they do contain some information that my employer would prefer not be public knowledge. When the database server starts it cannot access the database files until i provide the encryption password. Fortunately, PostgreSQL is totally bad ass. It will happily start up even if it is unable to access some of the configured table spaces.1 As soon as the encrypted file system is mounted, the databases that reside in the encrypted directory instantly become available. The encryption layer does not even effect performance noticeably.2

One thing i did think is a little weak is that encrypted file systems don’t get unmounted when the computer is put to sleep. No worries. A tiny script in /etc/pm/sleep.d to unmount the file system is all it takes to rectify that situation.

Now if someone steals my laptop the only thing they will be able to access are my family photos. That is a pretty nice feeling. Even better, it turned out to be very easy.


  1. To allow the postgres user to access the encrypted file system you do need to mount it with the --public option.

  2. This is light duty, single user, performance we are talking about. I wouldn’t suggest this setup for a heavy load production environment, but in development it no sweat.

The demise of law in the USA

When there’s no law, there’s no bread – Benjamin Franklin

I am not sure which way the causality goes in that statement. I fear that it is a self re-enforcing where a slip on either side causes a slip on the other. I fear that because we in the USA are already well on our way.

First day at ID Watchdog

Today is my first on the job at ID Watchdog.

After the normal pleasantries of getting a desk, etc, i had the pleasure of over hearing a conversation of some ops people in the kitchen. Apparently, a not so nice person provided the police with the identity of one of our clients rather than his own. A fact which the ID Watchdog system detected. Now the ops team is working with law enforcement and the Judaical system to cleared up the confusion. A standard part of the service we provide, apparently.

I am psyched to be working at a place that helps people in such concrete ways.

Configuration files

If you are using a dynamic interpreted language please do not used use YAML1, or any other simple data serialization language, for configuration files.

Strictly speaking configuration is just data, of course, so you can use a data serialization language to represent your applications, or libraries, configuration. In some environments, like static compiled languages say, using a data serialization language for configuration makes a lot of sense. Creating your own custom configuration language from scratch is probably going to be more trouble that it is worth, unless you have a really complex configurations to express.

On the other hand, if you have an easily available, highly readable, Turing strength language why wouldn’t you use it? If the language that you application is written in supports eval you should probably use the app language to express the configuration. You will end up with a simpler and more power configuration system.

Lets look at a common configuration as an example. In Rails the database connection configuration looks like this.

development:
  adapter: mysql
  database: my_db_development
  username: my_app
  password:
  
test:
  adapter: mysql
  database: my_db_test
  username: my_app
  password:

production:
  adapter: mysql
  database: my_db_production
  username: my_app
  password:
  host: db1.my_org.invalid

That is not horrible, but it does include a fair bit of repetition. This approach starts getting ugly when you need to have more dynamic configurations. For example, RPM type distros use /tmp/mysql.sock as the domain socket for the MySQL database. Debian type distros use /var/run/mysqld/mysqld.sock. So you end up with a config file that looks like this

development:
  adapter: mysql
  database: my_db_development
  username: my_app
  password:
  socket: <%= File.exist?('var/run/mysqld/mysqld.sock') ? 
                'var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock' %>

⋮

You always end up needing non-declarative bits in your configuration. Everyone realizes this at some point. So much so that the most common pattern for YAML configuration files in Ruby is for them to support ERB as a way to embed Ruby.

Rather than implement multi-pass configuration file loading you could just have configuration files be pure Ruby but produce a hash structure like the YAML+ERB version does.

databases = {
  :development => {
    :adapter  => :mysql,
    :database => 'my_app_development',
    :username => 'my_app',
    :password => '',
    :socket   => File.exist?('var/run/mysqld/mysqld.sock') ? 
                   'var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock'
  },
  ⋮
}

That is nothing to write home about but it is dead simple to implement. Simpler even than the YAML+ERB approach. And it is superior to the YAML+ERB version because it is more powerful and extensible.

Rather than trying to map configurations onto a set of name value pairs, i prefer to create a small DSL for the configuration. The results of adapting the language to the configuration that needs to be expressed is significantly more pleasant and DRY than the hash oriented approaches. Consider the following

database {
  adapter   = mysql
  database  = 'my_db_#{RAILS_ENV}'
  user_name = 'my_app'
  password  = ''
  socket    = File.exist?('/var/run/mysqld/mysqld.sock') ? 
                '/var/run/mysqld/mysqld.sock' : '/tmp/mysql.sock'

  env('production') {
    host = 'db1.my_org.invalid'
  }
}

That is a lot clearer, simpler and less repetitive. In addition, users can do anything they need to because they have access to the full power of the host language. This DSL would be a little more difficult to implement that the hash oriented approach, but not much. For that additional bit of effort you get a huge improvement in usability and power. Your users are worth that effort.


  1. For those who are unfamiliar with YAML it is a nice, easy to read data serialization format. It is quite similar to JSON. More limited than XML but a lot simpler to use if you are just doing data serialization.