30 Jul 2009
•
Software Development
Note to self: If you need to revert commits that have already been pushed, or otherwised merged with any other repo or branch, use git<br />
revert
.
git reset
is exclusively for undoing commits in your local working tree that have not seen the light of day. Attempting to use git<br />
reset
to undo changes that exist in multiple trees (changes that have been pushed or merged in to another branch or repo) will result in pain and suffering.
08 Jun 2009
•
Miscellaneous
I just read Blame Turned Upside Down1 by Robert Wright. I don’t think i have ever felt quite so uncomfortable while reading as i did during that essay. Its thesis is easy to apply to daily life and it quite handily explains the reactions of myself, and others, to many situations i have experienced. It is profoundly discomforting that people’s behaviors and feelings are dominated by a set of ancient heuristics. Particularly when that person is me.
21 May 2009
•
Software Development
The question of how to communicate events often comes up when designing APIs and multi-component systems. The questions basically boils down to this: should events be pushed to interested parties as they occur, or should interested parties poll for new data?
The short answer: interested parties should poll for new data.
The longer answer is, of course, it depends.
Polling is the only approach that scales to internet levels. For the smaller scales of internal multi-component systems the answer is much less clear cut. It is clear that a push approach can be implemented in such environments using either web hooks or XMPP. Such approaches often appear to be simpler more efficient than the pull equivalents, and they are definitely lower latency.
The appearance of simplicity is an illusion, unfortunately. Event propagation using a push is only easy if are willing to give up a lot of reliability and predictability. It is easy to say “when an event occurs i will just POST it to registered URI(s)”. That would be easy, but the world is rarely that simple. What is the receiving server is down or unreachable? Are you going to retry, if so how many times? If not, is that level of message loss acceptable to all the interested parties. If the receiving system is very slow, will that cause a back-log in the sending system? If a lot of events happen in a very short period of time, can the receiving system handle the load?
The efficiency benefits of a push approach are real, but not nearly as significant as they first appear. HTTP’s conditional request mechanism provides, when used effectively, a way to reduce the cost of polling to quite low levels.
Pull is cool
APIs should be built around pulling data unless there is a particular functional concern that makes pull not work (e.g. low message latency being very important). Any push approach with have all the complexities if pull approach (to handle reliability issues) combined with a lot less predictable behavior because it’s performance will be dependent on one or more other systems ability to handle the event notification work load.
13 Apr 2009
•
Software Development
That is what you gain when you allow more than one module of software to access any single database. Any integration database confers legacy status on all modules that access it.
For the sake of this discussion we will define a module as unit of software that is highly cohesive, logically independent and implemented by a single team. A module is usually a single code base. Practically speaking, module boundaries usually depend on Conway’s Law. However, disciplined teams can gain some advantage from sub-dividing their problem space into multiple modules.
Pardon me while i digress…
I recently attended Mountain West RubyConf. While there i had the pleasure of hearing a brilliant talk by Jim Weirch about connascence in software. (Here are some other write ups of the same talk, but from different conferences.) A unit of software, A, is connascent with another, B, if there exists some change to B that would necessitate a change in A to preserve the overall correctness of the system.
Connascence has various flavors. Some of these forms cause more maintenance problems than other. Connascence of Name is linking code by name. For example, calling some method foo
. Calling methods by name is a form of connascence but one which we regularly accept. Connascence of Algorithm, in contrast, is the linking of code by requiring both pieces use the same algorithm. When we see this in practice we generally run from the room screaming ”DRY”.
Many of the practices we like in software – DRY, Law of Demeter, etc – reduce connascence. Likewise, many of the things we regard as code smells result in increased connascence. Lower levels, and degree, of connascence are desirable because it reduces the effort required to change software. Mr Weirch posited that connascence might be a basis for a unified theory of software quality. It is definitely the most comprehensive framework i am aware of for thinking, and communicating, about the quality of so software.
Back on point
It is clear that a database makes all the modules that access it connascent with one another. Many changes to an application will require changes to the database; any changes to the database will require changes to the other modules to maintain overall correctness. All changes to the database require that the correctness of all accessing modules be verified and such verification is often non-trivial.
The integration database is particularly problematic because it forces a high degree of a variety of forms of connascence. It causes Connascence of Name. All modules involved need to agree on the names of the columns, tables, database, schema and host. Right off the bat you should start to wonder about this approach. The modules are weakly connascent at many points.
If the data model is normalized it may well avoid introducing significant Connascence of Meaning. If, on the other hand, there are enumerable values without look-up tables; logically boolean fields stored as numeric types; or a wide variety of other common practices you will have introduced some Connascence of Meaning. Connascence of Meaning requires that the units agree on how to infer meaning from a datum (e.g., in this context 1 means true). It is a stronger form of connascence than Connascence of Name.
While we are on this point, remember that if your mad data modeling skills did manage to avoid significant Connascence of Meaning you did it by adding significant amounts of Connascence of Name. That is what a lookup-table is.
So far not so good, but that was the good part. The really disastrous part is that by integrating at the database level you are required to introduce staggering levels of Connascence of Algorithm. Connascence of Algorithm is very strong form a connascence. Allowing one module to interact with another’s data storage means that both modules have to understand (read: share the same algorithm) the business rules of the accessed module. If the business rules change in a module, the possibility exists that any other module that accesses the data might now operate incorrectly.
I fall squarely on the side of not using databases as integration vectors. The forms and degree of connascence that such an approach necessitate make it recipe for disaster. Such high level of connascence will raise the costs of change (read: improvment) in all the modules involved. Application systems tend to get more integrated over time so the cost of improvement rises rapidly in systems that use integration databases.
After a while the cost of change will become so high that only improvements that provide huge value to the business will be worth the cost. For weak, undisciplined teams this will happen very rapidly. For strong, smart and highly disciplined teams it will take a bit longer, but it will happen. Once you allow more than one module to access your app’s database forever will it dominate your destiny.
06 Apr 2009
•
Software Development
Security is a thing at my new job. We follow PCI security standards. We take great care never to have our customers sensitive information on unsecured machines. We make efforts to stay aware of the security risks in our environments.
With that in mind, i decide that storing all my work related files on my laptops in clear text was suboptimal. I have thought this same thing at pretty much every job, but i have never done anything about it before.
After a bit of research i settled on EncFS as the best mechanism to encrypt my work related data. It is an encrypted file system. Unlike most of the other encrypted file systems for Linux, EncFS does not require reserving large amounts of disk space up front. EncFS effectively lets you make a directory on an existing file system in which all the files will be encrypted. It is very easy to setup and use.
In addition to normal sorts of files, all my test and development databases need to be encrypted, also. These databases don’t contain any customer data they do contain some information that my employer would prefer not be public knowledge. When the database server starts it cannot access the database files until i provide the encryption password. Fortunately, PostgreSQL is totally bad ass. It will happily start up even if it is unable to access some of the configured table spaces.1 As soon as the encrypted file system is mounted, the databases that reside in the encrypted directory instantly become available. The encryption layer does not even effect performance noticeably.2
One thing i did think is a little weak is that encrypted file systems don’t get unmounted when the computer is put to sleep. No worries. A tiny script in /etc/pm/sleep.d
to unmount the file system is all it takes to rectify that situation.
Now if someone steals my laptop the only thing they will be able to access are my family photos. That is a pretty nice feeling. Even better, it turned out to be very easy.