Why i don’t love schemas

Once, long ago, i thought that schemas were great and that they could provide much value. Schemas hold the promise of declaratively and unambiguously defining message formats. Many improvements would be “easy” once such a definition was available. Automated low-code validations, automagically created high quality, verifiably correct documentation, clearer understanding of inputs, etc. I tried fiercely to create good schemas, using ever better schema languages (RelaxNG compact syntax, anyone) and ever more “sophisticated” tools to capture the potential benefits schemas.

At every turn i found the returns on investment disappointing. Schema languages are hard for humans to write and even harder to interpret. They are usually unable to express many real world constraints (you can only have that radio with this trim package, you can only have that engine if you not in California, etc). High quality documentation is super hard even with a schema. Generation from schema solves the easy part of documentation, explaining why the reader should care is the hard part. Optimizing a non-dominate factor in a process doesn’t actually move the needle all that much. Automated low-code validations often turned out hamper evolution far more than they caught real errors.

I wasn’t alone in my disappointment. The whole industry noticed that XML, with its schemas and general complexity was holding us back. Schemas ossified API clients and servers to the point that evolving systems was challenging, bordering on impossible. Understanding the implications of all the code generated from schemas was unrealistic for mere mortals. Schema became so complex that it was impossible to generate human interpretable documentation from them. Instead people just passed around megs of XSD files and call it “documentation”.

JSON emerged primarily as a reaction to the complexity of XML. However, XML didn’t start out complex, it accreted the complexity over time. Unsurprising the cycle is repeating. JSON schema is a thing and seems to be gaining popularity. It is probably a fools errand to try steming the tide but i’m going to try anyway.

doomed to watch everyone else repeat history

What would it take for schemas to be net positive? Only a couple of really hard things.

Humans first

Schema languages should be human readable first, and computer readable second. Modern parser generators make purpose built languages easy to implement. A human centered language turns schemas from something that is only understandable by an anointed few into a powerful, empowering tool for the entire industry. RelaxNG concise syntax (linked above) is a good place to look for inspiration. The XML community never adopted a human readable syntax and it contributed to the ultimate failure of XML as a technology. Hopefully we in the JSON community can do better.

Avoid the One True Schema cul de sac

This one is more of a cultural and education issue than technical one. This principle hinges on two realizations

  1. A message is only valuable if at least one consumer can achieve some goal with that message.
  2. Consumers want to maximize the number goals achieved

However, consumers don’t necessarily have the same goals as each other. In fact, any two consumers are likely to have highly divergent goals. Therefore, they are likely to have highly divergent needs in messages. A message with which one consumer can achieve a goal may be useless to another. Therefore, designing a schema to be used by more than consumer is an infinite-variable optimization problem. Your task is to minimize the difference between the set of valid messages and the set of actually processable messages for every possible consumer! (See appendix A for more details) A losing proposition if there ever was one.

To mitigate this schema languages should provide first class support for “personalizing” existing schemas. A consumer should able to a) declare that it only cares about some subset of the properties defined in a producer’s schema and b) that it will accept and use properties not declared in a producer’s schema. This would allow consumers to finely tune their individual schemas to their specific needs. This would improve evolvability by reducing incidental coupling, increase the clarity of documentation by hiding all irrelevant parts of the producer’s schema, and improve automated validation by ignoring irrelevant parts of messages.

We as a community should also educate people in the dangers of The One True Schema pattern.


Designing schemas for humans and avoiding the One True Schema are both really hard. And, unfortunately, i doubt our ability to reach consensus and execute on them. Given that i think most message producers and basically all message consumers are better avoiding schemas for anything other than documents.

Appendix A: Where in i get sorta mathy about why consumers shouldn’t share schemas unless they have the same goals

I don’t know if this will help other people but it did help me clarify my thinking on this issue.

M = set of all messages

mC = {m | m ∈ M, m contains info needed by consumer C}

A particular consumer, C, needs messages to contain certain information to achieve its goal.

mV = {m | m ∈ M, m is valid against schema S}

For any particular schema there is some subset of all messages that are valid.

mC = lim mV as S->perfectSchemaFor(C)

As the schema approaches perfection for consumer C the set of valid messages approaches the set of messages actually processable by the consumer.

mC ≠ mV
mC ⊄ mV
mC ⊅ mV

In practice, however, there is always some mis-match between the set of valid messages and the set of messages actually processable by the consumer. Some technically invalid messages contain enough information for the consumer to achieve its goal. Some technically valid messages will contain insufficient information. The mis-match may be due to bugs, a lack of expressiveness in the schema language or just poor design. The job of a schema designer is to minimize the mis-match.

Now consider a second consumer of these messages.

mD = {m | m ∈ M, m contains info needed by consumer D}

A particular consumer, D, needs messages to contain certain information to achieve its goal.

mC ≠ mD (in general)

The information needed by consumer D will, in general, be different from the information needed by consumer C. Therefore, the set of messages processable by C will, in general, not equal the set of messages processable by D.

perfectSchemaFor(C) ≠ perfectSchemaFor(D)

This is the kicker. The perfect schema for consumer C is, in general, different from the perfect schema any other consumer. Minimizing the difference between mV and mC will tend to increase the difference between mV and mD.

supervisor child startup in elixir

Fact: Supervisors fully serialize the startup of their children. Each child’s init/1 function runs to completion before the supervisor starts the next child.

For example, given modules ModuleA and ModuleB that implement both the init/1 callback and this supervisor:

], strategy: :one_for_one)

The ModuleA child process will be started and its init/1 function will run to completion before the ModuleB child process starts. This is true regardless of the strategy used.

so what?

This fact can often simplify application startup and error handling scenarios. For example, consider a situation where several processes need a token to interact with an API. The application must acquire this token on startup. If the token expires or the server revokes the token the application must acquire a brand new token and distribute it to all the processes.

We can implement that behavior simply and easily using serialization and appropriate crash handling strategy of a supervisor.

defmodule MyApi.TokenManager do
  use GenServer

  def init(_) do
    token = fetch_token_from_api!()
    {:ok, token}

  def handle_call(:get_token, _from, token) do
    {:reply, token, token}

defmodule MyApi.ThingDoer do
  use GenServer

  def handle_call(:do_thing, _from, _state) do
    token = MyApi.TokenManager.get_token()

    # do stuff with token; crash if it doesn't work

defmodule MyApi.OtherThingDoer do
  use GenServer

  def handle_call(:do_other_thing, _from, _state) do
    token = MyApi.TokenManager.get_token()

    # do stuff with token; crash if it doesn't work

], strategy: :one_for_all)

In this example MyApi.TokenManager.init/1 acquires the token before returning. That means the token is ready by the time MyApi.ThingDoer and MyApi.OtherThingDoer start. If at any point to the API server revokes the token, or it expires, the next thing doer to try to use it can just crash. That crash will cause the supervisor to shutdown the remaining children and restart them all beginning with MyApi.TokenManager which will acquire a new, working token.

With this approach MyApi.ThingDoer and MyApi.OtherThingDoer don’t need any specific error handling code around token management. The removal of situation-specific error handling logic makes them simpler and more reliable.

hypermedia format manifesto

Through my work developing and consuming APIs i have come to value:

  • evolvability over message and implementation simplicity
  • self describing messages over reduced message sizes
  • standards over bespoke solutions
  • human readability over client simplicity
  • uniformity over flexibility

I value the things on the right, but i value the things on the left more.

evolvability over message and implementation simplicity

APIs and the producers and consumers of APIs must be able to evolve over time. Evolvability inherently means that the message and implementations will be more complex. Designers and implementers must have forward compatibility in mind at all times. This forward compatibility mindset produces features that add value only after months, years or even decades of life. Having those features is more complex than not, but the return on those investments is worth the cost.

self describing messages over reduced message sizes

Embedding all the information needed to interpret a message simplifies client implementation and improves evolvability. However, embedding all that information necessarily increase the size of the message. For most APIs the additional data transfer is just not important enough to give up the benefits of self-describing messages.

standards over bespoke solutions

Standards allow reuse of code and of knowledge. Standards often encode hard-won practical experience about what works and what doesn’t. However, standard solutions often don’t fit as well as purpose-designed solutions to specific problems.

human readability over client simplicity

It is important that APIs be understandable by mere mortals. An average developer should be able to easily understand and explore an API without specialized tools. Achieving human readability while also honoring the other values often means that clients must become more complicated.

uniformity over flexibility

There should be a small number of ways to express a particular message. This makes consumer and producer implementations simpler. However, this means that existing APIs will likely be non-conformant. It also means that some messages will be less intuitive and human readable.

why now

There has been a fair bit of discussion in HTTP APIs hypermedia channel (get an invite) lately about hypermedia formats (particularly those of the JSON variety). Personally, i find all of the existing options wanting. I’m not sure the world needs yet another JSON based hypermedia format but the discussion did prompt me to try to articulate what i value in a format. The format is blatantly stolen from the agile manifesto.

How i judge software engineers

My kid’s teachers routinely provide rubrics for assignments. At first blush, rubrics are tools to make grading an assessment easier. They are effective in that role. They turn out to be at least as effective at communicating expectations. Readers of a rubric can quickly and easily determine what is important.

Recently i developed a rubric to help me judge the performance of engineers (it was review season yet again). The engineers on my team have appreciated the transparency and clarity this tool provides. Hopefully, it will be helpful to others.

The Rubric is divided into two sections. One about results and the other about behavior. Both of these are important. Good, pro-social behavior is just as important in engineers as strong results.


3 2 1 0
Continuous improvement
  • always leaves code substantially better than they found it
  • manages scope of refactors to match time available
  • often improves code as part of normal work
  • sometimes improves code as part of normal work
  • often has to abandon refactors due to time constraints
  • rarely improves existing code
  • communicates intentions early and clearly
  • often provides material support to teammates (developers, QA, on-call person, etc)
  • regularly creates & maintains runbook (particularly when on call)
  • keeps stakeholder informed (particularly when on call)
  • communicates intentions after it is hard to change course
  • regularly supports teammates
  • occasionally maintains runbook
  • rarely communicates intentions
  • sometimes supports teammates
  • doesn’t maintain runbook
  • never communicates intentions
  • never supports teammates
  • never creates or improves runbook entries
Production support
  • prioritizes concurrent incidents correctly
  • use critical thinking and problem-solving skills to resolve issues quickly
  • handles lower priority issues (eg, jenkins nodes down) when there are no higher priority incidents
  • prioritizes concurrent incidents correctly
  • use critical thinking and problem solving skills to resolve issues
  • handles lower priority issues (eg, jenkins nodes down) when there are no higher priority incidents
  • prioritizes concurrent incidents correctly
  • ignores lower priority issues (eg, jenkins nodes down) even when there are no high priority incidents (works stories while on call)
  • relies too heavily on others (rather than using critical thinking and problem-solving skills)
  • bad attitude
  • incorrectly prioritizes concurrent incidents
  • always ignores lower priority issues (jenkins nodes down)
  • doesn’t communicate incident status to stakeholders
  • relies on others to resolve issues (throws it over the wall)
Code Quality
  • functional
  • well factored
  • documentation on classes/modules and public method/functions
  • PRs require trivial changes
  • functional
  • well factored
  • poorly documented
  • PRs require minor changes
  • functional
  • poorly factored
  • undocumented
  • PRs require some rework
  • buggy
  • poorly factored
  • undocumented
  • PRs require substantial rework
  • usually delivers more  stories per sprint than the average engineer
  • usually delivers more points per sprint than the average engineer
  • occasionally delivers more stories/points per sprint than the average engineer
  • usually delivers fewer stories/points per sprint than the average engineer
  • delivers substantially fewer stories/points per sprint than the average engineer
  • public contracts well tested
  • key scenarios have acceptance tests
  • tests are independent of current implementation
  • public contract partially tested
  • key scenarios have acceptance tests
  • tests are independent or current implementation
  • public contract partially tested
  • tests dependent on current implementation
  • acceptance tests check too many edge cases
  • no unit or functional tests
  • often reviews PRs
  • feedback is substantive and useful
  • reviews show understanding of PRs intent and the code that interacts with it
  • regularly reviews PRs
  • advice would materially improve PRs
  • reviews show understanding of PRs intent
  • sometimes reviews PRs
  • reviews are superficial
  • reviews are hard to understand
  • never reviews PRs
Product & domain knowledge
  • understands the domain
  • understands most of the supported features of the product
  • understands some of the historical features of the product
  • understands the domain
  • understands many of the supported features of the product
  • some knowledge of the domains
  • limited knowledge of product features
  • no knowledge of utility and grid edge domain
  • no knowledge of product
Personal goals
  • goals are SMART
  • goals drive achievement of team and corporate goals in material ways
  • achieves goals on time
  • goals are SMART
  • goals weakly support team and corporate goals
  • achieves goals
  • goals have no relation to team and corporate goals
  • achieves goals
  • goals are vague or unattainable
  • goals work against team and corporate goals
  • doesn’t achieve goals


  • polite and engaging even when under stress (eg, when on call)
  • accepts setbacks and moves forward
  • normally polite but brusque when under stress
  • accepts setbacks and moves forward
  • normally polite but rude when under stress
  • rude
  • dismissive
  • courageous in all aspects of work every day
  • strives for greatness even when difficult
  • occasionally fails spectacularly
  • often courageous in most aspects of work
  • occasionally fails
  • sometimes courageous in some aspects of work
  • rarely fails
  • often timid
  • usually takes least risky (and rewarding) approach
  • highly motivated to succeed
  • accepts challenges and new responsibilities
  • motivated to succeed
  • grudgingly accepts new challenges and responsibilities
  • resists new challenges and responsibilities
  • lacks motivation
  • rejects all new challenges and responsibilities
Strategic thinking plans for 6 month – 2 year horizon plans for 3 – 6 month horizon plans for 1 – 3 month horizon no planning
  • earns trust of others
  • reliably meets commitments
  • honest
  • occasionally fails to meet commitments
  • sometimes fails to gain the trust of others
  • fails to meet commitments
  • occasionally misconstrues the facts
  • often misconstrues the facts
  • widely distrusted
  • seeks out learning opportunities
  • applies lessons learn to enhance success
  • elicits relevant experiences from others
  • interested in learning
  • applies lessons learn to enhance success
  • learns when pushed
  • resists change
  • uninterested in learning
  • resists change
  • improves productivity and morale of pairs
  • pairs most of the time
  • pairs effectively
  • pairs most of the time
  • pairs ineffectively
  • pairs some of the time
  • reduces productivity and morale of pairs
  • rarely pairs

* Optional. If your team pairs at a matter of course this is very important. If your team works as individuals then ignore this row.

Web API best practices — forward compatibility patterns

Designing APIs and clients for forward compatibility is the best way to achieve evolvability and maintainability in a microservice architecture. Wikipedia defines forward compatibility as “a design characteristic that allows a system to accept input intended for a later version of itself.” Designing APIs and clients to be forward compatible allows each side of that relationship to change independently of the other. Forward compatibility is an exercise in reducing connascence. We want a system in which we can modify any component (service) without breaking the current behavior of the system. Without such independence any microservice architecture is doomed to collapse under its own maintenance cost.

Change is inevitable. Any successful architecture must accept, even welcome, change. Building avenues for adaptation into our designs allows our systems to survive and thrive in the face of change. This set of best practices has proven to provide superior forward and backward compatibility. As described here these practices are specific to HTTP and web APIs but the general ideas apply to any application protocol.

Forward compatibility is largely the responsibility of clients or document consumers. However, each forward compatible pattern is supported by backward compatibility in the servers or document producers. The following list of patterns describes the responsibilities of both parties. Compatibility will suffer if either side fails to uphold their end of the contract.


Clients must follow HTTP redirects.

URLs change over time. Companies rebrand their products, get acquired, change their name, etc. Developers relocate services to more optimal components. New features sometimes require changes to URLs. Redirection allows servers to change URLs without breaking existing clients by providing runtime discovery of the new URLs.

Supporting backward compatibility

A URL, once given to a client, represents a contract. To honor these contract servers must provide redirection of existing URLs when they change. Even in hypermedia APIs clients will bookmark URLs so Hypermedia is no excuse for breaking existing URLs.

Must ignore semantics

Document consumers must ignore any features of documents they don’t understand.

Document formats change over time. New features require new properties and links. New information illuminates our understanding of cardinalities and datatypes. Ignoring unrecognized document features allow producers to add new features without fear of breaking clients.

Supporting backward compatibility

Document producers must not remove, change the syntax or change the semantics of existing document features. Once a document feature has been released it is sacrosanct. Idiosyncrasy is better than carnage.

Hypermedia APIs

  • Clients must discover available behavior at runtime.
  • Clients must handle the absence of links gracefully.
  • Clients must not construct URLs.

Available behavior changes over time. We discover new business rules. We realize that existing features are ill-conceived. Hypermedia APIs provide a mechanism for clients to discover supported behavior by embedding links to behavior in API responses.

Supporting backward compatibility

The API must be defined using a hypermedia format. The exact format does’t matter. (HAL, JSON API, Hydra, HTML, etc are all good.) What does matter is that all behavior of the API must be discoverable at runtime.

Server content negotiation

Clients must inform the server of what document formats they can process via the Accept header.

Document formats have a half life. Sometimes the idiosyncrasies build to the point that a new document format is worth the effort. Sometimes you realize that a particular representation is too bulky or too terse. Server driven content negotiation allows servers to introduce new, improved representations of existing resources while continuing to support existing clients.

Supporting backward compatibility

Servers must continue honoring requests for existing document formats. Document formats may be phased out once they are no longer requested by any clients.


This set of best practices and patterns allows the construction of sophisticated distributed systems that can evolve and be maintained with reasonable effort. Without these patterns it is almost inevitable that a microservice architecture will end up in dependency hell.

How to componentize a monolith

Micro-service architectures are all the rage these days. Let’s say, totally hypothetically, that you already have a large code base that has all the pathologies we have come to expect from monoliths. You may think something like “we should break this behemoth up into a collection of components, each of which is comprehensible to mere humans.” That is a great goal, but how do you get from a huge, highly coupled, monolith to a constellation of cooperating components? Most “efforts” to modularize monolithic software are abandoned at about the time someone serious considers that question. That is, before the effort even starts. It is easy to become overwhelmed by the immensity of the task and absence of obvious starting points and to just give up without a fight.

I have lead several successful decomposition efforts over my career. Each of these efforts shared some common patterns which, i believe, are the keys to success. These patterns are:

  • commitment
  • componentize based on data locality
  • extract auth first
  • new features in new components
  • facture along existing seams


Decomposing a significant monolith will take time. It took tens or hundreds of man years to build the monolith. Breaking it up is going to take a while. Accept it. Plan for it. Remind yourself, and others, of this fact regularly. Patience is the only way to succeed. Some days, or weeks, or months, it will seem like you are never going reach your goal. In the moment, it will look like you are barely making dent, but if you have patience and persistence you will, over time, extract significant functionality into easier to maintain components.

Most people overestimate what they can do in one year and underestimate what they can do in ten years. – Bill Gates

You will never get done, though. Like any interesting project there is always more to do. Strive for progress, not perfection. The goal is not completion. Rather it is winning this, and the next, round of the repeating game. Winning this round implies shipping new features. Winning the next round implies leaving the architecture better than you found it.

All of the above applies to the entire development team, including management, not just the individual who initiates the decomposition. If management isn’t on board it will be difficult, or impossible, to sustain the effort need to make real progress. If the engineering team as a whole is not committed, progress will stop with the loss of a key team member. The goal is to build a organization that will continuously chip away at the monolith.

Componentize based on data

Perhaps the most ubiquitous concern for any service architecture is that of performance. Poor performance results in higher compute costs, reduced through put, and increased latencies. Performance is always an issue with services because the architectural style is based on inter-process communication (IPC). IPC requires IO and IO is way slower than memory access. The harsh reality is that it is harder to keep a service architecture performant.

One approach that can mitigate performance concerns is to co-locate the data and code involved in important operations in the same component. In practice, this means designing the component boundaries such that the component that implements a particularly important service is also the system of record for the data used by the service. Obviously, data locality must be balanced with componentization if we are to break up a monolith. It is important to consider the importance of any particular operation. Services that are depended upon by many other services, or that are used in time critical parts of the application should be co-located with the data they use. Services that are used less frequently, or in less critical places can use data from other services.

This is just a rule of thumb. Any real system will have situations that simply don’t allow for the preferred data locality for all operations. Engineering is all about trade offs. In these situations caching becomes critical. HTTP, my preferred service protocol, provides sophisticated support for caching. Send time thinking about cache lifespans and invalidation when designing the resources of an API. Doing so will allow clients to more effectively and safely cache the data exposed. That will, in turn, improve the perceived performance and reduce the compute needs of the system. No operation is faster that than the one you avoid altogether. A well designed caching strategy allows avoiding many operations.

Extract auth first

The very first thing to extract into a separate component, is authorization and authentication. Authorization is needed by basically everything (though it doesn’t always require a separated component1), including all the services that will be extracted from the monolith. Authorization and authentication have good, widely supported standards. Such standards and tools improve the chances of success in this first foray and, as we all know, early successes breed confidence and increase commitment.

Service architectures should use OAuth2. It is secure, widely implemented, and well supported in most technology stacks. The auth component will perform the authorization server role. All other components will be resource servers and/or clients.

The authorization request & grant portion of the OAuth2 flow should be short circuited for internal components. There is no need to ask the user to explicitly grant authorization to a mere implementation detail of the overall system. Such a short circuiting is usually accomplished by keeping track of which clients are internal components. When an internal component requests authorization the authorization server simply grants it immediately (after authenticating the user, of course).

The combination of simplicity and existing standards make auth a great first service to extract.

New features

Once you have extracted authorization and authentication it is time to take this show on the road. My preferred second target is a totally new feature. Building a new feature as a service has several advantages. First, new features are inherently less coupled to the existing code. This makes them easier to implement in a separate component. Second, new features are usually lower risk politically. Changing the way an existing feature works will usually raise some concerns, but fewer people have a vested interest in hypothetical features. Third, it sets a precedent to build on later. If new features can be implemented outside the monolith then it can become policy that new feature are always implemented outside the monolith. In this way we can effectively slow the monolith’s growth.

Implementing a new feature outside the monolith may require exposing existing capabilities of the monolith as a service. This is okay. As with most refactoring, we will sometimes have to make things worse before we can make them better. In this situation, implement a service in the monolith to expose the necessary functionality and use it from the new component.

Do not under any circumstances let the new component use the same database tables2 as the monolith. Doing so will only couple the two codebases in a that will both harder to maintain.

Existing seams

Once auth is a separate component and at lease one new feature has been implemented outside the monolith we can extract some functionality. The key is to understand the existing code base enough to see seams in it that could be used to cleave off a bit of functionality. This functionality will likely the best factored and designed part of the monolith. It is ironic that the easiest part to remove is the best part, but c’est la vie.

In most “mature programming environments” there will be multiple plausible candidates for extraction. It is important to have a list of candidates ready. Begin by taking every opportunity to tighten the encapsulation and reduce the coupling of extraction candidates. Every bit of this sort of refactoring done ahead of time improves the chances of successful extraction. Even if the section of code is never extracted, this work improves the maintainability of the code, so it is useful regardless.

Chance favors the prepared, as they say. Once you have the hit list, every feature request should be viewed as potential extraction opportunity. When possible, expand the scope of feature requests to include component extract before implementation of new functionality. Feature requests that allow for such expansion are rare so don’t become discouraged if you don’t run across one right away. It will happen, but only if you are constantly vigilant.

Once you have built a consensus around extraction as part of a larger feature, begin by further tightening the encapsulation of the section to be extracted. Once the interface is solid, re-implement the feature in a new component. Once that component is functional, replace the implementation in the monolith with a client to the new component that provides the same interface as the original implementation.

Basically, component extraction is just normal, every day incremental refactoring, but in the large.


With commitment, these few simple rules of thumb, and a lot of effort any team can componentize a monolith. Get buy in from your entire team for the effort. Organize components around central data interactions to optimized performance and clarity. Extract authentication and authorization first because it most other components will depend on it. Create a precedent around building new features outside of the monolith. Finally, extract an existing feature from the monolith. No monolith lasts forever.

  1. Capability based authorization is my preferred scheme, if it can be applied easily. In practice capability based authorization usually means generating unguessable URLs for individual resources. Simply knowing the URL implies that you are authorized to access the resource. Think of it like a door lock: if you have the key you get in, the lock doesn’t know anything about who should have access, or why. Such an approach reduces the inter-component communication and improves performance. This makes it a highly attractive option, however, it is not always practical.
  2. In some cases views can be used to define a maintainable interface between the monolith’s database and new components. Specifically, the monolith would define views in a new schema that was specifically for external components to access. These views need unit tests, etc, to ensure they are maintained as the monolith’s datamodel evolves. In general, this is more trouble than it is worth, and sets the wrong precedent to boot, but there are times when throughput, bandwidth, or latency concerns make it a reasonable choice.

Services Questions

I recently had a colleague ask me several questions about service oriented architectures and breaking monoliths apart. This is an area in which i have a good deal of experience so i decided to publish my answers here.

What is a “service”?

A “service” is a discrete bit of functionality exposed via well defined interface (usually a standardized format over a standardized network protocol) that can be utilized by clients that are unknown and/or unanticipated at the time of the service’s implementation. Due to the well defined interface, clients of a service do not need to understand how the service is implemented. This style of software architecture exists to overcome the diseconomies of scale suffered by software software

How has the services landscape changed in the last 5-10 years?

In the mid-2000s it became clear that WS-*, the dominate service technology at the time, was dramatically over complicated and led to expensive to maintain systems. WS-*’s protocol independence combined with the RPC style adopted by most practitioners meant that clients generally ended up being very tightly coupled with a particular service implementation.

As WS-*’s deficiencies became clear, REST style architectures gained popularity. They reduce complexity and coupling by utilizing an application protocol such as HTTP, and often using simpler message formats such as JSON. The application protocol provides a uniform interface for all services which reduces the coupling between client and service.

Microservices are a relatively recent variant of service oriented architectures. As the name suggest the main thrust of microservices is the size of the components that implement them. The services themselves could be message queue bases or REST style APIs. The rise of devops, automation around deploy and operations, raises the practicality of deploy a large number of very small components.

Message queue based architectures are experiencing a bit of a resurgence in recent years. Similar architectures where popular in the early 2000’s but where largely abandoned in favor of WS-*. Queue based architectures often provide throughput and latency advantages over REST style architecture at the expense of visibility and testability.

What do modern production services architectures look like?

It depends on the application’s goals. Systems that need high throughput, low latency and extreme scalability tend to be message queue based event driven architectures. Systems that are looking for ease of integration with external clients (such as those developed by third parties) tend to be resource oriented REST APIs with fixed URL structures with limited runtime discoverability. Systems that are seeking long term maintainability and evolvability tend to be hypermedia oriented REST APIs. Each style has certain strengths and weaknesses. It is important to pick the right one for the application.

How granular is a service?

I would distinguish a service from a component. Services should be small, encapsulating a single, discrete bit of functionality. If a client wants to perform an action, that action is an excellent candidate for being a service. A component, on the other hand, is a pile of code that implements one or more services. A component should be small enough that you could imagine re-writing it from scratch in a different technology. However, there is a certain fixed overhead for each component. Finding the balance between component size and the number of components is an important part of designing service architectures.

What is the process to start breaking down a large existing application?

Generally organizations start by extracting authentication and authorization. It is an area that is fairly well standardize (oauth, SAML, etc) and also necessary to support the development of other services. Once authentication and authorization are extracted from the monolith, another bit of functionality is chosen for implementation outside of the monolith. This process is repeated until the monolith is modularized enough to meet the organizational goals. The features implemented outside of the monolith are often new functionality but to really break apart a monolith you eventually have to target some of the existing features.

Starting small and preceding incrementally are the keys to success. Small successes make it easier to build consensus around the approach. Once consensus is reached existing, customer facing features can be extracted more readily.

What are the organization / team structure impacts?

It is generally superior to construct a (rough, high level) design of the services needed and to form vertically integrated teams to implement business features across the set of components. Empowering all teams to create new components and, where appropriate, new services, increases the likelihood of success and shortens the time to implement a service oriented architecture.

Alternatively, component structures can follow the organizational structure (Conway’s law), resulting in one component per group in the organization. This approach can be practical in organizations where vertically integrated teams are politically unacceptable.

What are needed / helpful tools? To run services? To manage services?

  • Service discovery. Without it an inordinate amount of configuration is required for each component to interact with the ecosystem of services.
  • Instrumentation and monitoring. Without this it is impossible to detect and remediate issues that arise from the integration of an ecosystem of services.

How do companies document their services, interfaces and versions?

There are not any popular tools for creating documentation for hypermedia and message queue based APIs. However, general guidance can be found for creating documentation, but in general you are own your own.

For more resource oriented APIs tools like swagger and api blueprint can be helpful.

How can we speed developer ramp up in a service architecture?

To speed developer ramp up it is important to have well maintained scripts to build a sandbox environment including all services. Preferably a single script that can deploy components to virtual machines or docker containers on the developers machine. Additionally it is important to maintain searchable documentation about the API so that developers can find information about existing services.

How to deploy new versions of the ecosystem and its components?

Envisioning an ecosystem of services as a single deployable unit is contrary to the goals of service oriented architectures. Rather each component should be versioned and deployed independently. This allows the components to change, organically, as needed. This approach increases costs on the marketing, product management and configuration management fronts. The benefits gained on the development side by avoiding the diseconomies of scale are worth it.

How to manage API compatibility changes?

A service oriented architecture makes breaking API changes more damaging. It is often difficult to know all the clients using a particular API. The more successful an API the harder, and more time consuming, it is to find and fix all clients. This difficultly leads to the conclusion that all changes should be made in a backwards compatible way. When breaking changes are unavoidable (which is rare) server-driven content negotiation can enable components to fulfill requests for both the old API and the new one. Once all clients are transitioned to the new API the old API can be removed. Analytics in the old API can help identify clients in need of transitions and to determine when it is no longer needed.

How to version the ecosystem and its components?

This is a marketing and project management problem more than anything. The ecosystem will not have a single version number. However, when a certain set of, business meaningful, features have been completed it is convenient for marketing to declare that a new “version”. Such declarations are of little consequence on the development side so they should be done whenever desirable.