Continuous Delivery Is a Journey – Part 3

In the first part I described why I think that continuous delivery is important for an adequate developer experience and in the second part I draw a rough picture about how we implemented it in a 5-teams big product development. Now it is time to discuss about the big impact – and the biggest benefits – regarding the development of the product itself.

Why do more and more companies, technical and non-technical people want to change towards an agile organisation? Maybe because the decision makers have understood that waterfall is rarely purposeful? There are a lot of motives – beside the rather wrong dumb one “because everybody else does this” – and I think there are two intertwined reasons for this: the speed at wich the digital world changes and the ever increasing complexity of the businesses we try to automate.

Companies/people have finally started to accept that they don’t know what their customer need. They have started to feel that the customer – also the market – has become more and more demanding regarding the quality of the solutions they get. This means that until Skynet is not born (sorry, I couldn’t resist 😁) we oftware developers, product owners, UX designers, etc. have to decide which solution would be the best to solve the problems in that specific business and we have to decide fast.

We have to deliver fast, get feedback fast, learn and adapt the consequences even faster. We have to do all this without down times, without breaking the existing features and – for most of us very important: without getting a heart attack every time we deploy to production.

IMHO These are the most important reasons why every product development team should invest in CI/CD.

The last missing piece of the jigsaw which allows us to deliver the features fast (respectively continuously) without disturbing anybody and without losing the control how and when features are released is called feature toggle.

feature toggle[1] (also feature switchfeature flagfeature flipperconditional feature, etc.) is a technique in software development that attempts to provide an alternative to maintaining multiple source-code branches (known as feature branches), such that a feature can be tested even before it is completed and ready for release. Feature toggle is used to hide, enable or disable the feature during run time. For example, during the development process, a developer can enable the feature for testing and disable it for other users.[2]

Wikipedia

The concept is really simple: one feature should be hidden until somebody, something decides that it is allowed to be used.

function useNewFeature(featureId) {
  const e = document.getElementById(featureId);
  const feat = config.getFeature(featureId);
  if(!feat.isEnabled)
    e.style.display = 'none';
  else
    e.style.display = 'block';
}

As you see, implementing feature toggles is really that simple. To adopt this concept will need some effort though:

  • Strive for only one toggle (one if) per feature. At the beginning it will be hard or even impossible to achieve this but it is a very important to define this as a middle-term goal. Having only one toggle per feature means the code is highly decoupled and very good structured.
  • Place this (main) toggle at the entry point (a button, a new form, a new API endpoint) the first interaction point with the user (person or machine) and in disabled state it should hide this entry point.
  • The enabled state of the toggle should lead to new services (in micro service world), new arguments or to new functions, all of them implementing the behavior for feature.enabled == true. This will lead to code duplication: yes, this is totally ok. I look at it as a very careful refactoring without changing the initial code. Implementing a new feature should not break or eliminate existing features. The tests too (all kind of them) should be organized similarly: in different files, duplicated versions, implemented for each state.
the different states of the toggle lead to clearly separated paths
  • Through the toggle you gain real freedom to make mistakes or just the wrong feature. At the same time you can always enable the feature and show it the product owner or the stake holders. This means a feedback loop is reduced to minimum.
  • This freedom has a price of course: after the feature is implemented, the feedback is collected, the decision for enabling the feature was made, after all this the source code must be cleaned up: all code for feature.enabled == false must be removed. This is why it is so important to create the different paths so that the risk of introducing a bug is virtually zero. We want to reduce workload not increase it.
  • Toggles don’t have to be temporary, business toggles (i.e. some premium features or “maintenance mode”) can stay forever. It is important to define beforehand what kind of toggle will be needed because the business toggles will be always part of your source code. The default value for this kind of toggles should be false.
  • The default value for the temporary toggles should be true and it should be deactivated on production, activated during the development.

One advice regarding the tooling: start small, with a config map in kubernetes, a database table, a json file somewhere will suffice. Later on new requirements will appear, like notifying the client UI when a toggle changes or allowing the product owner to decide, when a feature will be enabled. That will be the moment to think about next steps but for the moment it is more important to adopt this workflow, adopt this mindset of discipline to keep the source code clean, learn the techniques how to organize the code basis and ENJOY HAVING THE CONTROL over the impact of deployments, feature decisions, stress!

That’s it, I shared all of my thoughts regarding this subject: your journey of delivering continuously can start or continued 😉) now.

p.s. It is time for the one sentence about feature branches:
Feature toggles will never work with feature branches. Period. This means you have to decide: move to trunk based development or forget continuous development.

p.p.s. For the most languages exist feature toggle libraries, frameworks, even platforms, it is not necessary to write a new one. There are libraries for different complexities how the state can be calculated (like account state, persons, roles, time settings), just pick one.

Update:

As pointed out by Gergely on Twitter, on Martin Fowlers blog is a very good article describing extensively the different feature toggles and the power of this technique: Feature Toggles (aka Feature Flags)

Continuous Delivery Is a Journey – Part 2

After describing the context a little bit in part one it is time to look at the single steps the source code must pass in order to be delivered to the customers. (I’m sorry, but it is a quite long part 🙄)

The very first step starts with pushing all the current commits to master (if you work with feature branches you will probably encounter a new level of self-made complexity which I don’t intend to discuss about).

This action triggers the first checks and quality gates like licence validation and unit tests. If all checks are “green” the new version of the software will be saved to the repository manager and will be tagged as “latest”.

Successful push leads to a new version of my service/pkg/docker image

At this moment the continuous integration is done but the features are far from being used by any customer. I have a first feedback that I didn’t brake any tests or other basic constraints but that’s all because nobody can use the features, it is not deployed anywhere yet.

Well let Jenkins execute the next step: deployment to the Kubernetes environment called integration (a.k.a. development)

Continuous delivery to the first environment including the execution of first acceptance tests

At this moment all my changes are tested if they can work together with the currently integrated features developed by my colleagues and if the new features are evolving in the right direction (or are done and ready for acceptance).

This is not bad, but what if I want to be sure that I didn’t break the “platform”, what if I don’t want to disturb everybody else working on the same product because I made some mistakes – but I still want to be a human ergo be able to make mistakes 😉? This means that my behavioral and structure changes introduced by my commits should be tested before they land on integration.

These must be obviously a different set of tests. They should test if the whole system (composed by a few microservices each having it’s own data persistence, one or more UI-Apps) is working as expected, is resilient, is secure, etc.

At this point came the power of Kubernetes (k8s) and ksonnet as a huge help. Having k8s in place (and having the infrastructure as code) it is almost a no-brainer to set up a new environment to wire up the single systems in isolation and execute the system tests against it. This needs not only the k8s part as code but also the resources deployed and running on it. With ksonnet can be every service, deployment, ingress configuration (manages external access to the services in a cluster), or config map defined and configured as code. ksonnet not only supports to deploy to different environments but offers also the possibility to compare these. There are a lot of tools offering these possibilities, it is not only ksonnet. It is important to choose the fitting tool and is even more important to invest the time and effort to configure everything as code. This is a must-have in order to achieve a real automation and continuous deployment!

https://twitter.com/YellowBrickC/status/1105081259934605312
Good developer experience also means simplified continuous deployment

I will not include here any ksonnet examples, they have a great documentation. What is important to realize is the opportunity offered with such an approach: if everything is code then every change can be checked in. Everything checked in can be included observed/monitored, can trigger pipelines and/or events, can be reverted, can be commented – and the feature that helped us in our solution – can be tagged.

What happens in a continuous delivery? Some change in VCS triggers pipeline, the fitting version of the source code is loaded (either as source code like ksonett files or as package or docker image), the configured quality gate checks are verified (runtime environment is wired up, the specs with the referenced version are executed) and in case of success the artifact will be tagged as “thumbs up” and promoted to the next environment. We started do this manually to gather enough experience to automate the process.

Deploy manually the latest resources from integration to the review stage

If you have all this working you have finished the part with the biggest effort. Now it is time to automate and generalize the single steps. After the Continuous Integration the only changes will occur in the ksonnet repo (all other source code changes are done before), which is called here deployment repo.

Roll out, test and eventually roll back the system ready for review

I think, this post is already to long. The next part ( I think, it will be the last one) I would like to write about the last essential method, how to deploy to production, without annoying anybody (no secret here, this is why feature toggles were invented for 😉) and about some open questions or decisions what we encountered on our journey.

Every graphic is realized with plantuml thank you very much!

to be continued …

Continuous Delivery Is a Journey – Part 1

Last year my colleagues and I had the pleasure to spend 2 days with @hamvocke and @diegopeleteiro from @thoughtworks reviewing the platform we created. One essential part of our discussions was about CI/CD described like this: “think about continuous delivery as a journey. Imagine every git push lands on production. This is your target, this is what your CD should enable.”

Even if (or maybe because) this thought scared the hell out of us, it became our vision for the next few months because we saw great opportunities we would gain if we would be able to work this way.

Let me describe the context we were working:

  • Four business teams, 100% self-organized, owning 1…n Self-contained Systems, creating microservices running as Docker containers orchestrated with Kubernetes, hosted on AWS.
  • Boundaries (as in Domain Driven Design) defined based on the business we were in.
  • Each team having full ownership and full accountability for their part of business (represented by the SCS).
  • Basic heuristics regarding source code organisation: “share nothing” about business logic, “share everything” about utility functions (in OSS manner), about experiences you made, about the lessons you learned, about the errors you made.
  • Ensuring the code quality and the software quality is 100% team responsibility.
  • You build it, you run it.
  • One Platform-as-a-service team to enable this business teams to deliver features fast.
  • Gitlab as VS, Jenkins as build server, Nexus as package repository
  • Trunk-based development, no cherry picking, “roll fast forward” over roll back.
Teams
4 Business Teams + 1 Platform-as-a-Service Team = One Product

The architecture we have chosen was meant to support our organisation: independent teams able to work and deliver features fast and independently. They should decide themselves when and what they deploy. In order to achieve this we defined a few rules regarding inter-system communication. The most important ones are:

  • Event-driven Architecture: no synchronous communication only asynchronous via the Domain Event Bus
  • Non-blocking systems: every SCS must remain (reduced) functional even if all the other systems are down

We had only a couple of exceptions for these rules. As an example: authentication doesn’t really make sense in asynchronous manner.

Working in self-organized, independent teams is a really cool thing. But

with great power there must also come great responsibility

Uncle Ben to his nephew

Even though we set some guards regarding the overall architecture, the teams still had the ownership for the internal architecture decisions. As at the beginning we didn’t have continuous delivery in place every team was alone responsible for deploying his systems. Due the missing automation we were not only predestined to make human errors but we were also blind for the couplings between our services. (And we spent of course a lot of time doing stuff manually instead of letting Jenkins or Gitlab or some other tool doing this stuff for us 🤔 )

One example: every one of our systems had at least one React App and a GraphQL API as the main communication (read/write/subscribe) channel. One of the best things about GraphQL is the possibility to include the GraphQL-schema in the react App and this way having the API Interface definition included in the client application.

Is this not cool? It can be. Or it can lead to some very smelly behavior, to a real tight coupling and to inability to deploy the App and the API independently. And just like my friend @etiennedi says: “If two services cannot be deployed independently they aren’t two services!”

This was the first lesson we have learned on this journey: If you don’t have a CD pipeline you will most probably hide the flaws of your design.

One can surely ask “what is the problem with manual deployment?” – nothing, if you have only a few services to handle, if every one in your team knows about these couplings and dependencies and is able to execute the very precise deployment steps to minimize the downtime. But otherwise? This method doesn’t scale, this method is not very professional – and the biggest problem: this method ignores the possibilities offered by Kubernetes to safely roll out, take down, or scale everything what you have built.

Having an automated, standardized CD pipeline as described at the beginning – with the goal that every commit will land on production in a few seconds – having this in place forces everyone to think about the consequences of his/hers commit, to write backwards compatible code, to become a more considered developer.

to be continued …