Continuous Delivery Is a Journey – Part 1

Last year my colleagues and I had the pleasure to spend 2 days with @hamvocke and @diegopeleteiro from @thoughtworks reviewing the platform we created. One essential part of our discussions was about CI/CD described like this: “think about continuous delivery as a journey. Imagine every git push lands on production. This is your target, this is what your CD should enable.”

Even if (or maybe because) this thought scared the hell out of us, it became our vision for the next few months because we saw great opportunities we would gain if we would be able to work this way.

Let me describe the context we were working:

  • Four business teams, 100% self-organized, owning 1…n Self-contained Systems, creating microservices running as Docker containers orchestrated with Kubernetes, hosted on AWS.
  • Boundaries (as in Domain Driven Design) defined based on the business we were in.
  • Each team having full ownership and full accountability for their part of business (represented by the SCS).
  • Basic heuristics regarding source code organisation: “share nothing” about business logic, “share everything” about utility functions (in OSS manner), about experiences you made, about the lessons you learned, about the errors you made.
  • Ensuring the code quality and the software quality is 100% team responsibility.
  • You build it, you run it.
  • One Platform-as-a-service team to enable this business teams to deliver features fast.
  • Gitlab as VS, Jenkins as build server, Nexus as package repository
  • Trunk-based development, no cherry picking, “roll fast forward” over roll back.
Teams
4 Business Teams + 1 Platform-as-a-Service Team = One Product

The architecture we have chosen was meant to support our organisation: independent teams able to work and deliver features fast and independently. They should decide themselves when and what they deploy. In order to achieve this we defined a few rules regarding inter-system communication. The most important ones are:

  • Event-driven Architecture: no synchronous communication only asynchronous via the Domain Event Bus
  • Non-blocking systems: every SCS must remain (reduced) functional even if all the other systems are down

We had only a couple of exceptions for these rules. As an example: authentication doesn’t really make sense in asynchronous manner.

Working in self-organized, independent teams is a really cool thing. But

with great power there must also come great responsibility

Uncle Ben to his nephew

Even though we set some guards regarding the overall architecture, the teams still had the ownership for the internal architecture decisions. As at the beginning we didn’t have continuous delivery in place every team was alone responsible for deploying his systems. Due the missing automation we were not only predestined to make human errors but we were also blind for the couplings between our services. (And we spent of course a lot of time doing stuff manually instead of letting Jenkins or Gitlab or some other tool doing this stuff for us 🤔 )

One example: every one of our systems had at least one React App and a GraphQL API as the main communication (read/write/subscribe) channel. One of the best things about GraphQL is the possibility to include the GraphQL-schema in the react App and this way having the API Interface definition included in the client application.

Is this not cool? It can be. Or it can lead to some very smelly behavior, to a real tight coupling and to inability to deploy the App and the API independently. And just like my friend @etiennedi says: “If two services cannot be deployed independently they aren’t two services!”

This was the first lesson we have learned on this journey: If you don’t have a CD pipeline you will most probably hide the flaws of your design.

One can surely ask “what is the problem with manual deployment?” – nothing, if you have only a few services to handle, if every one in your team knows about these couplings and dependencies and is able to execute the very precise deployment steps to minimize the downtime. But otherwise? This method doesn’t scale, this method is not very professional – and the biggest problem: this method ignores the possibilities offered by Kubernetes to safely roll out, take down, or scale everything what you have built.

Having an automated, standardized CD pipeline as described at the beginning – with the goal that every commit will land on production in a few seconds – having this in place forces everyone to think about the consequences of his/hers commit, to write backwards compatible code, to become a more considered developer.

to be continued …

Base your decisions on heuristics and not on gut feeling

As a developer we tackle very often problems which can be solved in various ways. It is ok not to know how to solve a problem. The real question is: how to decide which way to go 😯

In this situations often I rather have a feeling as a concrete logical reason for my decisions. This gut feelings are in most cases correct – but this fact doesn’t help me if I want to discuss it with others. It is not enough to KNOW something. If you are not a nerd from the 80’s (working alone in a den) it is crucial to be able to formulate and explain and share your thoughts leading to those decisions.

Finally I found a solution for this problem as I saw the session of Mathias Verraes about Design Heuristics held by the KanDDDinsky.

The biggest take away seems to be a no-brainer but it makes a huge difference: formulate and visualize your heuristics so that you can talk about concrete ideas instead of having to memorize everything what was said – or what you think it was said.

Using this methodology …

  • … unfounded opinions like “I think this is good and this is bad” won’t be discussed. The question is, why is something good or bad.
  • … loop backs to the same subjects are avoided (to something already discussed)
  • … the participants can see all criteria at once
  • … the participants can weight the heuristics and so to find the probably best solution

What is necessary for this method? Actually nothing but a whiteboard and/or some stickies. And maybe to take some time beforehand to list your design heuristics. These are mine (for now):

  • Is this a solution for my problem?
  • Do I have to build it or can I buy it?
  • Can it be rolled out without breaking neither my features as everything else out of my control?
  • Breaks any architecture rules, any clean code rules? Do I have a valid reason to break these rules?
  • Can lead to security leaks?
  • Is it over engineered?
  • Is it much to simple, does it feel like a short cut?
  • If it is a short cut, can be corrected in the near future without having to throw away everything? = Is my short cut implemented driving my code in the right direction, but in more shallow way?
  • Does this solution introduce a new stack = a new unknown complexity?
  • Is it fast enough (for now and the near future)?
  • … to be continued πŸ™‚

The video for the talk can be found here. It was a workshop disguised as a talk (thanks again Mathias!!), we could have have continued for another hour if it weren’t for the cold beer waiting πŸ™‚

Event Storming with Specifications by Example

Event Storming is a technique defined and refined by Alberto Brandolini (@ziobrando). I fully agree the statement about this method, Event Storming is for now “The smartest approach to collaboration beyond silo boundaries”

I don’t want to explain what Event Storming is, the concept is present in the IT world for a few years already and there are a lot of articles or videos explaining the basics. What I want to emphasize is WHY do we need to learn and apply this technique:

The knowledge of the product experts may differ from the assumption of the developers
KanDDDinsky 2018 – Kenny Baas-Schwegler

On the 18-19.10.2018 I had the opportunity to not only hear a great talk about Event Storming but also to be part of a 2 hours long hands-on session, all this powered by Kandddinsky (for me the best conference I visited this year) and by @kenny_baas (and @use case driven and @brunoboucard). In the last few years I participated on a few Event Storming sessions, mostly on community events, twice at cleverbridge but this time it was different. Maybe ES is like Unit Testing, you have to exercise and reflect about what went well and what must be improved. Anyway this time I learned and observed a few rules and principles new for me and their effects on the outcome. This is what I want to share here.

  1. You need a facilitator.
    Both ES sessions I was part at cleverbridge have ended with frustration. All participants were willing to try it out but we had nobody to keep the chaos under control. Because as Kenny said “There will be chaos, this is guaranteed.” But this is OK, we – devs, product owners, sales people, etc. – have to learn fast to understand each other without learning the job of the “other party” or writing a glossary (I tried that already and didn’t helped 😐 ). Also we need somebody being able to feel and steer the dynamics in the room.


    The tweets were written during a discussion about who could be a good facilitator. You can read the whole thread on Twitter if you like. Another good article summarizing the first impressions of @mathiasverraes as facilitator is this one.

  2. Explain AND visualize the rules beforehand.
    I skip for now the basics like the necessity of a very long free wall and that the events should visualize the business process evolving in time.
    This are the additional rules I learned in the hands-on session:

      1. no dev-talk! The developer is per se a species able to transform EVERYTHING in patterns and techniques and tables and columns and this ability is not helpful if one wants to know if we can solve a problem together. By using dev-speech the discussion will be driven to the technical “solvability” based on the current technical constraints like architecture. With ES we want to create or deepen our ubiquitous language , and this surely not includes the word “Message Bus”  πŸ˜‰
      2. Every discussion should happen on the board. There will be a lot of discussions and we tend to talk a lot about opinions and feelings. This won’t happen if we keep discussing about the business processes and events which are visualized in front of us – on the board.
      3. No discussions regarding persons not in the room. Discussing about what we think other people would mind are not productive and cannot lead to real results. Do not waste time with it, time is too short anyway.
      4. Open questions occurring during the storming should not be discussed (see the point above) but marked prominently with a red sticky. Do not waste time
      5. Do not discuss about everything, look for the money! The most important goal is to generate benefit and not to create the most beautiful design!

Tips for the Storming:

  • “one person, one sharpie, one set of stickies”: everybody has important things to say, nobody should stay away from the board and the discussions.
  • start with describing the business process, business rules, eventual consistent business decisions aka policies, other constraints you – or the product owner whom the business “belongs” – would like to model, and write the most important information somewhere visible for everybody.
  • explain how ES works: every business relevant event should be placed on a time line and should be formulated in the past tense. Business relevant is everything somebody (Kibana is not a person, sorry πŸ˜‰ ) would like know about.
  • explain the rules and the legend (you need a color legend to be able to read the results later).
  • give the participants time (we had 15 minutes) to write every business event they think it is important to know about on orange stickies. Also write the business rules (the wide dark red ones) and the product decisions (the wide pink ones) on stickies and put them there where they are applied. The rules before the event, the policies after one event happened.
  • start putting the stickies on the wall, throw away the duplicates, discuss and maybe reformulate the rest. After you are done try to tell the story based on what you can read on the wand. After this read the stickies from the end to the start. With these methods you should be able to discover if you have gaps or used wrong assumptions by modelling the process you wanted to describe.
  • mark known processes (like “manual process”) with the same stickies as the policies and do not waste time discussing it further.
  • start to discuss the open questions. Almost always there are different ways to answer this questions and if you cannot decide in a few seconds than postpone it. But as default: decide to create the event and measure how often happens so that later on you can make the right decision!
    Event Storming – measure now, decide later


    Another good article for this topic is this one from @thinkb4coding

At this point we could have continued with the process to find aggregates and bounded contexts but we didn’t. Instead we switched the methodology to Specifications by Example – in my opinion a really good idea!

Specifications
Event Storming enhanced with Specifications by Example

We prioritized the rules and policies and for the most important ones we defined examples – just like we are doing it if we discuss a feature and try to find the algorithm.

Example: in our ticket reservation business we had a rule saying “no overbooking, one ticket per seat”. In order to find the algorithm we defined different examples:

  • 4 tickets should be reserved and there are 5 tickets left
  • 4 tickets should be reserved and there are 3 tickets left
  • 4 tickets should be reserved and all tickets are already reserved.

With this last step we can verify if our ideas and assumptions will work out and we can gain even more insights about the business rules and business policies we defined – and all this not as developer writing if-else blocks but together with the other stake holders. At the same time the non-techie people would understand in the future what impact these rules and decisions have on the product we build together. The side-effect having the specifications already defined is also a great benefit as these are the acceptance tests which will be built by the developer and read and used by the product owner.

More about the example and the results can you read on the blog of Kenny Baas-Schwegler.

I hope I covered everything and have succeeded to reproduce the most important learning of the 2 days ( I tend to oversee things thinking “it is obvious”). If not: feel free to ask, I will be happy to answer πŸ™‚

Happy Storming!

Update: we had our first event storming and it was good!

Unfortunately we didn’t get to define the examples (not enough time). Most of the rules described above were accepted really well (explain the rules, create a legend for the stickies, flag everything out of scope as Open Question). Where I as facilitator need more training is by keeping the discussion ON the board and not beside. I have also a few new takeaways:

  • the PO describes his feature and gives answers, but he doesn’t write stickies. The main goal ist to share his vision. This means, he should test us if we understood the same vision. As bonus he should complete his understanding of the feature trough the questions which appear during the storming.
  • one color means one action/meaning. We had policies and processes on the same red stickies and this was misleading.
  • if you have a really complex domain
    (like e-commerce for SaaS products in our case) or really complex features start with one happy path example. Define this example and create the event “stream” with this example. At the end you should still add the other, not so happy-path examples.