Here, I am going to discuss some “Release Safety Practices”. You need to focus on making sure that CI/CD pipeline you use to deploy changes is designed and set up in a way which will give your team the control needed to be successful every time.
The determinism of an automated pipeline
Here I want to refer to high-level qualities of an automated pipeline. To provide maximum value, automation to deploy or release a software package must be predictable, reproducible and consistent. It is expected that those 3 words become part of our culture as we evolve into our DevOps journey.
Why does this Help?
We all know that the weakest link in a release process is the human. Automating steps of a CI/CD pipeline requires a lot of attention to details in order to be successful: something humans are pretty bad at when done manually. We rely on scripts, workflows, and programs to help us in automating those steps to achieve our goals. However, care must be taken in the definition of that automation to make sure that the final outcome meets the expectation, each and every single time.
To help us in the design and implementation of automation, here are three important aspects that must be taken into consideration at design, coding and testing phases.
- It defines the degree to which we can predict the final outcome of the automation.
- Script/code must be able to detect and support error conditions to perform retries or alternate logic to generate the expected result (ex: to properly manage request limit error on the AWS APIs)
- Predictability is greatly related to complexity, as stated in the definition of predictability in Wikipedia: “Limitations on predictability could be caused by factors such as a lack of information or excessive complexity.“
- Defined as the ability to always provide the same outcome when executed multiple times.
- We need our automation to always generate the same result, not most of the time.
- It’s a key characteristic in one of our core value: “Reproducibility and replicability together are among the main beliefs of the scientific method” – from Reproducibility on Wikipedia.
- Consistency refers to the state of the cloud resources and pipelines at each and every step of the automation (from provisioning to deployment, to release).
- To be consistent, a pipeline will put an environment in a single possible state. Idempotence is important to ensure that automation will always bring your resources in that single end state.
- It is tightly related to predictability, but its focus on the resources we can manage or on which we have control.
This guideline can be widely applied to any type of script, infra as code template or workflow built to automate something. As we increase the amount of automation, we must think about the predictability, reproducibility and consistency characteristics to make sure we are raising the bar and increasing the maturity of our automation. In addition, post-incident action items should always target the faulty process or automation to improve it. We must have faith in our automation and be confident that when we perform a task, the end result is good and we can move to something else without doubt of causing an incident that would result in revenue loss.
How can you implement determinism in an automated pipeline
This guideline is not something to be implemented as it is. It is more a thought process, a strategic analysis to be applied or performed when putting in place automation. The key point is to keep it in mind at each phase of the automation lifecycle:
- At design: Think about the resources involved in the automation and the interactions with other components of your system.
- At development/test: Identify error scenarios, to improve the robustness/resilience of the automation.
- During operation: While running in production, keep visibility on logs and metrics to learn about edge cases and exceptional situations to feed your continuous improvement funnel
Another important point, with regards to predictability and consistency, is that the outcome of any automation must be binary: it fully succeeded (with full consistency) or it failed with an error message. Partial success gives a false impression of predictability and can have a negative impact on customers and suppliers.
Case Studies and References
Here are some examples that highlight when and how those qualities are important.
Launching a new instance
The cloud environment provides us the ability to scale out services on demand. We leverage this compute elasticity concept through auto-scaling groups that allow us to automatically launch new instances of our services to increase capacity. However, care must be taken to make sure that each new launch will result in a new service instance running and servicing traffic.
- Predictability: The new instance must complete the launch with the expected outcome that the service will be running and healthy.
- Reproducibility: Each scale up activity must come to the same outcome: a running service that serves traffic. For such, it is important to reduce to a minimum any external dependency like git or yum repositories, and to some extent, S3 buckets, to eliminate any side effects that an external failure event could have on launching an instance.
- Consistency: Of course, spinning up new service instances is not only about serving traffic: observability is also important. Each new instance must be in an expected state with regards to everything it relies on. That includes AWS resource related to the configuration like security groups and subnet settings, but also for software components that are installed along with the service like Splunk or haystack agents, and without forgetting about proper CloudWatch alerts thresholds and pagerduty web-hook configuration.
- The acceptance criteria for launching a new service instance is not limited to the successful processing of client traffic. It must also include other aspects like monitoring, alerting and audit.
Creating AWS resources
With the practice of infrastructure as code (IaC), we built trust and confidence in our automation. However, it can take some time for given automation to be mature enough to deliver the outcome needed. A common mechanism to implement IaC is to use tools like AWS CloudFormation or TerraForm by HashiCorp, which provide imperative (scripts) and declarative (templates) tools to create/update all your cloud resources.
- Predictability: The cloud is an ephemeral environment and your automation can fail for multiple reasons. Always put validation in your scripts that the expected resource was really created/updated and is available (avoid blind assumptions).
- Reproducibility: The templates or scripts should be able to support multiple environments and any region particularities (different instance types, different capacities to provision, different configuration, …).
- Consistency: The hardest part of IoC is to manage and keep the state of your cloud resources, as your needs evolve (we often forget about the evolution of our infra). Always update them via the same IaC tool to ensure predictability and reproducibility, and avoid yourself headaches later on. IaC must be idempotent: regardless of the initial state of your environment, the outcome is always the same.
A concrete example would be the case of the creation of a mongo or Cassandra cluster. Once completed with success, we want:
- The expected number of EC2 instances to be running and healthy.
- The software package (mongo or Cassandra) properly configured and running without error, listening on the port for communication.
- The cluster of nodes properly communicating with each other.
Having automation that validates all those aspects, each and every time, will give confidence in the automation.
Performing software build rollback
When rolling back a software build to a previous version, it is important that the outcome meets the quality expressed in this best practice:
- Predictability: The previous version should have 100% of the live traffic on it. It is known that relying on DNS update to route traffic back to an old release is not predictable, as DNS resolution relies on multiple layers of caches and is dependent on consumer code and connection pool strategies
- Reproducibility: A successful rollback should always bring the previous good known version back each and every time.
- Consistency: The state of the service, database, the configuration should all be back to its previous state. As well as all the cloud resources involved in the rollback (compute, ELB, DBs, DNS, …)
We want rollbacks to be our safeguard in case anything goes wrong when performing a change. The determinism of fast rollback is a key piece in protecting travelers and giving us the liberty to fail.
Example of automation that leaves an inconsistent state can be found here:
- When deployment routed traffic to deleted ELB – In this case, the release operation took for granted that the ELB was there and available. However, between the deploy and release operation, the target ELB was deleted by a PEAR policy related to cloud cost reduction.
Here, we can conclude that for making release process safe and hassle-free, we need to perform a deep analysis at each step as design, implementation, testing, and deployment. We need to be sure that automation of release process is being done in a way where we can rely on it and achieve a consistent auto release process for a long run without a single failure.