Infrastructure Testing in DevOps Teams Workflow
The World of DevOps
Take everything I write in this article and on my blog in general with a grain of salt. These are my thoughts and reflections from the past 5 or so years working in Agile DevOps teams across different industries.
The article will be a part of the bigger DevOps Series. There, I will do my best to describe the steps and their order, in my opinion, you need to take to become a DevOps, Cloud or Site Reliability Engineer and the best practices that are used when it comes to CI/CD, IaC, Monitoring and other "DevOps" tools (DevOps is not about the tools it's about the approach). The ordering of the articles will be chaotic and will depend on the questions I receive from people that are starting their journey in this realm of IT.
tl;dr Do not focus on infrastructure testing in the beginning.
Before Automated Tests
You should do 3 things before even starting to think about automated tests - parse, lint and dry-run your code. That will make sure that your code can run and do what you expect it to in the first place.
The Approach to Infrastructure Testing
...and to testing in general, I'm a firm believer that Minimum Viable Product ( MVP ) should be developed vertically. Here's what I mean:
The same applies to infrastructure testing, even though it is slightly different. You see, it's pretty simple when we're testing software - we have unit tests and it's simple to identify the testable pieces of software.
Unit testing in an infrastructure setting is more of an integration test, you are testing how different parts of code (Infrastructure building blocks) work together. Yevgeniy Brikman does a great job describing how Infrastructure testing should work and what the different types of testing are when it comes to infrastructure in his QCon talk from 2019.
I'll make one statement - Do not ( really, please don't ) test separate Terraform resources or Ansible modules. The tool works as it should, trust me, you don't have to ensure that Terraforms' or Ansibles'
gcp_compute_instance will create a GCP Compute Instance. That does not bring any value and just generates code management overhead and eventually makes the hall testing effort become obsolete.
You shouldn't approach code coverage in the same way you approach it for software - it's not about what percentage of infrastructure code lines are covered by the tests, it is about whether you are testing things that bring value to the overall robustness of infrastructure.
What you should test instead is that after deploying your unit of infrastructure it behaves as you expect it to behave. For instance, when you deploy a database instance, you probably should test whether the database exists by querying it, web service returns a 200 code, Kubernetes object was created and is running/available etc.
You generally don't want to perform infrastructure tests inside your production or even development environments because when we're talking about infrastructure teams, all environments are customer-facing, it may be external customer - fun society corp. for instance, or internal customers - developer teams or any team that is using a non-production environment.
I'll elaborate on that - I have always believed and rooted for the idea that DevOps teams' production starts in the development environments of the project. You should push a finalised, robust version of your infrastructure for anyone in the project, not just for external customers that pay money. The more stable your developers' workflow is, the faster their development cycles will become, they will be able to implement features faster, the project evolves faster, everybody wins.
Ideally, developers should be able to spin up a dynamic environment as they need it but that's a completely different realm.
This is when the infrastructure sandbox comes in.
A sandbox environment is a fully isolated copy of the infrastructure you're running in production environments, it is a must-have thing for any infrastructure team in my opinion. It is usually a separate subscription if we talk Azure, project if we use GCP etc.
It doesn't have to (in fact, it should not, more on that in a second) run all the time accumulating the cost, you should have a sandbox version of your CI/CD pipeline that deploys your infrastructure, performs tests during deployment of each building block ( unit ) of your infrastructure and ideally, run some smoke tests at the end to ensure the application is responding, tare the sandbox environment down and then, heck, automatically promote infrastructure code to development environments for developers to test because why not - automation baby!
Note that you should have a rollback procedure for this, in case deployment goes sideways because of whatever reason. The world of infrastructure is full of timeouts, outages and timeouts(yes, twice).
Sandbox. Should. Not. Run. Constantly. Period.
There are 2 reasons - first, the cost overhead of infrastructure (cost optimisation, very important) and second, maybe even more important - you can't make sure that your infrastructure code is robust by only making incremental changes to your environments. Infrastructure has to be deployed, redeployed and undeployed as often as possible.
Run a cron job to clean the sandbox environment every week and automatically deploy a fresh sandbox during the night before a new work week.
Add a counter to your sandbox deployments from the
master (or develop if you're using GitFlow, or whatever your teams' development process is) branch, every 10th deployment gets a bonus - complete deletion of the sandbox environment and a fresh deployment.
These relatively simple steps will improve the reliability and trust the team has in your IaC drastically, which will increase the pace at which you're able to develop infrastructure code.
Namespacing a resource is the process of adding a unique identifier of some sort to the resource being tested or deployed.
All cloud providers have some resources that require a unique name. When you are testing your IaC in parallel, or when your project grows to a certain level you should namespace resources.
Actually, you should strive of introducing resource namespacing when you start writing your first line of infrastructure as code, but since this is an article on infrastructure testing, this is out of scope for now.
Pester - more of an imperative testing framework based on PowerShell.
Terratest - personal favorite. Go based infrastructure testing framework that can cover a bunch of tools, also can both, deploy and destroy the infrastructure on its own in both scenarios, when the test succeeds and fails.
I will go into details of Terratest in a separate Article.