There are plenty of best practices and things you should do out there, but some are essential and universal. I try to summarize them as the “top 10”. Of course, this is somewhat subjective.
Pin your versions as specific as possible
Software gets updates, updates might break things or include changes you won't expect. Of course, there are concepts like semantic-release etc. out there, but this is not always a guarantee. Not every version is really immutable. For example, git tags are not as stable as they might seem.
So pin your versions as explicit as possible, for example NEVER use the docker latest tag for anything excerpt testing or tools that you just run locally. The real version behind the latest tag can change without further notice.
So e.g., for git if possible pin git commits, for docker images as explicit as possible, some even use the layer hash instead of tags.
Automate what you can
This one might seem obvious, but automation is key. Everything that can be automated, should be. Work state-driven whenever it is possible, there are tools like Ansible or terraform that can do the heavy work for you.
Automation allows you to restart from scratch when necessary or repair after outages. Which happens way more often than you might think.
For example, when your datacenter burns down, you can easily spin up the same config in the cloud. Or let's say someone bricked your jumphost, just remove it, let automation do the rest.
Documentation is important, I cannot repeat this enough. Everything that can be documented, should be. And the most essential thing: keep it updated and consistent.
The best is to document near to the actual infrastructure or code, a good strategy is to maintain documentation in the project itself instead of an external system. This way you don't forget about it and can update it right away.
Keep It Simple
Good solutions are mostly the simple ones. Of course, sometimes complex problems require complex solutions. Going a step back and decreasing complexity frequently helps.
Sometimes writing a custom tool is not the best option, and a shell script also works. In other situations, you can reuse existing tooling like terraform instead of deploying manually with a shell script hell.
Refactoring is nothing exclusively to programming. Infrastructure also sometimes needs a refresh. Maybe there is something new now that allows you to do things easier or better.
Build everything with refactoring in mind, and do it whenever it is suitable. Start with a simple solution and improve it if you hit limitations.
Don't save on hardware
Of course, this is a bit provocative, and you shouldn't throw all your money on hardware. It's more like don't save on hardware where it matters. For example, never be stingy for staging. I experienced that the test stage had one instance while the others had different amount of nodes. This might seem a good idea to save a few bucks, but it might hide problems with clustering that you will catch in a later phase in the development process.
Same applies for CI/CD, there are moments where devs need to wait for their pipelines to finish. This can be an issue when they are blocked until it is finished. So, not only optimize the pipeline itself, but also give it the resources it needs!
It is embarrassing when the developers or the customers notify you about a critical outage or concern you didn't even know existed in the first place.
One can monitor every single piece of hardware, software etc. but in most cases this just leads to noise, making the monitoring itself less important. You might have heard of “pager fatigue”, that describes exactly this situation.
So monitor carefully, every alert should indicate a real issue or something that will become a problem in the near future. For example, when your disk is almost running out of space you should take care of it, before it does.
Your service is losing instances like crazy? This is something that requires further investigation.
But for real no one cares if the server runs at a load of 2 for 10 minutes, as long as the application is responsive no one will care.
This might seem obvious as well, but when production is going to hell, of course this is not a situation where you feel comfortable. But it's not your mother who's dying! So keep calm, take breaks when it takes longer and most important: don't let others stress you!
Keeping calm is almost a superpower in the current world. And it should be a skill that you improve over time. It took a while for me too to realize that it is super critical. When I first got into this whole world, I was almost shocked how calm my colleague was when all our services in production went down because of a serious problem with our infrastructure. The first thing he did was getting a coffee and calling a colleague whether he has time or not. He was super calm, took a few minutes and then came back with a good solution, and we were back online in just a few minutes.
The same applies for breaks. Our brains don't work good if you don't give them “space”, sometimes it might take hours, days, or even weeks to fix something. And never ever in the history of operations has it helped to work 10 hours at a time on an issue. So take your health serious and listen to your brain when it screams for a break!
Don't be (too) ideological
Tools change, services change, people change. This is the most natural thing you will encounter. Usually things just should work and no one cares if you use tool a or tool b to reach a goal. So choose the right tool for the job. This applies to almost everything.
Of course, there is a thin line between determining the appropriate tool and hacking everything together that it just works. This is a huge difference, and you should be careful not to cross that line.
Do you believe it is slow? Do you think it is crashing frequently? Measure it!
Uncertainty is nothing that should be influencing your decisions. If you don't have information about something, gather it. This goes hand in hand with monitoring. Most importantly, metrics are the thing you want for everything. Better to have data for something you don't need yet instead of having to guess what's gone wrong.
Once you have set up everything, properly emitting metrics should be as simple as switching a light bulb on. You might think—why should I collect so much information? The thing is that often things happen that require you to look back in time to see historical behavior etc. without the relevant data, you will have a hard time figure out what the hell was going on.
These are some of the most important things I noticed and learned during my career in DevOps. Think I missed something or are interested in more like this? — Please let me know in the comments.