Automation 3 step framework, considerations, and expectations

Intro

Automation can be complex. How do we do it well? In this article, we're going to establish a system– a basic three step framework to bring structure to your automation. After that, we're going to go through an automation example so you can see what this looks like in real life. It's a bit of a read, but I believe there's some concepts here that can truly set clear expectations and give you understanding of automation, it's challenges, and the benefits.

Basic concept first, prove it before you do it

I can't tell you how many times a "critical issue" was brought to my desk to be solved with automation because it "happens all of the time". This isn't a slight to anyone that brings in issues to be automated, but it is a recognition of human psyche, and our responsibility as leaders, or engineers to read through it. If a client is very upset about an issue that has happened and that issue has happened more than once, as far as the person bringing it to you is concerned "the sky is falling" and "it is happening all of the time". This is dangerous. If this issue turns out to be for one person at one company and has happened twice ever but now we've spent days or weeks automating it...what have we accomplished? Whoever asked us to do it sure is happy, but almost literally anything else would have been a more valuable use of our time. This is to say that we take all input seriously, but we also respectively take it as input, not the truth. Numbers are the truth. We take that same scenario and, this is important, we acknowledge that is sounds like a real problem. Then, we let them know that we have to be careful where we put our automation time and efforts, so we need to understand how often it's happening. "Can you provide me with a list of tickets where this has happened previously?" is a question I have asked probably thousands of times and guess what? Almost no one will produce it for you. Why? Because they don't have it, or don't have the time to get it. Now to be fair, them not having it or not having the time to get it and this not being a real issue are different things but there's an important point here: if no one is measuring it, no one can quantify it, and so therefore no one actually knows where our pain points are, and no one knows what we should automate, yet we're positive it's time for automation...rough.

1: Dashboarding: We need to quantify the problem

You'll see me talk about dashboarding in a ton of articles and that's because I believe dashboarding to be one of the most, if not the most important piece in any given business. If we don't know what's happening in the business, we don't know what to work on in the business. Taking the example above, we need to see if we can extract out some information that can help us: "when that kind of issue comes in, do you set a specific ticket category or type or status?", "is it usually for one client or many? which client(s)?", "do you have a standard way you've been naming these tickets?". We're just trying to find some way to be able to find them in a dashboarding service so we can go quantify how often this thing is happening. I've been shocked in both directions many times. Sometimes the smallest seemingly insignificant thing turned out to account for literally hundreds of tickets a month and with a day or two of automation it could be solved. Who knew?! Well no one, because no one was measuring it. Then of course I've seen the opposite where I could only find one or maybe two examples of an issue happening in the entire history of the business.

What are the real issues? Where does our time yield high returns? If I create xx automation, it saves xx time or tickets on average per month. You should literally not even write a single line of code until this measuring phase is completed. Once we have proven we have a problem to solve, now we can start discussing how we solve it.

2: Process: If there's no manual process, we can't automate it

This is basic but missed almost 100% of the time. We do not want to automate anything until there is a known, documented, tested process for solving it manually. If all of those conditions are not true, we should absolutely, under no circumstances, begin automating anything. Why? Automating is taking the steps that would be performed manually and creating them in code. This means that if no one has a set way they do it manually, we have no set way for how we should develop it. If it's pushed to be developed anyway, we're creating and testing our process through development. This is expensive and time consuming. Don't do it.

Instead, take a step back, solidify the best possible way to solve it with manual processes, document it well, and then USE IT. Use it for a month or two. You'll likely find in that use of the process that there's nuances, sometimes it's different depending on the client and actually, this whole part over here doesn't actually work how we imagined so we need to rework that whole thing because it makes more sense to "fill in the blank". This is how all process creation goes. It's fine, but imagine that same discovery process after weeks or months of development for each iteration...it's terrible.

3: Structure: All automation should have two primary functions

Verification: All automations should be checking for the current state of whatever we're looking at. Since we're going to reference services for most of this article just as an example, we'll stick with that theme. Our simplest first verification for a service could be: "Is the service running?" and our verification returns truthy or falsey. From there we can get a lot more complex, and often we should, where it's actually going to check for a large set of conditions to ensure the service and surrounding conditions all meet expected results. If they do not, it should just tell us "it's not what it should be" (false), queueing us up to know it's time to take action.
Action: The action is what it sounds like– something is wrong, and we're going to do something about it. Doing something about it may just be raising a ticket to help desk, or it may be a complex remediation script to fix the conditions that were found to be outside of expected result. If the action includes any kind of remediation, then as soon as the action is complete, it returns to our step 1 of "Verification". If Verification does not return "Okay looks good now!" (true) this time, then we know out remediation failed and now we need to raise a ticket for someone to follow up.

These are the crucial pillars of all automation that you should always follow. If we don't have methods of verification and instead we just blast out "fixes" and "automations" whenever "something is wrong", we're likely not really doing anything.

Okay got it, 3 steps, what now?

Now that we know we have to quantify the problem, ensure we have a manual process for whatever we want to automate, and we know the structure our automations should be created in, what does this all look like in real life? Is the automation part complex? Doesn't the RMM have a lot of automation built for us? Lets get into an example.

Windows service example

One of the earliest things I automated in my automation career was monitoring for services that weren't running that should be running. Simple enough right? If it's not in the Running state and you want it to be, start it! Done! The RMM even has this built in, perfect! Kind of...the next several years slowly taught me that every adventure into automation has twists and pitfalls that you really can't see until you're further in and everything is breaking.

What are some ways this service concept isn't as simple as "start service if it's stopped"?

Uptime: The machine just booted and the services are still in the progress of starting, so you don't have a problem, you have a machine starting itself up and starting those services with it. This means we need to account for total uptime before we're automating any service tickets / actions to ensure we've given the services an adequate amount of time to start.
Updates: if the application that is associated to a service is being updated, guess what? The service gets stopped during the update. If the update takes longer than your interval of checking for stopped services and it forces the application services to start back up mid update, you can actually break the update to the app every single time you're trying to update it.
Techs: If a help desk tech is troubleshooting an application and your monitoring is so aggressive that every time the tech stops the service to troubleshoot it's immediately started back up, it means that tech isn't actually cancelling out the problems they think they are. Again, this likely implies your window of monitoring for a service not running is too short. But because this possibility exists, it means that you need to educate your help desk team of all automation like this that exist, and provide a way for them to pause it during their troubleshooting. Your RMM likely includes a "maintenance mode" that should block all monitoring / scripts from running, but your team is likely not using it. Make it part of their SOP to maintenance mode every machine they work on as a step 1 before connecting to the machine that needs help. The person doing the automation is most of the time the only one that could know these nuances exist, so it's your responsibility to get them out in the open to be aded to SOPs!
Dependencies: If you're just trying to start stopped services, you're likely missing the fact that a dependency has to be running for that service to be started. If you don't have logic to first verify dependencies are running, you're really not automating much here. Lets find all dependencies that are not started, get those started, then start the service that we had originally found to be stopped.
More to a service than stopped/running state: Services have more states than Stopped and Running yet all service monitoring I've ever seen in the wild is only worried about these two. This means that we're only really monitoring for the two states that imply things are fine. What do I mean? 99% of the time, a service just starts when it's supposed to, and it just stops when it's supposed to. Windows has the concept of Automatic Startup so most of the time if your automation is starting a service, the startup method was probably just wrong and we should fix that instead of starting it with a monitor / script. Furthermore, if a service is stopping all the time that's supposed to be running, that doesn't just "happen for no reason". A service stopping is a sign that something is wrong. If we just keep blasting it with a "start service!" every time it's stopped for eternity, we're actually making ourselves miss the fact that something is broken because the problem is "being automated". When a service is actually broken in some way it's almost never just going to be in the stopped state, it's going to almost always be stuck in Starting or Stopping. Lets go through those:
- Starting state: Sometimes a service takes awhile to start and it doesn't necessarily imply there's a problem...maybe it just really does take a long time. If we're watching for a service in any state other than Running and we just blast it with "START START START" but it's already starting, we're not "automating" anything, and we're certainly not helping it go faster. This means we need to measure how long the service has been in the starting state and determine if we believe that to be a problem, then we can design action around that result if we so choose– whether that's stopping it and retrying, or just raising a ticket that something is going on. If it truly is stuck in starting state for more than our xx amount of time we define then yes, maybe we have a broken service that needs help and we should raise a ticket.
- Stopping state: Almost everyone messes this one up (accidentally). The most common reason for a service being stuck in the Stopping state is a SQL service. SQL is designed to hold as much data in memory as possible for performance, meaning when you stop that SQL service it could potentially have data to commit to the disk. At the very least, it's likely that it's in the middle of write operations to the database and force stopping it could result in half committed data, half deleted data, half updated data, you name it. It can turn into a mess quickly. If you just force stop that service because it's "stuck", you could actually be causing database corruption. Usually when I say that it's followed up by people saying they "never force SQL services to stop" and "they know better" until we start talking about the commands they issue to reboot machines like restart /f and Reboot-Compuer -Force...guilty? These commands instruct the OS to force stop all services, including SQL. Oops. Like a service in the Starting state, we need to define edges around what means a service in the Stopping state has a problem. This may include checking to see if it's a SQL service and giving it more grace if it is, or just putting a blanket amount of time. Either way, it means we need to measure the time we give it to successfully stop, and draw a line of where we think something is wrong and take our action– whether that's force stopping the service, or raising a ticket.
Measure how many times a service has problems: This is critical. If we have a service we've started or restarted 2,000 times in the last week, something is wrong. How do we know if this is even happening? Can you prove right now this isn't happening? Chances are, no, and chances are your RMM doesn't provide a simple path to quantify it.

Hindsight is 20/20

Why did I start working on service monitoring in the first place? I learned a lot, have a lot of take aways, but...what was the gain? Full transparency, not much. I'm grateful for what I learned, but "stopped or hung services" did not represent any majority or large trend in tickets my MSP was working through. I dove in because a remote access service we used kept stopping and we couldn't remote in so naturally, I should fix that, sounds serious...so I did. I spent hours, days, probably even weeks making sure it perfectly accounted for all of the nuances. Do you know what happened during those weeks I was creating this? The remote tool released a new update addressing a "service stopping issue" that resolved the issue I was automating...did that stop me? No. Full steam ahead! I can apply this to anything!

It's true, I could apply it to anything..but was any other service on fire and causing hundreds of tickets and/or hours? Was that the most important, most impactful thing I could spend my time on? Not even close. But I had no idea, my intentions were pure...I just had tunnel vision to improve what I could see was broken. Again what I learned is cool, but was it truly impactful?

Wrap up

There's a really good chance that your person or team that handles automation is putting in hours, days, or even weeks on issues brought to them as "critical", "recurring", you name it issue, that in reality doesn't represent anything close to what actually moves the needle for your business. How do we help them? Introduce these frameworks. Require measurement– quantify before we move forward. Even once we know the issue is real, enforce having a manual process before we start automating. Then finally, once we're actually creating the automation, lets value verification and reporting as much as the fix itself.

I hope this was helpful to your business in some way, and I hope to hear of some awesome things that you've quantified and automated!