Maybe it’s happened to you.
Your job scheduler is suddenly unreachable via the network. Or maybe the agent where your critical job is executing has a hardware failure. Maybe the server downtime is planned, but the manual tasks required to get your job stream back on track consume hours of your valuable time.
Next time, you tell yourself, you’ll be more prepared.
In the latest “No-Stress Job Scheduling,” Jared Dahl, manager of the Automate development team, talks about the features an enterprise job scheduler should have to minimize the time and effort it takes to recover from downtime.
Tools for When an Agent Is Down
Whether your agent downtime is planned or unplanned, you don’t want it to cause too much extra work for you. The first tool your enterprise scheduler should have to save you time and trouble is a job queue. The job queue allows jobs to stack up when resources aren’t available for them to run. If the agent is unavailable, the jobs will queue on the server.
Another important function for a workload automation tool is agent failover, which allows another agent to take the first agent’s place when the first agent is unavailable. Your enterprise job scheduler needs to be able to support the two agents operating as the same agent. If your enterprise job scheduler allows agent groups, you can have your agents set up in a specific order so that when one is down or missing, the job can run on the next agent.
When meeting your SLAs is dependent on certain jobs, it’s essential that you are alerted quickly if they fail to run. For this, you need an enterprise scheduler with a job monitor function. For example, you can set up a job monitor that recognizes when the job is stuck on the queue for too long. You will then receive a notification so you know that a manual reboot will be necessary.
For high-traffic or especially important agents, you should be using job scheduling software that sends you an email immediately if the agents go offline.
Tools for When the Server Is down
When a server goes down, things get a bit more complicated. During server downtime, it’s important that work can continue on the agents. When the server comes back online, the agent will report the status of the jobs in process.
The next feature for dealing with server downtime is downtime processing. When the server comes back online, a downtime processing function allows the job scheduler to calculate which jobs were missed and offer options for how to deal with the missed work. A good workload automation tool will offer the choice to not run the job, or to run it when the server is back up, or to place it on hold until the operator can make a decision.
As always, notification features are crucial. Generally notifications go through your enterprise server, so if the server is down it won’t be able to alert you to the problem. The best option is to have notifications configured through a failover server.
Finally, you want to be able to failover to a high-availability (HA) system. This will usually be a system that has a backup copy of your entire database and your job history and logs. Ideally, the agents will automatically switch over when they see the server isn’t available.
Next time you face unplanned downtime, will you find yourself manually kicking off hundreds of jobs when the server comes back up? Or will you be prepared? Make sure you get an enterprise job scheduler with the tools you need to recover quickly from IT outages.
Start your free trial of Automate Schedule today to find out how you can minimize risk and effort in the case of unexpected downtime.