There's a lot of pressure on technology leaders right now to move applications to the cloud, and to move them quickly. But this rush to the cloud can leave Cloud Architects with some difficult choices. A quick move can result in explosive run costs when an application isn't properly refactored to run in the cloud. On the other hand, taking the time to refactor an application before moving it to the cloud can turn into an expensive and time-consuming activity in and of itself. Either way, a poorly-managed cloud migration can turn into an expensive proposition for any enterprise.
In this paper, we’ll discuss six ways to manage the cost of running systems in the cloud. First, we’ll look at a migration approach, Lift and Shift, which is arguably the fastest and simplest way to move to the cloud, but requires diligent cost-management after the move. Refactoring your applications to consume cloud services more efficiently before moving to the cloud is the second option we’ll look at. Then, we’ll discuss a more diligent variant of Lift and Shift, called Lift, Trim, and Shift. We’ll also look at a lengthier approach to migration, optimization, and how it can save you money. Once the migration approaches have been reviewed, we’ll look at approaches to managing costs for systems already in the cloud, including reserving instances, negotiating volume discounts, and reclaiming or reselling unused capacity for financial benefit.
But before we dive into how to manage cloud costs, let’s take a moment and consider the two fundamental financial arguments for moving to the cloud. First, an enterprise wants to recapture capital expenditures to reinvest in other areas of the business. Perhaps upgrading a manufacturing facility, or expanding into a new geography would serve the business better than sinking that same money into new computer hardware. Freeing up capital is a big financial benefit of moving to the cloud.
The second fundamental financial argument for moving to the cloud comes from switching the cost of what used to be depreciated assets, to pure operating expenses. Instead of making a huge capital outlay, and then worrying about the depreciation of those assets over the next three years, by moving to the cloud, you convert those dollars to a monthly operating expense.
So, freeing up capital for investment elsewhere and converting your spend from CAPEX to OPEX are the two fundamental and defensible financial benefits of moving to the cloud. If you try to argue that the cloud is less expensive, however, you’re likely to lose that argument. Depending on your current, on-premise infrastructure, and how much of that capacity you actually consume on average, your annual costs in the cloud, left unmanaged, could run anywhere from 20 to 400 percent higher than comparable on-premise infrastructure. That means that simply replicating your current infrastructure in the cloud is a mistake; you have to switch your thinking from a “design for max capacity” way of thinking, to a service-consumption model: elastic, scalable, and resilient.
Lift and Shift
Under intense pressure to move to the cloud quickly, many IT leaders opt for a Lift and Shift
approach because it’s a relatively quick and simple migration method. However, it also introduces a potential for huge run costs after the migration. Put simply, you select services from your cloud provider that are equivalent to your current architecture components, and configure a copy of your current architecture. Then, you migrate your system over to the cloud, essentially lifting the workload and shifting to the cloud: Lift and Shift. All the major cloud service providers provide tools to easily configure a replacement for your current on-premise system(s). However, since most on-premise system infrastructures are designed to handle maximum workloads (i.e., Black Friday), they end up being very inefficient in the cloud, where the model is aimed at only using services you need when you need them. So, using a simple Lift and Shift approach, however quick it may be, can leave you paying for all the unused capacity you just architected and moved to. An alternate approach to Lift and Shift could be refactoring your applications for the cloud before you move.
Refactoring
Any discussion on refactoring starts by answering the obvious question: what runs well in the cloud? Chances are, your on-premise systems probably are not optimized for cloud architectures. Systems that have been architected to leverage micro services, so that they are elastic, scalable, and resilient are ideal for running in the cloud. To be considered elastic, an application should start up new instances quickly to be able to meet spikes in user demand. Scalable architectures use stateless services so that they can quickly expand or contract in response to massive changes in demand for those services. Finally, to be considered resilient, an application needs to handle the coming and going of instances in a dynamic environment. So, what if your system doesn’t possess these characteristics?
If you've done any research on moving to the cloud, you probably already know that if your applications are not architected to leverage micro services, you need to look at refactoring. If you’re not familiar with the term “refactoring”, it’s basically the same thing as “re-engineering”. Simply put, anytime an organization wants to extend the lifespan of a system by re-engineering a non-functional component, that’s refactoring. In the case of moving to the cloud, you want to extend the lifespan of a system by basing it on micro services optimized for the cloud. By re-engineering the non-functional parts of your application, you can rest assured that you get the most cost-effective implementation right off the bat. Just remember that while refactoring will extend the lifespan of your applications and make them run more efficiently in the cloud, it also takes considerable time and effort to refactor properly.
Typically, refactoring in place requires a lot of upfront analysis to first understand how cloud services are consumed. This is where an experienced Cloud Architect can help tremendously. Once you understand the available cloud services and how to consume them, you’ll redesign the non-functional guts of your application to consume those cloud services in the most efficient manner. Keep in mind that there are literally hundreds of cloud services provided by a single cloud provider, and each service has dozens of options. There’s a lot of complexity to wade through. Obviously, then, this approach also requires a lot of testing to prove out that the efficiencies are realized, and that no functionality is being changed or damaged. It’s easy to see how this approach can be both costly and time consuming.
So, while you are refactoring in place, at a minimum, you will need development and testing environments built on your cloud platform. How else can you consume cloud services if you aren’t on the cloud? Even though your production environment may not be on the cloud, your lower-level environments required for the refactoring work do reside on the cloud. Run those environments without managing the costs, and you’ll find yourself right where you didn’t want to be: laboring under massive cloud costs. It almost defeats the purpose of taking the approach of refactoring in place before moving to the cloud because you still have to diligently manage your cloud services while at the same time incurring huge refactoring costs.
Something to consider is that unless your system has some features that make it a necessary candidate for refactoring, replacement may be a better option. If refactoring is like remodeling a home while you’re living in it, replacement is simply buying a new home. Replacement is still a costly endeavor, although perhaps not as costly or time-consuming or frustrating as remodeling around you. Either way, you have a lot of upfront work to do before realizing any of the benefits of moving to the cloud.
Lift, Trim, and Shift
At this point, you may be feeling a little conflicted, thinking you’re either going to pay huge cloud costs after a quick Lift and Shift move, or take a long time to do an expensive refactor up front. Thankfully, there is a more informed variant of pure Lift and Shift that uses capacity planning techniques and tools to help refine the configuration of your cloud services while you move to the cloud. Done right, this approach, called Lift, Trim, and Shift can reduce your run costs by a factor of as much as six times in some instances. With Lift, Trim, and Shift you can move quickly and still realize financial benefits of moving to the cloud.
Using a Lift, Trim, and Shift (LTS) approach, you can still refactor your applications over time, as necessary. LTS has the added advantage of using capacity planning tools and methods early in your move to the cloud. By considering cycles in your business demand in conjunction with resource utilization, you can significantly reduce the ongoing run cost of your approach. LTS still provides a quick method for bulk migrations; it just adds some analysis up front, in the form of capacity planning techniques and tools. Capacity planning has been around a long time, and tools like Vityl Capacity Management can help you with your LTS migration. We’ll cover how to select a capacity management tool for your move to the cloud a little later. LTS adds a level of diligence that isn’t present in a typical Lift and shift approach. So, if you’re under pressure to move quickly, it’s a great approach for migrating to the cloud. Whichever approach you choose, the more opportunity you have to optimize your cloud solution, the better.
Optimization
The general process flow for optimization is fairly simple:
- Redefine requirements
- Design the cloud architecture
- Build the infrastructure
- Migrate the application
- Test the application
- Manage the cloud
If the process seems familiar, it is. Start by investing time in defining the requirements for running your system in the cloud. Chances are, business requirements for your system have evolved a bit since first defined, which means your Service Level Agreements (SLAs) probably need some refinement as well. Invest plenty of time to revisit any existing SLAs and requirements for performance, availability, and recovery. Take the time to engage directly with your customer and renew these requirements as your first step.
In the cloud, there are always more resources to consume. The challenge with designing for a move to the cloud is to ensure you don’t design something that includes wasted resources.
A Cloud Architect and a Solution Architect should be closely involved in both the definition of requirements and the design of your cloud infrastructure. A Solution Architect who thoroughly understands the existing application architecture and the business requirements is important. Just remember that design in the cloud can be a complex exercise because of all the configuration options for the myriad of services available in the cloud, and it needs to be a thorough design that includes a clear reference architecture. Because of these complexities, having an experienced Cloud Architect is critical.
Once the requirements are well-defined and signed off on by your customer, security, network, and any enterprise architecture component in your organization, it’s time to move onto designing your system for the cloud. The first thing this step requires is accurate data regarding capacity, usage, and growth to inform you about factors like:
- Over or under-utilization of resources due to seasonality in demand (i.e., Black Friday, etc.)
- Trending, especially past growth patterns that could be extrapolated forward (i.e., data storage)
- Sales and Marketing growth projections (likely won’t come from systems monitoring but directly from the business),
Once you’ve gathered data on current capacities and usage, and coupled that with past growth trends and future growth predictions, you’ll want to pick a capacity planning approach that best suits the needs of your organization. The right capacity planning approach will help you design for, and refine, the non-functional requirements you defined earlier around performance, availability, recovery times, etc.
The easiest, and probably least useful capacity planning method for the purposes of designing a cloud infrastructure would be to design for larger resource usage based on certain historical thresholds attained. For instance, if over the past two years half of the storage capacity of your on-premise system had already been used, you might design for double your current storage capacity in the cloud. But when it comes to cloud, where you pay for what you consume, this kind of approach quickly gets expensive. In the cloud, excess resources mean wasted money. Remember, the whole advantage of using capacity planning as a part of moving to the cloud is to optimize your design to the most cost-efficient design possible.
The second, and only slightly more useful capacity planning method for designing in the cloud, is to look at usage trends over time. An example of this might be to look at CPU usage over a period of time to predict future CPU needs. The problem with this approach is that it assumes whatever is being measured has and will increase or decrease at a steady rate; it doesn’t account for seasonality. Like threshold modeling, it’s based on a legacy method of thinking based solely on planning for expansion. Neither of these approaches help with optimizing for the cloud, which is more concerned with right-sizing, rather than planning only for growth. In the cloud, there are always more resources to consume. The challenge with designing for a move to the cloud is to ensure you don’t design something that includes wasted resources.
Remember, capacity planning in the cloud isn’t about planning for peak usage, it’s about optimizing for zero-waste of expensive cloud services.
Thresholds and trends are perfect lagging indicators of what has already happened, but they fail to provide solid insights into what will occur, and that’s what you’re ultimately trying to design for in the cloud. Gone are the days of loading up a design to make sure you never run out of capacity. Cost-effective design for the cloud demands an approach that considers multiple scenarios, and that’s where a capacity modeling approach is useful.
Capacity modeling allows you to predict system performance under various situations and workloads, and use those predictions to identify potential risks of using certain configurations, as well as provide recommended alternatives. Unlike using thresholds, modeling is adaptive, thereby freeing you from the constraints of system-defined thresholds. And unlike trending, modeling isn’t bound to historical data only - it allows you to try different scenarios.
But capacity modeling isn’t simple, or easy; if you want it done right and quickly and accurately, you’ll need a good capacity planning tool. There are plenty of products that promise capacity planning in the cloud, but many fall short of what you’ll need to be successful. To start with, understand that capacity planning and performance management are two different things. Performance Management will let you know after, or right before, something bad happens. This is great and necessary for managing your cloud resources, but remember that we are talking about right-sizing your move to the cloud. Although defining your SLAs is a critical piece of that design, managing those SLAs comes later.
At a minimum, look for a tool that provides trending and complex computations. Remember, however, that while the historical data used in trending is useful, trending alone isn’t enough. You’ll also want a tool that considers the myriad of configuration options of all the cloud services you’ll be consuming, how your systems consume those services, and how interaction occurs across multiple workloads. You’ll also want a tool that provides simulation modeling based on queuing theory, which can simulate incoming workloads on a queuing network model based on your system in the cloud. While this can be a very accurate approach, simulation modeling also requires a lot of detailed set up to be reliable. If you want an approach that’s comparably accurate, and takes considerably less time to set up, look for a tool that provides analytic modeling. Analytic modeling also considers queuing, but rather than simulating workloads, it calculates all the factors leading to design decisions. The only limitation to analytical modeling is that it’s highly dependent on the choosing the right data as input.
There are tools, like Vityl Capacity Management, that provide all these capabilities, as well as automation, which will enable you to adjust to the cloud much more easily and accurately. To avoid early cost nightmares, make sure that you choose a tool that can provide good simulation and analytic modeling as part of your Lift, Trim, and Shift approach to moving systems to the cloud. Remember, capacity planning in the cloud isn’t about planning for peak usage, it’s about optimizing for zero-waste of expensive cloud services.
Reserved Instances
Once you’ve completed your move to the cloud, are satisfied that you’ve optimized the resource consumption using solid capacity planning techniques using tools like Vityl Capacity Management, and refactored your application to make the most efficient use of cloud resources, it’s time to look at reserving your instances.
Let’s look at reserve instance (RI) offerings from AWS and Azure, the leaders in the cloud computing market. Both AWS and Azure offer the ability to realize significant saving by committing to one to three-year terms on your compute, 75% and 72% off their on-demand models respectively as of this writing. And both AWS and Azure make the process relatively easy to do. Using their online configuration tools, you simply choose your compute, your region, and the term you want to commit to. You also have different payment options, from paying for the entire RI up front, to paying month-to-month.
To better understand how RIs work, let’s dig into AWS RIs a little deeper. AWS EC2 RIs come in three different flavors: standard, convertible, and scheduled. Standard RIs offer the biggest discount (up to 75% off ondemand) and are your best option when you are confident that your usage will be steady over time. Convertible RIs offer a discount of up to 54% off on-demand and allow you to convert to RIs of equal or greater value. Convertible RIs are also best suited to steady-state usage. Scheduled RIs allow you to spin up resources within certain time frames and are best used for predictable fractions of time. Scheduled RIs would be used well for environments that you know will only be used during certain fractions of a time period. For instance, a test environment that is only running Monday through Friday, 8 a.m. to 5 p.m. would be ideal as a Scheduled RI.
The most cost-effective approach to managing your cloud resources is to spread them across the various RI types, and even leave some as on-demand. For instance, once an application has been well-tuned with capacity planning and refactoring, go ahead and commit to a Standard RI, but only for the production environment. Keep the lower-level environments on Scheduled RIs, or even on-demand, and simply turn them off when they aren’t being used. Unlike a production environment, lower-level environments are seldom used 24x7. For sandbox environments, limit developers to a spend threshold, and then shut resources down when they’ve hit their budget. In the cloud, resources can be quickly shut off and turned back on; that’s part of the agility offered by cloud computing.
In short, consider a balance between on-demand and reserve instances that reflects your usage. While committing to a Standard RI for a well-tuned production instance can create tremendous value, development, test, and sandbox environments may be more cost-effective as Scheduled RIs or even as on-demand resources.
Using flexible options like Convertible RIs, you can shift capacity where it’s needed, when it’s needed, and make more efficient use of your compute.
Enterprise Discounts
We’ve covered several options for managing the cost of using cloud computing to this point; refactoring before moving, using capacity planning techniques and tools in support of a “Lift, Trim, and Shift” (LTS) approach, and balancing reserve instances (RIs) with on-demand models depending on your environment usage. Before we wrap up, there are two other approaches to managing cost of the cloud that are worth a word or two; discount programs, and reclamation & resale.
If you're looking at moving hundreds or thousands of workloads to the cloud, then another option for managing your cloud costs is through negotiating a consumption-based discount with your Cloud Service Provider (CSP). Typically, a CSP will want a multi-year commitment, combined with a multi-million dollar spend commitment, but a well-negotiated discount program cannot only address infrastructure resource usage, but also provide for training discounts as well. A good discount program can net you an annual rebate of ten to eleven percent and upward.
Taking time to create a three to five-year cloud roadmap will help in understanding what kind of spend you can commit to when negotiating an enterprise discount. Just keep in mind that it will be necessary to continually refine your usage projections. Having a good capacity planning and management tool like Vityl Capacity Management can help here, as well. Vityl Capacity Management can help you build your cloud road map with what-if scenarios so that you can feel confident in committing to a spend projection when negotiating a consumption-based discount with your cloud provider. Once your contract is in place, being able to monitor and project further usage can help you manage those annual usage milestones.
Reclamation and Resale
Despite doing all the right things, it is possible to find yourself in a position of having excess capacity. Unused capacity is a huge waste because you’re paying for resources and then just throwing them away. There are two options for managing unused capacity; reclaiming it and reselling it.
Using a good capacity planning and management tool, like Vityl Capacity Management, you cannot only identify where excess capacity exists, but you can also plan on shifting that capacity to other systems in your cloud portfolio. Using flexible options like Convertible RIs, you can shift capacity where it’s needed, when it’s needed, and make more efficient use of your compute.
If you find you just cannot use excess capacity elsewhere in your portfolio, AWS and Azure both offer marketplaces where you can sell any unused capacity. The AWS Reserved Instance Marketplace, for instance, allows you to list your Standard RIs for sale. While your Standard RIs are identical to the Standard RIs purchased directly from AWS, they’re usually priced lower and for shorter terms than Standard RIs offered directly from AWS. AWS charges a 12% sellers fee to list your Standard RIs for sale, and you’ll need to have owned the RI for at least 30 days and already have made a payment on it before you can sell it. Think of it as a marketplace for cloud buyers’ remorse, not as an alternative to the more diligent approaches we’ve already discussed but more as an out when you find an albatross hanging from your neck.
Conclusion
Like so many other IT leaders, you may be experiencing pressure to move to the cloud quickly. But without understanding how to manage your capacity, a quick move could end up being an expensive move. We’ve looked at six options for managing those costs, regardless of how quickly you need to move. Whether you have the time and money to pursue a refactoring approach, or you need to manage costs down for a move you’ve already been forced to make, you have options. Every approach we’ve discussed assumes you have a solid capacity planning approach in place, and a good capacity management tool like Vityl Capacity Management to help you manage cloud capacity regardless of your cloud provider.
Migrating to the cloud doesn't need to end up being a rude financial surprise. Given appropriate capacity planning and attention to detail, your cloud migration can reap significant financial benefits.
Are you ready to create cloud capacity plans?
Vityl Capacity Management is built to help you identify problems and create workflows to help you create cloud capacity plans (and capacity plans for the rest of your hybrid IT environment: physical, virtual, container). But don’t just take our word for it. Try it for yourself.