≡ Menu

Repairing the Crown

Should you ever have to repair a 1950-something Crown Range, they’ve done something smart. I had to figure it out, but it’s smart.

Philips head screws are for holding things together (like sides the over doors), and slotted screws are for attaching things onto other things (like hinges). They provided a clear visual signal about what things did. Makes me feel like whomever came up with this idea would have enjoyed data visualization.

And yes, I now have two working over door handles.


How do you measure an elephant?

A couple of years ago, Massimo Ferrari and I created the most extensive and thorough financial evaluation of OpenStack, which we called Elephant in the Room. We talked about it a lot, met a lot of customers doing amazing things, and received a lot of nice press coverage. Pulling together this type of research is a lot of work, and the hope was it would do more than help a few customers. The hope was it would help change the conversation we, as an industry, are having around cloud. That’s ambitious, I know, but I’m an optimist and was convinced we needed more understanding of financial implications of our technology choices.

Using some quick R¹ for statistical testing on the results of search phrase “cloud tco” on Google Trends, there is a sustained 42% increase in that search phrase following our blog post. I don’t know whether our blog post and talks around the world caused it, but the stats are significant (p << 0.01), and it sure is a heck of a coincidence.

¹ Code and data here: https://github.com/emorisse/ARIMAelephant


Despite the guest, Dr. Iain McGilchrist, explicitly rejecting the metaphor that the brain is like a computer, I can’t help but think about the process of building and incorporating machine learning models.

Psychiatrist and author Iain McGilchrist talks about his book, The Master and His Emissary, with EconTalk host Russ Roberts. McGilchrist argues we have misunderstand the purpose and effect of the divided brain. The left side is focused, concrete, and confident while the right side is about integration of ourselves with the complexity of the world around us. McGilchrist uses this distinction to analyze the history of western civilization. This is a wide-ranging conversation that includes discussions of poetry, philosophy, and economics.


In IT operations, we need to know when something isn’t working. But, humans are just bad at identifying anomalies over time.

FIGURE 1: Rapidly decreasing accuracy, after Mackworth & Taylor 1963.

A typical person’s ability to identify an anomaly that we know to look for can drop by more than half in the first 30 minutes on duty ((Jane F. Mackworth, “Vigilance and Attention,” Penguin Books, Ltd. 1970)). If that’s not bad enough, when unaided by technology, it can take us up to four times as long to recognize one ((Mackworth, N. H. (1948). The breakdown of vigilance during prolonged visual search. Quarterly Journal of Experimental Psychology, vol. 1, pp.6-21)). We’re actually terrible at this elementary IT requirement of identifying when things go wrong, and that’s before we get to the ugly case of looking for problems we don’t expect. Combining automation and anomaly detection powered by machine learning (ML) may be the only chance we have to successfully identify and respond to the rising swell of data in IT.

In this blog post, we’ll talk about how the biology of the human brain impacts IT operations, how we can augment our teams with ML applications, and finish with two concrete examples of these applications: one offered as a service today by Red Hat, and another which (as far as I can tell) is a novel approach to assisting Root Cause Analysis with ML.

Monitoring is a Human Problem, and We Can’t Fix It Alone

Our brains are great at recognizing patterns ((Jeff Hawkins’ “On Intelligence” is a great and accessible read on the topic.)). We’re so good that sometimes we see them where there aren’t any. If you’ve ever seen a cloud that looked like a cat, or rock that looked like your cousin, you know this is just regular human brain stuff. Our brains get used to seeing emerging patterns very quickly, it’s a physiological process called habituation – our brains expect the pattern. It helps us spend fewer cycles to understand what’s going on around us. In fact, when what’s going on around us isn’t radically changing, habituation reduces the attention we pay to the “signals,” in Signal Detection Theory lingo, from the pattern.

In the case of IT monitoring, we’re inundated with “unwanted signals”– signals that indicate everything is OK and can be ignored by operators. These unwanted signals play a valuable role in helping the monitoring systems know that the services (and the monitoring solution itself) are performing as designed, but are detrimental to human processing. Eventually, the human brain adjusts to the idea of receiving a large number of signals, which becomes the expected pattern. We then pay less attention to whether the signal means OK or PROBLEM. This habituation causes us to require more effort over time to identify the exceptions to the expected, and causes us to be slower at recognizing exceptions too. That’s long winded, let’s use an example:

Construction starts hammering away next door. It’s very loud, so of course you notice immediately. Over the rest of the day, you grow used to (habituated to) the noise of the hammering. When the it stops for the evening, it takes a minute, but you notice that the hammering has stopped. It’s not immediate like when the hammering started. If you’d like to try out your own attention skills, here’s a 60 second selection attention test from Daniel Simons.

This predestined loss in attention, the vigilance decrement, is magnified when we’re looking for rare problems – in IT, like those that cause unplanned downtime.

Work Smarter, Not Harder.

I hate that phrase. Said out loud, it’s too often a cop-out. It means: not enough budget, not enough headcount – go do impossible again. In other words, keep working harder. Do you remember the good old days when our teams and budgets grew at the same rate as the work we had to get done? Me either.

IT ops teams are asked to support ever-larger environments (more containers than VMs, more functions than containers, etc.), and also more types of things (application frameworks, development languages, etc.). This growth in scale and complexity makes support an increasingly daunting effort. So, we’re left with that despicable phrase: work smarter, not harder. When it comes to preventing errors, and especially in the world of overwhelming data that we live in, we need a systematic change to monitoring. Research shows that, rather than only relying on operators’ attention, a systematic approach can be superior for creating highly reliable operations.

With the velocity of complexity in IT, we clearly need that new systemic approach. We need a different approach to scale IT operations by accounting for natural human variability. Our customers often use automation to build quality into IT processes. That helps, and we’ve seen spectacular results. But, if we want the next big jump in improvement, automation is only half of the solution. Since we can only respond when we find an anomaly, how can we do better about recognizing them if we can’t even keep up with the incoming data?

As soon as ops sits down to a shift, their capability of finding something odd quickly decreases across almost every dimension: they miss more, they wrongly call good things as bad, they grow less confident in their decisions, and it takes them longer to make the decision.

Enter artificial intelligence. The availability of machine learning based anomaly detection is the start of a new way to support operations. Through machine learning, operations can learn to provide a higher level of service through identifying and eliminating more anomalies, and more rare anomalies, earlier and with greater accuracy.

Finding the anomalies is the first step, but that alone won’t solve the problem. You have to know what the anomaly means. Machine learning and advanced analytics can help with that, too. Let’s go through two examples of locating anomalies and helping provide information about what’s going on: one Red Hat provides as SaaS, and another you can build for yourself.

Red Hat Insights

A couple of years ago, we released Red Hat Insights, a predictive service which identifies anomalies, helps you understand the causes, and helps automate fixes before the causes become problems. If you subscribe to Insights, it uses a tiny ((Less than 5% of the data you provide for a single support case.)) bit of metadata to identify the causes of pending outages in real-time. With the data from well over 15 years of resolving support cases, we are able to train Insights to provide both descriptive explanations of the problems and prescriptive remedies. To take it a step further and make operators lives a little easier, we recently extended Insights with the ability to remediate identified problems with automation. As more and more customers use Insights for risk mitigation and automated issue resolution, the additional information enables Insights to become smarter every day, and enables more informed actions by operations.

Connect automation with machine learning to identify and resolve problems before you know to look for them.

Use machine learning to help us diagnose software.

Red Hat Insights provides exact and automatable actions to resolve the complex interactions that lead to downtime. We can also use machine learning to assist in identifying other types of software problems, and reduce the time required to discover the root cause by narrowing down where to first look to a few educated predictions – without having to pour through logs by hand. We can use machine learning to aid operators in root cause analysis by suggesting a possible dependency chain that led to the breakdown – a diagnostic map.

Applications and platforms responsible for the deployment and management of many things (VMs, containers, microservices, functions, etc.) are increasingly providing maps of the things under their control in order to provide operators with context. The example below shows the topology of container interactions in a Kubernetes cluster on Red Hat’s container platform, OpenShift. This works well for platforms that create the topologies, but what about trying to determine the topology for applications we don’t know or control?

FIGURE 2: CloudForms managing a Kubernetes Cluster in OpenShift

Turning one minute of laptop CPU into a diagnostic map.

System logs on Linux (and *nix-based cousins) are great sources of information for what isn’t working well, but any entry rarely provides much context outside the program or subsystem that generated it. In today’s world of massively interconnected systems, unless an operator already has experience with the observed problem, any log entry is rarely enough information to understand its root cause. However, even when we’ve never seen the problem before, we can use machine learning to build a diagnostic map and help us narrow down where to look first for root causes. Here’s an example.

FIGURE 3: Diagnostic Map

Figure 3 represents a part of a machine learning-derived diagnostic map of Linux programs from log entries in syslog. Each circle is a program that logged events in syslog. The arrows suggest an influence relationship: the first program, that points to the second program, impacts the behavior of the second. Now you have a picture of the entire system of events that led to the problematic behavior that led you to look at the events in the first place: you now have context.

Diagnostic maps can reduce cognitive overload, and identify what’s important.

With more deployment types, frameworks, and rapidly evolving applications, the interdependencies of things we support are exploding in number. The best IT operators can debug only some of these problems quickly. However, when we use ML to aid in identifying problems and generate diagnosis maps, we can help reduce time to resolution across problems we haven’t seen before.

Not only are graphs like this valuable as a troubleshooting tool, they can also be tied into monitoring systems to help operators identify and prioritize the right alerts. When something big happens, like a cloud region going down, we’re flooded with alerts. In this case getting an alert from your monitoring systems that every application, every container and every VM is down doesn’t add any new information to help resolve the problem. However, each and every alert takes cognitive effort to process and decide whether it’s important. In cases like this of alert floods, the human brain becomes overwhelmed and stops processing any new alerts. If you’ve ever felt overwhelmed by the amount of email in your inbox, that’s a small version of the same principle.

With an understanding of dependencies, you can gate alerts: you don’t need any more alerts about the applications down if the VM they’re running on are also down. But, knowing a new VM is down may be essential to know.

Artificial Intelligence (AI) is a rapidly evolving field, and its use in IT operations even more so. But, it’s a lot more than academic, and we’re beginning to see emerging markets categories of use. Red Hat already uses it to offer Insights, the service identifying and resolving infrastructure issues before your teams know about them. We also saw an emerging example of using AI to assist in root cause analysis. The field is just getting started, and these are just two of many exciting possible directions the field may evolve.

We’ve seen that it’s essentially impossible for people to watch for anomalies at any level approaching business critical; humans just aren’t wired for it. The good news is machine learning is good at it: both finding anomalies, and helping your teams figure out where to look to solve them. And, you’re not alone in this need.

If you have a substantial investment in any software, call those vendors and ask what tools they have to help your teams identify, diagnose, and solve problems with their software. If you’re feeling like you want to push a little harder, ask what they’re doing to help solve problems where their software is only one piece of the puzzle.

Erich Morisse
Director, Management Strategy


A few months ago, for our own internal use, we started a project to calculate what it costs to run an OpenStack-based private cloud. More specifically, the total cost of ownership (TCO) over the years of its useful life. We found the exercise to be complex and time consuming, as we had to gather all of the inputs, decide on assumptions, vet the model and inputs, etc. So, in addition to results, we’re offering up a few lessons we learned along the way, and hopefully can save you a scar or three when you want to create your own TCO model.

Ultimately, we wanted answers to three layers of cost:

  1. What is the most cost effective method for acquiring and running OpenStack?
  2. How does OpenStack compare financially to non-OpenStack alternatives?
  3. How should we prioritize technical improvements to provide financial improvements?

Following an exhausting survey of cloud TCO research, none of the cost models we could get our hands on were complete enough for our needs: some did not break out costs by year, some did not include all of the relevant costs, and none addressed potential economies of scale. We needed a realistic, objective, and holistic view – not hand-picked marketing results, and found a few suggestions that helped us get there – whatever the technology.

Since we could not find anything both comprehensive and transparent, we created one, and used the opportunity to go a few steps further by adding additional dimensions: full accounting impact across cash flow, income statement, and balance sheet. The additional complexity made it harder to understand and consume the model. Further, we needed the model to not only spit out projections, but be a reliable way to compare options and support decision making throughout the life of a cloud as options and assumptions change. So, we decided to create a tool rather than just a total cost of ownership (TCO), for easy comparisons, and conversations with financial teams and lines of business.

To help us view the data objectively, we relied as much as possible on industry data. Making assumptions was inevitable, not all of the required data is available, but we made as few as possible and verified the model and results with a number of reputable and trusted organizations and individuals in both finance and IT.

What is the most cost effective method for acquiring and running OpenStack?

If you’re considering or even running OpenStack already, we imagine you’re asking yourself a few questions, “I have a smart team, why can’t we just support the upstream code ourselves?”. As Red Hat is commercially supported open source software, we can talk all day about the value of supported open source software, including the direct impact on OpenStack, but we also want to address the direct costs, the line items in your budget. To get to these costs and answer our questions, we shaped the model to analyze two different acquisition and operation methods for OpenStack:

  • Self-supported upstream OpenStack
  • Commercially supported OpenStack

As the model shows, the self-supported upstream use of OpenStack, with the least expensive software acquisition cost, ends up the most expensive, which may seem counter-intuitive. Why? Because of the cost of people and operations.

All of the costs of a dedicated team* running the cloud: the salaries, hiring, training, loaded costs, benefits, raises, etc., regardless of the underlying technology, are a large chunk of the total costs. With a commercially supported OpenStack distribution, you only need to support the operations of your cloud, rather than the software engineers, QA team, etc., for supporting your cloud and the code too. We expect that you need to hire fewer people as your cloud grows, and the savings would exceed the incremental cost of the software subscription. Your alternative, is this:


Taking our analysis a step further, we also explored the financial impact of increasing the level of automation in an OpenStack cloud with a Cloud Management Platform (CMP). Why? Because most companies’ experience shows** that managing complex systems usually doesn’t go according to plan. However, if automation is appropriately implemented, it can lower the TCO of any complex system.

CMP is a term coined by Gartner to describe a class of software encompassing many of the overlaid operations we think of in a mature cloud: self-service, service catalogs, chargeback, automation, orchestration, etc. In some respects, a CMP is an augment to any cloud infrastructure engine, like OpenStack, necessary to provide enterprise-level capabilities.

Our model shows coupling a CMP with OpenStack for automation can be significantly less expensive than either using and supporting upstream code, or using a commercial distribution. Why? As with the commercial distribution, our model shows that you would need to hire fewer people as your cloud grows, and the savings can potentially dwarf the incremental software subscription cost. The combined costs are drawn from Red Hat Cloud Infrastructure, which includes the Red Hat CloudForms CMP and Red Hat Enterprise Linux OpenStack Platform.


One of the sets of industry data we used, to help create an unbiased model, came from an organization named Computer Economics, Inc. They study IT staffing ratios, and all kinds of similar things. They found that the average organization, with the average amount of automation, supports 53 operating system instances (mix of physical and virtual) per system administrator. They also found, that the average organization, with a high level of automation supports 100 instances per admin.

So, in our scenario, with the cloud expected to double in size next (and every) year, you have a few options. You can double your cloud staff (good luck with that), double the load on your administrators (and watch them leave for new jobs), or invest in IT automation.

The aforementioned study shows that high levels of automation can nearly double the number of OS instances supported. While automation can reduce the cost curve for hiring, and make your cloud admins’ lives easier, we’re in a financial discussion. Automation only makes financial sense if it lowers the cost per VM. Which is exactly what we found:


In order to compare the costs and advantages of automation more closely, we looked inward (it was an internal study after all). We compared with the completely loaded costs (hardware, software, and people) for one VM of our commercial distribution of OpenStack, Red Hat Enterprise Linux OpenStack Platform (RHELOSP), with those of our Red Hat Cloud Infrastructure, which includes both RHELOSP and our CMP, Red Hat CloudForms.

Looking at the waterfall chart above, we start with the fully loaded costs of one VM provided by RHELOSP of $5,340 per VM, and want to compare the similarly loaded costs for RHCI. The RHCI software costs an additional $53 per VM under these density assumptions, which increases the costs to $5,393. Next, we factor in the $1,229 savings through automation from hiring fewer people as your cloud grows, we see a loaded cost of $4,164 per VM for RHCI. Under our model, using a CMP with OpenStack resulted in savings of over $1,200 per VM.

Moving from just an average level of automation to a high level of automation, our model showed a significant improvement in costs as you grow, that the extra cost of automation can be dwarfed by the potential savings. High automation is only moving from the median to the 75th percentile, so our model shows that there’s a lot of headroom for improvement above and beyond even what we show.

At $1,200+ savings per-VM per-year, automation has the potential to quickly add up to millions in savings once you’ve reached even moderate scale.

That’s the kind of benefit is one of the many reasons why Red Hat recently acquired Ansible. And given that Ansible is easy to use, use of Ansible tools can not only improves the TCO through automation, but can also help customers achieve those savings faster.


h2 id=”openstackvsnonopenstack”>How do OpenStack and non-OpenStack compare financially?

As we said, we wanted to model to be useful also to compare different market alternatives, but in order for the comparison to be useful, we needed the comparison to be apples-to-apples. Competitive private-cloud technology available on the market at the time of our research provided much more than just the cloud infrastructure engine, so we decided to compare OpenStack plus the CMP against commercial bundles made of an hypervisor plus CMP, which is what Red Hat customers and prospects ask us to do most of the time.

In the model, we conservatively assume that the level of automation is exactly the same. If you have data you are willing to share which supports or refutes this, please let us know.

As we expected, the model showed us that an OpenStack-based private cloud, even augmented by a CMP, costs less than a non-OpenStack-based counterpart. The model shows $500 savings per VM increasing to $700 over time and over a larger number of VMs and more as the maturity of the cloud grows over time.


However, the question is: is the $500-700+ in savings per-VM worth the risk of bringing in a new technology? To find the financial answer, we had to consider how these savings add up.


As the chart shows, by the time you have even a moderate sized cloud, OpenStack with a CMP total annual cost savings can exceed two million dollars. We are aware that it’s common business practice to apply discount to retail prices, but to keep the comparison as objective as possible, we referred to list price disclosed by every vendor we evaluated in our research. Because our competitors were not real keen on sharing their discount rates, the only objective comparison we can make are these list prices. We estimate that there is a small portion of this savings that comes through increased VM density (which we’ll talk about later), but the majority is in software costs.

With this in mind, if you take a look at these numbers, and think about the software discounts you’ve negotiated with your vendors, you’ll have a reasonable idea of what this would look like for you. And as a reminder, these are just for the exponential growth model starting from a small base. We’ll wager there are any number of you reading this who have already well exceeded these quantities and are accumulating savings even faster than we show here.

We also recommend looking at the total costs over the life of a project. In fact, when we look at the accumulated savings over the life of your private cloud, we notice something rather striking.


Our model showed that it really doesn’t matter what your discount level is, if you plan on any production scale OpenStack with a CMP can potentially save you millions of dollars over the life of your private cloud.

How should we prioritize technical improvements to provide financial improvements?

In order to move from one-time decisions to deliberate on-going improvements, you need the “why” of the model as well as the outputs. By the time we finished building and vetting our TCO model, we made a number of interesting, and sometimes surprising, discoveries:

  • Cost per VM is the most important financial metric
  • The hardware impact on total spend is marginal
  • Lowering VM cost will increase usage and total costs
  • Track all of your costs to prioritize efforts
  • Considering the time scale increases model accuracy
  • The cloud growth curve doesn’t affect the TCO

Cost per VM is the most important financial metric

For most of this post, we’ve been focusing on cost per VM. Despite the necessity in budgeting, total costs are simply not instructive. Here’s an example of the total annual costs over six years, for one of the many private cloud scenarios we considered:


A typical approach in TCO calculations is looking at the annual costs, but this metric alone isn’t particularly helpful in the analysis of a private cloud, with or without OpenStack. In private clouds, we can’t get away from the fact that we are providing a service, and what our Lines of Business or customers consume is a unit, like a VM or container. Hence, we believe that it’s much more significant to look at the annual per-VM cost.


In the same scenario we showed with the rapidly increasing total costs, the VM cost has dropped by more than half, from the first year to the third. That dramatic improvement is impossible to see in the total costs curve. Without accounting for the VM costs, you’d miss that the total costs are increasing because of more usage, but you’re getting more for your dollar every year. Increasing growth while increasing cost efficacy is a good problem to have.

In other words, we recommend using VM Cost as your main metric because it shows how good you are at reducing the cost of what you provide. Total Cost does not distinguish between cost improvement and usage growth.

The hardware impact on total spend is marginal

We’ve woven in analysis of two of the three main cost components related to acquiring and running OpenStack, and financially comparing OpenStack and non-OpenStack alternatives. Our model shows that the selection of private cloud software choices has the potential to save you millions of dollars. The investment in automation similarly shows the potential to save additional millions of dollars. Either or both of these can save an organization a lot of money, despite the additional expenses. But, so far, we’ve only hinted at hardware costs.

Some of our readers may be surprised at the results: hardware is a large and easily identifiable cost, so if you can cut the amount of hardware, in theory you can save a lot of money. Our model suggests that it’s not really the case.


We asked the model how costs change across a large range of VM densities: 10, 15, 20, and 30 VMs per server, with no other changes. The numbers show very little difference in costs even across this large range of densities.

If we start with an average density of say 15 VMs per server and (unrealistically) double it to 30, we see a savings of around $350 per VM. Not a trivial amount, and one that adds up quickly at scale, but these amounts are before the costs of any software and the effort to make this monumental jump in efficiency.

If we make a more realistic (but still really big) stretch to a ⅓ increase in density from 15 VMs per server up to 20 VMs per server, the models indicates a $175 in savings per-VM before the cost of software and effort. This is tiny compared to the $1,200 or more savings per-VM through automation in the same scenarios.

Never neglect your hardware costs, but don’t start there for cost improvements, it’s unlikely to provide the biggest bang for your buck.

Lowering VM costs will increase usage and total costs

Our model shows that the more you lower the VM costs for the same service, the more you will increase your total costs. There’s a direct causal effect: the less expensive this service is, the more people want to use it.

Here’s a different example from our industry, to further prove our point. 1943 saw the beginning of construction of the ENIAC, the first electronic general-purpose computer, which cost about $500,000. In 2015 dollars, that’s well over $6,000,000. Today, servers cost less than 1/100th of that, and we buy 1,000,000’s of them every year. We now spend much, much more on IT than the first IT organizations did supporting those early giant beasts and, yet, our unit costs are significantly lower.

Based on this awareness, we looked at the market numbers for consumption of servers and VMs from IDC, and ran some calculations: for every 1% you reduce your VM cost, you should expect to see a 1.2% increase in total cost, due to a 2.24% increase in consumption. Which seems counterintuitive, but the increase in total costs is due to your success. You’ve reduced the costs to your customers, so they’re buying more. Once again, your reduction in VM cost is directly increasing the demand for the services of your cloud.

IT, and in particular IT components like servers and VMs have “elastic demand curves,” broadly meaning that reducing prices leads to greater utilization and greater total cost. If increased efficiency causing higher total costs comes as a surprise to you, you’re not the only one.

Track all of your costs to prioritize efforts

Tracking the costs of as many components as possible enables you to prioritize improvements over time even as your cloud matures, your staff gets better and better at running it, and even as demands change from your customers. In order to build a tool around our TCO model, we had to decide on what costs we want to track, and model together. Our model accounts for all hardware, software, and personnel required to operate a private cloud. Each and every one of them are a potential lever in affecting how your costs change over time.


The levers built into the model include: VM density affecting hardware spend, IT automation for personnel costs, and software choices for software costs. Between the three of these, the model addresses all of the major costs of acquiring and operating a private cloud, with the exception of data center facilities. With the low impact on costs of hardware and changes to density, we assumed that datacenter facility costs will largely be the same across technologies and were not a focus of this model. However, should you have great data center cost information you’d like to contribute, please let us know, as we strive to increase the completeness and accuracy of our model.

The model suggests IT automation should be the first item on your todo list.

Considering the timeframe increases model accuracy

Even though building a cloud can be quick, getting the most from its operation is a journey: staff will learn along the way, corporate functions will have to adjust, and business demands for new technologies and faster IT response will only increase.

Per-VM costs are inseparable from timing. You’re buying hardware, hiring people, buying software, suffering talent loss, refreshing hardware, and buying still more to support growth. All of these costs can, and usually do, hit your budget differently every year. If you’re buying software licenses, you have a large upfront cost and maintenance. If your staff gets promoted, gets raises, and sometimes takes new jobs, these will affect salaries, hiring, and training costs. Some you can plan for, some you can’t.

Put in another way, if next year, you provide the exact same quality of service, to the exact same customers, in the exact same quantity, with the exact same technology, there’s still a very real chance your costs will not be the same as they are this year.

We’re showing costs and cost changes over six years, but we modelled out to ten to find out when the costs start flattening out.

If you want your TCO model to be a tool for ongoing decision making, you need to not only look at costs, but how costs change over time.

The cloud growth curve doesn’t affect the TCO

One of the nice things about creating a flexible model is it allows you to try all sorts of hypotheses and inputs. While absolute costs depend on the success and speed of your private cloud adoption, one of our surprising discoveries is that relative costs are not dependant upon your adoption curve. None of the advice the model provides is affected by the growth curve.

This means IT organizations can get started even when unsure of how quickly your private cloud is going to take off. This also makes the particular growth model we discussed here a lot less important. Our examples have VM count doubling every year, which is the most common customer story you hear during IT conference keynotes. But, the advice is equally applicable no matter what your particular growth model is.

Having technical conversations with Lines of Business (LOBs) are frustrating for both sides: they often can’t provide you sufficient information you need in order to provide a thoughtful architecture and plan. Because of any number of reasons, you can’t provide accurate costs and changes to costs over time. With a good TCO model, these conversations can get unbelievably easier for both sides of the table: you can model different scenarios and provide ranges of pricing, and help your LOBs work through priorities. Invest the required time in an accurate TCO model, and you’ll not only make these conversations even easier, but you’ll have the tools in place to add financial input into your designs even as the services you provide change over time.

If you’re interested in expanding on what we’ve built, please let us know.

Erich Morisse
Management Strategy Director

Massimo Ferrari
Management Strategy Director

* If you think that you can run a cloud by leveraging existing IT Ops, think again. Research published by Gartner shows that not creating a dedicated team is one of the primary reasons for the failure of cloud projects: Climbing the Cloud Orchestration Curve

** Velocity 2012: Richard Cook, “How Complex Systems Fail”
The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
Complex Adaptive Systems: 1 Complexity Theory
What You Should Know About Megaprojects and Why: An Overview


When Luck Is Your Strategy

M.R.D. Foot tells a story of an underground agent forced to transport a B-2 wireless set (radio) through a railway station in which German forces were conducting random checks of luggage and personnel. The radio which the agent was carrying was a distinct size and shape and thus easily recognizable to alert police forces. The underground operative, realizing the precariousness of his situation, initiated a cunning security measure he presumed would reduce his risks.

He, reached a big terminus by train; carrying only his B2 in its little case; saw a boy of about twelve struggling with a big one; and said genially (in the local language) “Let’s change loads, shall we?” He took care to go through in front of the boy; there was no trouble. Round the first corner, they changed cases back. The boy said “It’s as well they didn’t stop you; mine’s full of revolvers.”

Foot, MRD, Resistance, (New York, McGraw-Hill Book Company, 1977), as quoted in: Underground Management: An Examination Of World War II Resistance Movements by Christian E. Christenson


MP3 and Entropy

If you’re like me, with a wife nursing in the other room and an infant distracted by any noise or movement, then your mind naturally drifts to the topic of entropy[1]. I saw something on “TV”[2] about music, and began wondering about whether different musical styles had different entropy.

Slashing around with a little Perl code, I hoped to look across my music collection and see what there was to see. Sadly, I don’t have the decoding libraries necessary[3], so instead I just looked at the mp3’s themselves. While most genres did show a good range, they are also relatively distinct, and definitely distinct at the extremes.


While I can’t make any conclusions about how efficient mp3 compression is by genre, we can safely say that the further compressibility of the mp3 file is affected by the genre. I.e., there is room for genre specific compression algorithms. Which only makes sense, right? If you know more about the structure, you should be able to make better choices.

  1. Entropy is the measure of disorder in a closed system, with the scale ranging from completely random to completely static. It is used as a model in information science for a number of things, including compression: entropy can be used as a measure of how much information is provided by the message, rather than the structure of the message.For example, the patterns of letters in words in the English language follow rules. U follows q, i before e…, etc. The structure defined by these imprecise rules, in English, dictates about half of the letters in any given word. Ths s wh y cn rd wrds wth mssng vwls.The thinking is, should you agree to what the rules are, you can communicate with just the symbols that convey additional meaning. Compression looks to take advantage of this by eliminating as much of the structure as possible, and keeping just the additional information provided.

    Mathematically, the entropy is often referred to by the amount of information per symbol averaged across structural and informational symbols. English has an entropy around 0.5, so each letter conveys 1/2 a letter of information, the other 1/2 is structure[4]. ↩

  2. Probably Hulu?  ↩
  3. Darn you, Apple.  ↩
  4. Apparently, this makes English great for crossword puzzles. ↩

Red Hat is the Best IT Software Company!?

wtn-winner-badgeFriday night, Red Hat was honored as the “Best IT Software Company” along with luminaries from other industries including Elon Musk, LeVar Burton, and Jane Goodall by the World Technology Network.

Rodney Brooks, co-founder of iRobot and winner of the individual award for IT Hardware, made the joke that it was nice to have a hardware category, because we software guys never give hardware credit. I can think of a few jokes in the other direction, but he does have a point: none of us will be successful without each other. That’s why the evening was so much fun…

The room was full of customers and even some partners getting recognition for their work. These are the folks who have made us successful. I can’t wait to see what’s next!


November 15, 2014 at 1257PMThis award was also personally rewarding, as I had to leave my very jet-lagged wife at home with our very jet-lagged and cranky 4-month-old, after a week away with my mother. So, it’s probably good for me that we won.

I know you’re dying, so here’s a goofy photo of me in my tuxedo…


The Baby Measureur

R Code for Our Kid

Not to long ago, a tiny, screaming, pooping, extraordinarily amazing data manufacturing machine came into my life. Long accustom to taking subtle cues from my wife, his arrival was not a surprise; so I had plenty of time to prepare my optimal workflow for consuming baby data. Basically, I just installed Baby Connect apps on all of our devices.1

Baby Connect syncs feeding, diaper, health and all sorts of devices across multiple devices. So, when I change a diaper, I record it and can get credit for it. They also provide a number of graphs so you can see changes in the input/output of your bundle. I wanted something that would point me to changes. Fortunately, in a stroke of genius, they also allow you do download the data in CSV format from their website.2 So, with no sleep, a month of paternity leave3, and ready access to data, I started putting together some R code looking for patterns through cluster analysis.4

Feeding the Beast

For the month and half this kid has been living with us, the model based clustering identified five clusters of feedings, when measured across the datetime, time of day, and duration of feedings.

Feeding Duration
My kiddo was eating either long or short, for the first week and a half. For the next two weeks the variation in duration of feedings came down enough to be considered a single cluster. For the next two-and-a-half weeks, the variation decrease further. The difference in feeding duration for the first fortnight is particularly noticeable in the graph below.
Feeding Duration

You’ll note I’ve discussed four clusters. The fifth has a single entry (Aug 19th, just before 5am). I have no idea what that’s about.

Long and short of it: if my kid’s like yours, you will definitively see changes to eating patterns over the first weeks.

Making Diaper Changing Cool Again

Running the similar tests over the diaper data, I calculate three clusters, and again see them largely grouped chronologically.
Diaper Timing
In the first, my boy went any damn well time he pleased. In the second, for a week, there’s a noticeable dropoff in quantity of diapers. In the third, quantity picks up again, but we also see the introduction of a small kindness: fewer changes after 9pm. Yes, interested parties, my boy is thankfully starting to fall into sleep patterns as well as sleep more. But, what’s going on in that middle cluster? For that, I look at the reasons for diaper changes.
Diaper Changes by Type
This graph requires explanation (and simplification). Aside from boredom and performance art, there are two main reasons I change diapers. These two reason are often, but not always, concurrent. This graph looks at those two reasons, and tests whether they are concurrent: yes on top, no on the bottom. The Y-axis is otherwise irrelevant, and variation is in place only so the points are more readable by not all occurring in a boring line.

What we see here is my data producer had ~2/3 exclusive diapers in his first two weeks. Then mostly double diapers, for a week. And now, about an even split. Note the shift to longer feedings during the same week (second graph), this coincided with a growth spurt, not that I can tell except for looking at my calendar.

What’s Next

Please, jump in. Take a look at the code. Use the code. Provide ideas, patches, comments.

Git Hub: babyconnectR

  1. Thank you, Gunnar
  2. I’d prefer a way to download it all at once, but by month isn’t so bad. 
  3. Thank you, Red Hat
  4. Most of the measures didn’t show me much, but I’ve added them all to the github repo as it could be an artifact of the data. 

More here: http://qz.com/242905/indian-parliament-to-hire-men-to-dress-up-as-langurs-to-scare-away-monkeys/