Top 4 Searches on HWKU

I accept the topics I post about can be rather eclectic.  However, I did not see this coming.  Presented without further ado, the top four searches leading readers to the site, according to WordPress:

  • transnational corporation network structure
  • tiny tiny rss reader api
  • number of brand conversation
  • 18-24 year old men demographics

I think doulingo is messing with me again.

Cloud and Assembly Lines – Choose the Right Model

I’m at Red Hat Summit this week talking about cloud with customers and partners, and it occurs to me one of the common metaphors isn’t quite right. The problem with the “Assembly Line” metaphor is everyone thinks of 1907 Ford (“any color you want, as long as it’s black”). And that’s actually a lousy example.  There was zero flexibility in product output and the only automation beyond individual parts was the well-defined hand-off during assembly. Don’t underestimate the power of those elements, but that’s nothing compared to what we can do today.

The right model is Chevrolet’s model: build knowing the products you need tomorrow are different from the ones you need today.  Build knowing you will change your process while it’s still running. It’s no wonder that once implemented, Chevy beat industry-leader Ford to market by a full year while continuing to serve their current customers and took the lion’s share of the entire car market .

If your cloud isn’t open and changeable, your competitors will out innovate you and take your market.

100 Years of Chevrolet Assembly Lines - 2011-10-27 - Assembly Magazine{ photo from excellent slide show on 100 years of assembly lines at Chevrolet and GM: http://www.assemblymag.com/articles/89625-100-years-of-chevrolet-assembly-lines }

[ update: corrected link to Red Hat Summit keynote streaming ]

Generating Reports in R – Suggestions?

I would like to programmatically generate a report using R.  The contents are mostly graphs and tables.  I have a working system, but it’s too many pieces.  When I hand this off to someone else, it become immediately fragile.

Isn’t there a better way? Here are my elements:

  • R script: collection of functions to manipulate the data interactively, and with the report
  • R script: wrapper to the above functions, and calls knit (from the knitr package) function to generate the report
  • R/LaTeX: report template
  • bash: script to tie it all together, and clean up leftovers

That’s four languages. Ugly.

OpenShift.com – Now With R and rpy2

A couple of weeks ago, I announced successfully installing and running R/rpy2 on OpenShift.com

Now, you can grab the installation process and bits for yourself* through github.

http://github.com/emorisse/ROpenShift

*I’d prefer (and will be thankful for) commits, hacks, advice, and ideas over code branches.

Calculating Conditional Entropy in R

 

conditionalEntropy <- function( graph ) {
   # graph is a 2 or 3 column dataframe
   if (ncol(graph) == 2 ) {
      names(graph) <- c("from","to")
      graph$weight <- 1
   } else if (ncol(graph) == 3)
      names(graph) <- c("from","to","weight")
   max <- length(rle(paste(graph$from, graph$to))$values)
   total <- sum(graph$weight)
   entropy <- data.frame(H = 0, Hmax = 0);
   entropy$H <- sum(graph$weight/total * log(graph$weight/total) / log(2)) * -1
   entropy$Hmax <- log(max * (max-1))/log(2)
   return(entropy)
}

 

The Lambert Effect – Subtleties in Cloud Modeling

threemodelsAfter you’ve done all of the hard work in creating the perfect model that fits your data comes the hard part: does it make sense? Have you overly fitted your data? Are the results confirming or surprising? If surprising, is that because there’s a surprise or your model is broken?

Here’s an example: iterating on the same CloudForms data as the past few posts, we have subtle variations on the relationship between CPU and memory usage shown through linear regressions with R. Grey dashed = relationship across all servers/VMs and data points, without any taking into account per server variance; and says generally more CPU usage indicates more memory consumed. Blue dashed = taking into account variance of the intercept but not slope of the variance (using factor() in lm()); and reinforces the CPU/memory relationship, but suggests it’s not as strong as the previous model. The black line varies both slope and intercept by server/VM with lmer().

So what’s the best model? Good question, I’m looking for input. I’d like a model that I can generalize to new VMs, which suggests one of the two less fitted models.

Many thanks to Edwin Lambert who, many years ago, beat into my skull that understanding, not numbers, is the goal.