analysis

Calculating Conditional Entropy in R

May 1, 2013 erichLeave a comment

conditionalEntropy <- function( graph ) {
   # graph is a 2 or 3 column dataframe
   if (ncol(graph) == 2 ) {
      names(graph) <- c("from","to")
      graph$weight <- 1
   } else if (ncol(graph) == 3)
      names(graph) <- c("from","to","weight")
   max <- length(rle(paste(graph$from, graph$to))$values)
   total <- sum(graph$weight)
   entropy <- data.frame(H = 0, Hmax = 0);
   entropy$H <- sum(graph$weight/total * log(graph$weight/total) / log(2)) * -1
   entropy$Hmax <- log(max * (max-1))/log(2)
   return(entropy)
}

Analyzing Cloud Performance with CloudForms and R

April 15, 2013 erichLeave a comment

CloudForms by Red Hat has extensive reporting and predictive analysis built into the product. But what if you already have a reporting engine? Or want to do analysis not already built into the system? This project was created as an example of using Cloud Forms with external reporting tools (our example uses R). Take special care that you can miss context to the data, as there is a lot of state built into the product, and for guaranteed correctness, use the builtin “integrate” functionality.

Both the data collection and the analyses are fast for what they are, but aren’t particularly quick. Be patient: calculating the CPU confidence intervals of 73,000 values across 120 systems took about 90 seconds (elapsed time) on a 2011 laptop.

Required R libraries
forecast
DBI
RPostgreSQL
Installing RPostgreSQL required postgresql-devel rpm on my Fedora 14 box

See: collect.R for example to get started. Full code is available on github.

Notes on confidence intervals
Confidence intervals are the “strength” of likelihood # a value with fall within a given range. The 80% confidence interval is the set of values expected to fall within the range 80% of the time. It is a smaller range than the 95% interval, and should be considered more likely. E.g. if are going to hit your memory threshold within the 80% interval, look to address those limits before those that only fall within the 95% interval.

Notes on frequencies
Frequencies within the functions included are multiples of collected data. Short term metrics are collected at 20 second intervals. Rollup
metrics are 1 hour intervals. Example: for 1 minute intervals with short term metrics, use frequency of 3.

Notes of fields
These are column names from the CF db. The default field is cpu_usage_rate_average. I also recommend looking at mem_usage_absolute_average.

Notes on graphs
Graphs for the systems are shown for the first X systems (up to “max”) with sufficient data to perform the analysis (# of data points > frequency * 2) and that have a range of data, e.g. min < max. Red point = min, blue point = max.

Example images
*.raw.png are generated from the short term metrics. The others from the rollup data.

Health Care Leans Republican

September 22, 2009 erich4 Comments

3.6-times as many former congressional staffers turned health care lobbyists and their immediate connections have network ties closer to former President Bush, than to current President Obama.

The connections in the network map shown below, and used for the analysis above, include people and organizations (e.g. corporate, not-for-profit, public, etc.) the people have been identified with.

Other trivia: Continue reading “Health Care Leans Republican” →

Mathematicians Do It Randomly

September 8, 2009 erichLeave a comment

What it look like if you took all of the Mathematics articles from JSTOR, the digital journal archive, and mapped co-authorship of the papers? It would look something like this. Interesting to note, that while the distribution does hold to the small world network distribution exponent, there’s some “peakiness” about it that may suggest it’s not really one network, but the merging of several. Given the role of mathematics on so many other subjects, that would not be a surprise.

JSTOR Mathematics Authors — Largest cluster of co-authorship

Zoomable image with names, after the jump.

Continue reading “Mathematicians Do It Randomly” →

Health Care Lobbyists Part Deux

August 11, 2009 erichLeave a comment

Thanks everyone for showing the strong interest in the Lobbyist map. I got a couple nice mentions at Mother Jones and LittleSis.org, but more importantly, I’ve added in all of the other names in the map.

Circles are people, squares are organizations, and white circles are the lobbyists in question.

If you’d rather the image than the flash bits, here you go, all 2.5MB of it.

A zoomable version of the earlier map is here:

[Thanks to Drew Conway for the Sea Dragon zoomable suggestion]

Best Networked Healthcare Lobbyists? [updated]

August 11, 2009 erichLeave a comment

The Huffington Post, along with public contributors, has been collecting a list of former Congressional staffers turned healthcare lobbyists. LittleSis.org has been keeping track of these former staffers, and thanks to their API, we now have a social graph of their relationships.

Former staffers in white (with names), and the rest of the visual field to show that some are MUCH better networked than others.

If there’s interest, I can add the names of the people they are networked with and start some analysis of the group.

HCIU Congressional Staffers Turned Healthcare Lobbyists

As always, click for a larger image.

Update: network map with all names, and in a zoomable widget here.

Healthcare and the Senate Finance Committee

August 11, 2009 erichLeave a comment

Late last month, the NY Times had an article about the debate over healthcare legislation taking place in the Senate Finance Committee. Coincidentally, around that time, the folks over at LittleSis, the “free database detailing the connections between powerful people and organizations,” were kind enough to give me early access to their API (thanks Kevin and Matthew!).

So from NY Times:

To LittleSis:

Of the named members in the photo, neither Tom Barthold nor Phil Ellis existed at the time in the LittleSis database, but it’s still showing a pretty networked bunch.

I’d like to see someone do this one better, and include donors.

Twitter Communication is Scale Free

July 7, 2009 erichLeave a comment

Creating a network from a sample of communications from approximately 900,000 people on Twitter, the distribution of distinct communication partners result fits the definition of a scale-free network. The power is a little higher than scale-free networks usually described for social networks (2<k<3), but not much.