The Baby Measureur

R Code for Our Kid

Not to long ago, a tiny, screaming, pooping, extraordinarily amazing data manufacturing machine came into my life. Long accustom to taking subtle cues from my wife, his arrival was not a surprise; so I had plenty of time to prepare my optimal workflow for consuming baby data. Basically, I just installed Baby Connect apps on all of our devices.1

Baby Connect syncs feeding, diaper, health and all sorts of devices across multiple devices. So, when I change a diaper, I record it and can get credit for it. They also provide a number of graphs so you can see changes in the input/output of your bundle. I wanted something that would point me to changes. Fortunately, in a stroke of genius, they also allow you do download the data in CSV format from their website.2 So, with no sleep, a month of paternity leave3, and ready access to data, I started putting together some R code looking for patterns through cluster analysis.4

Feeding the Beast

For the month and half this kid has been living with us, the model based clustering identified five clusters of feedings, when measured across the datetime, time of day, and duration of feedings.

Feeding Duration
My kiddo was eating either long or short, for the first week and a half. For the next two weeks the variation in duration of feedings came down enough to be considered a single cluster. For the next two-and-a-half weeks, the variation decrease further. The difference in feeding duration for the first fortnight is particularly noticeable in the graph below.
Feeding Duration

You’ll note I’ve discussed four clusters. The fifth has a single entry (Aug 19th, just before 5am). I have no idea what that’s about.

Long and short of it: if my kid’s like yours, you will definitively see changes to eating patterns over the first weeks.

Making Diaper Changing Cool Again

Running the similar tests over the diaper data, I calculate three clusters, and again see them largely grouped chronologically.
Diaper Timing
In the first, my boy went any damn well time he pleased. In the second, for a week, there’s a noticeable dropoff in quantity of diapers. In the third, quantity picks up again, but we also see the introduction of a small kindness: fewer changes after 9pm. Yes, interested parties, my boy is thankfully starting to fall into sleep patterns as well as sleep more. But, what’s going on in that middle cluster? For that, I look at the reasons for diaper changes.
Diaper Changes by Type
This graph requires explanation (and simplification). Aside from boredom and performance art, there are two main reasons I change diapers. These two reason are often, but not always, concurrent. This graph looks at those two reasons, and tests whether they are concurrent: yes on top, no on the bottom. The Y-axis is otherwise irrelevant, and variation is in place only so the points are more readable by not all occurring in a boring line.

What we see here is my data producer had ~2/3 exclusive diapers in his first two weeks. Then mostly double diapers, for a week. And now, about an even split. Note the shift to longer feedings during the same week (second graph), this coincided with a growth spurt, not that I can tell except for looking at my calendar.

What’s Next

Please, jump in. Take a look at the code. Use the code. Provide ideas, patches, comments.

Git Hub: babyconnectR

  1. Thank you, Gunnar
  2. I’d prefer a way to download it all at once, but by month isn’t so bad. 
  3. Thank you, Red Hat
  4. Most of the measures didn’t show me much, but I’ve added them all to the github repo as it could be an artifact of the data. 

Calculating Conditional Entropy in R


conditionalEntropy <- function( graph ) {
   # graph is a 2 or 3 column dataframe
   if (ncol(graph) == 2 ) {
      names(graph) <- c("from","to")
      graph$weight <- 1
   } else if (ncol(graph) == 3)
      names(graph) <- c("from","to","weight")
   max <- length(rle(paste(graph$from, graph$to))$values)
   total <- sum(graph$weight)
   entropy <- data.frame(H = 0, Hmax = 0);
   entropy$H <- sum(graph$weight/total * log(graph$weight/total) / log(2)) * -1
   entropy$Hmax <- log(max * (max-1))/log(2)


The Lambert Effect – Subtleties in Cloud Modeling

threemodelsAfter you’ve done all of the hard work in creating the perfect model that fits your data comes the hard part: does it make sense? Have you overly fitted your data? Are the results confirming or surprising? If surprising, is that because there’s a surprise or your model is broken?

Here’s an example: iterating on the same CloudForms data as the past few posts, we have subtle variations on the relationship between CPU and memory usage shown through linear regressions with R. Grey dashed = relationship across all servers/VMs and data points, without any taking into account per server variance; and says generally more CPU usage indicates more memory consumed. Blue dashed = taking into account variance of the intercept but not slope of the variance (using factor() in lm()); and reinforces the CPU/memory relationship, but suggests it’s not as strong as the previous model. The black line varies both slope and intercept by server/VM with lmer().

So what’s the best model? Good question, I’m looking for input. I’d like a model that I can generalize to new VMs, which suggests one of the two less fitted models.

Many thanks to Edwin Lambert who, many years ago, beat into my skull that understanding, not numbers, is the goal.

Determining Application Performance Profiles in the Cloud

I want to know how to characterize my workloads in the cloud. With that, I should be able to find systems both over-provisioned and resource starved to aid in right-sizing and capacity planning. CloudForms by Red Hat can do these at the system level, which is where you would most likely take any actions, but I want to see if there’s any additional value in understanding at the aggregate level. cpuWe’ll work backwards for the impatient. I found 7 unique workload types by creating clusters of cpu, mem, disk, and network use through k-means of the short-term data from CloudForms (see the RGB/Gray graph nearby).  The cluster numbers are arbitrary, but ordered by median cpu usage from least to most.

From left to right, rough characterizations of the clusters are:

  1. idle
  2. light use, memory driven
  3. light use, cpu driven
  4. moderate use
  5. moderate-high everything
  6. high cpu, moderate mem, high disk
  7. cpu bound, very high memory
    Continue reading “Determining Application Performance Profiles in the Cloud”

Analyzing Cloud Performance with CloudForms and R

weeklyCloudForms by Red Hat has extensive reporting and predictive analysis built into the product. But what if you already have a reporting engine? Or want to do analysis not already built into the system? This project was created as an example of using Cloud Forms with external reporting tools (our example uses R). Take special care that you can miss context to the data, as there is a lot of state built into the product, and for guaranteed correctness, use the builtin “integrate” functionality.

Both the data collection and the analyses are fast for what they are, but aren’t particularly quick. Be patient: calculating the CPU confidence intervals of 73,000 values across 120 systems took about 90 seconds (elapsed time) on a 2011 laptop.

Required R libraries
Installing RPostgreSQL required postgresql-devel rpm on my Fedora 14 box

See: collect.R for example to get started. Full code is available on github.

Notes on confidence intervals
Confidence intervals are the “strength” of likelihood # a value with fall within a given range. The 80% confidence interval is the set of values expected to fall within the range 80% of the time. It is a smaller range than the 95% interval, and should be considered more likely. E.g. if are going to hit your memory threshold within the 80% interval, look to address those limits before those that only fall within the 95% interval.

Notes on frequencies
Frequencies within the functions included are multiples of collected data. Short term metrics are collected at 20 second intervals. Rollup
metrics are 1 hour intervals. Example: for 1 minute intervals with short term metrics, use frequency of 3.

Notes of fields
These are column names from the CF db. The default field is cpu_usage_rate_average. I also recommend looking at mem_usage_absolute_average.

Notes on graphs
Graphs for the systems are shown for the first X systems (up to “max”) with sufficient data to perform the analysis (# of data points > frequency * 2) and that have a range of data, e.g. min < max. Red point = min, blue point = max.

Example images
*.raw.png are generated from the short term metrics. The others from the rollup data.

City Green as a Function of City Parks

Naturehoods Explorer :: Stanford UniversityI stumbled across the City Nature project at Stanford via some interesting interactive data visualization they have created like the comparison between natural and social variables and the Naturehoods Explorer for 34 US cities.

One of the comments in the comparison chart (first project link), was lack of clear relationships between any of the provided variables. As I’m a glutton for punishment, I thought I’d give it a go.

With the addition of only two variables at the City level to the data provided by the Naturehoods Explorer, I was able to get a good start on a linear regression model. The two variables added are city population and number of parks in the city.

Below are results to linear regression models through R.

> summary(lm(park_count ~ . , data =g2))

lm(formula = park_count ~ ., data = g2)

    Min      1Q  Median      3Q     Max 
-267.25  -42.72   -2.30   56.81  189.01 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.636e+01  4.260e+01   1.323 0.185897    
greenness    4.061e+02  3.640e+01  11.157  < 2e-16 ***
pavedness   -1.881e+01  1.861e+01  -1.011 0.312164    
pct_park    -7.069e+01  1.416e+01  -4.994 6.30e-07 ***
park_need   -7.382e+00  4.833e+00  -1.528 0.126742    
popdens      4.525e-04  1.780e-04   2.542 0.011090 *  
h_inc        5.544e-04  1.293e-04   4.288 1.87e-05 ***
home_val    -2.578e-04  2.129e-05 -12.110  < 2e-16 ***
pct_own      4.294e+01  1.184e+01   3.627 0.000292 ***
diversity    5.461e-01  9.816e-02   5.563 2.91e-08 ***
nonwhite    -8.912e+01  7.936e+00 -11.230  < 2e-16 ***
parkspeak   -2.449e+02  3.087e+01  -7.933 3.12e-15 ***
lng         -6.429e-02  1.508e-01  -0.426 0.669870    
lat         -7.898e+00  3.937e-01 -20.060  < 2e-16 ***
population   1.649e-05  1.372e-06  12.023  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 77.24 on 2646 degrees of freedom
Multiple R-squared: 0.447,	Adjusted R-squared: 0.4441 
F-statistic: 152.8 on 14 and 2646 DF,  p-value: < 2.2e-16

I also normalized the fields, except park_count, to get a feel of the relative impact of the individual variables across their very different scales.  The estimate indicates a change by one standard deviation of the variable.

> summary(lm(park_count ~ . , data =g3))

lm(formula = park_count ~ ., data = g3)

    Min      1Q  Median      3Q     Max 
-267.25  -42.72   -2.30   56.81  189.01 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 152.7865     1.4973 102.043  < 2e-16 ***
greenness    21.2528     1.9048  11.157  < 2e-16 ***
pavedness    -2.1510     2.1278  -1.011 0.312164    
pct_park     -8.3974     1.6815  -4.994 6.30e-07 ***
park_need    -2.8265     1.8503  -1.528 0.126742    
popdens       6.0435     2.3778   2.542 0.011090 *  
h_inc        15.4520     3.6039   4.288 1.87e-05 ***
home_val    -43.8665     3.6222 -12.110  < 2e-16 ***
pct_own       8.2300     2.2691   3.627 0.000292 ***
diversity    11.1726     2.0082   5.563 2.91e-08 ***
nonwhite    -21.6853     1.9311 -11.230  < 2e-16 ***
parkspeak   -14.3325     1.8066  -7.933 3.12e-15 ***
lng          -0.9985     2.3418  -0.426 0.669870    
lat         -40.1728     2.0026 -20.060  < 2e-16 ***
population   28.5058     2.3710  12.023  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 77.24 on 2646 degrees of freedom
Multiple R-squared: 0.447,	Adjusted R-squared: 0.4441 
F-statistic: 152.8 on 14 and 2646 DF,  p-value: < 2.2e-16

For explanations of the data, collection method, etc., please see the City Nature Project.

Data updated with new city vaules in csv format

Load Volatility and Resource Planning for your Cloud

Having your own cloud does not mean you are out of the resource planning business, but it does make the job a lot easier. If you collect the right data, with the application of some well understood statistical practices, you can break the work down into two different tasks: supporting workload volatility and resource planning.

If the usage of our applications was changing in a predictable fashion, resource planning would be easy.  But that’s not always the case, and volatility can make it very difficult to tell what is a short term change and what is part of a long term trend.  Here are some steps to help you prioritize systems for consolidation, get ahead of future capacity problems, and understand long term trends to assist in purchasing behaviors. Our example is with data extracted from Red Hat’s ManageIQ cloud management software.

Usually, we collect and see our performance over X periods of time, where X is a small number and we don’t get much insight. More data points are help, but require a lot of storage. ManageIQ natively provides data rollup of metrics, to provide a great balance between the two.  Since we want to compare short term to long term for trends, we lose little using the rollup data.

shorttermOur graphs look at the CPU utilization history of four systems. The first graph looks only at the short term data, smoothed (using a process similar to the one described here) over one minute intervals. We smooth the data to reduce the impact of intra-period volatility on our predictions. The method described corrects for “seasonality” within the periods, e.g. CPU utilization on Mondays could be predictably higher than on Tuesdays as customers come back to work and get things done they could not over the weekend. The blue dot is the highest utilization, and red, the lowest over the period. Continue reading “Load Volatility and Resource Planning for your Cloud”

Measuring Load in the Cloud: Correcting for Seasonality

youngkra2Usage is over threshold, unleash the kraken! 

Short run peaks are perfect for automated elasticity: the unpredictable consumption that we stay up late worrying about fulfilling.  But, short run peaks can be difficult to tease out from expected variation within the period: seasonality.  Using the open source statistical package R, we can separate and look at both.

Continue reading “Measuring Load in the Cloud: Correcting for Seasonality”

Communication Method, Scale, and Entropy

In a surprise to Marshall McLuhan, we see ad hoc conversations conducted through different electronic media demonstrating very similar scaling characteristics across number of nodes, number of edges, and number of unique edges.  Looking at email lists, IRC, and long-term twitter searches, we more similarity than difference between the three media.

However, when look at the observed conditional entropy (below the fold), the differences become clear: communication patterns are very different by media type, even as the networks scale similarly in communicants.  Maybe McLuhan was right.

nodes vs unique edges

nodes vs total edges

unique vs total edges


Continue reading “Communication Method, Scale, and Entropy”

Life in a Networked Age

John Robb, who brought us the term “open source warfare,” wallops the concerns of governance of our increasingly global network:

A global network is too large and complex for a bureaucracy to manage.  It would be too slow, expensive, and inefficient to be of value.  Further, even if one could be built, it would be impossible to apply market dyanmics [sic] (via democratic elections) to selecting the leaders of that bureaucracy.  The diversity in the views of the 7 billion of us on this planet are too vast.