alexfarquhar's posterous

alexfarquhar's posterous

Alex Farquhar  //  I'm a Developer/Data Scientist, lost in a community of like-minded Geeks (see Our Website).

Mar 13 / 3:54pm

Data Science - learn the lessons of software

We're starting to see a deluge of companies who businesses are all about making data analysis/science/insight "easy for the non-expert". We've been here before, quite a few times sadly. When I started writing software 12 years ago, there was great excitement in the air - finally we could use tools to design software, then press a button that would create our whole beautiful design in code! Then we could just hire some barely-sentient code monkeys to fill in the 'easy bits' like method definitions and those pesky database access routines.

It was a disaster. The fundamental problem was that by the time you'd crafted your beloved design and polished it to a high shine, the world had moved on. What may have worked on day 1 of the project was now hopelessly inadequate. We should always remember the maxim "no plan survives contact with the enemy", the enemy here being the shifting reality of what your software needs to deliver.

Another major problem with this approach was the proliferation of so-called Software Architects, beings of such insight and experience that they didn't even need to code anymore! Since they didn't code, they couldn't experience the grinding pain of trying to jam their grandiose designs into a reality-shaped hole.

Fast-forward to today - data is big, Data Science is even bigger (as a buzzword anyway), and we're all short of the right people. The answer, however, is not to make tools that hide the complex, ever-shifting reality of the analytical process. It's to make people better at doing this stuff. And there'll be no magic off-the-shelf solution that can achieve this, any more than giving a terrible golfer great clubs will make them win The Masters.

Filed under  //  data science   r  
Feb 3 / 12:37pm

Large search spaces using R

I'm working on some really interesting stuff at the moment, the details of which I can't discuss for reasons of national security (not really). However, one of the things I've been doing a lot of is searching though lots of different combinations of parameters to find an optimal solution.

Combinations of things are annoyingly...combinatorial. Let's say I have a machine that generates gold coins. The machine has 6 dials, each with 10 different positions, and the machine creates different amounts of gold for each setting. How many settings do I have to search through to find the best one? The answer is 10 ^ 6 , or 1,000,000. That's not too bad, you might think, we've got those computer thingies these days. But if each combination takes 100ms to measure, thats over 27 hours to test all combinations.

I have a very similar situation, where I'm working through millions of combinations. Each combination then needs to be tested (via some R code) to determine how "good" it is. If each test run takes over a day to complete then it's a problem, especially if you want to keep iterating with different scoring algorithms.

Another issue is in the generation of each combination. Initially I relied on the lovely expand.grid function in R. This will create all combinations of it's arguments (which are lists), and create a matrix, each row representing a different combination. This works fine for reasonable numbers, but once your combination count starts going into the millions, your memory starts running out...bah!

The answer to the memory issue was surprisingly simple - if there are n different parameters, each with 10 possible values, then think of each parameter occupying a position in an n-length integer e.g. 2314 would represent a combination with "2" as the value for parameter 1, "3" for parameter 2 etc. So if I wanted to search all combinations, I just go from 0000 to 9999, each time incrementing the number to get a new combination.

But what if don't have 10 values for each parameter? What if we have 2 for param 1, 5 for param 2, 8 for param 3 and 2 for param 4? In this case I needed to implement a "special" kind of number, which had different rules for incrementing. Here's the code:

next.comb = function(p){

 if(all(sapply(p, function(d) d$x == d$max))){
  return(NULL)
 }

 for(i in length(p):1){
  val = p[[i]]$x
  new.val = val + 1
  if(new.val <= p[[i]]$max){
   p[[i]]$x = new.val
   return(p)
  } else {
   p[[i]]$x = p[[i]]$min
  }
 }
 return(p)
}

# start number
p = list(list(x = 1, min = 1, max = 2),
    list(x = 1, min = 1, max = 5),
    list(x = 1, min = 1, max = 8),
    list(x = 1, min = 1, max = 2))
# start combination: 1 => 1, 2 => 1, 3 => 1, 4 => 1
p = next.comb(p)
# combination is now: 1 => 1, 2 => 1, 3 => 1, 4 => 2
p = next.comb(p)
# combination is now: 1 => 1, 2 => 1, 3 => 1, 4 => 3

Ugly, huh? But it works and takes up almost no memory, allowing the traversal of millions (or even billions) of different combinations. Next time I'll go into how I solved the other problem I outlined - the length of time to score every combination. Oh alright, I'll tell you now - I just used the multicore package and mclapply to split the problem up - simple!

 

Filed under  //  R  
Jul 11 / 3:32pm

Creating 3D geographical plots in R using RGL

I've been playing around with the rgl package in the last week, as part of an ongoing quest to come up with nice-looking (but more importantly, useful) data vizualisations. It's a nice little package, and once you've run through the excellent examples, you can rapidly create some cool stuff. The example that initially caught my eye was this one, which creates a 3D plot of the 'volcano' dataset in only a few lines of R code:

data(volcano) y <- 2 * volcano # Exaggerate the relief
x <- 10 * (1:nrow(y)) # 10 meter spacing (S to N)
z <- 10 * (1:ncol(y)) # 10 meter spacing (E to W)
ylim <- range(y)
ylen <- ylim[2] - ylim[1] + 1
colorlut <- terrain.colors(ylen) # height color lookup table
col <- colorlut[ y-ylim[1]+1 ] # assign colors to heights
rgl.open()
rgl.surface(x, z, y, color=col, back="lines")

Screen_shot_2011-07-10_at_16

What you can see here is the literal mapping of geographical height - the volcano dataset is a set of height measurements over a 10m x 10m grid of a volcano in New Zealand. What you can't see from this image is that the visualization is zoomable and rotatable (if that's a real word). The power of this is hard to describe until you've actually used it to fly through a dataset. The obvious extension of this is to map something other than physical height on the vertical axis. It could be anything - pollen counts, crime figures etc. So let see how we can do that...

Firstly, of course, you'll need to have R installed, along with the rgl package, which you can install by running:

install.packages('rgl')

And secondly, you'll need some data. I've prepared a sample (fake) dataset along with all the code from this post, you can find it here. It consists of 3 columns, 'latitude', 'longitude' and 'calls'. The first 2 columns represent a location, and the third column represents the count of mobile phone calls at the location. Let's begin by reading the data into a dataframe. I'll be using some of the awesome reshape package as well, so we'll need to load that along with rgl:

library(rgl)
library(reshape)
rgl.clear(type = c("shapes"))
calls = read.delim('call_data.tsv', header = T)

Now that we've loaded the data, we hit our first problem. The volcano example above uses the rgl.surface function, which takes a matrix argument (not a dataframe sadly). What's more, in the example, the volcano data has already been nicely arranged into a grid, with each cell representing the average height of a 10m x 10m grid square. By contrast, our calls dataset has arbitrary lat/long pairs. So (as ever) we'll need to do some manipulation of our data before we can plot it.What we need to do is bucket our call counts into evenly spaced divisions, in order that we can create a grid representing the geographical layout. Fortunately, R has full support for this kind of binning, using the cut function:

bin_size = 0.18
calls$long_bin = cut(calls$long, seq(min(calls$long), max(calls$long), bin_size))
calls$lat_bin = cut(calls$lat, seq(min(calls$lat), max(calls$lat), bin_size))
calls$total = log(calls$total) / 3 #flatten out totals

So now we've created a grid system, with each row in the dataset falling into a 0.18 x 0.18 degree grid square (I chose 0.18 for the most important reason - its makes the visualization look better :-)). Next we have to sum up all the call counts which are in the same lat/long bucket. Thankfully we have the splendid reshape library to help us here:

calls = melt(calls[,3:5])
calls = cast(calls, lat_bin~long_bin, fun = sum, fill = 0)
calls = calls[,2:(ncol(calls)-1)]
calls = as.matrix(calls)

Nearly there! We've now got our matrix, so we need to define the x, y, and z data for the rgl.surface function (run help(rgl.surface) for more details), and then call it:

x = (1: nrow(calls))
z = (1: ncol(calls))
rgl.surface(x, z, calls)
rgl.bringtotop()

And here's the result:

Screen_shot_2011-07-11_at_12

Cool - a 3D representation of the number of phone calls across the US. But that's not all! We can add colors to show the different counts more clearly. Firstly we'll clear the previous plot by using rgl.pop(), then create a color vector from the data:

rgl.pop()
# nicer colored plot
ylim <- range(calls)
ylen <- ylim[2] - ylim[1] + 1
col <- topo.colors(ylen)[ calls-ylim[1]+1 ]
x = (1: nrow(calls))
z = (1: ncol(calls))

rgl.bg(sphere=FALSE, color=c("black"), lit=FALSE)
rgl.viewpoint( theta = 300, phi = 30, fov = 170, zoom = 0.03)
rgl.surface(x, z, calls, color = col, shininess = 10)
rgl.bringtotop()

Screen_shot_2011-07-11_at_15

Nice - and there you have it, genuine 3D mapping in about 30 lines of R. Next time I may make some videos, so watch this space ;-)

Filed under  //  data science   r  
Jan 18 / 1:04pm

Programming Collective bugs

I've been working my way through Toby Segaran's Programming Collective Intelligence. It's mostly been great fun and very informative, I've learned a ton about the algorithms and it's sparked off loads of ideas. I'm currently porting all of the examples to ruby (see here), so as a by-product I'm picking up bits of python too. Although sometimes I wish I wasn't...

Anyway, my overall feeling about the book is that it could have been truly great, but the big fly in the ointment is the number of bugs in the code examples. I'm up to chapter 7 now, and I've only just found a code example without a mistake. This is a real shame and I hope the author/publisher sorts this out in the next edition, because simultaneously translating from an unfamilar language into ruby and trying to understand the algorithms is hard enough, without constantly wondering whether the code you're working with is correct. 

In summary I'd thoroughly recommend working through this book rather than just reading it - you don't really get the elegance of a lot of the algorithms until you've coded them. But be prepared to get frustrated with the bugs, and keep this link open, it has most of the bugs documented already, it may save you a few hours....

 

Filed under  //  data science  
Dec 17 / 6:10pm

I'm currently engaged in an exciting new project to determine how much data is required to melt mongodb. Oh, and I'm also storing tweets for some cunning data science reasons.....details will follow once the server explodes...

Filed under  //  data science  
Nov 10 / 6:49am

Competition within software teams is destructive

For as long as I've been in software (10 years now), I've felt that there was something slightly amiss with a lot of the teams I've worked on. I've had the privilege of working with some insanely talented people, and from them have learned an enormous amount about the art/craft/science of writing quality systems. And yet.....On only a few occasions have I felt truly happy with how the team was working as a collection of people. If I could put my finger on one thing that spoils a good team it would be individual competitiveness. 

I feel I should be honest with you at this point. I'm REALLY competitive. I always want to finish the code first, have the best ideas, be seen as a great developer. I constantly have to fight these instincts in me, because I know from experience that they're in direct opposition to the things that make a productive and enjoyable team environment, namely cooperation and shared success.

As I read more and more about creating software over the years, I found it hard to reconcile the idealised teams in the books and blogs I read about with the real ones I was working on. In the stories, everyone pulls together and makes a real effort to work collaboratively, especially when pairing. Individual heroism was frowned upon because it could actively hinder the rest of the team's work by hiding knowledge. People were free to try ideas without fear of ridicule, and to ask questions when they didn't understand something. The real world was very different. For instance, one of the things I like to do when I start a new project is ask stupid questions. Quite often they reveal a gap in people's shared understanding of a concept, and the resulting discussions really help the team to gel their collective picture of the world. But in hyper competitive teams, asking stupid questions simply gives ammunition to those who seek to look like a better developer than you. So on such a team, no stupid questions are asked, shared understanding drops away, and more often than not a crappy product is the result. A realisation was growing in me that left to their own devices, most teams will start to drift into unhealthy individualistic competitive behaviour. That's because the team is implicitly or explicitly being incentivised to act as selfish individuals.

How are they being incentivised to do this? Well it seems to be about recognition, that thing which all egotists (and by extension, most developers ;-)) crave. In a vacuum, recognition on a software team tends to go to those who do brilliant things as individuals - e.g. solving a difficult problem at 2am and presenting the solution the next morning to rapturous applause etc. Whereas typically there is no recognition for those who deliberately go slowly with a junior pair, or those who stop to think through a problem before tearing into the code, or those who spend hours refactoring to improve their and others understanding of the codebase. These things aren't sexy, they don't make a splash - but they're the lifeblood of any collaborative software effort. 

Happily, this whole issue is really easy to fix - start giving real recognition to those behaviours which are explicitly anti-competition: Asking questions, coaching others, working through problems with others, sharing knowledge. And de-emphasise individual heroics. One classic example is when a senior developer is "pairing" with a junior, who just sits next to the senior all day watching them carve out code at top speed. This behaviour is understandable in a team where people get recognition for fast delivery, but get no recognition for training others and sharing knowledge.

At my current company we really try to do the right thing on our teams, and we're pretty good at it. It's the best balance of recognising individual talents vs. working as a team that I've ever come across. But we're not complacent and we try to stay aware of any behaviour (including our own!) that might start to create a corrosive competitive environment.  

Filed under  //  teams