Friday, February 15, 2013

Podcast: Redmonk's Donnie Berkholz talks Big Data

Donnie Berkholz describes himself as RedMonk's resident Ph.D.

He spent most of his career prior to RedMonk as a researcher in the biological sciences, where he did a huge amount of data analysis & visualization as well as scientific programming. He also developed and led the Gentoo Linux distribution.

In this podcast, he discusses the impact of Big Data, why you need models, how to get started in Big Data, and what we'll be saying about the whole space five years from now.

Listen to MP3 (0:09:37)
Listen to OGG (0:09:37)


Gordon Haff:  Hello, everyone. This is Gordon Haff, cloud evangelist with Red Hat. I'm sitting here at Monki Gras, with Donnie Berkholz, who's an analyst at RedMonk. As well as having done lots of other exciting things. Donnie, why don't you tell people a little bit about yourself?
Donnie Berkholz:  Sure. I've been at RedMonk for a little over a year now, as an analyst. My history's actually pretty different from most people in tech. Because a year ago, I was a scientist doing drug discovery at Mayo Clinic. It turned out that I was having more fun doing all the scientific programming to enable that drug discovery and working with all the data involved in scanning lots of drugs through computers that I decided, "I want to work on that as my job." I didn't care so much about the drug discovery itself anymore.
Gordon:  You've also done a little bit with Linux, haven't you?
Donnie:  Definitely. I've been working on open source software for about 10 years now. Mainly on a Linux distribution called Gentoo, but also on a number of other projects and learned a lot about how to lead projects without authority and how to deal with community problems, manage communities and all that kind of thing.
Gordon:  We're going to talk a little bit later about some of the intersects between open source and big data, which is a pretty huge intersection. But first, let me start off with something that's maybe a little bit provocative. There's all this talk about data now. I remember back in mostly the 1990s, there was a lot of talk about something called data warehousing. This was how all this data was going to be collected in business, was going to do wonderful things. In fact, it mostly only did wonderful things for the companies selling the expensive software that was supposed to achieve these wonderful results. Why are things different this time?
Donnie:  Things are different, but they're not nearly as different as people think they are. One of the big differences is dealing better with unstructured data. Another one is that, with the whole trend of data science, you're getting a lot more people who really understand data involved. Instead of just storing the data and querying it by business analysts or data analysts. Now, you've got people who are professional statisticians involved in understanding that data. Modeling it using patterns and trends. Being able to think about data in terms of, not just trends ongoing in the past, but predictive trends, using better statistical models than just drawing a flat line.
Gordon:  I think that's an interesting point because Wired's editor in chief, at the time, Chris Anderson, wrote a rather provocative article, maybe a couple of years ago now, essentially saying that we don't need this model stuff any longer. We have enough data. We have powerful computers. The answers are going to fall out. I may be stereotyping him a little bit. But I don't think that much. What's your reaction to that?
Donnie:  I think it's true for some definitions of models. You're always going to want to understand the data. The only way to really understand things is by modeling them. Just looking at a distribution of a million numbers doesn't give you an understanding of what that data means or how it ties into any statistical distributions. That kind of information lets you much more accurately predict the future. Another point is that modeling, in the context that I think he meant, he's talking about very simplistic ways of modeling things. But there's much more popular methods now called Robust Statistics. Not just pretending everything can be modeled with a simple average or a standard deviation, but instead saying I don't care what kind of distribution it is. So, throwing out the model in that respect, but you still want to be able to understand that data and represent it in a more abstract, more simple way.
Gordon:  Now, you do have areas like low‑bias models, natural language processing, things like that. In fact, we've had a pretty hard time coming up with the models, and probably have done a better job when Google just crunches through a huge amount of recorded data essentially. Can you maybe give us a little idea of having a spectrum of problems, predominantly those related to business, social networking, advertising, and so forth. What areas can you come closer throwing a bunch of data at the problem, and where are the ones that we really do maybe need to understand the problem better and come up with better models for?
Donnie:  The easiest distinction to make is, how much data you have. If you can throw more data at it, a lot of times it is cheaper to just throw more data at the problem, rather than trying to model it better. But in some cases, you have a very limited data set, and you have nothing else to resort to but to try and model it smarter. One example might be, trying to understand very small subsets of a group, and what those subsets are doing. So, if you imagine that the purchasing patterns of people who are blind, might be very different from the general population, and you want to understand what those people are like and cater to them more effectively. But you're working with a tiny subset of one percent of all of your data, and it might be hard or even impossible to collect more of it.
Gordon:  Let's turn our attention to somebody that wants to get started in this big data space. Maybe they've been a programmer, maybe working in open source for a while. But maybe they don't have a formal statistics background, and they will all get into this area. Or maybe there is someone who is a statistician and wants to get in. Let's talk about some of the open‑source tools, and for that matter knowledge, that they can pick up, that they can work more effectively here.
First of all, for example, we have distributed file systems, which is a big part of large data sets.
Donnie:  Yeah, a place that I would definitely get involved if I were thinking about working in big data, would be with the analytical tools themselves. So tools like R, or like Python, which is actually starting to become a very competitive solution for working with lots of data, with libraries like pandas. Now, there is a lot of work going on to graph more effectively too, not just to do the analysis itself. I'd certainly start there with the analytical tools. Now, on the data side, Hadoop is really the default choice. Much like Github is the default for version control, Hadoop is the default for Big Data.
There's lots of easy ways now to get going with Hadoop whether it's using a Hadoop distribution from one of the popular vendors. Or, looking into something like Project Serengetti, which is designed to help you set up a virtualized Hadoop cluster very easily.
Gordon:  If we're a few years from now and looking back...I know predictions about the future are always hard, but where do you expect we'll have seen the big wins and, conversely, where do you think we might see some disappointment, having come out of this Big Data enthusiasm today?
Donnie:  Five years out, we'll definitely have a much better understanding of what the ROI is from working with Big Data. Because a lot of companies are implementing Hadoop right now, but it's not really clear to them whether it's going to be a five percent improvement or a 25 percent improvement. And whether those costs will outweigh the benefits. One place where we'll see change is there. Another example is, I think, the idea of data science right now is one that's very exclusive sounding. What's going to happen is that's going to become much more democratized. With tools like R and Python becoming increasingly popular, not just within the data community, but in the broader world of everybody who has to deal with data on a daily basis whether that's a business analyst or a software developer.
We're going to see things become much more democratized. We're going to see what the true payoff looks like. Other than that, it's just going to be the same trends continuing. In terms of what we've seen before, with the popularity of GitHub. Everything's going to become more intermingled.
What happens is, the culture's start to mix. Suddenly you get all these interesting benefits that you wouldn't have realized were there, as those two cultures start to mingle.
Gordon:  One of the interesting things with Big Data, in many respects, like much of cloud computing, is how pervasive open source has become here. Linux was the guy on the outside coming in, competing with already proprietary tools that were already in place. In many cases, with Big Data, it's not just the case of the open source tools being more innovative or less expensive. But they're often the default choice, as you put it.
Donnie:  Yeah. A big part of it is, when you have an open source solution you can have many companies collaborating together on it, to make it work much more quickly than you could otherwise. So you end up with something that works next month, instead of next year. This is what we've seen with Hadoop. It's a very effective collaboration between a number of different companies around the Hadoop ecosystem. That enables the user, which in this case is often a developer or data scientist, to start using it much more easily, to get features added much more quickly than they might otherwise. And as you know, developers have a very strong preference for open source. Just by being open source, it already has a step ahead of the competition.
Gordon:  Thank you very much. Anything else to add?
Donnie:  No. Thank you for having me on.
Gordon:  Great. Thanks, Donnie.

No comments: