Donnie Berkholz describes himself as RedMonk's resident Ph.D.
He spent most of his career prior to RedMonk as a researcher in the biological sciences, where he did a huge amount of data analysis & visualization as well as scientific programming. He also developed and led the Gentoo Linux distribution.
In this podcast, he discusses the impact of Big Data, why you need models, how to get started in Big Data, and what we'll be saying about the whole space five years from now.
Listen to MP3 (0:09:37)
Listen to OGG (0:09:37)
Transcript:
Gordon
Haff: Hello, everyone. This is Gordon Haff, cloud evangelist with Red
Hat. I'm sitting here at Monki Gras, with Donnie Berkholz, who's an analyst at RedMonk.
As well as having done lots of other exciting things. Donnie, why don't you
tell people a little bit about yourself?
Donnie
Berkholz: Sure. I've been at RedMonk for a little over a year now, as
an analyst. My history's actually pretty different from most people in tech.
Because a year ago, I was a scientist doing drug discovery at Mayo Clinic. It
turned out that I was having more fun doing all the scientific programming to
enable that drug discovery and working with all the data involved in scanning
lots of drugs through computers that I decided, "I want to work on that as
my job." I didn't care so much about the drug discovery itself anymore.
Gordon:
You've also done a little bit with Linux, haven't you?
Donnie:
Definitely. I've been working on open source software for about 10 years
now. Mainly on a Linux distribution called Gentoo, but also on a number of
other projects and learned a lot about how to lead projects without authority
and how to deal with community problems, manage communities and all that kind
of thing.
Gordon:
We're going to talk a little bit later about some of the intersects
between open source and big data, which is a pretty huge intersection. But
first, let me start off with something that's maybe a little bit provocative.
There's all this talk about data now. I remember back in mostly the 1990s,
there was a lot of talk about something called data warehousing. This was how
all this data was going to be collected in business, was going to do wonderful
things. In fact, it mostly only did wonderful things for the companies selling
the expensive software that was supposed to achieve these wonderful results.
Why are things different this time?
Donnie:
Things are different, but they're not nearly as different as people think
they are. One of the big differences is dealing better with unstructured data.
Another one is that, with the whole trend of data science, you're getting a lot
more people who really understand data involved. Instead of just storing the
data and querying it by business analysts or data analysts. Now, you've got
people who are professional statisticians involved in understanding that data.
Modeling it using patterns and trends. Being able to think about data in terms
of, not just trends ongoing in the past, but predictive trends, using better
statistical models than just drawing a flat line.
Gordon:
I think that's an interesting point because Wired's editor in chief, at
the time, Chris Anderson, wrote a rather provocative article, maybe a couple of
years ago now, essentially saying that we don't need this model stuff any
longer. We have enough data. We have powerful computers. The answers are going
to fall out. I may be stereotyping him a little bit. But I don't think that
much. What's your reaction to that?
Donnie:
I think it's true for some definitions of models. You're always going to
want to understand the data. The only way to really understand things is by
modeling them. Just looking at a distribution of a million numbers doesn't give
you an understanding of what that data means or how it ties into any
statistical distributions. That kind of information lets you much more
accurately predict the future. Another point is that modeling, in the context
that I think he meant, he's talking about very simplistic ways of modeling
things. But there's much more popular methods now called Robust Statistics. Not
just pretending everything can be modeled with a simple average or a standard
deviation, but instead saying I don't care what kind of distribution it is. So,
throwing out the model in that respect, but you still want to be able to
understand that data and represent it in a more abstract, more simple way.
Gordon:
Now, you do have areas like low‑bias models, natural language processing,
things like that. In fact, we've had a pretty hard time coming up with the
models, and probably have done a better job when Google just crunches through a
huge amount of recorded data essentially. Can you maybe give us a little idea
of having a spectrum of problems, predominantly those related to business,
social networking, advertising, and so forth. What areas can you come closer
throwing a bunch of data at the problem, and where are the ones that we really
do maybe need to understand the problem better and come up with better models
for?
Donnie:
The easiest distinction to make is, how much data you have. If you can
throw more data at it, a lot of times it is cheaper to just throw more data at
the problem, rather than trying to model it better. But in some cases, you have
a very limited data set, and you have nothing else to resort to but to try and
model it smarter. One example might be, trying to understand very small subsets
of a group, and what those subsets are doing. So, if you imagine that the
purchasing patterns of people who are blind, might be very different from the
general population, and you want to understand what those people are like and
cater to them more effectively. But you're working with a tiny subset of one
percent of all of your data, and it might be hard or even impossible to collect
more of it.
Gordon:
Let's turn our attention to somebody that wants to get started in this
big data space. Maybe they've been a programmer, maybe working in open source
for a while. But maybe they don't have a formal statistics background, and they
will all get into this area. Or maybe there is someone who is a statistician
and wants to get in. Let's talk about some of the open‑source tools, and for
that matter knowledge, that they can pick up, that they can work more
effectively here.
First
of all, for example, we have distributed file systems, which is a big part of
large data sets.
Donnie:
Yeah, a place that I would definitely get involved if I were thinking
about working in big data, would be with the analytical tools themselves. So
tools like R, or like Python, which is actually starting to become a very
competitive solution for working with lots of data, with libraries like pandas.
Now, there is a lot of work going on to graph more effectively too, not just to
do the analysis itself. I'd certainly start there with the analytical tools.
Now, on the data side, Hadoop is really the default choice. Much like Github is
the default for version control, Hadoop is the default for Big Data.
There's
lots of easy ways now to get going with Hadoop whether it's using a Hadoop distribution
from one of the popular vendors. Or, looking into something like Project
Serengetti, which is designed to help you set up a virtualized Hadoop cluster
very easily.
Gordon:
If we're a few years from now and looking back...I know predictions about
the future are always hard, but where do you expect we'll have seen the big
wins and, conversely, where do you think we might see some disappointment,
having come out of this Big Data enthusiasm today?
Donnie:
Five years out, we'll definitely have a much better understanding of what
the ROI is from working with Big Data. Because a lot of companies are
implementing Hadoop right now, but it's not really clear to them whether it's
going to be a five percent improvement or a 25 percent improvement. And whether
those costs will outweigh the benefits. One place where we'll see change is
there. Another example is, I think, the idea of data science right now is one
that's very exclusive sounding. What's going to happen is that's going to
become much more democratized. With tools like R and Python becoming
increasingly popular, not just within the data community, but in the broader
world of everybody who has to deal with data on a daily basis whether that's a
business analyst or a software developer.
We're
going to see things become much more democratized. We're going to see what the
true payoff looks like. Other than that, it's just going to be the same trends
continuing. In terms of what we've seen before, with the popularity of GitHub.
Everything's going to become more intermingled.
What
happens is, the culture's start to mix. Suddenly you get all these interesting
benefits that you wouldn't have realized were there, as those two cultures
start to mingle.
Gordon:
One of the interesting things with Big Data, in many respects, like much
of cloud computing, is how pervasive open source has become here. Linux was the
guy on the outside coming in, competing with already proprietary tools that
were already in place. In many cases, with Big Data, it's not just the case of
the open source tools being more innovative or less expensive. But they're
often the default choice, as you put it.
Donnie:
Yeah. A big part of it is, when you have an open source solution you can
have many companies collaborating together on it, to make it work much more
quickly than you could otherwise. So you end up with something that works next
month, instead of next year. This is what we've seen with Hadoop. It's a very
effective collaboration between a number of different companies around the
Hadoop ecosystem. That enables the user, which in this case is often a
developer or data scientist, to start using it much more easily, to get
features added much more quickly than they might otherwise. And as you know,
developers have a very strong preference for open source. Just by being open
source, it already has a step ahead of the competition.
Gordon:
Thank you very much. Anything else to add?
Donnie:
No. Thank you for having me on.
Gordon:
Great. Thanks, Donnie.
No comments:
Post a Comment