Thursday, June 27, 2013

Links for 06-27-2013

IT risk isn't just about security

Because it mirrors a point I often feel compelled to make when discussing security and cloud computing, I wanted to highlight a couple of paragraphs from Cloud Computing: Assessing the Risks by Jared Carstensen, Bernard Golden and J.P. Morgenthal as excerpted in Tech Target.

It's important to understand that this risk limitation [whereby service providers shift the primary responsibility for risk to consumers] is not unique to Cloud Computing. Outsource providers (e.g. firms that take over operating a company's IT data centre) also limit their financial responsibility in the event of an outage. Therefore, it is important not to regard this risk limitation as a complete restriction on using a Cloud provider, unless, that is, a company regards any risk limitation by a service provider as unacceptable. In that case, the company should continue to operate its own computing environment and forego use of an external Cloud provider.

The important point from this discussion is that when Cloud Computing security is raised as an issue, other issues are often being addressed. It's important to distinguish what type of issue is of concern, as that will change the method of evaluating the issue, the demarcation of the trust boundary and the appropriate actions to be taken by the Cloud user.

One of the reasons why I think this point is important is that discussing overall IT governance discussions solely in terms of security (whether we're talking public clouds, private clouds, or--increasingly--some manner of hybrid IT) is far too narrow a framing. This narrow framing, in turn, often leads to thinking about the issue in narrow technical terms such as multi-tenant security features, encryption and key management, and physical facility protection.

These are important matters certainly. But they're also matters that public cloud providers (like other types of outsources) can reasonably argue they have well in-hand  using well-established procedures and processes. The more difficult answers about where workloads should run come down to broader questions--and those answers may well change over time.

(I covered some of these broader issues in a presentation at the Red Hat Summit in June. I'm hoping to get a version of that presentation Beyond Safety: Controlling Clouds posted over the next month or so.)

Monday, June 24, 2013

Links for 06-24-2013

Friday, June 21, 2013

The GigaOm Structure 2013 zeitgeist

Go structure logo 2013

I like GigaOm Structure. I find it gives a good sense of the current zeitgeist in cloud computing and related areas. What's being talked about and what isn't? What new or reimagined techs are emerging as memes? 

GigaOm's own writers (among others) covered the event in considerable depth and I won't attempt to recreate their reporting here. Rather, I wanted to hit on some general themes I noticed and a few points that particularly struck me. So with no further and in no particular order, here we go.

OpenStack was omnipresent. Other on-premise IaaS? Not so much.

Mind you. There was still a bit of commentary about how OpenStack was still in relatively early days. Maybe. Ryan Granard of PayPal told the audience that his company runs 20 percent of its production infrastructure on OpenStack. As readers probably know, there was much ado about PayPal's adoption of OpenStack--they were and are heavy VMware users--a while back. One of Grandard's points though was that PayPal has a strategy of deliberately making several bets as a way of getting velocity while still having a robust infrastructure. 

Notably absent from this Structure was the once ubiquitous Marten Mickos of Eucalyptus. Nor did I hear much mention of CloudStack--though Citrix did have a sponsor workshop, which I didn't attend.

Speaking of PayPal. A nice endorsement for PaaS and OpenShift.

Granard also articulated, as well as I've heard it from anyone, why PaaS is such a big deal for organizations. As reported by GigaOm's Jordan Novet:

Companies big and little have been jumping aboard the concept of on-premise PaaS, to some degree because security, regulatory compliance and cloud vendor lock-in fears remain part of the conversation about running on public infrastructure.

How is PayPal going about this? It’s been running Red Hat’s OpenShift on-premise PaaS to build out products such as PayPal Here — the company’s answer to Square — as well as a developer sandbox.

With that tool, Granard said, a developer chooses a product to work on “and in minutes, we have you up and running in a fully connected container” with infrastructure resources immediately allocated.

The real money quote for me though was that PaaS lets PayPal "enable developers and get out of the way."

x86 vs. ARM. Come back next year.

By which I mean that there were a few threads on this topic. Especially in the vein of whether ARM will make a meaningful dent in the server world. But no clear resolution.

To the degree that there was something of a consensus among folks I spoke with, it largely parallels my opinion and goes something like the following: x86 is the clear incumbent on the server. ARM is the clear incumbent in new-style mobile (tablets, cell phones, etc.). There's considerable inertia to that default condition for reasons of ecosystem and other things. For either architecture to make a major dent (narrow use cases aside) outside of its home base will require it to develop a 10x advantage--which most people don't think is going to happen.

What's that SDN stuff anyway?

There was some skepticism. For example, Arne Josefsberg CTO of ServiceNow said that  “The conversations today sound almost exactly like the conversations we had three to four years ago." Indeed, the "hot or not" panel he sat on declared SDN a loser technology.

But that seemed to be a minority opinion. Session after session returned to the idea that the networking component of infrastructures need the same sort of rewiring in software that you can do with compute and storage if the whole dynamic IT process is going to be realized. While I think it's fair to observe that there were still a lot of open questions about how we're going to get there (and what exactly it will look like), the consensus was squarely behind SDN--at least as a concept.

Private, private, hybrid

This topic really deserves a separate post, especially given that I was on a panel about private clouds at the ODCA Forecast event preceding structure. Suffice it to say that it's a complicated topic for a variety of reasons:

  • Depending upon the specific requirements, there are strong economic reasons to choose private over public or vice versa.
  • Different organizations have strong pre-dispositions for in-sourcing vs. out-sourcing
  • Existing applications can't be ignored
  • Regulation is a factor that may or may not be "fixed" (from the perspective of public clouds.

The bottom line is that there are plenty of arguments and cherry-picked examples available to bolster "your" side. That said, there was widespread agreement that, for at least the next n years (where n is much less agreed upon), the cloud world will be hybrid.


Podcast: Geo, mobile, and more on OpenShift

Red Hat's OpenShift lets you skip the unproductive work associated with setting up development environments, work that Red Hat developer evangelist Steven Citron-Pousty (@TheSteve0 on twitter) call "yak shaving." Listen to this podcast to learn how easy it is to get started with OpenShift.

Here are links to some of the posts and other topics covered in the podcast:
Spatial MongoDB in OpenShift, be the next FourSquare - Part 1 | OpenShift by Red Hat
REST web services with Python, MongoDB, and Spatial data in the Cloud - Part 2 | OpenShift 

Shameless plug: I'll be on a panel with GigaOm analysts talking about PaaS Tuesday, June 25: Red Hat-Flexibility with PaaS: How to Keep Your Options Open — GigaOM Pro.

Listen to MP3 (0:17:57)
Listen to OGG (0:17:57)

Monday, June 17, 2013

Links for 06-17-2013

Podcast: The past, present, and future of Linux systems management

The scale and dynamism of cloud computing is changing the way in which systems need to be managed. Red Hat Satellite product managers Todd Warner and David Caplan talk about how these changes are being manifested in both current and upcoming Red Hat products and associated open source communities, such as Puppet and Foreman.

Listen to MP3 (0:24:09)
Listen to OGG (0:24:09)


Gordon Haff:  Hi, everyone. This is Gordon Haff, Cloud Evangelist with Red Hat, and I'm down here in Raleigh, in the shiny new Red Hat Tower, talking with our two Red Hat Systems Management Product Managers, Todd Warner and David Caplan. Welcome, Todd and David. Todd, briefly introduce yourself, and then I will have David do the same thing.
Todd Warner:  Hello Gordon, my name's Todd Warner. I've been with Red Hat for approximately 10, 11 years now, and I've been associated with the Red Hat Satellite product for the majority of that time. Recently, I had the benefit of sharing that duty with David Caplan, who's now my co‑captain on the product. David?
David Caplan:  My name's David Caplan, and I'm a Principal Product Manager for the next generation Satellite. Todd has been working diligently on Satellite 5, and I'm picking up the reins on Satellite 6.
Gordon:  We'll get into a little bit about the actual products and the road maps later on in this conversation, but let's take things up a level to start with. Maybe you could talk to our listeners about what some of the big trends that you're seeing in systems management today, and how that's driving product change. Maybe we can start with you, Todd.
Todd:  Let me talk about where the industry has been for some time, and then I'm going to turn it over to David to tell us where he sees it going in the future, and how systems management needs to address that, and then also how Satellite intends to address it directly for our customers. For the longest time now, physical systems have been the primary platform for customers to build workloads on top of, and often for small shops that's been a handful of machines. For larger shops, it might be some BladeCenters or even a data center. Satellite was built with that premise in mind, the concept of the data center.
Satellite was introduced in 2002 as a means to patch systems, provision systems, and build standard operating environments associated to systems, but times are changing. Virtualization used to be a very specialized world in the old days in computing, which was just a decade ago. Now it's very prevalent, relatively cheap, and taking over the industry.
Satellite has addressed that, but with recent trends in computing, Satellite has to be more nimble, more adaptable. With that, we're talking about cloud, hyperscale, things like that. Let me turn it over to David to talk a little bit more about that.
David:  Thanks, Todd. The world of IT, in general, and systems management in particular, has certainly grown increasingly complex. The challenges of standing up servers on bare metal has now expanded to include problems like virtualization, as Todd mentioned, but not just one flavor of virtualization. There are competing standards, including RHEV/KVM on the Red Hat side. There's VMware, of course. EC2. There's OpenStack, and then there's all the variations on OpenStack that are beginning to emerge. Systems management has to evolve alongside so that it can be nimble and basically handle these different provisioning requirements, but still keep its eye on bare metal. Bare metal is also going through some radical transformation, as we go to data center densities of 10,000 servers to hundreds of thousands of servers, as we get into the world of hyperscale. That is something that we are watching here at Red Hat.
Open source is another emerging trend that has captured the imaginations of information technology [people]. It promises, with do‑it‑yourself techniques, to take care of some of the more daunting problems of configuration, drift management, and whatnot.
What we're trying to do at Red Hat, in systems management, is to take all the benefits of open source, all the innovation, and envelop it in workflows and structure, and really allow customers to derive the full value of open source innovation, but not be burdened by all the do‑it‑yourself vagaries and cul‑de‑sacs.
Gordon:  One of the interesting things I find talking about systems management and open source is I think, arguably, systems management was a relatively late field for open source. There have been some projects around for a while, mostly in the monitoring area, but I think systems management has tended to be such a large surface area type of application that it's been really hard for open source to tackle. Today, there really are some open source technologies, including some that we are using, that have really started to have a pretty major impact.
David:  Absolutely, Gordon. For many decades, systems management was controlled by a set of well‑recognized brands. The legacy providers who built very high‑quality stuff, but the iteration and improvement tended to be very slow once these things were deployed. On the other hand, open‑source moves at a very, very fast cadence, tends to be less encumbered by what's been done before, and can take a fresh look at solving problems. An example is Puppet, something that has in many respects exceeded all expectations, and certainly capabilities beyond what the legacy suppliers have been able to do, the proprietary shops.
Todd:  I did want to add to what David was saying, is that systems management's changed, not only technologically, but also in processes, workflow, sophistication of how people model their end systems. Just 5, 10 years ago, it was only common the big houses, the big IT firms, where you would model systems and layers, develop well‑designed SOEs, and have teams that owned each little piece of the layer of the application that went out the door. That's becoming more common all the way down to even the smaller IT groups, because it's more accessible, and the tooling is much better. Open‑source has really been a cost‑effective way of getting that level of technology, that reach of technology, in the hands of people that can't afford the million‑dollar systems management deployment.
The open‑source community is really outmaneuvering the larger systems management houses, because they can focus on the small tasks, doing very well. Then, folks like Red Hat helped tie all this together and build a better, larger system.
Gordon:  Let's make things a little more concrete here, and let's talk about some of the specific things that are going on at Red Hat. By way of context here, what we're discussing in this podcast is, specifically, the systems management area. It's certainly not the only thing Red Hat's doing in management. We acquired a company called ManageIQ last December, which we're combining with some of our in‑house developed open source technology into an overall open hybrid cloud management product.
What we're going to talk about here, what we're really going to dive down here, rather than this CloudForms hybrid cloud management, is specifically the systems management side of things.
I thought maybe you could start out by describing what Red Hat means when it says "systems management," and where we are and where we're going with our systems management product that we're shipping today.
Todd:  According to Red Hat, systems management really is building that flow of defining the system, deploying that system, managing that system over time, and then recycling that system. Managing many of those, many of those definitions, many of those systems, and being able to manage that at scale. Being able to build policies surrounding that, for example security, patching, configuration management, and things like that. To Red Hat, that's really Red Hat's definition of systems management. When we associate that to where we did the ManageIQ acquisition, the driving technology behind CloudForms, Red Hat needs to, as David was talking about, the future trends that we're adapting to are systems management technologies, in this case Satellite, have to adapt to that, so that we can leverage those technologies. Like, for example, the cloud technologies that ManageIQ is bringing to bear.
With Satellite, we have the systems management, that defining systems, managing systems, recycling those systems over time, at scale, physical, virtual, and now cloud.
Gordon:  David, talk about where we're going.
David:  That was an excellent description, Todd, of Red Hat and systems management, and we're building on that in our next generation Satellite, known as Satellite 6. If I could summarize in one sentence what we're attempting to achieve in Satellite 6, it would be bare metal to cloud in a single workflow. It's the recognition that most of the work today happens in the cloudy domain, whether that's hybrid cloud, whether that's private cloud, or in the public cloud. But getting there and automating the steps from new bare metal to the new world of abstracted resource is not easy for most IT customers.
Satellite 6 is designed to begin the process with bare metal discovery, in ways that previous systems could not achieve. Primarily because Satellite 6 is unaffiliated with any one supplier of hardware.
It has to do this bare metal for all of our partners and even hardware we've never seen before. Once we have things discovered, we need to register it, we need to provision it, and we do that with end‑to‑end automation. The next step is configuration, and we do configuration with recipe‑based solutions.
The current system built today for this is Puppet. We are leveraging all the power of Puppet, and building Satellite 6 around the Puppet ecosystem, by introducing something called an "external node classifier." The combination of Puppet, of Satellite 6, and the legacy that we have built on for managing content and entitlements, should provide customers with a very, very capable solution. Not just for now, but hopefully for another decade.
Gordon:  Todd, maybe you could tell our listeners specifically, whether they're current Satellite customers or someone who's interested in doing Linux systems management, what they can get from Red Hat today.
Todd:  Thank you, Gordon. Red Hat Satellite, we just released. I shouldn't say "just." At the end of 2012, we released Satellite 5.5, and 2012, that was our 10‑year anniversary for Satellite. It's a very mature product. With that release, we were really focused on modernization and compliance features. For example, "modernization" meaning keeping up with the times, IPv6, in particular, in this case. "Compliance" meaning I want to take a policy that another group gives me, apply it to a system, and report back if that system's compliant or not. Additionally, we had some content management improvements in that release, and some generalized scalability and network bridging technologies added within that release.
Satellite 5.5 was an incremental release for us that was released in October 2012. Coming this year, we have Red Hat Satellite 5.6 coming in this fall. We're really excited about this release, in that one of the key things we're bringing to bear in this release is that we're adding improved reporting. Our customers want to be able to understand how they are consuming product from Red Hat, and operating system resources better than they have in the past, and we hope to bring that coming this fall in 5.6.
Additionally, lots of improvements as far as manageability of the product. Hot backup support, being able to split the product into two so you can scale it out better when you install it. This is the server‑side piece of Satellite that you can split into two. We want to improve the way it scales.
We also are improving our ability to do client‑side introspection, as far as troubleshooting. A core operating system service goes down, Red Hat Enterprise Linux will send the details of that crash to Satellite so that, in one console, the administrator can see why that system had issues.
We're also giving some options as far as the database that Satellite actually utilizes. We're introducing PostgreSQL as an option in Satellite 5.6.
We have some coming releases in the next couple years after that. We have currently planned out two more releases after that, into 2015. I don't really want to go into details surrounding those. They're still nebulous and in motion, but we do have, currently, 5.6 planned for this fall, a 5.7, and a 5.8, all the way out to 2015.
What's more exciting, in my opinion, Satellite 5 continues to grow and mature, and it's going to be a supported platform for many years to come. In parallel to that, we're developing our next technology Satellite, which we're calling Satellite 6, and David will talk more about that and where we're going with that.
Gordon:  David, maybe you could share a little bit of detail about what people should expect, when they should expect it, what the use cases are that Satellite 6 might make the most sense for.
David:  Certainly. Thanks, Gordon. Thanks, Todd. Satellite 6 is currently under development now, and it's a system that was built from the ground up. A lot of the capabilities of Satellite 6 are derived from Satellite 5, and would be familiar to Satellite 5 customers, and familiar to Red Hat customers who are new to systems management, because the problems that it solves are very familiar. Satellite 6 is really broken up into two major components. There's a content and entitlement piece, which takes its cue, to a large degree, from Satellite 5. Some of the changes are the introduction of our customer portal as the main access for Red Hat content. Our previous access point, Red Hat Network, served Red Hat well for many, many years, but our own success has basically caused bottlenecks in getting customer access to content when they want it and where they are located.
Satellite 6 uses a worldwide content distribution network, and it plumbs to the points of presence of that network that are closest to where our customers are. It syncs content very efficiently into a kind of a common content mirror.
From there, there are exciting new capabilities, where customers are able to create very special content containers, which are called "content views," that are similar in concept to the channels that Satellite 5 supported, but are much more performant than what we are able to do with channels, so newer technology there in delivering these things.
The other part of Satellite 6 is entitlement management. Entitlement management is very important to our customers, so a lot of effort has been put into really superb and granular reporting of subscription consumption.
The other half of Satellite 6 is concerned with provisioning and configuration. Where Satellite 5 did managed kickstarts and then used configuration channels for files and other configuration information, Satellite 6 is built on an upstream project called The Foreman. It wraps itself around Puppet in a way that simplifies the construction and manipulation of kickstart files, the introduction of special, late‑binding override parameters, and a smooth and seamless handoff to Puppet.
Where the two systems are integrated seamlessly is in the content delivery part. When Puppet runs and extracts content, when it's doing its work of provisioning a server, that content is coming from the content management system that I described previously. It's a very tightly controlled set of processes. When systems are up and running, they register for errata and can be repurposed or re‑provisioned at any time. That's basically the 10,000‑foot view of Satellite 6.
Gordon:  Great. Thanks, David. Can you maybe talk a little bit about the managed design program?
David:  Of course. The timetable for Satellite 6 is roughly a year from now, so we're talking about June of 2014. It's a big program, and it has a lot of button and knobs. What we are hoping to do is to deliver something at that time that is ready to go, and is familiar, and useful to the largest number of Red Hat customers. The way we're getting to that, or achieving that goal, is with a special program called the "Managed Design Program." The idea of this program is to pre‑release Satellite 6 at three different stages between now and GA, and let our customers experiment with the software, exercise the workflows, and then give us feedback about what could be better, what they love about the product, we're hoping to have a lot of about what they love about it.
It also allows us to have a much closer relationship with those customers who have a real stake in open source systems management and Red Hat.
This program is a big win for our effort to build the right product at the right time, but it's also a win for our customers, because they will be able to see many of their great ideas and innovations incorporated into something that they can subscribe to when the product becomes generally available.
The first MDP drop will happen this summer, and then the others in three‑month cadence.
Gordon:  Briefly, what would a typical customer getting involved in the managed designed program look like?
David:  The ideal MDP participant would have Satellite 5, be using Satellite 5, and exercising many of its complex workflows. Whether it's patching, provisioning, configuration, content management. Those types of flows will be covered in the MDP version of Satellite 6, so that would be very important to us. Equally important are customers that are already ahead of the curve, and are using Foreman and Puppet today, and are looking for ways of tying it back into their content and content versioning. That would be the second candidate.
Lucky for us, as we've gone out and talked to many of our customers, we find that there are many, many Satellite 5 customers are already using Puppet today. They're experimenting with Puppet, and they're using it the do‑it‑yourself way, and looking to Red Hat for guidance, and looking to Red Hat for enterprise expertise in tying these things together. That basically describes the ideal candidate, Gordon.
Gordon:  I think one of the takeaways from here, and I think Todd, you may have been the one that used the term, "hyperscale," is as we look to how cloud computing is developing, one of the hallmarks is you're talking about orders of magnitude difference in terms of the increase in the number of running instances under management. Trying to handle that kind of thing manually certainly hasn't been, necessarily, a very good idea for many years now. When you're talking thousands of instances in even a moderate‑sized organization, that simply can't be done manually or you're just asking for runaway administration costs, and for that matter, runaway compliance problems.
David:  Right, Gordon. In the past decade at Red Hat, we've seen our customers expand from tens of machines to 100 machines, to thousands of machines, to tens of thousands, to hundreds of thousands of machines being managed. They're able to do this because A, the technology is getting cheaper at the client side, and B, the workflow and processes in order to manage those systems has been improving over time.
Satellite's been right there with them the whole time, but we are seeing the challenge with the boom in number of systems represented by virtualization and cloud, that Satellite has to be able to adapt to extreme large numbers of systems. How does a customer deal with this issue? The fact that they have workloads that are spanning many, many systems.
One way is to expand the number of actual human beings working on those problems, but that's not realistic. Satellite exists today and it does expand to thousands of systems, but we're working towards Satellite 6 being positioned to better manage many, many, thousands of systems with a reasonably small team of administrators. That's the challenge today. That many systems, and make sense of it, to a reasonable number of people.
Gordon:  Fundamentally, this is just one facet of everything has to be automated in a consistent way. Thank you David and Todd. Anything you'd like to add?
David:  No, only that it's a pleasure doing this podcast, I hope our customers find this interesting. Satellite is a tremendous product, it has tremendous success with our customers, and I hope you're encouraged that we're not standing still. Satellite is improving, and we're making investments in new technologies in the open source world with existing product today, and where we're going in the future.
Todd:  We're very excited about what will be coming out through our MDP program and then through our general availability of Satellite 6. Pay attention, check out and see what's going on with the latest systems management, and thank you very much.
Gordon:  Thank you David, thank you Todd.

Wednesday, June 05, 2013

Stop data silos

Apropos of the post I put up earlier today, this sponsored post on GigaOm notes that:

Data silos and integration challenges, more than security, are the biggest barriers to cloud adoption. Siloed data in cloud apps and data centers is costing companies millions annually due to inconsistency, inaccuracy and inefficiency across the business. And the enterprise software market is crossing the threshold of another transformation, now that cloud computing has shifted the center of gravity for data.

The MIT Media Lab's Sandy Pentland noted something similar at the MIT Sloan CIO Symposium a few weeks back when he said that "About 20% of big data is getting the data out of the silos and transforming it." 

That's why at Red Hat we're so interested in the idea of open hybrid cloud--which includes open hybrid storage. (Starting with Red Hat Storage based on GlusterFS.) This isn't to minimize the important, and really difficult, role of organizational change in breaking down the silos. But there's at least technology available to help break down silos rather than aid in their creation.

Links for 06-05-2013

Why we should probably retire "Big Data"

Data is hugely important. Long has been of course. It's often argued that applications are the longest-lived IT asset. Arguably the data that those long-lived applications create and access sticks around for at least as long. Data is only going to become more important.

Is there a lot of hype around data today? Sure. Raw data isn't information. And information doesn't necessarily lead to useful action if organizations (or their customers and users) aren't willing to change behavior based on new information. Just because pricing mechanisms can be used to reduce traffic congestion based on sensor data--just one of the ideas discussed under the "Smart Cities" term--doesn't mean that even basic congestion pricing is necessarily politically viable. 

Furthermore, I strongly suspect that the lots of data hammer will turn out to be a rather unsuitable tool for certain (perhaps many) classes of problems--the enthusiasms of the "End of Theory" crowd notwithstanding. For example, it's unclear to what degree more data will really help companies design better products or target their ads better. (To be clear, there's long been lots of market research in the consumer space. It's just that it hasn't been terribly effective in creating the killer ad or the killer product.)

But it would be a mistake to think that today's data trends are 1990s-style Data Warehousing with just a fresh coat of paint. Whatever the jokes about the local weatherman, weather forecasting has improved. Self-driving cars will happen, though they may take a while to come into mainstream use. DNA sequencing is now commonplace--although, in a common theme, we're still in the early days of figuring out what we can (and should) do with the information obtained. And we're well on the way to sensors of all sorts becoming pervasive. 

Which makes the "Big Data" term somewhat unfortunate in my view. I realize that may seem a bit of a contradiction given what I wrote above. Let me explain.

My first problem is that "Big Data" is too narrow. This is true even if we use the term in the broader sense of data that is atypical in some respect--not necessarily in its volume. The four Vs is a common shorthand. (For a less precise, but possibly more accurate description, I like the "Difficult Data" term that I heard from the University of Washington's Bill Howe.)

But an emerging culture of data doesn't have to be about big or even difficult. Discussions about data at the MIT Sloan CIO Symposium last month included big data examples, but it was also, in no small part, about cultures of data and the breaking down of silos. Just as with IT broadly and cloud computing, data and storage have to increasingly be based on a hybrid model in which data can be accessed when and where it is needed and not locked up in one department or even one organization. Governments and others are increasingly making open data available as well. 

It's also worth remembering Nate Silver made headlines for calling the last US presidential election correctly, not because he did big data stuff or even because he applied particularly innovative or sophisticated analysis to polling data, but mostly because he used data and not his gut.

The second issue I have with "Big Data" isn't really that term's fault. Rather, it's that "Big Data," today, is so frequently conflated with Hadoop.

Based on Google's MapReduce concept, Hadoop divides data into many small chunks, each of which may be executed or re-executed on any node in a cluster of servers. Per Wkipedia: "A MapReduce program comprises a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)."

Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. (The standard file system is HDFS, but other filesystems, such as Gluster, can be substituted for higher scalability or other desirable characteristics. 

Hadoop is often a useful tool. If you can split up data sets and work on them with some degree of autonomy, Hadoop workloads can scale very large. It also allows data to be operated in-situ without being loaded and transformed into a database, which can greatly decrease overhead for certain types of jobs. (This presentation by Sam Madden at MIT CSAIL offers some benchmarks as well as some pros and cons of Hadoop relative to RDBMS systems.)

However, data can be processed and analyzed using a wide variety of tools, including NoSQL databases of various kinds, "NewSQL" databases, and even traditional RDBMs like PostgreSQL (which can still scale sufficiently to handle a great many types of data analysis and transformation tasks). In fact, we even see something of a trend with some of the new-style databases adding back in traditional RDBMS features that had been assumed to be unnecessary. 

Even high volume data doesn't begin and end with Hadoop. As Dan Woods writes for CITO Research: "The Obama campaign did have Hadoop running in the background, doing the noble work of aggregating huge amounts of data, but the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed."

Hadoop is an important tool in the kit when the amount of data is large. But there are lots of other options for that kit bad too. And never forget that it's not just about the bigness of the data but whether you share it with those who need it and whether you do anything with it.

Monday, June 03, 2013

Links for 06-03-2013