Thursday, August 29, 2019

Podcast: OpenDataHub brings together the open source tools needed by data scientists

[click diagram to embiggen]

OpenDataHub is a reference architecture that brings together tools for AI and data work. In this podcast, I talk with Red Hat's Steve Huels, Sherard Griffin, and Pete MacKinnon about the motivations behind OpenDataHub, the open source projects that make it up, how OpenDataHub works in concert with Kubernetes and Operators on OpenShift, and how it reflects a general trends towards open source communities working together in new ways.

Show notes: - An AI and Data Platform for the Hybrid Cloud

MP3 - 22:11


Steven Huels:   I am responsible for the AI Center of Excellence at Red Hat.
Sherard Griffin:   I am in the AI Center of Excellence as well, responsible for the OpenDataHub and the internal running of the data hub at Red Hat.
Pete MacKinnon:   Red Hat AI Center of Excellence, helping customers and ISVs deploy their machine‑learning workloads on top of OpenShift and Kubernetes.
Gordon:  We heard "OpenDataHub." Let's tell our listeners what it is.
Sherard:  OpenDataHub, that's an interesting project. It's actually a metaproject of other community projects that are out there. It basically is responsible for orchestrating different components that will help data engineers, data scientists, as well as business analysts and DevOps personnel to manage their infrastructure and actually deploy, train, create models, and push them out from for machine learning and AI purposes.
What we tried to do is, take things that are being done in the community through open‑source technologies. Wrap them up in something that can be deployed into Red Hat technologies such as OpenShift, and leverage some of the other technologies like SAP, JupyterHub, and Spark. Make it more easily and accessible for data scientists and data engineers to do their work.
Gordon:  I'm going to put an architecture diagram in the show notes. Could you, at high‑level, describe what some of the more interesting components are, what the basic architecture is for OpenDataHub?
Sherard:  One of the key things that you'll see when you deploy OpenDataHub inside of OpenShift is that it solves your storage needs. We collaborated with the storage business unit at Red Hat.
We allow users to deploy Ceph object storage that allows for users to be able to store their massive amounts of unstructured data, and start to do data engineering and machine learning on top of that. Then, you'll see other popular components like Spark where you're able to then query the data that's sitting inside of Ceph.
You can also use Jupyter notebooks for your machine‑learning tools and be able to interact with the data from those perspectives.
That's three high‑level components but there are many more that you can interact with that allow you to do things like monitoring your infrastructure, being able to get alerts from things that are going wrong, and then also doing things like pushing out models to production, testing your models, validating them, and then doing some business intelligence on top of your data.
Gordon:  We've been talking at an abstract level here. What are some of the things that our listeners might be interested in that we've used OpenDataHub for?
Pete:  There's a variety of use cases. It is a reference architecture. We take it out into the field and interact with our customers and ISVs and explain the various features.
Typically, the use cases are machine learning where you have a data scientist who is working from a Python notebook and developing and training a model.
ETL is an important part of machine learning pipelines. This part can be used to basically pull that data out of data lakes that perhaps are stored in HDFS and Hadoop and put into the model development process.
Sherard:  More broadly, the OpenDataHub internally is used as a platform where we allow users to import data of interest themselves and just experiment and explore data.
Whether they want to build models and publish those models, it's really an open platform for them to play with an environment without having to worry about standing up their own infrastructure and monitoring components.
More specifically, we use it for use cases that ultimately make their way into some of our product focus. We're experimenting with things like how do we look at telemetry data for running systems and predict future capacity needs. How do we detect the anomalies and then drill down to the root cause analysis and try to remedy those things automatically?
These are all some of the use cases that we're using the OpenDataHub for. We feel a lot of these, obviously, have resonated with our customers, and they mirror the use cases they're trying to solve. Those are just a couple of the ones we're doing internally.
Gordon:  We've been talking about the internal Red Hat aspect of OpenDataHub. Of course, this isn't just an internal Red Hat thing.
Sherard:  Correct. As Pete mentioned, OpenDataHub is a reference architecture. It is open source, and we use it as a framework for how we talk with customers around implementing AI on OpenShift.
There's a lot about OpenDataHub in the way it's been broken apart and architected that it hits on a lot of the key points of the types of workloads and activities all customers are doing.
There's data ingestion. There's data exploration. There's analysis. There's publishing of models. There's operation of models. There's operating the cluster. There's security concerns. There's privacy concerns. All of those are commoditized within OpenDataHub.
Because it's an open source reference architecture, it gives us the freedom then to engage with customers and talk about the tool sets that they are using to manage their use cases.
Instead of just having a box of technology that's maybe loosely coupled and loosely integrated, we can gear the conversation toward, "What's your use case, what tools are you using today?" Some may be open source, some may be Red Hat, some may be ISV partner provided. We can work that into a solution for the customer.
They may not even touch on all of those levels that I discussed there. What we've tried to do is given all encompassing vision, so we can build out the full picture of what's capable and then solve customer problems where they have specific needs.
Gordon:  Again, as listeners can see, when they look at the Show Notes, there is an architecture there. For example, there's a number of different storage options, there's a number of different streaming and event type models, there's a number of other types of tools that they can use. Of course, they can add things that aren't even in there.
Steven:  If they add them, we'd love for them to contribute them back, because that's the entire open source model. We use the OpenDataHub community as that collection point for whether you're an ISV, whether you're an open source community. If you want to be part of this tightly integrated community, that's where we want to do that integration work.
Gordon:  That's what the real value in open source and its open source and OpenDataHub is. It does make it possible to combine these technologies from different places together, have outside contributors, get outside input in a way that I don't think was ever really possible with proprietary products.
Pete:  That's where it really resonates with Red Hat customers. Is they finally see the power of open source in terms of actually solving real use cases that are important to their businesses. All the components are open source, the entire project, open source, OpenShift itself is open source. It's an ethos that infuses everything about OpenDataHub.
Sherard:  I would add to that. One of the interesting points that both Steven and Pete brought up, is how customers have gravitated towards that. A lot of that is because we're also showing them that, hey, you've invested in Red Hat already or you've invested in RHEL, you've invested in containers.
In order for you to get that same experience that you may see from an AWS, SageMaker or some of their tools, their cognitive tools, Microsoft's cognitive tools, you don't have to go in and reinvest somewhere else.
We can show you through this reference architecture how you can take some of Red Hat's more popular technologies and use those same things like OpenShift, like RHEL, like containers and be able to have that same experience that you may see in AWS, but in your own infrastructure.
Whether that's something that's on‑prem or whether that's something in the hybrid cloud or even an OpenShift deployed in the cloud, you're able to move those workloads freely between clouds and not feel like you have to reinvest in something brand new.
Gordon:  Well, the interesting things about OpenDataHub and I think is also an interesting thing about OpenShift, for example is, over the last few years, maybe, we've really started to see this power of different communities and different projects and different technological pieces coming together.
Certainly, Open Source has long history. But with Cloud‑native in the case of OpenShift, with the AI tools coming together in something like OpenDataHub, we're seeing more and more this strength of open bringing all these different pieces together.
Sherard:  Yeah, absolutely. OpenDataHub first started out as an internal proof point. How can we, with open source technologies, solve some AI needs at Red Hat? What we quickly understood is, there's more of a life cycle that machine learning models have to go through. First starting with collecting data all the way through to a business analyst and showing the value of a model.
That allowed us to map out all the different parts of the life cycle and then start to figure out, "How can we introduce Open Source Technologies at each stage of that to help the process along?" As we've discovered what those processes are, we've internally deployed those into our own systems.
We're working towards getting a more robust system that solves our own problems. As we do that, we share that with the broader community saying, "Hey, here are the open source tools we did for each part of the life cycle of a machine learning model and here's how you can do the same thing."
Gordon:  And even really goes beyond that. We're here at Boston University at conference. For example, there's a lot of work being done on the research side with Red Hat and BU, for instance, on privacy preserving AI techniques. That's too much to get into in this podcast, but that's part of the whole mix too.
Steven:  Privacy, obviously, with a lot of the trends that people have seen with some of the major companies out there, is a very hot topic. There's a lot that Red Hat has already, historically been doing in the space. Whether it was looking into things like multi‑party computing to preserve the anonymity of certain data sets, that's something we've been doing for quite some time.
There's other things we're looking into too, things like differential privacy. How do we allow access to data for analysis from multiple parties while still preserving that anonymous component? Then even beyond that, we're starting to look into things like data governance.
What exists in the open source world for data governance? How do I adhere to and maintain my GDPR compliance? These standards are only going to continue to emerge as more and more data gets collected on people. They're very hot topics. They are things that Red Hat is actively involved in and has a voice in going forward.
Gordon:  Outside of Red Hat, what are we seeing in terms of interest aand doption of not just the individual technologies but OpenDataHub more broadly?
Steven:  If I even pull that a step back, the reason why a lot of this technology now is taking off the way it is, is because industry has readily adopted a lot of the open source standards. They started to expect the open source frameworks to support their use cases.
It's not enough that you simply have a single component that can deliver one piece of value. You want an integrated suite that solves a whole myriad of your problems. In doing so, there's a natural correlation and integration that has to occur.
That's being done in pockets in different areas. They're solving maybe niche use cases. When you look at something like OpenDataHub, it's actually crossing the spectrum and the boundaries of what it takes to operationalize the solutions to these problems.
Historically, a lot of these problems were solved by individuals who could do something on a very high‑powered work station on their desk, but they never made their way into production. The value companies were to get from them was very, very limited. They made great PowerPoint slides, but they never really delivered any value.
Companies now expect that value to be delivered. The OpenDataHub and that type of framework is what allows for something to put be put in operations and then maintain, like you would any other sort of critical application.
Gordon:  I think the other thing that happened is if we looked at this space a few years ago, it almost looked like people were looking for that magic bullet, like Hadoop for example, "Oh, this is going to solve everybody's data problems."
What we're seeing is you need a toolkit. You don't necessarily want the toolkit that you have to go all over the vast Internet and assemble from scratch and figure out what projects do the job and which ones are works in progress. OpenDataHub kind of reaches that, it would seem.
Pete:  In recent talks I've started off the talk talking about the AI Winter which has been this notion of this cycle of enthusiasm and investment by companies and other institutions in artificial intelligence and machine learning, only to ultimately see it fall by the wayside, fall apart.
Everything has changed now. I think open source is a big component of that change because, as Steven was saying, these various individual component projects like JupyterHub, they're their own communities, but those communities are starting to interact with each other in various ways.
What's been missing is an integrated suite like Steve was talking about. That's what we're trying to do with the OpenDataHub, something that provides a comprehensive AI machine learning platform to defeat the cycle of AI machinery winters that come and go.
Sherard:  I would also say, what we've seen when we talk to customers is that they're all at so many different stages of their AI journey. Some of them are at the very beginning. They just want to know what AI is and what that means to their organization. Some of them are at the tail end where they've developed models, and they don't know how to productionalize it.
One of the things we're able to do is take the OpenDataHub as a grounding moment for us to all have the same basic conversation. Then we can start to talk to them and say, "Hey, if you're just starting out, here's something you can start out with, or if you're ready to productionalize it, the OpenDataHub can help you conceptualize and have a reference architecture for how to productionalize it.
It allows us to just have conversations with our customers to let them, first, understand that we know the problem because we're doing it internally, ourselves. But then also, as we work with other customers and get more information about what they may want to do with governance, with security, with auditing, all these other things that happen before you go push it to production.
How can we go about it in a broader community sense where we're tackling all of this together, and we're really pushing the envelope, where everybody's starting to contribute and we're getting feedback from the customers, but we're also providing guidance?
Gordon:  How does a new person or organization typically get involved, whether it's OpenDataHub specifically, but these types of programs, what's your general recommendation?
Pete:  Having served in various open source communities, it becomes readily apparent about best practices for getting involved. There's typically a lot of enthusiasm, somebody identifies a project that they think is going to be great to work on.
It's very important to approach the community and understand how the community conducts itself. Typically, open source communities these days have exactly that, a code of conduct. There's best practices around things like GitHub in terms of forming a pull request if you have a new feature or bug request to help the community.
Also, there's modern chat applications like Slack, and Google Chat, and things like that where these communities form around. It's always good to come into those arenas hat in hand, as it were, and be humble about what it is you're trying to do. Ask questions, listen to the conversations, and build up value and present that value back to the community.
Gordon:  You're saying that if I want to get involved, I shouldn't just join the mailing list and tell everyone they don't know what they're doing?
Pete:  I wouldn't recommend that, no.
Gordon:  How would you describe, very high level, the state of OpenDataHub today and what we should expect to see for next steps, what should our expectations be?
Sherard:  The state of OpenDataHub, it's ever‑evolving. We're meeting with customers, understanding what their use cases are, trying to see how do we solve the use cases with something with a reference architecture like the OpenDataHub and see where the gaps are.
If you look at when we first started this earlier this year, and we first put out our first Operator that deployed the OpenDataHub...
Gordon:  Operator?
Pete:  Talking about OpenShift and Kubernetes, Operator is a powerful new paradigm where you basically encapsulate application lifecycle for particular components. That component interacts with the Kubernetes and OpenShift API to do full lifecycle management of that component. That's the 50,000‑foot view.
Gordon:  Yeah. To add a little bit to that, having seen a workshop yesterday in this topic, then ideas give the operator. It couldn't install to your Kafka eventing system, Spark cluster, your Jupyter Notebooks, and for something that has law components like OpenDataHub.
The idea is, it's as if you had an expert on the system come in and spend a couple hours installing things for you.
Sherard:  Yeah, that's exactly right. That's why we gravitated towards operators pretty early. The OpenDataHub is an operator, a meta‑operator. It even deploys other operators like Kafka Strimzi. We're also working with the Seldon team. We're going to be looking at integrating some other of our partners into that ecosystem.
What I was getting at is, where we were earlier this year was, the OpenDataHub was really focused on the data scientist use case trying to replace the experience of all of your data scientists across an organization doing work on their laptops.
Certain people may have different components installed. Everyone's doing pip installs with different versions. You have all kinds of dependencies that are very specific to that data scientist's laptop or workstation.
What we tried to do is solve that by introducing this into Kubernetes so that we have a multi‑user environment in OpenShift, so that everybody has the same playing field, every user's using the same suite of tools. They're using the same suite of dependencies and same versions of packages so it makes it easier to collaborate.
Once we did that, the next step was to start to introduce more of a management of your machine learning models. Now, we've introduced Seldon where you can actually deploy your model as a REST API. Then, we also introduced Kafka for data ingestion down into your object storage. We also had the ability to query the data using Spark.
Coming down the pipeline and the next month here, we're going to be introducing tools for the data engineer. What we're doing is looking at how do you catalog your data that's stored in the object storage. This is Hive Metastore but we're also introducing technologies on top of that such as Hue, which will allow you to be able to manipulate the data before the data scientists even get there.
The reason that we decided on that is because we all know that before you do machine learning, data just doesn't come in cleanly. It's not perfect right out the gate. We knew that there was a step missing in enabling data engineers to massage and clean that data before the data scientists got ahold of it.
Then, down the pipeline after that, we're looking at BI tools but then also, there's going to be more governance. We're looking at tools that might help out such as Apache Ranger, Apache Atlas. We have a number of people that are contributing in that space.
We're looking at how can we introduce more cohesive end‑to‑end management of the platform. You'll see more of that as we move along in the next few months here.
Gordon:  Where could someone go to learn more?
Steven: is the community site. You'll find a number of listserves if you want to stay in the loop. If you want to get involved, you can sign up and we can pull you into the various workstreams.

No comments: