[click diagram to embiggen]
OpenDataHub is a reference architecture that brings together tools for AI and data work. In this podcast, I talk with Red Hat's Steve Huels, Sherard Griffin, and Pete MacKinnon about the motivations behind OpenDataHub, the open source projects that make it up, how OpenDataHub works in concert with Kubernetes and Operators on OpenShift, and how it reflects a general trends towards open source communities working together in new ways.
Show notes:
opendatahub.io - An AI and Data Platform for the Hybrid Cloud
Podcast:
MP3 - 22:11
Transcript:
-->
Steven
Huels: I am responsible for the AI Center of Excellence at Red Hat.
Sherard
Griffin: I am in the AI Center of Excellence as well, responsible
for the OpenDataHub and the internal running of the data hub at Red Hat.
Pete
MacKinnon: Red Hat AI Center of Excellence, helping customers and
ISVs deploy their machine‑learning workloads on top of OpenShift and
Kubernetes.
Gordon:
We heard "OpenDataHub." Let's tell our listeners what it is.
Sherard:
OpenDataHub, that's an interesting project. It's actually a metaproject
of other community projects that are out there. It basically is responsible for
orchestrating different components that will help data engineers, data
scientists, as well as business analysts and DevOps personnel to manage their
infrastructure and actually deploy, train, create models, and push them out
from for machine learning and AI purposes.
What
we tried to do is, take things that are being done in the community through
open‑source technologies. Wrap them up in something that can be deployed into
Red Hat technologies such as OpenShift, and leverage some of the other
technologies like SAP, JupyterHub, and Spark. Make it more easily and
accessible for data scientists and data engineers to do their work.
Gordon:
I'm going to put an architecture diagram in the show notes. Could you, at
high‑level, describe what some of the more interesting components are, what the
basic architecture is for OpenDataHub?
Sherard:
One of the key things that you'll see when you deploy OpenDataHub inside
of OpenShift is that it solves your storage needs. We collaborated with the
storage business unit at Red Hat.
We
allow users to deploy Ceph object storage that allows for users to be able to
store their massive amounts of unstructured data, and start to do data
engineering and machine learning on top of that. Then, you'll see other popular
components like Spark where you're able to then query the data that's sitting
inside of Ceph.
You
can also use Jupyter notebooks for your machine‑learning tools and be able to
interact with the data from those perspectives.
That's
three high‑level components but there are many more that you can interact with
that allow you to do things like monitoring your infrastructure, being able to
get alerts from things that are going wrong, and then also doing things like
pushing out models to production, testing your models, validating them, and then
doing some business intelligence on top of your data.
Gordon:
We've been talking at an abstract level here. What are some of the things
that our listeners might be interested in that we've used OpenDataHub for?
Pete:
There's a variety of use cases. It is a reference architecture. We take
it out into the field and interact with our customers and ISVs and explain the
various features.
Typically,
the use cases are machine learning where you have a data scientist who is
working from a Python notebook and developing and training a model.
ETL
is an important part of machine learning pipelines. This part can be used to
basically pull that data out of data lakes that perhaps are stored in HDFS and
Hadoop and put into the model development process.
Sherard:
More broadly, the OpenDataHub internally is used as a platform where we
allow users to import data of interest themselves and just experiment and
explore data.
Whether
they want to build models and publish those models, it's really an open
platform for them to play with an environment without having to worry about
standing up their own infrastructure and monitoring components.
More
specifically, we use it for use cases that ultimately make their way into some
of our product focus. We're experimenting with things like how do we look at
telemetry data for running systems and predict future capacity needs. How do we
detect the anomalies and then drill down to the root cause analysis and try to
remedy those things automatically?
These
are all some of the use cases that we're using the OpenDataHub for. We feel a
lot of these, obviously, have resonated with our customers, and they mirror the
use cases they're trying to solve. Those are just a couple of the ones we're
doing internally.
Gordon:
We've been talking about the internal Red Hat aspect of OpenDataHub. Of
course, this isn't just an internal Red Hat thing.
Sherard:
Correct. As Pete mentioned, OpenDataHub is a reference architecture. It
is open source, and we use it as a framework for how we talk with customers
around implementing AI on OpenShift.
There's
a lot about OpenDataHub in the way it's been broken apart and architected that
it hits on a lot of the key points of the types of workloads and activities all
customers are doing.
There's
data ingestion. There's data exploration. There's analysis. There's publishing
of models. There's operation of models. There's operating the cluster. There's
security concerns. There's privacy concerns. All of those are commoditized
within OpenDataHub.
Because
it's an open source reference architecture, it gives us the freedom then to
engage with customers and talk about the tool sets that they are using to
manage their use cases.
Instead
of just having a box of technology that's maybe loosely coupled and loosely
integrated, we can gear the conversation toward, "What's your use case,
what tools are you using today?" Some may be open source, some may be Red
Hat, some may be ISV partner provided. We can work that into a solution for the
customer.
They
may not even touch on all of those levels that I discussed there. What we've
tried to do is given all encompassing vision, so we can build out the full
picture of what's capable and then solve customer problems where they have
specific needs.
Gordon:
Again, as listeners can see, when they look at the Show Notes, there is
an architecture there. For example, there's a number of different storage
options, there's a number of different streaming and event type models, there's
a number of other types of tools that they can use. Of course, they can add
things that aren't even in there.
Steven:
If they add them, we'd love for them to contribute them back, because
that's the entire open source model. We use the OpenDataHub community as that
collection point for whether you're an ISV, whether you're an open source
community. If you want to be part of this tightly integrated community, that's
where we want to do that integration work.
Gordon:
That's what the real value in open source and its open source and
OpenDataHub is. It does make it possible to combine these technologies from
different places together, have outside contributors, get outside input in a
way that I don't think was ever really possible with proprietary products.
Pete:
That's where it really resonates with Red Hat customers. Is they finally
see the power of open source in terms of actually solving real use cases that
are important to their businesses. All the components are open source, the
entire project, open source, OpenShift itself is open source. It's an ethos
that infuses everything about OpenDataHub.
Sherard:
I would add to that. One of the interesting points that both Steven and
Pete brought up, is how customers have gravitated towards that. A lot of that
is because we're also showing them that, hey, you've invested in Red Hat
already or you've invested in RHEL, you've invested in containers.
In
order for you to get that same experience that you may see from an AWS,
SageMaker or some of their tools, their cognitive tools, Microsoft's cognitive
tools, you don't have to go in and reinvest somewhere else.
We
can show you through this reference architecture how you can take some of Red
Hat's more popular technologies and use those same things like OpenShift, like
RHEL, like containers and be able to have that same experience that you may see
in AWS, but in your own infrastructure.
Whether
that's something that's on‑prem or whether that's something in the hybrid cloud
or even an OpenShift deployed in the cloud, you're able to move those workloads
freely between clouds and not feel like you have to reinvest in something brand
new.
Gordon:
Well, the interesting things about OpenDataHub and I think is also an
interesting thing about OpenShift, for example is, over the last few years,
maybe, we've really started to see this power of different communities and
different projects and different technological pieces coming together.
Certainly,
Open Source has long history. But with Cloud‑native in the case of OpenShift,
with the AI tools coming together in something like OpenDataHub, we're seeing
more and more this strength of open bringing all these different pieces
together.
Sherard:
Yeah, absolutely. OpenDataHub first started out as an internal proof
point. How can we, with open source technologies, solve some AI needs at Red
Hat? What we quickly understood is, there's more of a life cycle that machine
learning models have to go through. First starting with collecting data all the
way through to a business analyst and showing the value of a model.
That
allowed us to map out all the different parts of the life cycle and then start
to figure out, "How can we introduce Open Source Technologies at each
stage of that to help the process along?" As we've discovered what those
processes are, we've internally deployed those into our own systems.
We're
working towards getting a more robust system that solves our own problems. As
we do that, we share that with the broader community saying, "Hey, here
are the open source tools we did for each part of the life cycle of a machine
learning model and here's how you can do the same thing."
Gordon:
And even really goes beyond that. We're here at Boston University at defcon.us
conference. For example, there's a lot of work being done on the research side
with Red Hat and BU, for instance, on privacy preserving AI techniques. That's
too much to get into in this podcast, but that's part of the whole mix too.
Steven:
Privacy, obviously, with a lot of the trends that people have seen with
some of the major companies out there, is a very hot topic. There's a lot that
Red Hat has already, historically been doing in the space. Whether it was
looking into things like multi‑party computing to preserve the anonymity of
certain data sets, that's something we've been doing for quite some time.
There's
other things we're looking into too, things like differential privacy. How do
we allow access to data for analysis from multiple parties while still
preserving that anonymous component? Then even beyond that, we're starting to
look into things like data governance.
What
exists in the open source world for data governance? How do I adhere to and
maintain my GDPR compliance? These standards are only going to continue to
emerge as more and more data gets collected on people. They're very hot topics.
They are things that Red Hat is actively involved in and has a voice in going
forward.
Gordon:
Outside of Red Hat, what are we seeing in terms of interest aand doption
of not just the individual technologies but OpenDataHub more broadly?
Steven:
If I even pull that a step back, the reason why a lot of this technology
now is taking off the way it is, is because industry has readily adopted a lot
of the open source standards. They started to expect the open source frameworks
to support their use cases.
It's
not enough that you simply have a single component that can deliver one piece of
value. You want an integrated suite that solves a whole myriad of your
problems. In doing so, there's a natural correlation and integration that has
to occur.
That's
being done in pockets in different areas. They're solving maybe niche use
cases. When you look at something like OpenDataHub, it's actually crossing the
spectrum and the boundaries of what it takes to operationalize the solutions to
these problems.
Historically,
a lot of these problems were solved by individuals who could do something on a
very high‑powered work station on their desk, but they never made their way
into production. The value companies were to get from them was very, very
limited. They made great PowerPoint slides, but they never really delivered any
value.
Companies
now expect that value to be delivered. The OpenDataHub and that type of
framework is what allows for something to put be put in operations and then
maintain, like you would any other sort of critical application.
Gordon:
I think the other thing that happened is if we looked at this space a few
years ago, it almost looked like people were looking for that magic bullet,
like Hadoop for example, "Oh, this is going to solve everybody's data
problems."
What
we're seeing is you need a toolkit. You don't necessarily want the toolkit that
you have to go all over the vast Internet and assemble from scratch and figure
out what projects do the job and which ones are works in progress. OpenDataHub
kind of reaches that, it would seem.
Pete:
In recent talks I've started off the talk talking about the AI Winter
which has been this notion of this cycle of enthusiasm and investment by
companies and other institutions in artificial intelligence and machine
learning, only to ultimately see it fall by the wayside, fall apart.
Everything
has changed now. I think open source is a big component of that change because,
as Steven was saying, these various individual component projects like
JupyterHub, they're their own communities, but those communities are starting
to interact with each other in various ways.
What's
been missing is an integrated suite like Steve was talking about. That's what
we're trying to do with the OpenDataHub, something that provides a
comprehensive AI machine learning platform to defeat the cycle of AI machinery
winters that come and go.
Sherard:
I would also say, what we've seen when we talk to customers is that
they're all at so many different stages of their AI journey. Some of them are
at the very beginning. They just want to know what AI is and what that means to
their organization. Some of them are at the tail end where they've developed
models, and they don't know how to productionalize it.
One
of the things we're able to do is take the OpenDataHub as a grounding moment
for us to all have the same basic conversation. Then we can start to talk to
them and say, "Hey, if you're just starting out, here's something you can
start out with, or if you're ready to productionalize it, the OpenDataHub can
help you conceptualize and have a reference architecture for how to
productionalize it.
It
allows us to just have conversations with our customers to let them, first,
understand that we know the problem because we're doing it internally,
ourselves. But then also, as we work with other customers and get more information
about what they may want to do with governance, with security, with auditing,
all these other things that happen before you go push it to production.
How
can we go about it in a broader community sense where we're tackling all of
this together, and we're really pushing the envelope, where everybody's
starting to contribute and we're getting feedback from the customers, but we're
also providing guidance?
Gordon:
How does a new person or organization typically get involved, whether
it's OpenDataHub specifically, but these types of programs, what's your general
recommendation?
Pete:
Having served in various open source communities, it becomes readily
apparent about best practices for getting involved. There's typically a lot of
enthusiasm, somebody identifies a project that they think is going to be great
to work on.
It's
very important to approach the community and understand how the community
conducts itself. Typically, open source communities these days have exactly
that, a code of conduct. There's best practices around things like GitHub in
terms of forming a pull request if you have a new feature or bug request to
help the community.
Also,
there's modern chat applications like Slack, and Google Chat, and things like
that where these communities form around. It's always good to come into those
arenas hat in hand, as it were, and be humble about what it is you're trying to
do. Ask questions, listen to the conversations, and build up value and present
that value back to the community.
Gordon:
You're saying that if I want to get involved, I shouldn't just join the
mailing list and tell everyone they don't know what they're doing?
Pete:
I wouldn't recommend that, no.
Gordon:
How would you describe, very high level, the state of OpenDataHub today
and what we should expect to see for next steps, what should our expectations
be?
Sherard:
The state of OpenDataHub, it's ever‑evolving. We're meeting with
customers, understanding what their use cases are, trying to see how do we
solve the use cases with something with a reference architecture like the
OpenDataHub and see where the gaps are.
If
you look at when we first started this earlier this year, and we first put out
our first Operator that deployed the OpenDataHub...
Gordon:
Operator?
Pete:
Talking about OpenShift and Kubernetes, Operator is a powerful new
paradigm where you basically encapsulate application lifecycle for particular
components. That component interacts with the Kubernetes and OpenShift API to
do full lifecycle management of that component. That's the 50,000‑foot view.
Gordon:
Yeah. To add a little bit to that, having seen a workshop yesterday in
this topic, then ideas give the operator. It couldn't install to your Kafka eventing
system, Spark cluster, your Jupyter Notebooks, and for something that has law
components like OpenDataHub.
The
idea is, it's as if you had an expert on the system come in and spend a couple
hours installing things for you.
Sherard:
Yeah, that's exactly right. That's why we gravitated towards operators pretty
early. The OpenDataHub is an operator, a meta‑operator. It even deploys other
operators like Kafka Strimzi. We're also working with the Seldon team. We're
going to be looking at integrating some other of our partners into that
ecosystem.
What
I was getting at is, where we were earlier this year was, the OpenDataHub was
really focused on the data scientist use case trying to replace the experience
of all of your data scientists across an organization doing work on their
laptops.
Certain
people may have different components installed. Everyone's doing pip installs
with different versions. You have all kinds of dependencies that are very
specific to that data scientist's laptop or workstation.
What
we tried to do is solve that by introducing this into Kubernetes so that we
have a multi‑user environment in OpenShift, so that everybody has the same
playing field, every user's using the same suite of tools. They're using the
same suite of dependencies and same versions of packages so it makes it easier
to collaborate.
Once
we did that, the next step was to start to introduce more of a management of
your machine learning models. Now, we've introduced Seldon where you can
actually deploy your model as a REST API. Then, we also introduced Kafka for
data ingestion down into your object storage. We also had the ability to query
the data using Spark.
Coming
down the pipeline and the next month here, we're going to be introducing tools
for the data engineer. What we're doing is looking at how do you catalog your
data that's stored in the object storage. This is Hive Metastore but we're also
introducing technologies on top of that such as Hue, which will allow you to be
able to manipulate the data before the data scientists even get there.
The
reason that we decided on that is because we all know that before you do
machine learning, data just doesn't come in cleanly. It's not perfect right out
the gate. We knew that there was a step missing in enabling data engineers to
massage and clean that data before the data scientists got ahold of it.
Then,
down the pipeline after that, we're looking at BI tools but then also, there's
going to be more governance. We're looking at tools that might help out such as
Apache Ranger, Apache Atlas. We have a number of people that are contributing
in that space.
We're
looking at how can we introduce more cohesive end‑to‑end management of the
platform. You'll see more of that as we move along in the next few months here.
Gordon:
Where could someone go to learn more?
Steven:
Opendatahub.io is the community site. You'll find a number of listserves
if you want to stay in the loop. If you want to get involved, you can sign up
and we can pull you into the various workstreams.