Episode Transcript
[00:00:00] Speaker A: Foreign.
[00:00:04] Speaker B: Hello, and welcome to season four of the True DataOps podcast. I'm Keith Belanger, field CTO here at DatOps Live, and I'm honored to be your new host.
Now, many of you might be asking, where's Kent? For the past few years, Kent Graziano has led this podcast with incredible passion and insight. And I want to thank you, Kent, for building such a strong foundation and and I'm excited to carry the torch forward from here.
As we're all aware, AI is the hot topic and there has been a lot of talk about AI ready data. So this season we're going to embark on a journey to unpack what that really means and what it truly takes to achieve it. We'll dive deeper into the capabilities, practices, and actions organizations need and highlight the critical role that DataOps plays in making AI ready data reality.
But before we go too far down that road, we need to start at the beginning by answering a simple but powerful question. What do we actually mean by AI ready data?
And with that, I am thrilled to bring onto the podcast our very first guest of the season.
Hello, Sanjeev, and welcome. For those of you who may not know Sanjeev, he is a former Gartner Research vice president and author and one of the most respected voices in the industry when it comes to data strategy, analytics, management, modern architecture. He's been guiding organizations worldwide on how to move from buzzwords to business value. And I'm so excited to have him today and to get his perspective on what kind of has been like the buzzword. You're seeing it all over the place around AI ready data. So, Sanjeev, I really would like to get your take. What do we mean by AI ready data?
[00:01:44] Speaker A: Thank you, Keith, for your kind words and for the introduction. AI ready data is, in my mind, it's absolutely central to the success of every enterprise's AI initiative. That's how important AI ready data is. But the big question that begs is what does it mean? If I start looking at it at a very high level, it's data that has to be accurate, timely in manner.
I can trust that data.
It's available to me, it's accessible whenever I want. So it's all of those things that make it AI ready data.
And that is true. But there's a problem, Keith. When we stay at a high level, then the question is, well, what's so different between what the requirements we had 10 years, 20, 30 years ago and what AI needs? Because it's almost sounding almost the same, but there is a difference. So There are actually a couple of nuances that if we dig, peel the onion and we start looking deeper, then we start finding what exactly is AI ready data? It's all of what I just said. But the most important thing is context.
By context, what we mean is it's not just metadata. It's not just because I have data in my table and I run my DQ rules and I made sure there were no missing values, no duplicates, unique values. So my data is clean.
But what does it really mean? What is the name of that column?
How are they related to each other at a much deeper level than what the data tells you? So it is data, it's metadata and it's a context.
[00:03:35] Speaker B: Yeah, I mean that's a great point. You often hear if you just give your AI initiative, just the data, but it doesn't know your business context. How does this relate to this? You know, you hear things like knowledge graphs. Right. Or, or conceptual, like getting that in content. That's a great point. You know that we, that we often not think about that. That's great.
[00:03:58] Speaker A: Here's a funny example. If I, the topic is not funny, but, but, but humor me. So let's say I have a age old chatbot before AI came into the picture and I, I tell this chatbot my job.
Can you give me the list of the tallest bridges in New York?
And he says, oh, great question. The height of Brooklyn Bridge is.
That's a wrong answer because it, but with an LLM it will understand. Oh my God, this is serious. You know, so I'm not going to tell him what the height of Brooklyn Bridge. I'm instead going to counsel him.
So you see, so this how important context becomes in AI. So that's one. The second thing that I also want to differentiate is that in the past when we talked about AI ready data.
Sorry, just good data, we were just talking about structured data. But now unstructured data is a first class citizen. I can upload a picture to a shopping site and I can say I'm designing my room. Here is a picture of my living room.
What kind of sofa should I put in place? But that information about sofa is in my product catalog.
How does it know what will fit in my room in terms of size and decor and ambiance.
That unstructured data unified with structured data is a first class citizen. Like I said, for AI ready data, it's these two things. It's a comprehensive data, it's contextual on top of everything that we've always expected from our data.
[00:05:47] Speaker B: Yeah. We're not just looking to make sure the pie chart is still accurate anymore. Right. Or, or it's really to make sure that it could really have major consequences if you don't have that context. Right, Right. So.
Yes. So, but what's interesting, you know, we hear we have all this AI, I'm gonna say hype or, or energy that's going on around the industry. But so Gartner is also. I read, I was reading. And is this. It's saying that 60% of AI projects will fail without AI ready data. But in, in, on the other end of that, 75% of organizations are making major investments in, into that. Right. That's a lot of investment for a high failure rate. And then you have like, you know, MIT reports saying that 95% of corporate gen AI pilots are failing. So I really would like your, your take on that. You know, lots of spending going on, but lots of failure as well.
[00:06:44] Speaker A: Yeah. Keith, I have to say I don't want this disrespect these studies and where there is smoke, there must be some fire. So there.
I agree that there is. The reason we are seeing a lot of reports like that almost on daily basis is because to some extent return on investment has not matched the expectation or the hype. But at the same time, the reason I'm being a little bit more careful in my response is because I have seen this story repeat many times. There are very interesting articles from like CEO of IBM talking about how many personal computers will be in the world, about the impact of the Internet and calling World Wide Web a fad.
Those were all statements made by very respected, educated folks.
My point is that sometimes, like when the MIT reports and 95% of projects have no ROI in that they're including pilots, but pilots are not supposed to have roi. So we need to give it some time. This is all happening very rapidly in a very short period of time. Over, over the next few years, we will solve some of these challenges that we are facing today and then it'll be a different story.
[00:08:18] Speaker B: Yeah, you know, there's been a lot of, as you said, going back with the World Wide Web and, and other things like that, that you tackle these, these little obstacles and you move on. And what I've noticed is we're breaking down those obstacles and moving so fast. If you just look at how AI is has come in the last, in the last year. Yeah, we will as an industry, you know, get over these hurdles and those, those percentages should start, you know, our failures should start going down for sure. So, but you know, so our focus here is on data ops and I know you, you've over the last few years have been very involved in, in the data ops of working with our founders in the past, both Guy and Justin.
And I really want to take data ops in its, its position and its impact in this AI ready data perspective. So I'm going to bring up a few areas. There's so much we can talk about data ops, but a few areas in that and really get your take on how that impacts AI ready data in the data ops space.
I know you co wrote the book on data products for dummies. When I say data product delivery and its role in getting AI ready data, what's your take on that?
[00:09:45] Speaker A: First of all, it literally occurred to me all of a sudden that the journey we started with data products was exactly the right thing for AI. Because what we did with data products was we said just looking at data in a table is not enough. We need to wrap it with something that is easily consumable. It has its own contracts and SLAs, it has its versioning. There's a whole factory model behind it, basically treating it as a product and gaining trust, easy user interface trust.
That was a goal of building data product to some extent. What we did was we took data, we wrapped around metadata and context around it. We made it available in a marketplace so it could be discovered easily.
I'm very happy to see that data product was the right move to get your data AI ready. I even go as far as saying that even agentic architecture and agents are actually data products.
So. Or if people say oh my God, how can you put data because data is always been there. Agent AI is a new stuff. Then call it AI product Data and AI product doesn't matter what label we put.
But one of the interesting things that I want to bring up in the journey of creating data product and why dataOps become so critical in the journey is first of all, when I create a data product, I am putting a certified data set for business users to discover and then use it in their applications. And in other words, it's reusable when I say AI ready data. We didn't talk about this earlier when we talk about the accuracy and trust, but I want my AI ready data to be reusable. I don't even know how it's going to be used because new AI use cases are always coming up in organization. But I know it has to be reusable. How do I do it is through data ops Yep.
[00:12:19] Speaker B: It, you know, I've been in those scenarios where you might have to use a data product as a source or an input to another data product. Yeah, Right. And so you don't want that other data product to reinvent the wheel that you already solved in that other one. Right. And I think about it when I go to the store, right, And I buy a product, you know, sometimes I'm buying just this component, but other time I'm buying it with that component added to it. But I, but I know that it's been tested, right? That has been validated.
[00:12:50] Speaker A: Yeah. You know, oil, when oil comes out of the ground, it's super valuable, but what happens to it makes it more valuable? That oil out of the ground is unusable. It has to be refined. When it gets refined, it gets turned into, you know, aircraft fuel or the one we put in our cars also becomes plastic. And, you know, so. So those are the derived data products. And you cannot derive a data product if you don't trust that the underlying data product is. Is good.
So now.
So, Keith, these are the things that I think about sometimes. You know, it's like I think about, why is AI different than. Than what we've been doing? Because it's really important for me to understand the nuances. Otherwise we can be accused in it of just hyping up a buzzword and not really understanding.
The difference that crossed my mind was when I did my dashboards and reports.
Let's say my dashboard was a data product by itself. It was based on finance data and I built it for SEC reporting.
If SEC reporting is done incorrect, then the CFO can go to jail. So highly sensitive, it has to be accurate. That report dashboard does not change, it just stays intact. But with AI, it's a totally different ball game because it's probabilistic. My outcome can change. My definition of accuracy can be very different from your definition of accuracy.
Biases can come in the data because we're going past data and metadata, we're going into context.
So if there was some racist data or some inadvertently, some demographic data that went in, then my output is going to reflect that.
The funny thing about AI is it is so confident in giving the answer, it won't even think twice.
Here's your answer. It's up to you.
One more thing that I want to add to your question about data ops and the overlap is one of the most important things in DataOps is observability. It is monitoring the inputs and the output for various reasons. Accuracy, quality, even cost.
Because cost varies a lot. With models, token cost varies a lot. So that observability becomes a very important part of AI ready data. Being able to monitor it and report on it on continuous basis. In software we don't have to do that.
We monitor it. If it goes down, then the container needs to spin back up and although those things. But we don't, we don't worry about the, the output going haywire.
But in AI it can. So, so that's another very important part.
[00:16:14] Speaker B: Of, and I think that's a great point because on the continuous part, because especially now, like you said with the unstructured data, even semi structured data, the, the drift in the context of that data coming in, I've seen scenarios where the data is formatted this way and then also it changes and it's a different format and now you're putting the month and the day and then a different scenario. And how do you make sure you pick that up? Are you looking for scenarios that can drive bias?
So it's monitoring, hey, not just did this particular report run properly, but did the data change coming in? What did I do to it? Did my transformation logic changed? Did something data come in? So I love the word, like you said, continuous. If you're not looking at every single record process, it could definitely throw things askew. It's no longer a nice to have right.
Type of thing.
[00:17:15] Speaker A: Even cost.
See in the software if I say, okay, I'm going to run this on this instance, type in my cloud provider and I just let it go. It's the cost is what the cost is. But in AI the cost can change dramatically because if somebody uses a reasoning model instead of using up 60 tokens, it may use 6,000 tokens and you see your cost has gone up. But if you change to a cheaper open source model, then your cost can come down and models are being introduced every day. So, so it's not a one and done deal. You have to constantly monitor your AI deployment.
[00:17:59] Speaker B: Yeah, and speaking of deployment, you know, we often talk about CIC and the critical aspect of CICD over the years, but I even want to focus with, with the AI is that you really got to go to 100% automated CI CD because you know, I was in a scenario where oh, we only released and updated like once a week and it was very clunky. But as you were bringing up these models and things changing, if you need to adjust these models or adjust your transformation logic or something, you need to be doing these deployments rapidly and fast and building in all of this observability and stuff into that, to me you need to have it fully automated.
[00:18:37] Speaker A: Yep, absolutely. In fact, the very basic definition of an agent is the fact that it's autonomous. If it is not autonomous, then search at bot, it's an assistant. But we want our agents to, to be able to go do a task, reason the output and then go do the next task. And if we, if we have humans always interjecting, then it's not an agent, it's, it's a chatbot. You're asking it a question, is going to do something, then you ask another question to do something else. But that's not automation.
And, and I'm completely in the camp that we will get there. Multi agent pieces are complicated right now.
Single task agents already working really well.
I by the way, just as we are recording this, I was just watching Mark Benioff was saying how Salesforce is heavily into agents.
Here's an interesting fact I learned from his podcast. They have 15,000 salespeople but over the last 26 years of Salesforce there have been 100 million unanswered sales inquiries that now an AI based SDR is making these calls and talking to them.
After 15,000 sales people, I mean, how many more do you want to hire? So anyway, the point is that, that there is huge potential in AI and we are literally just at the very early innings of that.
[00:20:25] Speaker B: Yeah. And to add on what we were saying, you know, governance and enforcement, you know, the whole ethics and privacy and all of that, when you come into this, you know, where's your take on, on the enforcement of your, of governance in this whole process?
[00:20:42] Speaker A: Yeah. One mistake we made with data governance was it was an afterthought. After everything was done we said, okay, now let's put data governance to some extent. Data governance in the past was very defensive thing like sarbanes, Oxley, Solvency 2, that kind of stuff. It was.
But today we take a fresh look at governance. We look at governance as an opportunity to make your data available for analysis in a safe manner.
The example I use is what is the purpose of having brakes in a car? And the answer is so you can go faster.
Because you know that I have a safeguard if something happens I can break. And with anti braking logs and all of that, the technology has advanced. So governance is no longer a defensive thing, it's an offensive thing. If I can govern my data and make it available to business to do more AI initiatives, all the power to them, it gives me more competitive advantage. So governance folks now have a seat at the table because they are unlocking these doors.
The old school was to save my data from some disaster. I'm just going to lock it down. No one can use it except for maybe five people. CFO can get his or her reports, but nobody else can do anything. So to me, governance now becomes part of my AI initiative. Right from get go. Right from experimentation. Actually forget even experimentation. Ideation, Ideation, experimentation, production, evaluation.
I want to drive all the phases of my life cycle, of my project through AI governance.
[00:22:46] Speaker B: Yeah, you don't want to be putting in your secret recipe that you're been in your family's, in your family for many, many years and then load that into, you know, an LLM and make it available to everybody else.
[00:23:03] Speaker A: Right? Yeah, yeah, the Coca Cola formula.
[00:23:08] Speaker B: Exactly, exactly.
Yeah. So.
Oh, go for it.
[00:23:17] Speaker A: I'm going to turn the tables on you.
The question I have, I know DataOps Live has a new release coming out.
I think Big Data London is what I heard in just a few weeks.
I'm curious, how are you planning to measure whether an organization's data is AI ready or not?
[00:23:43] Speaker B: Great question. And yes, we are very excited for we are going to be showcasing it at Big Data London.
I think announcements even might have gone out yesterday, but there is so much like even we just talked about it that all this automation needs to be built in, these organizations need to do from governance and being AI ready is that we're elevating our platform right. To a data ops automation platform and helping organizations be and generate AI ready data.
And well then how do you do that? How do we know? You talked about so much stuff. How do we know? Do we just say, yeah, we feel our.
So I know you're very familiar with, with fair that was in your data products for Dummies. It's like how do we take that up a level? And we're going to be introducing what we're calling AI ready scoring which will then allow you to take. I think we're breaking it down into really five categories of that data from everything we're talking about from, you know, are your data sets accurate and consistent? Right. So from a foundational quality and integrity perspective, accessibility and operationalization operators. I can't even talk operationalizing big word, you know, but how do you, how do you, you know, use it at the right scale and at speed? You know, speed. Accuracy is a big thing when it comes to AI, right? Is, you know, if you're feeding AI, you know, a week's worth of bad data that's not going to be good. So how do you, you make sure you're changing these things at speed, Predictive features are the right shape, training your models. So all of these things that we've talked about is how do we, we're going to give people a dashboard where they can see their, their data products and each data product is going to have a score, an AI ready score that's going to take all of these parts and pieces that are critical and give it a score so that your, your consumers would know, hey, this is good, this is bad. So it is something we're very excited about. I think like everything else we've talked about in AI, it's going to evolve, it's going to change. But, but we do feel that the accuracy that is needed in all the parts and pieces like we talked about governance. If you don't have governance in building your data product then you're not going to be AI ready data. If you don't have monitoring, right, and we have with data ops is that full and you can do end to end monitoring, you can do governance, you can do quality testing, you can do automated cicd.
If you're doing all of those parts and pieces then you should have a good AI ready data product.
And that's what we're really excited about. We're really excited about, you know, it getting into the hands of people and people leveraging it and then iterating on making it better and better. But we don't feel you could just say hey, we're AI ready without measuring. I think, you know, I was always in a lot of the data initiatives I led, we had to generate metrics. You know, I am I meeting my sla, am I meeting this? We always had metrics so we feel you need to have metrics around this AI ready data and that's really, really excited about showcasing and people taking advantage.
[00:27:13] Speaker A: I want to dig a bit deeper for those who are listening to this and don't know what fair is. So I just want to define.
Fair is an open movement accepted by a lot of pharma companies, use it a lot and it stands for findable, accessible, interoperable and reusable. Fair.
That has always been a part of DataOps Live platform.
Correct.
When you now do AI scoring, AI readiness scoring, what is that Delta? What are you going from fair, which is an industry standard to where you are taking it?
[00:27:53] Speaker B: It's a great question.
I think it's going to be, I mean, I don't know the granularities and we can Definitely get into that. But it's really where fair didn't come into the depth of what's needed when you really need to get into AI and some of the stuff that, that we were talking about in terms of, you know, the in depth quality checking. Right.
The are you really building in those governance and the ethics and looking for bias, like bias. How do you measure for bias? Right.
So there are those things that are not covered in fair that we feel you need to add on top of it for AI ready that you might not have needed for a data product before. Yeah, we will have fair scoring. So you'll be able to see them side by side. Right. Fair score, AI ready score. You might be, hey look, I'm not building right a LLM, I'm just building great. Here's your fair score. You don't care about being AI ready but this data or this data product is going to be used for my fraud detection LLM. Right. Then I best be sure. Maybe I have looking for biases or I'm looking for something might, might cause hallucination or I have to look at my drift of my incoming, you know, data structures differently. So really it was, how do we, how do we take that to that next level?
[00:29:21] Speaker A: I like it. Okay, I, I see the difference. Yeah. Very powerful. Thank you. Yeah, I, I cannot wait to see it.
[00:29:28] Speaker B: Yeah, I, I, it's, it is exciting. I mean it's always great to be involved. I know I've been In the industry 29 years, you know, a long time as well, and I've seen so much change and many people, I guess, not really like change. I love change. I love trying something new. I love a new problem.
I honestly never felt, you know, 29 years ago when I was starting that what we're doing now to me was science fiction. Right.
And it's exciting to be, I think, a part of, of making that contribution to the data industry and how can we help organizations, you know, take that, that leap forward. So it's exciting times and it's, it's, it isn't going to be without its trials and tribulations and it's, it's, it's gotchas. But you know, hey, that's, that's how we've made it as a society.
[00:30:25] Speaker A: Yeah, that's the discussion we were having earlier.
There is this very famous statement which says that we overestimate the benefits of new technology in a short time and we underestimate its benefit over a longer period of time.
So we are overestimating that AI is going to change all these things and make us so much more productive. But it's a journey. It takes time, and we will get there. So.
[00:30:57] Speaker B: Yeah, absolutely. And also just for everybody to know that you were asking me about the A already scoring, and I'm really giving you a high level. We will be having a webinar in the upcoming weeks so people can keep an eye out for that. We will really dive into what you were talking about, the fair scoring and the AI ready scoring. And really, people can see it in action, so keep an eye out for that. But I think we're kind of running out of time. Sanjeev, I really enjoyed this. This was a lot of fun. And getting your perspective is always great.
And I want to thank you very much for joining us today.
[00:31:34] Speaker A: It's been such a pleasure. Keith, anytime. So I'd love to actually talk more about once your AI readiness scoring comes out.
I'd love to talk, dig deeper into it. But till then, thank you so much for having me come back on True Data Ops Broadcast.
[00:31:53] Speaker B: Always a pleasure. Thank you.
[00:31:56] Speaker A: See you.