Episode 7

December 05, 2023

00:31:57

Matt Aslett - #TrueDataOps Podcast Ep.25 (S2 Ep7)

Hosted by

Kent Graziano
Matt Aslett - #TrueDataOps Podcast Ep.25 (S2 Ep7)
#TrueDataOps
Matt Aslett - #TrueDataOps Podcast Ep.25 (S2 Ep7)

Dec 05 2023 | 00:31:57

/

Show Notes

“By 2026, three-quarters of organizations will adopt data engineering processes that span data integration, transformation, and preparation - producing repeatable data pipelines that create more agile information architectures.”

Streamed live and available on demand, #TrueDataOps podcast episode 25 saw Kent ‘The Data Warrior’ Graziano welcome Matt Aslett, VP and Research Director at Ventana Research, which recently released its new DataOps Buyers Guide 2023. Matt was shortlisted for Shortlisted for IIAR> Analyst of the Year 2022. “We see DataOps as a methodology for the delivery of agile BI, data science, and to some extent operational data, focused primarily through the automation and orchestration of data integration and data processing pipelines.” Watch this episode here

Episodes normally stream on Wednesdays at 8 AM PST, 4 PM GMTSubscribe here and never miss an episode.

View Full Transcript

Episode Transcript

[00:00:00] Speaker A: Foreign welcome to this episode of our show, True DataOps. I'm your host, Kent Graziano, the Data Warrior. In each episode, we bring you a podcast covering all things DataOps with the people that are making DataOps what it is today. If you've not yet done so, please be sure to find and subscribe to the True data. Sorry. To the DataOps Live YouTube channel. That's where you're going to find all the recordings for our past episodes. There's a QR code on your screen right now for that. And if you missed any of our earlier episodes, this is your chance to catch up. Now, another option, of course, is you can go to truedataops.org and just subscribe to the podcast and that QR code is up there, too. Then you'll be sure to get all the notifications of who our upcoming speakers are going to be and when we're going to be broadcasting. So those things will help you out if you want to keep track of what we're doing. So today my guest is Matt Aslett, who's the VP and research director for Ventana Research, and they just released a brand new DataOps buyer's guide, which we're going to discuss a little bit in our show today. Welcome to the show, Matt. [00:01:15] Speaker B: Hi, Ken. Thanks for having me on. [00:01:18] Speaker A: Yeah. So for folks who don't know you very well, tell us a little bit about your background in data management and a little bit about what you do there at vent. [00:01:28] Speaker B: Sure. So I've been an industry analyst covering analytics, predominantly data, since 2007. For most of that time I was at 451 research and a couple of years at S and P Global after it acquired 451. And then for the past two years I've been part of the analytics and data practice at Ventana Research, which was actually just about a month and a half ago, acquired by isg. So, you know, we keep moving onwards and upwards and yeah, looking forward to increasing investment and expanding what we do. Anyway, throughout that time, I've been involved in research and analysis and providing advisory to clients on data and a number of areas in particular over the years. Things like NoSQL NewSQL, distributed SQL databases, Hadoop Cloud databases, and data management, data governance, data streaming, and of course, most recently, a lot of focus on data ops. [00:02:27] Speaker A: Great. Yeah. So you've been through it all. [00:02:31] Speaker B: Yeah. Gray has to prove it. [00:02:34] Speaker A: Yeah, yeah, yeah. The last 20 years has definitely add a few of those to most of us. Just keeping up with the terminology. [00:02:42] Speaker B: Exactly. [00:02:43] Speaker A: To turn you Gray Yeah, and you're having to do that, especially if you're writing these research papers and it's like, okay, what are we calling it today? [00:02:51] Speaker B: Right, exactly. Yeah. [00:02:54] Speaker A: So before we jump into the report, can you give us your perspective on DataOps, what it is and how does it really fit in the evolving data landscape? [00:03:04] Speaker B: Yeah, it's something obviously been tracking for quite a while and have seen an increased number of organizations paying attention to it and taking it up as a methodology and processes. And we really see it as a methodology for the delivery of agile, business intelligence and data science and to some extent operational data focused primarily through the automation and orchestration of data integration and data processing pipelines, but also incorporating things like improved data reliability and integrity through data monitoring and observability. And I think that casts a pretty wide net, especially in relation to the report, which I can go into more detail, but we tried to focus there on really the practical application of things like agile development, DevOps, lean manufacturing to the tasks and skills that employed by data engineering professionals in support of data analytics and development and operations. So emphasizing things like continuous delivery of analytic insights, process simplification, automation, and the buyer's guides are designed to reflect a real world sort of rfi, RFP process. We put ourselves in the shoes of an organization evaluating products on functionality. So that gave us a list of things that we look at in terms of capabilities, but we also look at things like reliability and manageability and the viability of the vendor and also how they can help organizations think about costs and tco. So those are some of the things we're thinking about. But when we looked at DataOps, we eventually looked at a specific set of capabilities and those were agile and collaborative for agile and collaborative data operations. So the development and testing of data and analytics pipelines, data orchestration and data observability, and those are the three sort of key reports. And then we had the overall data ops report. So there's actually there was one sort of research project and then four products that reports that came out of that. [00:05:27] Speaker A: Wow, okay. Yeah, it sounds like you, you named off most of our seven pillars of true data ops. In the process of describing everything you just described, which is great. Glad, glad to see that, that, you know, that was the, the seven pillars were designed to be kind of a conceptual framework and sounds like you took kind of the same approach with your, your research, which is, which is awesome, right? It's like, yeah, it's like, figure. Let's figure out what problem we're trying to solve and what what kind of features and capabilities do we need before we get down to, you know, actual technical details? [00:05:58] Speaker B: Exactly. [00:05:58] Speaker A: Yeah, yeah, yeah, it is. Is very much a philosophical, just like agile, right, and DevOps, there's a lot of philosophical and process oriented things that are involved there. It's not just technology. A friend of mine used to say, you know, a bad architect with a good tool is still a bad architect. And likewise a good architect with a bad tool is still a good architect. It just, you've got, you got to get the two to meet in the middle there somewhere, right? Good process, good technology together. So in the report, and I'm going to quote from your report here, you say we assert that by 2026, so that's really just barely two years from now, three quarters of organizations will adopt data engineering processes that span data integration, transformation and preparation, producing repeatable data pipelines that create more agile information architectures. So the big question there is like, what led to that conclusion that 75% of organizations are going to adopt this by 2026? [00:07:10] Speaker B: Yeah, I suppose a slight caveat to that is that 75% won't exclusively necessarily be adopting those processes. But we do see this is a slow progress. As I say, I think DeadOps has been around for many years. We've seen an increased number of organizations that are using it. They're still using it today, perhaps in certain pockets of the organization alongside more traditional approaches. But generally I've seen we observe this movement towards data engineering process that support agility and continuous data processing. Whereas more traditional approaches and tools tended towards an assumption that data pipelines are linear. You take data from these sources, you integrate it, transform it and you deliver it over here and then you're done. I think we see this greater focus from engineering organizations on continuous operations and that's when then the focus becomes much more on things like repeatability and automation. And so yeah, you mentioned the pillars. Definitely. When we were looking at what are the key criteria for products in this space, we looked at those pillars, we looked at the DataOps manifesto and we tried to reflect those organizational and cultural changes that are driving interest in those products and services, in addition to obviously just the features and functionality. Because we do think that is philosophically important, but I think it is truly important in differentiating these products that address data ops from more traditional data management products. [00:08:54] Speaker A: Yeah, I guess it's, it's hard to remember that it's only been a couple of decades and you talk about traditional data management processes that. Yeah, we had a data warehouse and it was etl and it ran initially when I first started doing data warehousing we ran it like once a month. Right. And we're pre aggregating all the data and scrubbing all the data and then just plop, there it is, we can do some reports. And then we got to going like trying to run it once a week. So it was running over the weekend. Right. You couldn't run it during the week because everything else would come to a crawl. And then we eventually got to, okay, we can run it once every 24 hours. Still running in an overnight window. And now with the expansion, especially with the cloud and the availability of cheaper storage, cheaper compute on demand, all of that, now we've got stuff coming from all over the place all day long and analytics is no longer. Well, let's just look in the rear view mirror and try to project a couple of things we want to know like what's happening now and can we make adjustments right now. And that's sort of changed the whole face of this. Like you said, it's like you've got to have a different approach. It's not even just agile anymore. Right. Agile development process definitely needed, but the continuous integration ideas, cicd, the repeatability, the monitoring, the observability, all of that is like just exploded here. Like probably, you know, last five years, I think. [00:10:25] Speaker B: Yeah, I think there's been a significant. Well, you know, just observing there does seem to have been an uplift in the last five years in particular. Absolutely, yeah, yeah. [00:10:33] Speaker A: So do you think it's even possible these days to, for companies to really deliver value from all their data at this kind of scale if they're not, you know, adopting some agile data ops sort of approach or mindset? [00:10:47] Speaker B: Yeah, obviously it's certainly possible to deliver value from data without necessarily adopting data ops. But I think you mentioned at scale, I think that's the key point here. The volumes of data that modern organizations are trying to make use of and then the number of different applications and sources and projects and the number of users. When we talk about organizations being data driven and having self service access to data, that changes the game completely in terms of the expectations of, of the output and the scale of the number of users. So that's where repeatability and automation come into play, as you say. And I think if you're talking about that large scale level of initiative or project, then it's certainly increasingly difficult to deliver the value expected from that Data without a DataOps approach. Yeah, yeah. [00:11:46] Speaker A: And I think, you know, it's been what Is it two decades or so since we came up with this term big data? And now that's like, you know, Claudia Imhoff used to say, it's still just data. Right, right. And you know, people started conflating that term with the technology, but today with all of the sources, you know, we're talking about, you know, all this data off of mobile devices, edge devices, IoT. It's, it is, it's, it's not even. Big isn't even the right word anymore. Right. It's, it's massive. It's massive. There was a research from IDC a couple years back where they were predicting like 75 zettabytes of data. I think it was by 2025. I remember the report. Right. It's like that's just, that's so many zeros. I don't even know how many zeros that is of how much data that is. That even smaller operators are having to deal with data at scale. Yeah, right. Where they may have never thought they were never going to hit a terabyte of data. Right. And now it's like, yeah, we passed that three years ago. It's long gone, it's way in the rearview mirror. [00:13:01] Speaker B: Yeah, absolutely. Yeah, yeah. [00:13:04] Speaker A: So with that, you mentioned automation a couple of times. So how do you see the role and the importance of automation in doing these data ops type processes? And what do you think about, are we moving towards maybe some AI driven automation to really address this scale thing? [00:13:26] Speaker B: Yeah, no, I think automation generally is a key part of data ops in our view. It's a fundamental aspect and I think it's particularly essential if you look at data observability in particular, if organizations are focused on automating the monitoring of all that data. We just talked about the volumes of data, the number of users, and you have to assess the health of that data based on however many attributes across things like freshness, distribution, volume schema and then tracking the lineage. That's just beyond the realms of humans, being able to track all that, particularly in a continuous manner. So I think the use of automation expands the volume of data that could be monitored, but it also helps improve the efficiency compared to more manual data monitoring and management in terms of things like applying data quality checks and then recommending actions. And that's where we get into machine learning and even potentially some sort of, you know, we do see some early experiments in terms of generative AI as well in terms of data preparation, data tagging as well for data quality. So yeah, I think the use of machine learning to Automate the monitoring of data is being integrated into data observability as well as data quality tools and platforms. And you know, that is, you know, that's only going to continue and it's because it is essential to be able to automate that level of data to ensure that it's, you know, things like it's complete, it's valid, it is consistent and relevant and free from duplication. It's just, yeah, the volumes we're talking about, the scale we're talking about, it's just beyond the realms of having humans do that on a continuous basis. [00:15:24] Speaker A: Yeah. And I think when we start talking about things like AI and generative AI, the I think the risk factor goes up quite a bit on, you know, are you using good quality data? Is the, has the PII data and the phi data been tagged and managed appropriately? So, you know, things that we've always been concerned with in data warehousing, like lineage becomes even more important because if you're going to verify all this. So, yeah. Would you think of this to a certain extent as a. Important for risk mitigation by implementing data ops? [00:16:10] Speaker B: Yeah, no, definitely. I think, you know, when, you know, obviously we've seen a huge amount of excitement and interest in generative AI in the, you know, throughout the year and there's been points when I sort of felt like people not going to be focused on the inputs and the data and they're trusting the outputs and the reliability because obviously, don't get me wrong, it's incredible what we can do with generative AI, but obviously it can be incredibly wrong, as we all know. And I think people talk a lot about generative AI democratizing access to data, which clearly it does through natural language interfaces. But it places even more importance on the ability to verify the outputs of the models. Is that data point or that stem, is that correct? Was that in the underlying data? And as you say, was the underlying data of high enough quality and could be trusted in the first place? And was there data in there that there shouldn't be in terms of privacy and reliability? So, yes, absolutely. Yeah. Even more important than ever to be able to trust your data. [00:17:19] Speaker A: Right, yeah. Because I think that that's the. That question of do we trust the data, the input data. [00:17:27] Speaker B: Yeah. [00:17:28] Speaker A: Becomes even more important. I mean, it's always kind of been there and we've had all the battles over data quality. We mostly said data quality. Right. It's like, oh, it's low quality data. So yeah, the projections may or may not be great, but now if you're, you know, democratizing access to that data with generative AI and people are potentially drawing conclusions from it and things like chat GPT are, are writing reports. Right. And summarizing that data, you know, without something like data ops and the observability and the monitoring being automated, you know, how, how in the world would we know that we, that the data that went into that really was valid from that particular perspective, that you know, that it's the right kind of data and that it's trusted. And I think that question we have now of do you, you know, to the business folks in particular, do you trust the data that was used to make this decision? Especially if it's going through some sort of Geni. Gen AI black box. Right. [00:18:31] Speaker B: Yeah. [00:18:31] Speaker A: You can look at the output but then we got to be able to trace that input. So that's, I think you mentioned that data lineage. Right. That's even more important. [00:18:40] Speaker B: Yeah. And I think especially because these applications are so advanced and have this appearance of genuine conscious intelligence that it's easy, as you say, for not wanting to. Well, business users don't necessarily need to think about in their day to day can this be trusted? If they're presented with something which is an amazing report that looks great and looks like it's the result of an intelligent process, there will be an assumption that the underlying data is correct. And I think it's obviously that puts greater emphasis on data management professionals to ensure that that is in fact the case and that these users can trust in the output and, and get ahead and make their business decisions based on it rather than having to go back and check and verify everything and second guess whether something is real or a hallucination. [00:19:39] Speaker A: Yeah. So I think from that perspective as data management professionals, the business may be just accepting it at face value, but if there's a compliance issue or a question or an audit, we have to be prepared to show exactly what happened and show yes, that was trusted data and have that lineage all the way back to the source with all the transformations and everything else that happened in there and whatever business rules were applied to be able to very quickly and effectively show that when called upon. And I guess that's where the data ops automation comes in to be able to do that. Because if you got 500 engineers writing manual code, well, you know that that's going to be a nightmare to figure out. Where did that one little thing come from? And did something go wrong because it was coded wrong without having some sort of framework to do that with? Right. [00:20:42] Speaker B: Yeah, no, exactly as you said, in terms of the speed and the scale, but also the, you know, the tools to have the, you know, the change tracking and. Yeah, to be able to actually identify. Yes, this changed at that point and this is how and why and who. Who was responsible. [00:20:56] Speaker A: Yeah, yeah. Okay, so let's. A little bit more on, you know, what do you think are some of the considerations that buyers should be keeping in mind when they're looking at these tools. You know, when we got into the cloud world, we saw a lot of what I called cloud washing with legacy products that were suddenly rebranded as being cloud, maybe born in the cloud or moved to the cloud or cloudified or something. Do we need to be concerned about the same kind of creative rebranding of legacy data management tools in this space that they all of a sudden become, oh, we're a data ops tool? [00:21:34] Speaker B: Yeah, I think that there's definitely some of that, I'd say data ops washing going on and also different definitions of DataOps. I think, obviously, as we said, DataOps has been around for a long time. People who've been involved with DataOps for a long time look at the DataOps manifesto and the seven pillars you talked, and they have a clear understanding of what that means. I think we also saw another use of the term dataOps which just refers to data management rebranded. And I think you have to be careful, as I said, when we looked at, we were very clear in terms of the capabilities we were looking at that they should. The products we were evaluating should match the kind of capabilities that are listed in the pillars and the DevOps manifesto. We do see, obviously, data integration tools being rebranded as data orchestration platforms, data quality tools being rebranded as data observability. As I said, we try to be careful around that. I won't name names, but what did kind of amuse me, there was one way in identifying vendors for inclusion. We came across one product in particular which had been rebranded and was being positioned as a data observability product. And one of the things we do in this process is we go and look at the documentation and we assess the documentation. When I looked at that, it's like, well, this hasn't actually appeared. The documentation hasn't changed, it's just the name has changed. And we actually went back to the, to the vendor to clarify, do you, do you want to be included in this or not? And they actually said, no, that's not us. We're not data, you know, we're not data observability. As you're defining it, but they still have data observability on the product name. So, you know, it's definitely. [00:23:22] Speaker A: But it probably didn't say observability anywhere in the documentation. [00:23:26] Speaker B: No, it wasn't exactly. It was not in there. [00:23:28] Speaker A: It was integration. [00:23:29] Speaker B: Yeah. So, yes, definitely. For potential buyers, you've got to examine the core functionality of the products in terms of is it capable of doing X, Y and Z. But also I think things like automated testing, change control, collaboration, continuous delivery, all of those aspects that are part of DataOps aren't necessarily covered by products that are being positioned as part of the DataOps umbrella. So, yes, I think you'd be very careful and cautious to think about as a potential buyer, what are you looking for in a DataOps tool? Why are you looking for DataOps tool in particular and make sure that those capabilities are part of any evaluation in addition to just the core functionality of the technology itself. [00:24:33] Speaker A: Yeah. So like always, it's buyer beware. Don't necessarily believe the marketing hype around a product, especially if it's a company that's been around and predates the data ops world. Then, you know, look at that with some level of scrutiny. And of course, that's where your buyer's guide is, I'm sure will be very helpful to people because you've got some clear criteria in there. So even if it's not one of the tools that you cover in the Buyer's Guide, that still gives people a framework of what are the right questions to be asking. Right. [00:25:10] Speaker B: Yeah, well, hopefully that, I mean, that is what the buyer's guides are designed to address. Obviously, we do our own evaluation of the various products that meet the inclusion criteria. But yes, in theory, an organization could take the buyer's guide scoring criterion and use it as it is or adapt it and go from there. So, yeah, it definitely is designed to serve that kind of purpose too. [00:25:36] Speaker A: Yeah. So for somebody doing, like you said, mentioned early RFP or RFI type of investigation, and they want to evaluate a number of options, but this gives them a bit of a framework to do it with. And they said you've looked at the data ops manifesto and the seven pillars of true data ops. So that's kind of been kind of incorporated into the thinking already. And then the bigger one, though is your point is like, as an organization, you have to decide what is it we're really looking for? You know, how did they define data ops? Do they agree with how we've all defined data ops? I know many organizations early on, they just kept they got really hooked on we need cicd. Great. But what about all the rest of this? Right, and what they were really looking for is, you know, how can we do better version control? And you know, they're trying to manage their agile sprints and, and do some sort of continuous integration there. But they, they hadn't thought about, oh, what about automated testing and monitoring and you know, componentizing things, containerizing things and all of that sort of thing. They just said, oh, we just wanted CICD. And then if you do that, you search on CICD, well, you're going to find some of the traditional DevOps tools which may or may not be appropriate, you know, depending on the environment you're building. Right, sure, sure, yeah. So any, any other advice for folks getting started with DataOps? [00:27:14] Speaker B: Yeah, and I think, you know, you kind of touched on it there. You know, obviously the whole point of the buyer's guide is it is focused on evaluating products and technology. But I think, you know, we also see that DataOps, as we've talked about here, it does involve the change of mindset in addition to that technology. So I think organizations need to be thinking also about not just the tools, but also their processes, their methodologies. Are they organized to support agile and collaborative processes? Things like continuous delivery, reproducibility, process simplification and measurable improvement as well. I think that's part of DataOps. It's not just a matter of say, doing things more efficiently. It's being able to prove you're doing things more efficiently. And actually the data engineering and data management team being able to articulate the value that they are providing to the business as well, I think that is an important part. But some organizations that'd be more important than others and you can sort of dial the gauges up and down in terms of the scoring, but depending on what's more important to you. But I think, yeah, the process and methodologies need to be taken into account. Definitely, yeah. [00:28:27] Speaker A: So that's definitely, that's the people, processes, technology question is, what do you think is the biggest barrier for success in this area for folks? [00:28:37] Speaker B: Well, I mean generally I think you see time and time again that people is the biggest barrier to change and success with any new initiative anywhere. Because it doesn't matter how you're, how good your technology is, if people don't want to or are reluctant to use it. And this is particularly true obviously when we said earlier about organizations trying to be more data driven and encourage more self service access, you need to bring people along with you on that. With three things like data culture and data literacy and data democratization initiatives. But, you know, so I think I'd go with people. But that said, obviously I just was talking about processes. The processes also need to evolve. So, yeah, I think you need the. It is definitely a combination of the people, the process and the technology to actually enable the people to make best use of the technology within the organization. [00:29:35] Speaker A: Yeah. Great. So time to close out here. Where can people find this report and keep up with what you're doing at Ventana? [00:29:45] Speaker B: Yeah, so I think we got the QR code here which will take you specifically to the report itself. And I should note that like other analysts, we don't actually charge for our written research. So if you go to that link there, you do have to fill out a form, but their report is freely available to anybody, be they a vendor or an enterprise. And also if you go to, well, you can find your way from there. But ventanaresearch.com data you can find all the details of our latest research on the data sector of the market specifically. And also you can sign up there to subscribe to our latest analyst perspectives, which again are available free of charge. Just have to fill in your details and subscribe to that. So, you know, we have a growing large and growing community of, of end users and obviously on the vendor side as well, and we're always glad to welcome more people into that. [00:30:43] Speaker A: Great. All right, well, thanks for that, Matt, and thanks for being a guest on the show today. Appreciate you coming on and discussing the report. It's really exciting that the data ops world has evolved to the point that now there's a buyer's guide that's pretty exciting, really, and people have needed it. So I think that's to be very helpful. You know, thanks everyone else who's joining us today or watching the replay. Be sure to join us again in two weeks when my guest is going to be a guy who's actually interviewed me more times than I can even remember. He's a podcaster, radio host and industry know it all. Eric Kavanaugh. He's the CEO of the Bloor Group and a longtime host of DM Radio, which is the longest running show in the world of data. So that's going to be a really fun talk with him in two weeks. So also be sure to like the replays on our podcast here and tell all your friends about the TrueDataOps podcast. And don't forget to go to TrueDataOps.org and subscribe so that you don't miss any of our future episodes. So until next time, this is Kent Graziano, the Data Warrior, signing off for now.

Other Episodes