Episode 1

November 25, 2024

00:33:47

Mike Ferguson - #TrueDataOps Ep. 37

Hosted by

Kent Graziano
Mike Ferguson - #TrueDataOps Ep. 37
#TrueDataOps
Mike Ferguson - #TrueDataOps Ep. 37

Nov 25 2024 | 00:33:47

/

Show Notes

In the season 3 opener, Kent welcomes Mike Ferguson, CEO Intelligent Business Strategies.

View Full Transcript

Episode Transcript

[00:00:04] Speaker A: All right, welcome to this episode and the third season of our show, True Data Ops. I'm your host, Kent Graziano, the Data Warrior. In each episode, we're going to bring you a podcast discussing the world of Data Ops and the people who are making Data Ops what it is today. So be sure to look up and subscribe to the DataOps Live YouTube channel, because that's where you're going to find all the recordings from our previous seasons and as we go through this season, all the recordings of the new episodes. So if you missed any of the prior seasons, now's your chance to get started and catch up. Better yet, just go to truedataops.org and you can subscribe. So you'll get notifications about the upcoming podcasts. Now, my guest today is a returning guest. He's industry expert, analyst, author, speaker, world traveler, Mr. Mike Ferguson. He's the owner and CEO and head of research for Intelligent Business Strategies. And Mike and I have known each other for quite a while. And again, he's a returning guest. So welcome back to the show, Mike. [00:01:11] Speaker B: Thanks, Kent. Real pleasure to see you again. Haven't seen you since New York earlier in the evening. [00:01:16] Speaker A: Oh, that's right, yeah. A data universe. [00:01:19] Speaker B: There you go. [00:01:20] Speaker A: One of those occasions where we actually were in the same town at the same time at the same show. [00:01:28] Speaker B: Makes a change. [00:01:30] Speaker A: Yeah. All right, for those of you folks on our show who don't know much about you, would you give us a little bit of your background in data management, the abbreviated version, because I know it's a long history. [00:01:43] Speaker B: All right. Okay. So over 40 years in the industry. Just started out as a regular dba, but very quickly I co founded a company very early in my career with a Turing Award winner, which is Dr. Edgar F. Codd, who of course invented the relational model which caused the birth of relational databases, the SQL language, et cetera. And another guy at the company, Chris Dade, who wrote lots of books on relational databases. Then I went on to be CJ Date. There you go. Most computer science graduates of a certain age certainly studied that at university. And then I went on to. I got headhunted and asked to come to the US to become part of a great team. Chief architect of a startup called Teradata. Worked there, went there in the late 80s and the early 90s building massively parallel optimizer for the first massively parallel database and data warehousing industry. And then I came back to Europe. I live in UK right now and set up my boutique industry analyst firm and consulting firm And I've been in data management and analytics consulting and researching and presenting all over the world ever since. So over 30 years now, working for myself. And then more recently back in 2016, I got together with four great events guys and got involved in an event which we started called Big Data London, which of course runs next week. It's the ninth year now, and it's grown into the largest data analytics conference in Europe with over 20,000 delegates and about over 200 vendors on the exhibition floor and 15 theaters. It's all going to be crazy. Over 300 speakers. So next week is going to be huge. Anyway, that's me. [00:03:54] Speaker A: I think you'll be a little tired by the time that's over. That's a huge, huge event. [00:04:00] Speaker B: It is. Yeah, yeah, yeah. [00:04:02] Speaker A: Well, this season on our show, we want to take a step back and think about how the world of true data hops has evolved in the last number of years and what we've learned in the past few years. So, Mike, you were one of the original contributors to the philosophy and the concepts of true Data Ops and the development of the seven pillars of true Data Ops. So that's one of the reasons I wanted to kick the season off with you. For listeners, if you've not looked at the seven pillars recently, you can find them at truedataops.org seven pillars or just scan the QR code that's showing up on your screen. So after four years since we first published the TrueDataOps.org site and the Dummies Guide to DataOps, how do you feel about the seven pillars? Do they still resonate today with all the changes with AI and everything that's coming in? [00:04:56] Speaker B: Oh, 100%. 100% still resonate, I would say, more than ever, in my opinion. I mean, I think there's still a lot of people out there who don't know or let's say not sure about data Ops. And, you know, for that reason, I guess in some cases data ops practices are adopted, in some cases maybe partially adopted or in some cases not at all. But I think, you know, we've got to do more to educate people about what data ops and the benefits of it for most organizations. But, you know, the way I look upon data ops is it kind of lays the foundation for what I would say is industrializing the development of data analytics. And, you know, before data ops, in a way before four years ago when we defined all of that, it kind of reminds me about what car manufacturing was before Henry Ford turned on. Okay, you know what I mean, I mean, yeah, you can build all your own custom cars, but if you really want to accelerate development, you're going to need a production line. And if you're going to put a production line in place, then you got to get the foundations right because everyone's got a grasp kind of standardizing development processes in order to be able to really get this thing moving and accelerate the whole, you know, development to kind of shorten the amount of time and coordinate the amount of the work, you know, to deliver insights that are going to contribute to outcomes in your business. So in a way, you know, I would say, you know, four years on. Absolutely. You know, definitely resonates even more. So perhaps the only thing I would say is that there's a trend that, you know, data ops, rather than being some sort of separate technology all the time, necessarily, you know, starting now, to kind of go into bigger platform data fabrics and whatnot, as kind of, you know, baked in, if you like, in order to adopt those capabilities within that kind of tooling, which of course is good because then the people using those tools get to make use of those data ops capabilities and the tooling that they're using. [00:07:32] Speaker A: Yeah, I mean, for decades we've been on this quest for self service, bi but for data ops we kind of need the same thing. Right. We need, it really needs to be more self service, not just, you know, a couple of really good data engineers who know how to use CI CD tools, for example. Right? [00:07:53] Speaker B: 100%. 100%, yes. I mean, no question about that. I mean, it. But we can talk a little bit more about that, I mean, in a second, you know, but, but certainly, yes. I mean, we kind of got to break them away a little bit, I think as we go forward into necessarily, let's call it GitHub terminology when it comes to data ops, because of the direction we're headed. [00:08:28] Speaker A: Right. Yeah. Because previously we would have talked about doing check in and check out, doing branches and merges, and that's very technical. [00:08:38] Speaker B: Yeah, there's definitely some terminology that if you're not a software developer, you might struggle a bit, you know, grasping, you know, and, and it's, it's more of a terminology thing, I think, you know, but there's still a little bit of education needed in and around that to get, for people to kind of get it, so to speak. But a, in, in general, you know, I, I kind of think we've, we've got to get away a little bit, but we'll talk a little bit, you know, get to get away from the kind of techno speak sort of, you know, but still, still have those data ops practices everywhere, right? [00:09:14] Speaker A: Yeah. So you've kind of answered, started answering my, my second question already. You know, now that we've got this, AI and gen AI are like the rage. Do you think that the seven pillars and the data ops approach is more or less important than it was when we started all of this? [00:09:36] Speaker B: Oh absolutely. More important. I mean I think the reason I'd say that is because gen AI is just going to cause an explosion of what I would call citizen development in the sense that whether that be citizen data modeling or citizen data engineering or data science, you know, citizen data science, feature engineering or even development of gen AI apps, etc. But basically we're going to see a lot more stuff because people are going to be using co pilots, natural language prompts in order to build things. And as we democratize, you know, we're going to see a lot more stuff now clearly, whether that's more components and models, more pipelines, more data cleansing jobs. [00:10:35] Speaker A: And the last data products, just the whole, well then data products producing a Data product, right? [00:10:39] Speaker B: 100%, 100% data products. But I think the last thing you want is the wild west in a citizen development environment and you know, being caused by generative AI based, prompt based, you know, development that'd just be a disaster for most organizations. Because I kind of think I'd go as far as to say that without data ops we could absolutely end up in a wild west. And we've been there before, right? [00:11:08] Speaker A: Yeah. [00:11:09] Speaker B: You know I was to cast my mind back, you know, kind of 2013 timeframe when we saw the. Was that 10 or 11 years ago when we saw the emergence of self service data prep, you know, then you know, we had new startups, you know, the trifectas, the Paxiles of the world and whatnot who've both been acquired, you know, now and then we saw the emergence of self service data prep, getting baked into BI tools and whatnot. But you know, the attitude then was just like, you know, well here's the tools, you figure it out. You know, we just kind of pushed the complexity on to the user and. [00:11:49] Speaker A: In many cases they were business users, right? Not. [00:11:52] Speaker B: Yeah, yeah, right, yeah, business users. And so you know, it's kind of the last thing you would want, you know, in a self service environment. You mean self service is about taking complexity away, not pushing it on to you know, people that are lesser skilled, you know, from technical professionals. And so you know, but those kind of tools. You know, we just ended up with a mindset of like, well, here you are everybody go knock yourself out. And what we got was a wild west, right? I mean, and we got inconsistent everywhere and maintenance was a nightmare because nobody knew, you know, who's building what and with what tools and all of that. But the whole point I think of the seven pillars of true data ops is to organize and manage and accelerate collaborative development, not introduce wild complexity and stall everything. And so to me we're trying to shorten time to value when we put that together four years ago. And so to me it's all about building up more reusable data and analytical products. And the idea being is that the more that you have available, the less you need to build. And so therefore development gets progressively faster as more and more stuff is ready made. It reminds me, I always use the canteen analogy. If you go for a canteen for lunch and you want to have chicken or something, you don't expect to walk in, somebody hands you a raw chicken and some vegetables and says, here you are, you figure it out. I mean, you want ready made, you want ready made. But what we did 10 years, 11 years ago was exactly that. We just said, here you are, you figure it out, right? There's raw data out there all over the place. You figure it out. And I kind of think that, you know, what we were trying to do with data ops was industrialized development speed up our ability to produce data products. And as you incrementally build these, then consumers get progressively faster at using it and delivering value with it because more and more of what they need is already made. And they don't have to go all the way back to zero and start developing from scratch again or potentially even reinventing if they're not aware of what others have created around the enterprise. You know, the, the last thing we need right now with gen AI in the mix is a more rapid way of creating chaos. You know, you know, with, with, with prompt based, you know, generation of code or pipelines or whatever, whatever it may be, we just don't want to get the chaos faster. What we need is this thing to be managed. We need it to be organized. And I, and I think, you know, in that sense, you know, data, the seven pillars is even more needed in a world of gen AI. But the other thing I would like to say as well is that gen AI can be used in data ops. I mean, you know, to help automate, you know, certain data ops practices. I mean, for example, if I was thinking about Pillar number six, you know, which is really about automated testing and monitoring, you know, then very clearly generative AI could be used there to generate synthetic test data, for example, just as an example to be used in automating testing or you know, as part of that pillar. But you could also use it in pillar number five, which is the governance and security and change control and indeed I think in several other pillars that we have. In the true DataOps definition. To me, we need DataOps in order to manage and put up an environment that encourages collaborative development and reuse and not chaos in a Gen AI world. And equally we can also take advantage of Gen AI and data ops itself. [00:16:22] Speaker A: Yeah, I want to do a quick, since we're talking about them, run through what the seven pillars are. So we got elt, the spirit of ELT agility and CI cd and what you just mentioned really is component design and maintainability. That's one of our pillars. Environment management. Yeah, there's lots of opportunities to use Gen AI and all of these. The governance and security, change control. I mean in order to avoid the wild west, the governance and security is actually part of what we really need, automated testing and monitoring. I think people forget about the monitoring part which we now tend to call observability, automated testing monitoring. Because you can use gen AI and build all this stuff, but if you're not paying attention to what's actually happening after you've built it and deployed it and put it in production, that can be a challenge. And of course, as you've alluded to multiple times, collaboration and self service. So with that in mind, do you think we need any revisions in this? You know, are some of these needing to be updated now with the advantage Nai is it still. Are those. Do they cover it? Does it really still. Did we do a good enough job? [00:17:44] Speaker B: No, I think we did a pretty good job. But I do think there needs to be some revisions. Yes. I mean not to, you know, kind of replace anything necessarily, but strengthen it more. More if you like. I mean, you know, especially now as we really are moving towards, you know, citizen based data engineering at a pretty rapid rate. I mean I am more than sure. Next week at Big Data London. You know it could be tons of that technology on display as people Demonstra Straight co pilots baked into all of their data management tools and whatnot. And it's a good year. In fact this time last year at Big Data London we were seeing the first products emerging. So we're a year into this now. So to me there's a Couple of examples I'd like to give them. If you take pillar number two, which is what we were talking about, agility and cicd. You know, again, I have an issue here because. Well, just because I want to see in practice with my clients. I mean, I still think, you know, the people who are, let's say know this inside and out, are the people who came from software development. Right, right. But if I look at even professional data architects and data engineers, you know, we've seen what I would call, I suppose you could describe as a stuttered uptake of CI cd. I think the reason is simple. I mean, because it came out of software development with kind of GitHub terminology and was very well understood over there. I don't. And I think data scientists were probably the first, if you like, people working in data analytics that really grasped that and understood it because from the get go they were writing Python or maybe R, but probably Python most of the time. [00:19:46] Speaker A: Algorithms. [00:19:47] Speaker B: Yeah, they were having to version everything and check, check stuff in and version control it all. And pretty early on most data science tools were already plugged in into GitHub or BitBucket or whatever version control system you were using. And I think in that sense. But data engineers didn't necessarily come out of a software development background. They may have come out of a drag and drop ETL or ELT background where they're not really writing a lot of code and they're not overly familiar with that kind of terminology. I don't think they understood it as well as they should. [00:20:36] Speaker A: And they definitely weren't doing it at the database level. [00:20:38] Speaker B: Absolutely not. Absolutely not. [00:20:40] Speaker A: If we were lucky, they might have been versioning their DDL scripts, but they weren't. [00:20:44] Speaker B: Yeah, if you were lucky. [00:20:46] Speaker A: But they weren't versioning them. They weren't versioning the schemas or the databases themselves or the data in the database. [00:20:54] Speaker B: But I mean, don't get me wrong, I mean, you know, without a doubt CICD has been incredibly successful and very beneficial in helping shorten shortening development times. That's made maintenance way, way, way easier. But we're now. The problem for me is we're now heading into this world of citizen data engineering using generative AI prompt based user interfaces. The idea that they fully understand GitHub terminology is just. Sorry, but no, I mean, I just don't think that's going to be the case. And I think because we are quite actively out there encouraging democratization, you know, for me we have to present the idea of CI CD in a much More business user friendly way, you know, more, you know, better terminology or you know, to make it easier for them to grasp it in a more layman's terminology. Or I even think what, you know, just automated or hide it or have it baked into the tooling where it just kind of takes place, but it's kind of not described as, oh, issue a pull request, talk to a business user in marketing or finances, a pull request, they're looking at you like you've got three heads or something. It's just not particularly friendly. And if we are deliberately moving to the world of data producers, pushing them out into the business to get them to develop pipelines using prompt based data engineering, we've got to find a better way to describe this and get them into the habit of using these practices or even have guardrails where they have to use those practices. But we should certainly not be talking about it using GitHub terminology. That's one example of where I think we need some revision. I think another division would be pillar number three, which is component based design and maintainability. Now I mean that needs improved in my opinion, just to create more clarity. Let me explain what I mean by it because I think when I first contributed to this four years ago, along with you and several others, I was originally thinking about data components, Data products was front of mind. But you know, and I think, you know, that can be brought together, multiple data products to serve different analytical workloads. That was forever. So I was kind of thinking of, you know, rather than having monolithic data stores, you know, we create these component based building blocks, data building blocks, you know, as data products. But at the same time component design goes way deeper than that. I mean it's, it's, you know, it's, it, you know, it's not just about code. We can have, you know, data concepts in a business glossary you can have. So therefore you could actually have component based data modeling, you know, data model components. We can have component based data engineering which has been around a long time, long before data apps came about, you know, component based ETL developments being talked about for decades. It's just that perhaps we haven't practiced it as well, but the whole idea about thinking about that or component based services like cognitive services, I don't know, voice to text or sentiment analysis. You got some content like voice and you need to convert the text as a service for that, use a component for that. Or if I need sentiment analysis done as a component for that, or if I need to translate from one language of the world to another language of the world. With regards textual data, then there's a component for that. And so I think the whole idea, or the secret, if you like, of success with pillar number three is organization of the components to make them findable and make them reusable. And so to me there needs to be kind of clear lines of the different types of components. We have a pillar there that says component design and maintainability, but you want to turn the lens on that and. [00:25:44] Speaker A: A little bit more focus. [00:25:46] Speaker B: Yeah, drill on a bit more detail if you like. Yeah, exactly right. And to create things like shareable libraries and templates and stuff like that, to have these components. What we don't want is 20 different components that do the same thing. Then people getting confused over this and then we end up with, I think we need to associate these components, if you like, with maybe stages of data engineering like ingestion components or data cleansing components. [00:26:20] Speaker A: Data cleansing components or transformation for the components. [00:26:24] Speaker B: Exactly, exactly. [00:26:25] Speaker A: Taxonomy for the components. The data mesh theories all talked about, even on the data product side, they had to be discoverable and understandable. And we're really much like much coming together. [00:26:37] Speaker B: Yeah, much like you got a data marketplace telling people what's ready made from a data product perspective, you kind of need the same idea from a component perspective. If we're going to get into, I mean if you were going to make Pillar 7 really work, which is collaborative and self service development, collaboration, if you really want that collaborative development to work. We do need a taxonomy so people aren't reinventing stuff when it's already available and we can share it. So I kind of think we need to organize that far better. And we need, so you know, we need an inventory of what all these components are. We need a taxonomy. We might even have critical code elements, you know, and even what I would see going on in data now we might need for components like I'm beginning to see data intelligence in catalogs which know about, you know, where all the sensitive data is and all, you know. But I kind of wonder, do you do even need component intelligence? But by that I mean things like frequency of use and components never used and as you say, the different part. [00:27:46] Speaker A: Of observability that's part of the. [00:27:48] Speaker B: Well, yes, in a way, in a way it is. You're quite right. I mean the last thing you kind of want, especially in a world of citizen data engineering using generative AI, you know, the last thing you want is a component graveyard. It's kind of littered with, you know. [00:28:06] Speaker A: Stuff we used to talk about that in the data warehousing world is like, well, do we know that the tables we built are being used? Right? And trying to put that kind of monitoring is like, when was the last time that data was even queried? You know, was a big business priority back, you know, two years ago. And then you're loading it, refreshing it every night. But is anybody actually using it? But so same thing. If we're going to build component based data products and environments like that. Yeah, you don't want to. Yeah, you don't want duplicates, you don't want, you know, three products that are producing customer data and you don't know which one to use. And you don't also want a product out there that's just flat not being used. It's taking up resources, you know, we've productionalized it. So you, you've got your assembly line going along, chugging along every, every minute of the day, building, rebuilding, refreshing data. And that costs. And if there's any of that that's not actually being used and isn't of value to the business, well, you want to turn that part of the production line off, right? [00:29:14] Speaker B: 100%. I mean, in a way, I think data ops, you know, the whole idea behind data ops or a major purpose of it at least, was to stop sprawl like that, you know, I mean, and to accelerate managed development and encourage collaborative development and reuse and sharing. And I think, you know, we probably need to drill deeper on that area in order to, if you like, promote absolute clarity and get maximum reuse out of all of that stuff, you know, in order and whilst of course having full traceability and whatnot. So yeah, to me, yeah, maybe the idea of a marketplace and a taxonomy or something like that, you know, is it really matters. [00:30:05] Speaker A: Yeah, well, unfortunately, Mike, as usual with you and I, we just blew right through that 30 minutes in no time and we could go on for probably a couple of hours on this, diving into it. So we're definitely going to have to get together after Big Data London and all your other travels and get with Justin and Guy and talk through some of this Guy, you've got some really good points there and I think it is time that we maybe go back and we put a little more detail and maybe revise some of the detail under those seven pillars and it's time for that. Yeah. [00:30:41] Speaker B: But I think as you said, sorry, data observability, you know, wasn't really out there when was it? [00:30:49] Speaker A: Word? [00:30:50] Speaker B: Yeah, it wasn't when we defined it where it is now. And absolutely, you know, if there was anything missing, if you like, or needed to be added and strengthened, I would definitely have that very much as part of DataOps today. You know that, right? [00:31:03] Speaker A: Gotta at least have a reference to it in there somewhere. And it might be under monitoring. Yeah, maybe we change it to observability and monitoring or something like that. But yeah, no, I think we do need to dive into that. So. Okay, I know next week, Big Data London. You've given a good pitch for it already do anything else you want to say about Big Data London next week other than how much fun it's going to be? [00:31:25] Speaker B: Well, no, it's going to be great fun, but I mean, you know, if any of you watching this are going to get, get there, be great to see you, you know, make sure you come along and say hi. I'll be hosting the Great Data Debate with executives from Calibra and SAP and IBM and aws. So it should be on Snowflake. So it should be a really great time. But also if anyone wants to follow it with me anyway, you can look at my website. I think there's a. There you go. There's a QR code for that or you know, I go into all of this stuff that we've talked about in all kinds of detail in my educational classes. And so if you're interested in that, there's a QR code there as well. [00:32:21] Speaker A: That and of course you're out on LinkedIn, people can find you there. [00:32:25] Speaker B: Oh, absolutely, yeah, yeah, yeah. [00:32:27] Speaker A: And if you're tuning into this, you probably saw the event scheduled on LinkedIn, you can find Mike that way as well. Then you don't have to go searching around LinkedIn for him, you just go there. Well, thanks again, Mike for coming here for your insights. It's great as always to catch up with you and talk about these things. Thanks to everybody out there listening and for joining us online and those of you who are going to watch the replay on this, thanks for joining in and hope it was educational for you. Be sure to join us again in two weeks. My guest is going to be David Garrison, who's a principal data architect and data ops expert from National Grid. He's also a snowflake data superhero. He's a practitioner in the field. We're going to have a great time talking him what it's like on the ground these days with DataOps and True Data Ops. As always, be sure to like the replays from today's show. Tell your friends about the True Data Ops podcast. And don't forget to go to truedataops. Org and subscribe so that you don't miss any of our future episodes. Until next time, this is Kent Graziano, the Data Warrior, signing off. For now.

Other Episodes