#TrueDataOps Podcast - Doug Needham - Ep.58

[00:00:00] Speaker A: Host Keith Belanger, field CTO at DaedOps Live and Snowflake dude, superhero. Each episode this season we are exploring the topic of AI ready data. So if you've missed any of our previous episodes, Please visit the DDOPS Live YouTube channel to subscribe and get notified about upcoming episodes. I'm really excited today. My guest is Doug Needham, principal data solutions architect here at DDoS LIVE. But Doug is also a seasoned data scientist in and author. Welcome to the show, Doug. [00:00:28] Speaker B: Thank you, Keith. Glad to be here. [00:00:32] Speaker A: Before we really get into it, be great for you to introduce yourself, give people a little bit of history about, you know, from your Marine Corps data sciences and in your role here at Data Ops Live. [00:00:43] Speaker B: Thank you. So yes, I started my career in the Marines supporting operations around the world. The I describe a lot of that in detail when I talk about some of my experiences in my, in my book and the the most recent book that I've published. But essentially I maintained the the mainframe databases that supported Marine Corps operations around the world, including Desert Storm. And one of the cool things about one of the cool stories related specifically to Desert Storm is we put a mainframe on the back of an 18 wheeler and deployed it to Southwest Asia. It was a big database and we had to migrate all of the stuff, do it remotely. We did some tricks that you could consider creating a virtual mainframe. We had upgrades from one device to another. All of these interesting things that happened. By the way, this was the 90s. There was no agile, there was no slack, there was none of this. I talked about doing stuff before jira and people were like, what are you talking about? Well, you know, JIRA didn't exist, you know, until the 2000s. So back in the 90s we were able to do this. And by the way, we did it. Zero data loss. And nobody, everybody survived, nobody lost their life because they didn't have the gear that they needed. We did this remotely and completely in theater. So that was the start of my career and everything just went downhill from there. So I did a lot of DBA stuff migrating around different SQL Server, Oracle mainframes, this thing ended up doing a lot of data architecture design, that stuff. Then decided that, hey, all of this heavy lifting that I'm doing from data infrastructure work, what generally happened is I would do all of this stuff and then some data scientist, some mathematician or econometrics guy would come in and go, I need to run this one query and put that in my model. And so I would write the query for them. They would Put that in the model and then they would go present it to the leadership and be like, hey, this is brilliant. You did one thing and I did all this other stuff. So I'm like, I'm going to figure out how to do that. So I took a bunch of courses to get a data science certification and became a data scientist a number of years ago. So now I'm with Data Ops Live. So I help with getting our platform deployed into customers environments, helping them do what they need to do, you know, figure out how to use our platform to solve their problems and occasionally get to work with some other really interesting data scientists through time. [00:03:34] Speaker A: It's great. You know, you were mentioning like the 90s and, and you know, it was a different world and here we are really in the era, I'd like to say the, the AI. The AI Era, right. Which really different. But you and I were having a conversation last week about, you know, how important it is about getting back to basics. You know, why do you think that conversation probably is more important now than ever about getting back to basics? [00:04:00] Speaker B: So a lot of I'm, I'm going to use some analogies. I like to speak in analogies and I like to tell stories. And so I'm going to want to tell a little bit of a story about the, the space program. And if you look at the, the history of the space program and if you look at it in, you know, starting in 69, then you're like, hey, we have this Saturn 5 and we went to the moon and we did all these crazy things and this was great. Go back a few years. The Apollo didn't happen without Gemini. [00:04:35] Speaker A: Right. [00:04:36] Speaker B: You know, and how the space program worked was very, a very much an iterative process. They would try to launch a rocket and they would fail. And they would go, okay, what, what failed? How do we do this differently? You know, and using the tools that they could at the time, humans, by the way, all by ourselves, we didn't need any help. Humans built these rockets and figured out how to do this kind of stuff. But it was an iterative process over a number of years to get there. And so those fundamentals, the things that they learned in Gemini were very important to ultimately what happened. And getting us to Apollo and landing on the moon, that was that iterative process. Those were foundational steps. Those were iterative things. And so now that, you know, now that I've talked about a little bit of that foundation, I'll use that as an analogy. Too often people see what's going on in the, in the industry, in the AI world. And they're like, I want to do the AI stuff, right? Well, you want to. You want to go to the moon, Great. Have you built Gemini? Have. Do you have any found. Do you even have a launchpad? Do you have anything? Do you. Have you poured concrete to do this? Have you put metal together that can handle the stress of going for me. Well, no, we haven't done any of that foundational stuff. We just want to go do AI Right. That's good luck. Call me in a couple months when you fail, and then we'll fix it. And then, you know, but you're still going to have to do this. One of my favorite quotes from Bill Inman is, you can pay me now or you can pay me later. [00:06:08] Speaker A: Yes, absolutely. [00:06:10] Speaker B: And, you know, you've met Bill and you know how he is, and that, that is. It's not so much that you're going to hire Bill to fix it, because Bill's not going to fix anything for you. He's going to tell you what to do right the first time. And if you do it, great, then you'll be successful. But it's that concept of you're going to have to do these things, whether you are involved with him or whether you're involved with whoever, whatever. You're still gonna have to do these fundamental things. You're gonna have to do these foundational. These foundational steps, and if you don't do them, you're gonna have to redo them. [00:06:43] Speaker A: It's. It's interesting because I, you know, I would say mostly talk about the last 10 years, and I've seen, like, I come from a very, you know, I started in that, in, in the early 90s. I was trained by, you know, following these, these principles, whether it be Inman and then Kimball and then other folks are talking about, you know, relational design and theory and those concepts. And I found, I'm gonna say during the dump, all the data into a data lake, right? And. And we'll figure it out. And to your point, I see that some of that stuff, those lost art, those forgotten patterns, those capabilities are now coming to bite people because, oh, we got to create a semantic layer or semantic model. It's like, well, if you had a model, you wouldn't be rushing to try to create semantic model. You already had that model. Right. So I'm finding that people are recognizing they're not being able to do that AI because they didn't do those fundamentals in their solution. And having to go, you know, build these semantic models at the far right versus having done it at the beginning of the whole process, you know, you know, so kind of building on that, you know, you're interacting with teams and seeing people that are, that are out there. You know, where do you see teams cutting corners today that come back to bite them? We're coming just talking about that. When they're trying to build that layer of AI on top of that. [00:08:10] Speaker B: It's a variety and what. One of the, the, one of the good things about my role with DataOps Live is that I can caution people on cutting corners and also because of the way our platform works. So long as you do certain things, you're not cutting corners. You're building those things in here. We're making it easy to do all those foundational things and even easier now with some of the newer tools that I know you've been working with and you've been talking about, but those foundational components, to keep your documentation up to date, I want to make sure this column, I have a description of this column at the database layer. Then by the way, that's a comment that can be pushed out there that's really easy to keep track of. If it changes, fine, make a change and a configuration file and deploy that. Some of those, Some of those foundational components like you were talking about for a semantic layer, if you don't have, if you don't have a description of what columns are used for, that's just going to ultimately have long term consequences and you're not going to be able to have the, you know, the opportunities to be able to do the things you want to do because you took these shortcuts and we just, it's, it's one of those things I really caution people, you know, make sure you put this in there because you, you will need it later. [00:09:44] Speaker A: Yeah. And what even compounds that is, you know, the, the speed organizations want to go these days, right. And then you start having these teams that have, you know, well, maybe you have 10 people or maybe you have 100 data engineers. How do you keep that number of people, you know, working consistently and having that consistency across the board without. I know there was a point, probably the, in my career, the, the peak that I most had was like a, probably close to 150 data engineers that were working on it. And I could tell you it was, you know, didn't have the tools like we have now, like data ops live to, to try to build. I used to have this process. I, you know, that we tried to do programmatically with scripts to try to keep that level of consistency. Because you cannot, as hard as you want, to educate and train all of those people to do everything the exact same way every single time. So, so kind of we're talking about Data Ops and your, you know, operationalizing the fundamentals. How does data ops, you know, a little bit deeper talk, allow us to take those fundamentals and make sure they can be consistently deployed within organizations and holding those practices to protect those, those basics. [00:11:16] Speaker B: So I think the, or the approach that we take, the approach that the Data Ops Live platform gives to the data engineers and the data architects is that we have, it's very easy to collaborate and it's very easy to break things apart, modularize them and focus on your individual component by doing that. If you have, well, first of all, if you have 100 data engineers working on a single project, that's going to be crazy. But let's say you have 10. If you have 10, then depending on the speed of changes that you need to make, how many changes that are coming in, you can actually break those into sub teams and have, let's say three people work on three different things and kind of one person overarching, you know, keep track of, keeping track of standards. Hey, did all of you guys name this field this way? Did all of you put comments in? You know, I'm going to double check, I'm going to do, we're going to do code reviews. Oh my goodness. Code reviews. Talk about 90s. Let's, let's make sure we're all doing this. And oh, by the way, you guys work on this thing, you guys work on this thing. And then ultimately, as we promote things through the feature branches that we, we have, they're going to come together cleanly and we're not going to have, you know, confusion and overrides all these other things because you've separated out the problem. Breaking the problem, breaking a problem down into, you know, core components, that's, that remains, that's not, that's not going to go away. You have these big problems. Those need to be broken down. You can break those things down logically into logical components, give them to certain team members, let them do some stuff. And by the way, so long as you do it the right way, then when you merge things back together, we merge things together in a, you know, like in a stair step fashion and we build on stuff, then we have this release, get it out there, get feedback. Oh, we need to make some changes. Okay, fine. Now those team, that team can then take and break it apart and you know, individual again individual components and make changes. The platform makes it very easy to do that and very robust for getting those things, those new features deployed quickly because the platform manages all of the continuous integration, continuous delivery, all of that type of stuff. Because you can just let the platform handle that. You can focus on what is the value that I need to provide to my customers and how to make these, this data product better for them to be able to do things. So that's one of the value, one of the value propositions that I see in our platform is being able to allow that kind of work to get things out there. [00:14:12] Speaker A: Yeah. You know, I go back to when I was in, I would say it in the trenches and I first discovered Data Ops live and what it was important to me because a lot of people kind of forget. Right. I know when you talk to Guy and even, you know, when Justin and Guy came up with the seven pillars of, of data Ops. Right. In my experience, many organizations or even many different teams will skip or not do one part or another. Right. You brought up definitions or doc or, or documentation or testing that oh, we delivered a data product, but it doesn't have the documentation or the testing. Another one didn't put in the governance policies didn't do this. And then what you end up having, having are those pockets of half, you know, beeped solutions that have been put into production and where organizations are now finding as they go into AI, those missing parts impact, you know, the AI. Like you said, I don't have a definition on that or I didn't put a test on top of that. And I'm now training a model with untested data. So like you said, putting those components of all the critical parts that are needed to deliver a data product into production and leveraging those seven pillars of data ops, you can now make sure in an automated fashion you're covered across your organization, whether it's 1 or 10. [00:15:42] Speaker B: Absolutely. [00:15:45] Speaker A: So what, what I would say, what have you seen in, in your experience with working organizations that are going to this AI perspective who have actually gone from manual process to the automated data ops, you know, what changes or have you seen or feedback have you gotten from these organizations that have gone through from. That's different from the manual process to the, an automated data ops perspective? The. [00:16:18] Speaker B: It'S. It's like almost like an emotional release. [00:16:21] Speaker A: Yeah. [00:16:22] Speaker B: Because too often, well, not too often, but in, in many ways engineers, we as engineers, we like building things. [00:16:31] Speaker A: Yeah. [00:16:31] Speaker B: We like bit fiddling. We like getting into the details of these kinds of stuff. And so there's a little bit of inertia at first because it's like, oh, wait a minute, I was doing all of this infrastructure stuff and you're automating it so that. I'm not sure how to feel about that. However, once we explain the details and how all of this stuff works and how. We're not automating it, you're automating it with these configuration files. We're just helping you deploy it. That's what the platform is doing. Then they're like, oh, okay, so this. Wow, this keeps things very smooth. Now I can make this change and see this change deployed very quickly. I'm able to maybe even experiment and try something slightly different. I can experiment with this, put it in production, see how it works. If people like it, great. If they don't like it, I can disable it really quickly with just making some configuration changes. Relatively recently, we were talking to a customer about how would you do, how would you manage security? I have a very sophisticated security demo. It is actually a little confusing sometimes. That's how I made it that way on purpose. But I'm using a lot of Jinja logic and some configuration files and on and on. Before I showed it to them, I had ready to hit the share screen button. I said, don't run away, because when you first see this, it's going to look frightening. But it's not. I can explain it to you. Let me answer your question this way. Let me show you what I did. I showed them the samples and I showed them how things were defined and how it deploys it, and then how it applies all these grants to the various objects. And when I explained once I got through the explanation, like I said, it was a little complicated. So it took me about five minutes or so to walk him through it. But once I did that, the guy was like, oh, man, that is so much better than what we've got. Oh, my goodness. And I can do something what you. I said, you can see my code. I will share it with you. And, you know, if you want to build on it, great. Don't put my code in your production. That's bad. And he's like, oh, wow, that's going to make life so much easier than the ways that we've done it. Once you've overcome that initial hesitancy and explain how the platform manages some of these details for you, the engineers, the architects, they're still building it. It's just the platform is handling Some of the implementation details. [00:19:16] Speaker A: We'Re talking about AI and, and this just popped in my head and how you're doing the code. You know, we talk about AI solutions for business, but I think many of us from a, from an engineering and architect are also leveraging AI to help us do our jobs. I'd be curious, how do you see, you know, in the role AI plays in leveraging DataOps live in being able to configure and do all these other things within DataOps to be kind of a force multiplier for, for yourself. [00:19:50] Speaker B: So I have on, in the new book I'm writing, I have an analogy that I, that I use in the new book and I like it so much I'm going to use it here. So you get a sneak peek at the new book that I'm working on, at least a part of it. One of the things that Michelangelo said about David was David was already there, that he just carved away the rock. That wasn't David, right? Or something to that effect. I don't remember the exact quote, so don't misquote me. According to Michelangelo, however, to answer your question, one of the things that I'd like to think of what you can do with some of these text generation tools that we have is you can, in broad terms, you can describe the problem that you think you need to solve and the text will generate for you. And that's your rock, right? And there are times, depending on how well you describe it, where it could be very much conformed to what you need. But you're still gonna have to go in and tweak a thing or two. But having that foundation, right, to then work with that is, that is just, that's phenomenal for being able to get to that point of, oh, okay, now I have something that's working. And because I took a stab at it and I describ overall what I needed to do, all this text was generated for me. Now I can test it and I need to tweak it and I need to do this and that and of course review it. Code reviews. Hey, we just said that I need to review what's there and then put my sign off on it. Now that I've tweaked it, this is good to go. Ultimately it will enable is this whole idea of rapid prototyping to where you can very quickly put something together, show it to your users, show it to your customers, get feedback, iterate on it, tweak it, and then follow your production rules. Like we were talking about the data Architects are going to say, hey, make sure you do this, make sure you do this, make sure you add this column and that kind of stuff. Make sure you have comments for all of the stuff. And these comments are generated either generated from the text generator or they're tweaked based off of feedback from another human that you put in for whatever you're describing. I think it's the speed to find out how bad you're doing because the sooner you get feedback on what needs to improve, the sooner you're able to make tweaks and improve it. [00:22:21] Speaker A: That's great. This season we've been talking about AI ready data. If someone was to come up to you and ask what actually makes data AI ready, how would you answer that question? [00:22:36] Speaker B: What problem are you trying to solve? What AI methodology, what AI tools are you using to solve whatever problem it is? Then we'll back down into is your data ready? If you want to predict something, you need to have certain features that are available and they need to conform to certain standards to feed into a regression model or something of that nature. So the question has never changed in my mind. The question is always, what problem are you trying to solve in the tools, the methodologies, the techniques, the platforms, all of this other stuff just makes it quicker to be able to iterate on. But ultimately, if somebody says, I want to use AI to solve, I want to use AI, I want to make sure everything's ready for, for them. Okay, great. What are you trying to do? Are you trying to improve your supply chain? If you're improving your supply chain, are you, what are you looking at? Time to delivery metrics? Okay, have you capturing those accurately? Are all those timestamps valid? You know, in, you know, those are the tests that we need to run. It's always about what are you trying to do? That's going to be the question that drives everything else. [00:23:53] Speaker A: Yeah. What advice would you give? You know, I've, I've seen and heard a lot of leaders in organizations who think AI readiness is a technology problem. [00:24:07] Speaker B: Not even a little bit. Oh, I can, we can talk for days on this one. So I've worked with application developers that have come to my, my DBA team, been like, hey, tell me how to put this table together because it needs to go into production this afternoon. No, I'm going to shoot you right now. That's not how that works. I've seen other things. It's like, well, it's taking us too long to do this stuff. I have this text field in my application Where I can store stuff. I'm just going to open that up to my users and let them freeform stuff again. That makes me want to shoot somebody really quickly because that's bad. There are ways to use freeform text and there's ways not to, but if you're just. I don't have any other choice. I'm going to make this and then tell users to type it in. Now you're going to have to go through and do all this text processing and do all this other stuff on it. Yes, it's a little bit easier today than it was 10, 15 years ago, but there's a lot of overhead to that, take a little bit of time and let's figure out how to do things the right way. Oh my goodness. There's just, there's so many things that I could talk about for where I've seen bad use cases and people are trying to, you know, duct tape something together to meet a deadline when, you know, take a breath and think about this and you know, we could probably do something that's a little closer to best practices than just throwing something together. [00:25:41] Speaker A: Yeah, it, I know I could go on and on and on and I'm looking at the time and I know we're getting close, like kind of pivot in my brain, but we're almost around. [00:25:52] Speaker B: Three minutes. [00:25:56] Speaker A: Kind of gonna go. You know, the future isn't about abandoning those fundamentals and embracing them. And you know, I want to kind of talk about a little bit about you. You're gonna be speaking at wwdvc, right. And coming up, what are you going to be talking about there at wwdvc. [00:26:16] Speaker B: So at the, at the Data Vault conference. Well, I'm going to be going, doing a hands on lab and my book, I'll bring it up here so you can see it. Data structure synthesis. Keith, you actually saw this, the presentation that this book is based on? Yeah, I think I did it in 24, if I remember right. [00:26:34] Speaker A: Yes, you did. Yep. [00:26:35] Speaker B: And after that presentation, I had a number of people say, hey, do you have a book that explains some of the things going on here? I'm like, okay, all right, I'll write the book, I'll write the book. So I wrote the book. And so in the book, spoiler alert, there's a decent amount of math. And I talk about, I talk about some set theory, finance set theory, the, the, the way that relational theory came. [00:27:04] Speaker A: About. [00:27:06] Speaker B: At least at a high level, I can't do, I cannot duplicate what CJ Date did in some of his Books, his books are, you know, tomes. [00:27:14] Speaker A: Right. [00:27:14] Speaker B: You can kill somebody if you hit somebody with CJ Dates books. But those foundations, of course I kind of talk about those foundations. And then it's like, okay, now here's how you, you make some, you know, here's how you build on that, right? [00:27:27] Speaker A: Yeah. [00:27:28] Speaker B: And so the, the, the, the big thing that, that Chris Date talks about is predicate calculus. And so I wrote, I just recently wrote a blog where I incorporated this, where you can, you can build a table by telling a story with it. And so long as you're doing that, that's a good way of thinking about how to do this. Now, once you have that, or let me back up those stories of those tables right there, are they actually conform to mathematical principles. In the book I talk about the mathematical principles for how to discover those patterns of how to do that. That's part of what I'm going to be doing at the devolve conference is actually showing people the math. And I'm going to put together some tools where they can kind of, they can upload some stuff to this little website. It'll run through and it'll do some regressions and say, okay, this table that you have defined, excuse me, really, it's three tables. It should not be one, it should be three. Because of these reasons. Then by doing that and putting these tables together the right way, people are like, okay, so I'm conforming to this mathematical thing. What does it do? It decreases your volume usage, it decreases your compute usage, and it makes your runtimes for anything that you do with three tables instead of one faster. So you want to increase, you want to increase time or reduce time for turnarounds, do things efficiently. Here's how you do it. [00:29:08] Speaker A: I'll give my two cents on that. As modern we are today in the technology, I often go back to those fundamental theories. You know, CJ's book, I can't tell you how many times I've gone that if I always go back to the fundamental theory, I can solve my problem. And it didn't matter if it was the 90s, the early 2000s, the 2000s, it is held true to this day. You know, and if anything, I would recommend people, we are at time. I heard the lime going off there, but I would definitely everybody check out that book. It's great to understand the fundamentals and getting back to the basic. Doug, it's been a pleasure having you. [00:29:49] Speaker B: On for a reason. [00:29:51] Speaker A: Exactly. It's been a pleasure having you on. I know you and I could go on for. On. On and on. But we're out of time. But everybody, until next time, this is Keith Blandre. Remember? Good enough. Data is not AI ready. Data. We'll talk to you guys all later. Thanks, everybody. Bye. [00:30:04] Speaker B: Take.

#TrueDataOps Podcast - Doug Needham - Ep.58 - S4.EP5

Show Notes

Episode Transcript

Other Episodes

Barr Moses - #TrueDataOps Podcast Ep #14

Jennifer Daniell Belissent, PhD - #TrueDataOps Podcast Ep.31 (S2, Ep13)

Chris Tabb - #TrueDataOps Podcast #21 (S2, Ep3)