#TrueDataOps Podcast - Amit Kapoor - Ep.59

[00:00:00] Speaker A: Foreign [00:00:04] Speaker B: hey everyone. Welcome to another episode of True Data Ops. I'm your host Keith Belanger, Field CTO here at Data Ops Live and Snowflake Data Superhero. Each episode this season we are focusing on the topic of AI ready data. And so if you've missed any of our previous episodes, you can always visit the DataOps live YouTube channel. Subscribe you'll get notified for any upcoming episodes and when we release them today. I'm really excited to have my guest here, Amit, who is a senior analytics analytics platform consultant at Formative Group. Amit, welcome to the show. Happy to have you. [00:00:41] Speaker A: Thank you Keith. Happy to be here. [00:00:44] Speaker B: So for those of you guys who do not know, actually I'm not going to introduce and kind of describe who you are. For people that don't know, you give a little bit of background about yourself. [00:00:55] Speaker A: Yeah, I've been in this industry for about 20 plus years. Started my career off at Verizon as a mainframe developer and then worked my way all the way up to this. Now AI analytic engineer, went through Microsoft, went through a bunch of consulting companies in between. So always been data focused when before big Data was even called big Data. Working in big data spaces with mainframes. I live, breathe and think about data all day long. [00:01:33] Speaker B: For those of you who do not know I was going to say that, but I'll say it now. I had the pleasure of working with the MIT on some past projects. One of them where we actually implemented DataOps live together. It was in a healthcare organization. And you know, I used to consider admit my, you know, data engineer magician and things we could do, we kind of take my my thoughts and ideas and he would make them a reality. So it was always a pleasure to to work with you admit. Now going back to, you know that initiative that we worked on together that was in healthcare. You know AI at that time wasn't the headline or the major topic but what were some of the real problems that we were trying to solve at that time? [00:02:18] Speaker A: We were getting data from various different locations on various different schedules, schemas, changing a lot of schema drift. Everybody trying to build development up at the same time was a lot of our challenges is just dealing with all the data, dealing with the different data quality challenges, dealing with people stepping on each other's toes. Meaning if I'm in development and I noticed the schema drift file I'm changing, I may update or alter a table, add a new column, but then I don't redeploy my views so my views break and we're trying to push something into QA at the same time and not now there's code that's in development that shouldn't make it to qa and now we have to cherry pick commits or drop objects. It was just a lot of chaos, I would say, really at the very beginning, because everybody wanted to build fast and deploy fast, but you're breaking things at the same time. I would say the other challenge was when you're deploying something to QA into producing again, there's stuff that shouldn't be in QA because they're not ready or something. Failed testing, but you still want to deploy half that release into prod. Now you're again manually deploying things instead of the automated fashion. That was a lot of it then. At the same time, if you use, let's say DevOps, like an Azure DevOps, none of the pipelines were really, they're built for code like Java, Python, blah blah, blah. Nothing was really data focused, so it didn't account for running the DB test. And then we spent a lot of time building these pipelines to make them resilient, only to find out some software package updated to a newer release and then they would break our pipelines and then we would spend, you know, a few hours, days debugging what we thought was an issue with code, but it turned out to be an issue with the pipeline because there is, you know, some software package got updated and we didn't know about it. [00:04:36] Speaker B: Right. Yeah, definitely was a lot of chaos, you know. Yeah, I kind of want to unpack that a bit. So how did, when we brought Data Ops to the table, how did that actually look? How did that actually help that, that environment? Chaos, as you were saying? [00:04:53] Speaker A: Honestly, one of the simplest things was as a developer, I open a branch, I have a database, just separation of. You have your own isolated environment to work off. I create a feature, that feature is associated with the database and I can develop that database is cloned from producer or dev or qa, whatever have you, whatever you want your stable environment to be. When you're cloning, those same permissions get updated so you don't have to worry about all those rules that exist in one environment. And your feature branch exists and you basically have an isolated development environment that you can build, test, break things without impacting the other streams of work that are going on. I think that in itself was the biggest game changer. We didn't have to worry about what pipeline to use. This dependency, it's not a black box per se. But we knew DataOps manage all the dependencies. They have their own release cycles, they're testing. We just had things that just worked. If I said feature branch, not only do I have my own isolated database, I also have the dev environment. It just works. It's not like I have to install as a developer. I have to install XYZ components, I have to make sure DBT depends, DBT packages are all in sync. I have to make sure I have the same version of DBT or airflow, whatever have you. I can go into Data Ops, go into the ui, click develop along my feature branch, and everything's right there. Develop. It works on machine, it doesn't work on the dev server. It's one isolated environment that I can develop on. [00:06:55] Speaker B: When you're working with other developers on your team. And I know in that case we definitely had more than one developer that still gives you that independence without having to worry about stepping on each other's toes. [00:07:08] Speaker A: Yeah, exactly. Then I knew. Let's say they checked in a feature that I needed. I can always rebase my pipeline to the latest version of dev and I can grab that and I can test it without worrying about conflicts. Again, that was a huge time saver. Just that alone. It also forced everybody to check and code A lot of times as developers were like, oh, it's a simple insert or a simple update statement. Once you take that ability away, things get checked in and there's more thorough testing as opposed to, let me just quickly update this table because I can get you on your way. It forces people into this rigor of real software development practices, of not just being cowboyish and, you know, altering a table. [00:08:01] Speaker B: Yeah. And still getting that opportunity in the life cycle to eventually merge together to make sure what you're doing and the other developers are doing do eventually harmonize together before they go into production. [00:08:15] Speaker A: Yeah. And then also, if you think about it like a hotfix, that happens, you know, you're, you're, you're prod environment is going to be different than your UAT environment. That's going to be different from your QA environment. Creating a hotfix that can clone a prod database in an isolated environment. Apply that hotfix. That is like, wow. To me at least, you know, I didn't have to think of like, oh, you know, I can't just work in UAT because we just deployed this feature that changes this table. So now it's different than prod. So of course you can clone a database. But then how do you check the whole pipeline works against that database. That itself is a game changer in itself, right? [00:09:05] Speaker B: Yeah. Back before when we used to have to just do a break glass scenario, get a temporary password and try to go right into prod and fix it right there, you know, that had risk in of itself. [00:09:17] Speaker A: Yeah. And I mean another thing, you know, especially when it came to QA and you would get results, let's say one, you know, in production you're getting number Y and QA you're getting like some other number. Just being able to do a diff, compare against the two branches and know that all the code is in the different. There's no code that was manually done in production versus qa. Knowing that that's what you can do. You can use git for what it's used for to figure out what the differences are. That was a huge time saver in itself too. [00:09:55] Speaker B: Yeah. And when you're in a regulated industry, like on that project with healthcare, you know, that's important to have, you know, reduce, reduce risk. Right? [00:10:05] Speaker A: Yeah. And even going like one of the features that not a lot of people think about. But for every time that you do a build, like what DataOps does a build, it also creates docs. So you can go back into, you know, you can compare DBT docs from production to QA and literally go through the different flows and see like the lineage at that level is in itself like every build is there's docs generated. It's not like did you run the docs? It's just there and go back. And you can go back, you know, 30 days, 40 days, 50 days, 50 commits and literally compare version by version, the lineage. [00:10:49] Speaker B: Yeah. [00:10:49] Speaker A: I don't know of any other platform that does something like that. [00:10:53] Speaker B: No. And then I know I've worked on projects where you didn't have a platform like this and then you would come, oh, we'll come back to it, we'll come back, we'll do the documentation and a couple releases or we'll do a sprint for that. And it. That technical debt never gets, you know, addressed. And then you're like trying to chase down and to your point, you know, those are mundane, important things that just happen without saying, oh, we have to plan for those, those activities. [00:11:20] Speaker A: Yeah. And it also like to a certain degree in the back of your mind that you know that this commit is going to generate a DBT doc by default. You almost go and it forces you to document because you know it's going to show up the documentation. You can't Forget about the task, because it's going to build that doc. There's literally a task that says docs generated, Forces you to document your code and making sure that lineage is right. [00:11:53] Speaker B: Exactly. One thing you did mention earlier in the conversation, which was, you know, reproducibility. Right. Why is that such a critical thing? [00:12:08] Speaker A: I mean, even if you, like in healthcare, you know, I know when I worked in certain organizations, one of the things was auditors would come and audit your, you know, your data flow. So you may have a calculation. How did you get to that calculation? Well, oh, no, we've had six releases before that calculation was now being able to say this is the exact code and this is the exact calculations that happened from start to finish. It's hugely important. I don't know how many hours it would save in auditing. Especially when you're getting audited by an external source, it sometimes takes six months to figure out not only what version of code was at that, you know, that calculated that calculation, but then also the lineage behind it is another aspect of it. Because, yes, that function might have not changed, but what about the six? Like, how do you know the lineage of the track didn't change? That could have changed that calculation from version A to version B. So being able to see that complete pipeline and all those steps is crucially important. Then the other aspect of it was you may have external systems like Calibra. How do you keep everything in sync for every version? You can actually go into DataOps and say, show me the difference between this commit versus this commit, not only this function, but all the way down the tree. [00:13:51] Speaker B: Yeah. And when you have multiple data engineers, I know, for example, you could be doing what sounds like a simple thing like generating a hash key or a checksum. Right. And if you have one data engineer versus another data engineering, just have something slightly off on that calculation, things blow up and then that scenario. [00:14:11] Speaker A: Exactly. That definitely causes many issues, [00:14:19] Speaker B: that's for sure. So you've worked across multiple industries, you know, multiple architectures, you know, what does it mean for DataOps to be, you know, we say a force multiplier for the data engineer. You know, what. What does that mean to you? [00:14:37] Speaker A: I mean, it's a lot. But knowing that, so knowing that I can, number one, every time I do a change, I have an isolated environment before I even touch development, before I even touch anything else. Knowing I can roll back is not having to think about that, not having to worry about, oh, I got to call, I got to put a ticket in to create a new Database because I want to test some new functionality or maybe I need to test the new package. Just being able to say create branch, knowing I have my isolated environment frees me up to do that. Right. Otherwise you put a ticket in to create a database and then are the grants. Right. Waiting two days just to have my isolated environment to do some testing or maybe to try a new feature, maybe to try Cortex analysts. I have that environment that's isolated from development from anybody else's. You know, I'm not impacting them. Is a huge like change. Like I can do that. I don't have to worry about. I'm like, I don't have to knock on, you know, developer A's door or talk to DBA or kit. I'm ready. Like I can go. [00:15:56] Speaker B: Yeah. And when you're working and I know in that scenario you have data engineers coming onto projects who might have different backgrounds or different ways of doing things. You know, you don't have to spend at least this. In my experience, we didn't have to spend, you know, weeks trying to train people on how we do it. Right. This is how we do it. It's just use it. [00:16:19] Speaker A: Yeah. Like if you wanted to create anytime you Data Ops was a tool, right. You go to like go to Data Ops, create a branch. You don't even have to set up your environment. Our onboarding instructions was click develop. In Data Ops, you have your environment. You have access to get all the connections. Nobody has to know any passwords because it's integrated. These are stuff that a lot of people don't think about is even the governance around a development environment. You're forcing to use key, vault or secret keys. Everything's baked into the environment where as a developer, I don't need to know passwords to anything. I can just go in my environment. DataOps tells me what I have access to. As an administrator, you divvy that out, but they don't see it, but they have access to it. It just works. That's a huge time saver. [00:17:20] Speaker B: I would say some of our biggest competition is people just saying, I just can do git actions. I can just build my own get actions. I'm good. But I think what you were just decided is that there's more than just what you can do in a git actions that needed to do this. [00:17:36] Speaker A: But even if you think about it, yeah, sure, you can have private keys and you can set up that. But that's just the snowflake. There's a whole ecosystem around you. There's AWS keys, S3 keys that you may need to know that you need to have access to. There's websites that you may need to pull data from. It's not just Snowflake and how to orchestrate that and managing all the dependencies. Now you're creating a full time role just to manage that. You may have two, three people building out good action pipelines. Rather than developing and adding value to your business, you're now developing these internal tools, which means you have to manage these internal tools, which means you have to now worry about security issues. Docker runtimes, the Kubernetes clusters, because you have 10 different developers doing commits at the same time. And then how do you scale that environment? Oh no, you need to rotate keys. There's a lot to it. It's not just as simple as let's just create a good action. Yeah, you can, but now you're managing that environment instead of adding value to your business. [00:18:47] Speaker B: Right. And you hit a good point there though. Snowflake is kind of like the core, everything that's around it. Right. When we were building these architects you had mentioned, there's Calibra, there's third party ELT tools, there's, you name it goes on and on and on. That surround like it's kind of like that membrane around Snowflake that have to be orchestrated with everything that you're doing inside Snowflake as well. [00:19:14] Speaker A: Yeah, I mean especially the organization where we're like we were both at. They were using Calibra, that was their data governance, keeping Calibra and Snowflake in sync. You can build a tool, of course you can use Get Actions to build a tool or go to Data Ops, get the operator and go just plug in some little variables, know it works, know it's going to work with the next version of Collibra because there's a team on your side managing all those dependencies. It's not something I have to worry about. Oh, now Calibra is getting upgraded. Our code, our actions are based on the old library. Now we have to burn hours, burn task away from feature A to upgrade this calibre connector. [00:20:08] Speaker B: Yeah. So you know, and those things never happen in a timely fashion. They always happen at the critical time in the middle of the night. You know, something changed and what broke and you're investigating it and you know you're not delivering to your, to your [00:20:20] Speaker A: business because when it's not your job, when it's not your full time job to monitor all these third party apps or you're not looking at those release notes or you're not looking at when this API is going to be end of life or breaking features. So that was a main game changer to us. [00:20:38] Speaker B: Yeah, so you know, we, we weren't really using AI then. You know, AI is a, is, is the talk of the town now. I mean many organizations, how would you say, you know, an organization that is doing it manually, not having that data ops discipline versus implemented data ops, you know, how would that you say impact their move into AI initiatives? [00:21:03] Speaker A: I think especially around governance and testing of the data. If you take Cortex as an example, of course you can just point it and create a semantic view and then start asking questions. But it's going to generate the SQL and it'll generate the SQL properly. But if you don't have checks on your data, as simple as foreign key checks, the analyst is not going to say oh, there's a missing key. Like when you and I developing reports, we looked at the data, said yeah, that doesn't look right. Let me do a left join and see if there's nulls. The analyst is going to say here's the SQL based on the question, but it's not going to say oh, you missed half these transactions because the key going back to your surrogate key, they forgot to coalesce one of the field values and now the key is null. Like just having those checks and making sure every time you do a build those checks are being run. It's even more important in the AI world because again, these are business analysts using Cortex or intelligence to just ask questions and they're going to run with it. It's not a report developer with three months validating every step of the way that data. So it's even more important to make sure those tests are in place. Are the governance, are you proper, is every object properly tagged and viewable? Or you have row level permissions. All those need to be checked every time that your data goes. Otherwise you're exposing data that may not. The AI is not going to know that, that this person shouldn't have access to this person unless there's checks and ballots and guardrails even in place. Like thinking about you can't have more than the US population of customers if you're dealing with the us Even those guardrails need to be in place and thought through before you do an AI implementation. Because AI is great, but it's also just looking, it's being trained on the data that's available to us. [00:23:14] Speaker B: That's a key point, right? You Just said like being trained on the data that's available. Right. It's, it's, it's taking that as. Especially now when you're doing things almost in a near real time. It. Loading that in is you're training that. That model and it's going to take that as fact. Whether it was or wasn't. And you know, in, I would say old school analytics. Right. Oh, my graph looks funny or my pie chart is off. Right. It's, it's kind of like I can remember we used to word. Ah, it's in. It's statistically insignificant. We used to say all the time you're not going to worry about it. But that has huge impacts in the, in the AI space. [00:23:53] Speaker A: Yeah. And especially with millions of transactions, even if you're off a penny on one, you know, a couple transactions here or there, that adds up. And it's in a very heavily regulated industry. That could be, that could be, you know, you passing an audit or, you know, failing an audit, which even if you fail an audit now, you're, you know, that kind of takes away from the trust factor. [00:24:19] Speaker B: Right. Yeah. Your business. I can remember many conversations. I know as you, we have many conversations with the business and the last thing you want to do is. Is lose their trust. [00:24:31] Speaker A: Yeah. Right. [00:24:32] Speaker B: Is. You, you have a, you know, I know we were kind of talking about this before we even got on. Right. Is. Is, you know, we want to make sure our reputation as data practitioners, you know, is. Is trusted. And so putting that, that behind our work is. Is critically important. [00:24:49] Speaker A: Yeah. I mean, 100%. And that's like even little things such as, you know, Nai, sometimes it's very. It wants to help, so it's being as helpful as possible, but it's not going to check to see if categories from source A and categories from source B line up. Even adding those checks and those mapping rules and saying, oh, we have new categories that don't line up because the AI, let's say you're searching for [00:25:30] Speaker B: some [00:25:30] Speaker A: kind of medical treatment and if you don't classify the data correctly and there's miscalcifications, it's not going to return those records to the users. And now all of a sudden, user A gets a number, user B gets a number. And now they're saying the data is really bad. You can't trust your data. The data is actually not like transactions are there. It's just how you query the data. Not anticipating some of these data quality issues that could impact the trust in your data at the End of the day. [00:25:58] Speaker B: Yeah. And in some of these industries like healthcare, you know, sometimes people's lives are, are at stake with, with information that you're providing out there. Right. So. [00:26:09] Speaker A: Oh yeah, importance. [00:26:12] Speaker B: So I know, I know we're getting close to time and I know you and I could probably go on talking forever. I. If you could give, you know, a piece of advice to a data engineer or teams chasing like AI, what kind of advice would you give them in this space? [00:26:32] Speaker A: I would say focus on looking through your data, adding the rules, putting the guardrails in place, really checking data governance and really focus on tightening up. Because at the end of the day, AI can be used to generate SQL. You're having to worry about the bigger picture is your pipelines. Can you reproduce the number that the AI is generating? If you can't, why can't you reproduce that number? You should be able to. Even though there's 10 different check ins, 10 different deployments, you should be able to reproduce a number. If you can't backtrack it all the way down. Look at like how you're versioning data, how your pipelines are being built. Look at, you know, all the little things that make the system more reliable before you, you know, take endeavor on this AI journey. Because AI is great, but without proper data, without proper guard guardrails, without proper, you know, data quality checks, it's not going to be trustworthy. And if it can't be trustworthy, no one's going to use it. And then that's where the shadow it comes into place. [00:27:47] Speaker B: I love the shadow. Its. Yep. Yeah, well, to kind of wrap it up, you know, it kind of put out AI doesn't eliminate the need for, for data ops and basically we said here it makes it essential. The fundamentals we put in place years ago, you know, are exactly, you know what, what is enabling AI success today. So admit it's been a pleasure catching up with you. I'm, you know, always happy to talk to you and it's been a while so it was great to catch up with you today. [00:28:16] Speaker A: Thank you. [00:28:18] Speaker B: So again, until next time, this is Keith Belanger. Remember, good enough data is not AI ready data. Thanks everybody, we'll see you all next time. [00:28:26] Speaker A: Bye.

#TrueDataOps Podcast - Amit Kapoor - Ep.59 - S4.EP6

Show Notes

Episode Transcript

Other Episodes

Juan Sequeda - #TrueDataOps Podcast Ep.24 (S2 Ep 6)

Veronika Durgin - #TrueDataOps Podcast Ep.11

Santona Tuli - #TrueDataOps Podcast Ep 32 (S2, Ep14)