Episode Transcript
[00:00:00] Speaker A: Foreign Happy New Year and welcome to 2025 and this episode of our show, True DataOps. I'm your host, Kent Graziano, the Data Warrior, and each episode we're going to bring you a podcast discussing the world of DataOps with the people that are making DataOps what it is today. So be sure to look up and subscribe to the DataOps Live YouTube channel, because that's where you're going to find all the recordings of past episodes. So if you missed any, you can go there and catch up. Better yet, go to truedataops.org and subscribe to this podcast Now. My guest today is the co founder and CTO of DataOps live, my friend Guy Adams. Guy is also one of the contributors and main authors of the true DataOps philosophy, the Seven Pillars of TrueDataOps, and of course co author of DataOps for Dummies. Welcome back to the show, Guy, and Happy New Year.
[00:00:57] Speaker B: Thanks, Ken. Happy New Year to you and great to be here.
[00:01:00] Speaker A: No, I think by now everyone listening to this show pretty much knows your background, so I'm going to skip over the usual introduction today. Now, as some listeners may have noticed, we had to cancel the last episode of 2024 due to an unfortunate unexpected event. So, Guy, if you'd be so kind as to fill folks in and say a few words about what happened here.
[00:01:25] Speaker B: Yeah, it's, it's been over a month now, but doesn't get any easier.
Earlier in December, after a long battle with illness, Justin, who was my co founder and CEO of daytrops Live, unfortunately passed away. I remember, you know, we did pretty much this event this time last year and you know, it was just an eye bouncing backwards and forwards on our prediction for, for 2024. And it's, it's, it's difficult and poignant to be here doing it on my own this year. But his, as you know, better than anybody can, his passion for Day drops was, was absolutely contagious. You know, we worked together, you know, you, Justin and I, on, on all sorts of things, you know, not least the True Day Drops site, the seven Pillars, and of course Day Drops for Dummies.
But more than just the work, you know, Justin was a friend to so many people. I had the absolutely horrendous job of emailing so many customers and partners and suppliers and other teams we'd work with and, you know, it's, it's a horrible thing to have to do, but the response that I got back was, was overwhelming in terms of, you know, that just people sharing Memories. Hundreds and hundreds of photographs, always with Justin smiling. There was even one or two that weren't taken in a bar, which, which, you know, surprised me a little bit.
But, you know, themes coming across, you know, friendly, warming, happy, and maybe one of the ones I like the most, you know, full of integrity. I think, you know, if anybody that knows Justin, knew Justin, you know, would. Would certainly ascribe all of those to him.
I remember very fondly Justin and I coming to visit you just after your retirement and have. We had a wonderful time on the beach, partying, you know, with old friends and making new friends. And I know, you know, I was, I was reminiscing with Justin about that just a few weeks before he passed. So I know that was a. Was a great memory for him as well. So he's going to be missed by lots, you know, but there are very few that are going to miss his, his warmth and his counsel and his friendship as much as I will.
[00:03:18] Speaker A: Yeah, yeah. No, it's very sad. And, you know, though, as Justin would want, we, as they would say, soldier on and continue with the work that is data ops and getting the concepts of true data ops out there into the world to, to help organizations be as successful as possible with their data. You know, as, you know, had it not been for Justin, I wouldn't be doing this podcast. I mean, as soon as I retired from Snowflake, he had, I had an email from him that afternoon asking me to, to come on board with and help you guys as an advisor. And then that eventually turned into doing this, this podcast now in season three. So, so, so I'm deeply thankful for him and for all he did for, for everyone in. Including and including me in all of this with, with you and him as we evolve these concepts and, and really tried to help, you know, move the industry forward.
So with that, let's get today's main topics, guys, predictions for 2025. You know, as you know, this year we've been kind of trying to take a step back, look at the world of true data ops and how it's evolved over the last number of years, whatever it's been what's been four years, five years now, and see what we've learned. You know, you're in the middle of all of this, you know, as cto, but obviously out there working with all kinds of customers and all sorts of industries, you know, working on the product and evolving the approach to how people work with data. So let's a, let's get into your predictions for 2025 guy.
[00:04:58] Speaker B: Great, thanks. And it's, it is great to be able to, you have that impetus to step back and look because when you're, when you're deep down in this all day, every day, you know, it's kind of like your children growing up. You don't realize how tall they've got until you look at a picture from, you know, a year or two ago, because you see them every day. It's kind of the same for me with Data Ops. You know, it's only when I look back and think, you know, what, what were we doing? What conversations were we having two or three years ago? You realize just how much data ops as a, as a field, as a discipline and data company has come. You know, I remember, you know, it was, it was big Data London three years ago, where the majority of the conversations I had started. So what's Data Ops then? Not, not who is you as a company, not how do you do Data ops better than anybody else done. It was, what is Data Drops all about? And you, you know, you'd have to start by saying, okay, well, it's kind of a bit like DevOps, but it's for data and so on and so on the app. Those questions are gone. I haven't had that question in probably 18 months, which in itself is great. It means that things are maturing the way that we thought they would. This is now a well understood concept. So now we're moving from, you know, and I think it's not just a well understood concept. I think the business values are well understood as well, which is the other critical piece you've got. You've got to help people understand what it is and then why should they care? Both of those seem to be pretty well understood now. So we're moving on to the kind of, the next level of that, which is, okay, well, how do I do it best? You know, which are the best technologies to use? You know, obviously people are now more and more aware that it's not just a technology challenge, but it's a people and process challenge as well. So people are starting to get more and more savvy about that.
So I think that kind of leads me to my first prediction, which is, you know, day drops will become in 2025, ingrained in every business process. And I use the word business there quite carefully because if I look back, and I wouldn't have, I probably wouldn't have acknowledged it if you challenged me at the time, but if I look back where we were two years ago, what we were calling Data ops might have even been labeled Data Engineering Ops was really targeted at kind of bottoms up very engineering problems. We were, you know, we were approaching the world very much from a what are the problems that data engineers are having and how do we solve those? Which, you know, looking back, I don't make any apologies for, I think that was needed. But I think really where we, you know, where I've seen things change over the second half of 2024 and certainly into 2025 is a real shift into looking at this from the business perspective.
And rather than just taking the Persona of the data engineer and saying what do they need to do?
You know, how do we make their lives more efficient? You know, taking a more adequate step and saying, well what, where and why do we need those data engineers? And the answer isn't will never be that we don't. But you know, they are so overloaded, they've got so much stuff to do. You know, rather than just making them a bit more efficient or a lot more efficient, how do we actually take entire kind of problems off their plate? And a huge part of that is kind of that self service element where we're moving more and more of this to the left. We're saying actually, you know what, you know, it first started out that, you know, anything that came from the business would still start some sort of, you know, ticket, some sort of, you know, request and you know, and the very first actual action it was taken by a data engineer when we launched data shops that live create last year, we kind of moved that and said, well, you know what, you know, the business users can actually start that process themselves. They can actually give a basic definition of what they want this data product to look like and that will do quite a lot of initial work. So by the time the data engineer first sees this, not only if they got a wellness of problem that you've got well scaffolded project set up with the right components and so on and so on, where we're going in 2025 is to complete that loop and say, well, why do we actually in that particular case, why do we need the data engineer? Certainly for the simpler sets of use cases, why can the business user that started off defining their data product and where they're getting the data from, how they want it to look, what tests they want, how do we allow them to complete that process to get that maybe all the way into production and then go and make updates to that and changes to that and deploy those updates and ultimately maybe even do the full lifecycle up to saying, hey, what? This data is not needed anymore.
Now there will be certain organizations for whom they still want data engineers in the loop to do approvals. But I'll tell you, a data Engineer can do 50 approvals in the time they can do two changes. So putting data loop for approvals is maybe a good thing, but the efficiency is still there. But what it does allow the data engineers to do is really spend their time focusing on the tricky problems, on the thorny, know, the difficult ones, the ones with complex data, the ones with complex requirements, and particularly the ones which are maybe more on the AI, you know, LLM side of things, which is nothing like as commoditized as, as what they would say, you know, a simple analytical data product with, you know, data transform data quality and so on. So I think that, you know, that shift to the business and shift to focusing on the needs of the business is absolutely huge. And, and, and I would say if I, if I was to kind of wrap up all my predictions into one macro level prediction for the year, I would say by the end of 2025, we'll have case studies of organizations that are saying 80% of our data products are conceived, built, updated, deployed, lifecycled by business users without, without needing hands on keyboards from a data engineer.
That, will that be industry wide? You know, will that be most organizations? No, but I think the, the thought leaders and the people who are early on in this journey will be hitting those sort of metrics by the end of 2025.
[00:10:21] Speaker A: And in part is this because we've kind of shifted our thinking from just building analytics dashboards to actually thinking in terms of data products 100%.
[00:10:33] Speaker B: One of the things that the data thinking about everything as a data product does is allows you to put boxes and constraints around things. You know, if you, if I just think about, you know, an analytical data warehouse and the things that sit, you know, I can do anything. It can be anything. It can, it can, you know, it's an infinite problem space and the problem, it's, it's very, very hard to simplify an infinite problem space for a business user. One of the things thinking about things as a data product has done, say, okay, well, there's still an infinite number of data products or types of data products. But if I say what are the most common ones that the business needs and one of the simplest ones a business has requirements for, let's draw a box around those and let's say, right, let's try and self service these so there will still be Data products that have all sorts of arms and legs and complexity that need, you know, great data engineers. But by treating these things as packaged data products, we can then draw a box around some of that problem space and say, you know what, the, the simplest data products, the sort of, the, the transformational ones are also the ones that most business users want to do most of the time. And so if we can take that, whatever it is, 70, 80, 90% enable self service for those, then you know, we free people, free the data engineers up to work on the rest. So yeah, without, I would say without thinking about this in the data product way, you just don't have that, that base unit of management to say, what is it that I'm enabling the business user to do?
[00:11:49] Speaker A: Awesome.
All right, next, what's, what's your next prediction, man?
[00:11:54] Speaker B: So I think one of the things we are certainly starting to see happen is data products and, and therefore, you know, Datrox platforms that support them, becoming more multi platform and more multi cloud. I think, you know, this was starting to happen a little bit anyway, but I really think this is dramatically accelerated due to the release of Iceberg. You know, Iceberg is kind of the, the enabling technology that's driven a lot of people down this road and what that meant.
[00:12:24] Speaker A: I'll stop here just for a second for the people that aren't quite up to quick definition of what is Iceberg.
[00:12:32] Speaker B: Iceberg is essentially a technology and a format that allows you to store your data in a particular way, index and catalog in a particular way, so that multiple tools, multiple query engines can sit on top of it. Rather than having to have my Data duplicated across 3, 4, 5 different tools, I can have it in one authoritative store and, and they can have different platforms, technology sitting on top of that, that have their own unique capabilities, whether it be analytics and SQL or data science or application.
[00:12:58] Speaker A: And I guess notably Snowflake added support for Iceberg in the last year. Right.
[00:13:04] Speaker B: And we've got customers doing and seeing some really incredible results shifting the underlying storage mechanism to Iceberg or even kind of using Iceberg as an ingestion mechanism or a way to bypass traditional ingestion, saying, look, it's being written straight to Iceberg tables and therefore there is no load process per se.
[00:13:22] Speaker A: Right. It's an external table in Snowflake now. Right?
[00:13:26] Speaker B: Yeah.
So what this means is that the definition of a data product, and arguably this should have always been the case, but in the early days it was a little bit gray. The definition of the data product in that case becomes truly independent of the data platform or the cloud technology that you host it on or that you instantiate it on. So you say, okay, yeah, I've got my data, you know, I've got my data in one place, but I still need to reflect that as a data product through technology A and technology B and technology C.
The challenge with that approach and the challenge with that hybrid approach is it's, it works great when you consider your, your boundary or your publication to be raw data. So, okay, you know, I can publish it as raw, you know, raw data through, you know, as I said, the tables and views through technology A, B and C. As soon as you start to move beyond that. And remember, one of the fundamental premises of a data product is you can publish your data to users in one of many ways. You know, a direct data is one, you know, a UI application is another, REST API is another, flat file on a file share is another. You know, LLMs are another. So there's lots of different ways of interfacing into that data. The only one that's really well standardized is the, is the raw data itself. So if I define a data product that has, for example, you know, a react application ui, the way it gets deployed onto technology A, B and C will be very, very different. And it may be some technology simply don't support running react applications. So the, the, the move or the tendency towards, you know, this hybrid model is a little bit at odds, certainly for organizations who are extending their data product interfaces beyond raw data into kind of some amount of the application space. But I think that will resolve over time.
And I think you'll see sort of say, okay, well I might deploy my data product to technology A. Let's say I deploy it to Snowflake, and so I can do the raw data there, but I can also do my UI streamlit and so on. Maybe when I deploy it to data technology, baby, I also have to deploy that as a container to an AWS or an Azure Service or something like that. So you may start seeing that it's not just technology A, B and C, it's technology A and then technology B plus something else to enable that kind of full stack, that data plus the application on top of it.
But all of that really means is that the need for data drops, the criticality data drops, increases exponentially when all I've got is one Data Pro or one one way of defining data products and one platform to deploy them on. It's all in the same place. Now for every data product I've got This sort of abstract definition of it, then I've got the actual iceberg store of it, and then I've got 1, 2, 3, 4 potential instantiations of it across multiple different platforms. So first of all, I've got a deploy all of that together. I can't have the data look different if I'm looking at it through a SQL interface in Snowflake via React application on aws. So it's got to be standardized across those number one. But you think about testing today to test a data product on one platform, relatively straightforward, but every single change a developer is making now has to be tested across two, three, four different platforms. So the requirement, the value or the necessity of data drops in that case increases exponentially.
[00:16:42] Speaker A: Right, because you've got to have that automated because no way you're going to be able to scale up if you're having to, you know, write all, write all of these tests, run all these tests manually. And again, now we're back to the battalion of data engineers in order to make this work. And rather than the business seeing the results and then being able to validate and say, yes, that's the right answer, that's what we should be getting, we can move this to production.
[00:17:10] Speaker B: But I also think that the, and this is where there's an overlap with the first prediction that that definition of the data product should be done in a broadly abstract way. That business user, to be honest, shouldn't really care whether this data product is deployed on Snowflake or on something else. They say, look, this is where I want to get the data from. This is how I want to manipulate it. This is what I wanted to, how I want to present my users. And maybe there's checkboxes make this data product available on platform abc, or maybe that's completely on demand. Maybe that's driven from the consumer side from the data catalog. When someone requests this data product via Technology C, maybe that triggers it. Maybe up to that point it hadn't been deployed on Technology C. So, you know, there's, there's a whole bunch of different ways of doing it. But what the business user is defining is becoming more and more abstracted away from the way that it's actually being built and delivered. And the critical piece here is that the business user should know none of that, should not care about implementation detail. They just say what they want, how that's delivered is all about the data process, data pipeline. And that's really where the data engineers of the future are earning their, earning their crust is by you Know, building and extending those actual deployment methods. Mechanisms.
[00:18:21] Speaker A: Yeah. Oh, okay. All right, so next, AI as an extension of data products.
[00:18:28] Speaker B: Yeah, it's just a difficult one. I mean, you know, we have the same conversation every year about AI. You know, it changes a little bit. Certainly in terms of data products, I wouldn't necessarily call it an. Well, let me rephrase that. I would, I would consider this as having two meanings. So the first one is AI within the data product. So as I said, you know, you can have lots of different ways of interacting with your data as a data product. And one of them could be a natural language interface. That would be a great way of, you know, that would be a great way of interacting with certain types of data. And if I'll, I'll tell a lanicdote in a minute. That kind of, you know, a personal project I worked on that very much had this sort. There's the consumption of data through something like an LLM.
Then actually in the middle you've got the generation of the data via an LLM, which is. Okay, the classic example is, hey, I get a review from, I get all of the user comments from my trouble ticketing system or my CRM system and then I process them to find out customer satisfaction and things like that. And then I store that in the list. You've now got AI in the data product itself, actually in the pipeline, in the, in, in, in transformation process. But then you've also got, and it's the area that I work mostly in and, and, and most excited about is AI to help me build data products. And over the last year I've released a number of blogs and videos where I've built entire data products with sophisticated, not, you know, not 101, but sophisticated interactive user interfaces with external lookups for geolocation and heat maps and all this sort of sophisticated stuff without ever writing a line of code, without writing a line of SQL completely to the natural language interface. And that's the, that's our assist module, which again we released last year. And that has really changed the game a kind of lower technical people because you can now do so much more, knowing so much less. But even as a, you know, well, my engineering team would disagree with me as a Python programmer, but for argument's sake, say I can program a little bit of Python.
[00:20:31] Speaker A: So, so some somebod me who I don't code Python, I don't know Streamlit or any of those, I can potentially build a data product using those tools without actually doing it myself by using a natural Language interface.
[00:20:47] Speaker B: Yeah, I want to get data from here. I want to join it together in the following way, I want it filtered in the following way. Then I want to put a user interface on top of it, allows me to select this, this, this, and I want the results displayed this way and that way. And then in the, in the, in the video that I did, I then said, oh, by the way, go and look up the city name as a latitude longitude and then aggregate this latitude longitudes and then show me a heat map of this data presented on a map of the world. This is pretty sophisticated stuff. And then I said, oh, it's a bit slow. Implement caching. And it implemented an entire caching layer based on half a sentence from me. So this is really, really powerful stuff. And it's not just it's for the people that really couldn't do it any other way, but it's also for the people like me that in theory could, but it meant I could build an application in 20 minutes.
[00:21:26] Speaker A: Not makes you a lot more productive.
[00:21:28] Speaker B: Yeah, I can turn out 10 of those in a day quite happily as opposed to, you know, it would take me a day, best part of a day to build that manually myself.
[00:21:35] Speaker A: Well, and back to the first comment about, you know, abstracting the definition of the data product. This is like the, the epitome of that is it's abstracted out without you really. You don't have to know the nitty gritty of the details of the technology you're going to use to implement it. You have to have the knowledge, though, of the concepts of how it works. Like the idea that, hey, to speed it up, I can say implement caching. Right. Well, you have to understand what caching is and when you would use it and why you might want to use it, but you don't have to know how to code it in any particular platform. The AI is going to do that for you.
[00:22:13] Speaker B: Yeah, but you bring up a couple of points. I mean, you know, you mentioned about caching. Yes. Now, as a, as an engineer by trade, I know what caching was, but actually there's a lot of situations where I can simply say this is slow, make it faster, and it will either.
[00:22:27] Speaker A: I mean, that's something a business person could do.
[00:22:29] Speaker B: Yeah, right, absolutely. You know, so, yeah, it's, it really is kind of changing the game in terms of AI being used to create data products and being used as part of an augmented process. So, you know, when I talk about those 70, 80% of people, you know, where the business users can create those data products and own the complete lifecycle. A big part of that is coming because we're putting generative AI in that process. So it will find relevant source data, they describe what they want to do, it'll look and find relevant source data, then it will suggest how to transform it together, then it will suggest automated tests for them. And of course we're still expecting them to go in and say, I like that test, I don't like this test, add that one.
But we've got a huge amount of the heavy lifting will be done with generative AI and the business user is really there providing more of sanity checking and a few special cases than grinding through configuring up 50, 100, 200 automated tests.
[00:23:27] Speaker A: Yes, that's right. Back to your first prediction of having it be part of the business process.
It's really going to become ingrained and people aren't even going to know that it's data ops. Ideally at the end it's, this is how we build data products. Even if we call them data products, we may not call them data products, who knows? But engage the business.
[00:23:53] Speaker B: Yeah. And so much of the technical complexity has to be abstracted away. You know, we're talking about people that, you know, these are people that don't know, you know, they don't know what git is, they don't know what a branch is, they don't know what a commit is, they don't know what a merger request is. And they shouldn't need to know. Ultimately, you know, these are, these are things that we've kind of been imposing on people, on, you know, on non engineering people for a long time. And the reality is they shouldn't need to know. That doesn't mean those things won't exist. They need to be there for good governance reasons and the data engineers will absolutely want them. But the business users shouldn't know or care about that. They should make, open it up, have a nice way of making and validating their changes, confirm it's what they want, ASA send for approval and they get, you know, and they get a green light when it's been approved and that's it. You know, the fact that actually it's too.
[00:24:37] Speaker A: We have a question from our buddy Omar. You know, do you see a future where there is a data ops expert in the shape of an AI agent acting as a data ops engineer? And I think that's really what we're talking about.
[00:24:48] Speaker B: Yeah, it's, it's, it's definitely coming. And in fact there was a, I Had a very interesting philosophical conversation with someone the other day that said like, so if we, if we consider where we're using, you know, generative AI today, we're taking the human input, we're putting it through, you know, some, some fine tuned large language models, taking the output, whether it be SQL or YAML or Python or whatever, and then we're storing that in the repository and then you can bring it back up and edit it. The philosophical question was, well, why wouldn't we just store the human bit in the repository? Why can't we generate, if we need SQL from that or we need Python from that, or we need YAML from that or whatever, why can't that be generated by a fine tuned line language model on the fly? So you start at the pipeline. The pipeline is given a set of human imperatives to say this is what you're supposed to achieve, go forth and do it. And I think one day we'll, we'll get there. I think where this really kind of starts bumping into his governance problems. You know, I was going to say.
[00:25:42] Speaker A: That'S what came to mind is like how do you know what code was actually run unless you store the code? Because the LLMs evolve and AIs, you know, they're are supposed to get smarter over time. And so the way it generated the code today, it might have a completely different way to generate the code that may or may not be more efficient in six months.
[00:26:01] Speaker B: Exactly.
[00:26:01] Speaker A: And you've got to be able to do that comparison.
[00:26:03] Speaker B: And they are still, reality is they are still very, very black box and you have absolutely no idea why they gave you what they gave you. And yeah, it may be, of course they, they advance in the whole, they advance in aggregate, but it doesn't mean that something that worked really well last month or last version doesn't, you know, cause problems. This, you know, this version. So I think we're, well, we're a while away from that, but I really think the value is more being able to express what I want in natural language.
[00:26:29] Speaker A: Right.
[00:26:29] Speaker B: And then have it generated for me, have it all previewed, make sure it's all working, but then having it committed and stored as a much more rigorous thing that I can show an auditor that I've got good governance over and so on and so on.
[00:26:42] Speaker A: Right, yeah, yeah. And for the listeners, just a little warning, we're probably going to go a little long today because we got one more prediction to go through and a couple other things Guy wants to talk about out.
[00:26:54] Speaker B: So I think you, we've touched on maybe my, I think my fourth prediction is, is around data agents. And this is a term that, you know, I hadn't really heard until the beginning of last year. And, and, and you know, I, I saw it kind of morph a little bit over the, over the year and I still don't think it has a rigorous definition.
So as I said, you know, data products can have, and I think in many ways should have lots of different interfaces. Even if I look internally, we have a, we have, have a customer success data product. We collect data from all over our platform and we bring it together and it helps us analyze how customers using the, you know, the system. And we've got three different interfaces on it. We've got kind of a high level executive view that shows systems as a whole, how the whole platform's working. Then we have lower level interactive user interfaces where our customer success team can look at an individual customer and check how it's working for them and if they've got any problems. And then we've got the raw data which the product team might dip into because they want to find out, you know, how, how many people are using a version of this that's older than whatever. So it's the same data, but it's got three very different ways of being presented for three completely different Personas and three completely different sets of use cases. There's no way our executive team is going to be writing SQL, but there's no way that we're going to have a dashboard that answers all the questions that the product team want because those questions change multiple times a day.
I think really where data agents come in is a new and very exciting way of interfacing into that data.
So they don't become data products maybe, but they become a really powerful way of interacting with your data where it doesn't matter whether, you know, as I said, you know, the pro, the product manager wants to know how many people are using this module older than version X, but they're trans, they're translating that in SQL. Why can they not just ask the question exactly as I phrased it? And similarly, the executive team, why, you know, they're perfectly capable of saying, you know, I want to know which of my customers, you know, have shown the biggest increase in usage in the last three months.
Maybe the dashboard doesn't do that. But there's no, you know, they can absolutely ask the questions, you know, today they'd send that question as a change request and someone would update the dashboard to show that there's no reason that, you know that that needs to happen. So I think, you know, data agents really become this very flexible way of allowing any type of user, from the least technical to the most technical, to interact with that data in ways that haven't had to be predetermined. Most of the ways that certainly when you're building BI dashboards or applications, you're having to guess or research what the use cases will be. This bypass that and says, you don't have to tell me what the use cases are going to be.
I can respond on the fly.
[00:29:29] Speaker A: Dumbing it down just a bit so that like a really intelligent chatbot specifically for that data product.
[00:29:35] Speaker B: Yeah. So I think, you know, what the way. And I was kind of thinking about the, you know, how to illustrate some of these predictions and you know, I've, I've got a real world example. It's actually not a work example, it's an example I did with a friend of mine. So we were having a drink and.
[00:29:51] Speaker A: They, how many, how many of these conversations started with we were having a drink, we're having a drink. And then that's how we started this whole Trinidad Ops thing too.
[00:29:58] Speaker B: Exactly. So we're having a drink and my friend has, you know, is, is looking to buy a house in Italy and unlike in the UK and I suspect in the us where we have all of these aggregations, so it doesn't matter what, what, what estate agent is, your realtor is advertising your property, they all get aggregated up to one of a small number of things, so you can look in one place and you can see everything. Well, apparently that doesn't exist in Italy, you know, it's very, very regional. So surprised to see what's going on. You have to go to probably 100, if you, you want the whole scope of it, you have to go to hundreds of different websites, all of which work differently, all of which, you know, have different ways of outputting the data, different ways of querying it. Some can filter by this, some can filter by that. And he said, but my biggest problem is every time I visit them, I don't know which ones are new. There's probably only two, two houses added since I last went there, but I don't know what they were. And so we had a bit of a think about this problem and this is a very, it's, it's simplified, but it is a very, very nice, know, definition of a data product. So we said, okay, well first of all we got to collect the data, but we could do this ourselves. But there's some brilliant tools out there now for automating kind of web scraping. So we put in a bunch of the URLs of the websites that he did and they automatically worked out how to do the scraping. And they would give us back some, some JSON objects to describe, you know, all of the, you know, they would recurse through all the pages, get all the house information, and then we dropped all of those as, as semi structured data into Snowflake. So that was kind of our data acquisition stage. That's the first bit of our pipeline.
Then we said, okay, well we need to extract certain bits of that JSON because that represents the actual text of the page.
Of course, every page is different. Some have got a little details bar at the top, give us some key information, some don't. The key information differs from site to site. We decided to ignore all of that, just throw the whole lot into an LLM and say, I want to know the following things. How many bedrooms, how many bathrooms, how many square meters is the house, how many square meters of the land? Sometimes it will come out and in acres, sometimes it'll be in hectares, sometimes square meters. But I wanted to standardize all of the answers in these ways. So we then turned completely free form text using Snowflake Cortex into structured data. So we could now do kind of filtering and things on that.
But then we got really clever because he said, one of the things I want to do is I don't want to do too much work. I want something I can move into straight away. And it's very rare for that to be included in the description. You get that by looking at photographs, which is fine unless you've got a thousand plus houses to look at. So said, well, we built an API into a, into a public generative AI model, hosted that in Snowpark Container Services. And then for each property we threw all of those images in that we'd stored on the stage. And we said, right, based on these images, answer the following questions, you know, on a scale of 1 to 10, how good is the decor? How neat is the garden? And even to the point where we said, guess how many bedrooms there are? It was really interesting. This was more of a science experiment, but could I accurately guess the number of bedrooms in a property just from the picture and then compare that to the number of bedrooms that were advertised? And it was a real, it was really amazing because of course, some bedrooms you've got three, four, five different shots of, can it work out that they're the same bedroom with different Shots. And it, it was remarkable how accurately that data matched the actual metadata.
But then we had a set. So we've now got a set of structured data that we based both on freeform text, but also based on images. And then my friend is not super technical, so then I had to put together a UI for them to interact with this. So I built a streamlit app that allowed him to sort and favorite them. And, you know, but the most important thing with him was like, show me all of the video or show me all of the villas that we haven't already seen. So it would run once a day, the background process, find all the new villas, do all this processing as part of the pipeline, and then once a day he'd log in and say, right, these are all the villas I haven't seen. I like that one. I don't like this one. But then he was asking me questions I was having to answer using SQL, like, well, is this a good price for a villa like this?
And that's where I thought, actually that's where the data agent came from. So, well, I've got all this data in Snowflake, I've got Cortex. So I added to my application essentially a chatbot. So when he had a villa that was interesting, say, is this a good price for a villa of this size in this area or are there other villas? Is this the largest sort of villa, or are there other villas of a similar size in this area? And even things like, how does the price in this area compare to the price in that area? He was now able to interact with that through a ui, but ultimately treating the data product as a data agent. So I built a data product with three interfaces. A RAW SQL interface, which I used, a UI interface that had sort of the basics, filtering, sorting, saving, things like that, and then a data agent interface which allowed my friend to actually ask questions of the data and make decisions on the basis of that.
It's a very simple, in theory, use case, but actually when you look at it, it shows so many different things about the process of building a data product and all of the different ways that AI particularly can be used. By the way, I didn't write most of that code by hand. I wrote most of that code both for the, the data collection, for the ingestion and for the user interface. I wrote the vast majority of that using Assist. So I got, you know, Generative AI helped with all of that because I wouldn't have been able to do so.
[00:35:01] Speaker A: There was Data Ops, Live Assist, your, Your. Your AI product. Yeah, and so in the end now you've got a, you've got a data product that sounds like you could, there's probably a market for it in Italy. You could put it on the Snowflake Marketplace now and the up a subscription model and you got a little side hustle here.
[00:35:20] Speaker B: I could, I'm just hoping to get a free holiday out of it, to be honest.
[00:35:23] Speaker A: Oh yeah, yeah. That he buys a villa there in Italy and you can still visit.
Awesome. Well that's great. All right, well we kind of run up on time, so it's like a quick review here. So your, your predictions for 2025. Data ops become ingrained in every business process.
Data products become multi platform in a hybrid, multi cloud environment. That's a, that's a mouthful. But we all know that that's coming. We all know that that's real AI becomes an extension or, and, or incorporated in data products. And that's your example you just gave. And then data agents become the new data products, or at least data. Every data product potentially can now have a data agent as one of the interfaces.
[00:36:11] Speaker B: Yeah, awesome.
[00:36:13] Speaker A: I mean that's a lot. That's a lot. And you say we're seeing bits of it already and obviously you just built one. So we know it and we know it can be a reality. The question is how many people pick it up and run with it. Right?
[00:36:27] Speaker B: Yeah, but I think, you know, the re, you know, the reality is enough people are now seeing on a, on a, on a daily, weekly basis that A, this is real, B, this works and does what it says in the tin and they've got the first hand experience of that, that, you know, everybody who's been through that process and done that evaluation has very much got to the point of like, I can't afford not to do this. You know, the co, you know, I, the one thing that hasn't changed in all the years I've been doing this is the conversations I have with heads of data or similar, which is next year I'm, I've got to do a heck of a lot more work and I've got, you know, either the same or less or maybe a few more people. But you know, I'm out of 20 more people and 2,000% more work. That hasn't changed for as long as I've been doing this. You know, and if anything that's, you know, that, that challenge, that dichotomy of, you know, doing more with less or more with the same is increasing. So yeah, this just becomes more and more and more critical.
[00:37:19] Speaker A: Yeah. All right, so what's coming up for you here in the new year events and things that folks might be interested in or where they might find you?
[00:37:29] Speaker B: So we're, we're getting prepared at the moment. There's the, the Snowflake Build event in Emea in London that's coming up early February. That's going to be really exciting. So we're taking our, our new application development toolkit is being released as part of that. So we're taking people who've never even heard of what a Snowflake native app is. I've certainly never built one, probably never built a streamlit application. And in 60 minutes, in a, in a, in a classroom kind of workshop environment, we're getting to build and publish their first meaningful native application. And we're doing that through Create, which is going to get them through all of the plumbing and infrastructure piece and then we're doing that through Assist, which is going to allow them to talk and describe what they want and have that met and then Datrox pipelines to build all of that, test all of that and deploy it into the test Snowflake account. So this is the sort of thing that, to do if you gave that challenge to someone today without this realistically as a multi week activity to learn all the things you need to know. As with all these things, our goal is to kind of simplify it away and say you don't need to know most of that, you just need to know what it is that you're trying to achieve and some very basic high level things. And we'll take care of all the plumbing and bits and pieces underneath. So kind of risky to offer 100 people that they're going to be able to build a, build and publish a native app in 60 minutes. But you know, if you don't, if you don't aim high, you don't achieve. So yeah, we're really excited about Snowflake Build and then yeah, we've got a whole bunch of events, you know, you know, podcasts and blog posts happening and other things we're going to be at. And obviously all this is working up to, you know, Snowflake Summit in the summer, which will be another huge, huge event as it always is.
[00:39:05] Speaker A: Yeah. And that'll be in San Francisco in June, right?
[00:39:07] Speaker B: Yep.
[00:39:08] Speaker A: And where's, where's the build Emea?
[00:39:11] Speaker B: It's in London. I think there's a, we got a QR code for the build Emea. I think we can put a QR code up on the screen. So if you're not registered for it, you know, love to see you there. You know, if going to be there, Bo is come and find me. I'll be, I'll be doing the hands on lab for, for an hour and a half but otherwise I'll be around the show. I'll be at the Data Ops booth.
But yeah, it's a, it's a really, really good event. Very much focused at, at the other people actually doing this every day. There's not a management interface or you know, or management event or a business users event. It's for the people on the ground actually working with these technologies day in, day out. Really valuable event.
[00:39:44] Speaker A: Great. And of course they folks can connect with you over LinkedIn and, and you know, do me a favor everybody if you connect with guy as a result of listening to the podcast here, just put that in the note that I heard your podcast interview with Kent. So that Guy knows, you know why you're reaching out to him and he knows the context. So that, that'd be awesome.
All right, well thanks so much for, for your insights and your predictions today and being the guest guy. It's great having you back on, on the show and always, always good to look forward to what's going to happen here in the next year.
We, we, we move on and you know, continuing to try to put you know, some of Justin's visit vision into, into reality and continuing to, to make that happen and provide value to, to the folks out there in the world that are trying to do everything with data.
[00:40:39] Speaker B: Yeah, we're committed to keeping, keeping his vision and keeping it moving forward.
[00:40:44] Speaker A: So yeah, thanks for me.
[00:40:46] Speaker B: Appreciate it.
[00:40:47] Speaker A: Yeah. And thanks thanks everyone else online for, for joining. Be sure to join me again. And it'll be two weeks as usual with my guest is going to be independent consultant Data Vault, Data Ops and Data Mesh expert. He knows a lot of stuff. My buddy Paul Rankin and guys laughing I was like we've all known Paul and worked with him for years now. It's like he's, he's like the poster child for everything that we talk about and he's been there and he's done it in the real world in large corporations. So really looking forward to talking to him.
And as always, you know, be sure to like the replays from today's show and tell your friends about the true Data Ops podcast. Don't forget to go to truedataops.org subscribe to the podcast to make sure that you don't miss any of the forthcoming episodes for the rest of our season. So until next time, this is Kent Graziano, the Data Warrior, signing off for now and again. Wishing you a happy and productive 2025.
[00:41:45] Speaker B: Thanks everyone. By.