Episode 9

February 06, 2024

00:31:21

Sonny Rivera - #TrueDataOps Podcast Ep.27 (S2, Ep9)

Hosted by

Kent Graziano
Sonny Rivera - #TrueDataOps Podcast Ep.27 (S2, Ep9)
#TrueDataOps
Sonny Rivera - #TrueDataOps Podcast Ep.27 (S2, Ep9)

Feb 06 2024 | 00:31:21

/

Show Notes

“If you’re going to do this at scale, you really do need an automated process. There are too many steps, too much complexity, too much data… to do this in a timely way, you need automation and repeatability.”

The latest episode of the #TrueDataOps podcast, live on LinkedIn then available on demand, saw Snowflake Data Superhero and ThoughtSpot senior analytics evangelist Sonny Rivera join host Kent ‘The Data Warrior’ Graziano. “DataOps is applying software development principles to the entire lifecycle of data. And that lifecycle has grown over the years.” Before, Sonny says, you had to build solutions yourself: “DataOps is about operationalising and automating that, from ingestion all the way through to consuming analytics through self-service and other collaborative tools.” Watch this episode

View Full Transcript

Episode Transcript

[00:00:03] Speaker A: All right, welcome back, everyone, and happy New Year. Welcome to our episode of the True DataOps podcast for first one of 2024. I'm your host, Kent Graziano, the Data Warrior. Now, in each episode, we try to bring you a podcast covering all things DataOps with the people that are making DataOps what it is today. If you've not yet done so, please be sure to look up and subscribe to the DataOps Live YouTube channel, because that's where you're going to find all the recordings from the past couple of years. If you missed any of our prior episodes, you know, start the new year off and get caught up. And better yet, if you just go to truedataops.org org you can subscribe to the podcast and then you won't miss any more of our episodes. So my guest today is my good friend, Snowflake, data superhero and senior analytics evangelist at ThoughtSpot, Sonny Rivera. Welcome to the show, Sonny. [00:00:58] Speaker B: Hey, thanks, Kent. And I'm glad to be the first one of 2024. So I'll kind of check that off and make that a, you know, highlight. [00:01:09] Speaker A: Yeah, your bucket list checked off. First. First guess in a new year. There you go. [00:01:16] Speaker B: That's right. [00:01:17] Speaker A: So for the folks who don't know you, can you give us a little bit about your background in data management and a bit about Thoughtspot and what you do there? [00:01:25] Speaker B: Sure. Well, I'll start a little bit about me, Kent. I started in software development a long, long time ago, and I started in the defense industry. Right. Building TV guided weapons systems, of all things. Yeah. So, okay, you and I are old enough to remember those TV guided weapons from the, from first Gulf War. And so when you saw those on cnn, that's what I was working on. And I really got in that experience. I learned the value of the software development process, and I think that was really important for me, you know, whenever that was 25, almost 30 years ago. And since then, I've moved on to building data platforms, specializing in data modeling and building cloud data platforms. You know, you and I are both snowflake data superheroes, so we've worked in that space for some time. And I've built platforms for small companies, for embedded analytics and large companies, large financial services companies and that sort of thing as well. So just a few years ago, I came out of the data platform side and joined Thoughtspot because I just absolutely love the product. So that'll let me tell you a little bit about ThoughtSpot, if for those of you who don't know who Thoughtspot is so thoughtspot really is AI powered analytics for the modern data stack. So we work well with your cloud data platforms, Snowflake Databricks, GCP, others, but we're really based on natural language queries, natural language search, and we've been in that space since 2012. So we been way ahead of the curve when it comes to natural language search. We have several patents in that space and we really have focused on enabling every user, whether you're a technical user or you're a non technical user, to do analytics. So that's a little bit about ThoughtSpot. I, I'm super excited. By the way, I'd be remiss if I didn't mention last summer we acquired Mode analytics. So we've joined forces with Mode Analytics. So whether you're business first, kind of a non technical user or you're a data team that is code first, we've got you covered on both ends for your analytics. That's a little bit about ThoughtSpot. [00:04:01] Speaker A: Great, thanks. So prior to you going to Thought Spot, you were really down in the trenches there building snowflake based solutions for a couple of years. Hence the Snowflake data superhero. And how you and I got to know each other after one of your talks about IOT and barbecue, one of your other passions, making barbecue. And then you have all this software development experience. So you know, from all that experience perspective, you know, what do you think DataOps is? How do you describe DataOps and how does it really fit into this ever evolving data landscape that we have? [00:04:40] Speaker B: Yeah, I think of DataOps really as applying our software development principles to the entire lifecycle of data. And so that lifecycle has grown over the years. When we first started out, there weren't tools to do this. So if you wanted to have a data pipeline with an orchestrator where you had some observability, you actually had to build it all yourself and cobble together pieces of it. And so it was very, you know, immature, error prone and that sort of thing. So I would say Data Ops from my perspective really is operationalizing, automating the the entire life cycle of data from ingestion all the way through to the end where you're consuming that analytics through self service and other collaborative tools. [00:05:36] Speaker A: Good. Yeah. Do you think it's possible for organizations to actually effectively deliver the value from all this data with the scale that we're at these days if they're not adopting some sort of agile data ops approach? [00:05:50] Speaker B: Yeah, I think if you're going to do this at scale, you really do have to do this with an automated process. There's too many steps, there's too many, there's too much complexity, there's too much data. And so I do think if you're going to get it into the market in a timely way, automation, repeatability, you think about something, maybe your listeners may or may not know about this, the capability maturity model. And this goes all the way back to early software development where we had, yeah, I just have this ad hoc process. Yes, I do it, but every time I do it it's a Snowflake and I, I do something different every time. Now I have a repeatable process, right? Now I have a documented repeatable process. Now I have an optimized process. Now I have metrics for improving the process. So I do think that if you're going to do this at scale, if you're going to do it with quality, you have to have a Data Ops program. No doubt about it, no company out there today would have, would not have a software development process that automated their deployments so they can do continuous deployment. [00:07:09] Speaker A: Right? Yeah. Because that's in our seven pillars of true data ops. It started off the whole conversation on Data Ops. When I got working with Justin and Guy back when I was the evangelist at Snowflake was the need for CICD. Because that was like the number one question back in, I don't know, it's probably 2017, 2018 that we kept getting from Snowflake customers is, yeah, I know about DevOps, right. And I know about Jenkins and Cucumber and all these other tools that we could go build things in. But how the heck do we do that with data? How do we do that in Snowflake? And so that was, I guess one of the key principles we were looking for is, you know, how do we enable people to do CICD effectively with data? [00:07:55] Speaker B: Right, right. And so one thing I love about just the whole continuous integration, right, the continuous deployment CI CD is the ability to be alerted and make people aware of what's going on and when something broke and who actually broke it. In my early, early days in software development, still being a bit of a embedded analytics geek, I got some software and a card and a little marquee and we posted it up over the developers cubes. And so anytime the, the build broke there would be a kind of a, a shame board Sunny broke the build. Right. It's been broken for this many minutes or this many hours. Right. So as soon as somebody did it, soon as I broke the build, everybody in the building knew and it was a little bit fun. It was fun. Not, not humiliating, put it that way. [00:08:52] Speaker A: Yeah, yeah, you got a good sense of humor about it. But yeah, that's because it's, that's part of, of cmm but also tqm. Since you learned from the government space, I'm sure you're, you're familiar with that. Total quality management is there has to be a feedback loop which today we call that observability and monitoring. Right. But it needs to be automated. You, it doesn't do any good to automate your build process if you don't have the metrics being automated as well and the feedback being automated. Like you said, the automatic notification that something went wrong that turned out to be that was very critical. Which is again another one of the seven pillars of true data ops was automated testing and monitoring. Again, it doesn't do any good to test if you're not actually paying attention to the results. [00:09:42] Speaker B: That's right, that's true. Not just of your data pipeline, but all the way through. We see this from your business operations. Think about Thoughtspot out at the end where business users are leveraging metrics in their everyday operations. We don't need to go to a live board or go to something every single day, every single minute to check a metric. It may be better to go ahead and push and notify, hey, you're trending out. You're not out of your bandwidth, your band, but you're trending out. So let's take some active activities and be proactive. And we do that in Thoughtspot as well. So these concepts are I think a little more universal and can be applied in a lot of different places. [00:10:30] Speaker A: Exactly, yeah. Like you said, it's the entire data life cycle from source through consumption. Right. And so since we're onto that, you know the one of the hot topics we that have been buzzing around recently is data products is talking about data products on the consumption side. You know, comes out of the data mesh principles, but certain been no more universally adopted in the data world that yeah, we need to be producing data products that are consumable. So what's your take on that concept of data products and what it means and how important it really is again in our landscape as it's gotten more and more complicated and big. [00:11:16] Speaker B: Yeah, well, I will say so. Data products, that idea has come out of data mesh itself. Right. And you've got the four pillars of data mesh around having data products and domain oriented self service. And you know, I think what federated. Federated compliance. Right. Or federated. [00:11:35] Speaker A: Federated governance. [00:11:35] Speaker B: Yeah, federated governance, right. [00:11:37] Speaker A: Federated computational government governance, if I remember the phrase correctly. [00:11:42] Speaker B: That one trips me up a bit. But you know, I think a lot of people are, when you think about data mesh in general, they focus on data products and they might focus a little more on distributed, you know, kind of a distributed domain oriented architecture there too. And so some of the other principles might kind of slide off. So you hear people talking, well, we're doing this meshy type of thing. It's a little meshy. But I do think data products are important here. And the concept to me is that your data really isn't valuable unless it's actually in motion and you're actually doing something with it. The idea of hoarding data, you know, we've gotten really good at ingestion, we know how to load data into lakes and we can throw it all there. Now what the heck do you do with it? And so building data products and having specifications, things like contracts, things like testing, things like automation for those is critically important. And you know, the idea of, now let me say it's not a centralized product, but it's decentralized and it's domain oriented so that the people that know the business processes and know the data own that. So love that process. Definitely still, you know, maturing. Right. The data mesh, it's a, it's a, it's a framework, not necessarily a hard architecture. [00:13:16] Speaker A: Right. It's not an architecture, it's not a methodology. Yeah, I think framework's probably a good term for it. It's a good way to be looking at the architecture and the perspective that you're taking as you build it with the goals that you're trying to achieve. Like we're trying to build data products that are going to be consumable, visible, traceable. There's a lot of observability words in there. Right. We start thinking about data catalogs and things like that so that folks can find the data and use it. One of the things that I learned, I worked for the feds very early on in my career as well and learned total quality management there. And the one concept that I think I carried through from them into the Agile days and into data vault and up into the cloud, was this concept of internal customers. And I saw that come out with the, the, the spec on data products from data meshes, you know, where the domain team has to think of everybody who's going to potentially consume the data from their domain as a customer. And a lot of those are just Internal customers. You're talking about trying to build kind of a federated data warehouse. Whether you're doing a data, internal data marketplace, whatever that is. You're really talking about focusing on your, in many cases internal customers in your organization. And can you build a data product that makes it easy for them to consume that data and do what they need to do with it? Which means you got to understand their perspective and what they're trying to achieve, right? [00:15:04] Speaker B: Yeah. And we saw that in software development too, right. Where we were saying, oh, we're going to develop specifications and SLAs for this particular product. The data is going to come across this way. Here's what it means, here's the shape, the format, all those data specifications, as well as here's how quickly we're going to get it to you. So we had these, for lack of a better term, SLAs in place around that data or around a software development, API or solution. So I love the idea of bringing these principles over from Agile, from Lean and my first love was software development. So yeah, I love bringing these things over and they're critically important for organizations that want to get to scale. Because I'll circle back to you had a question on scale. The challenge here is to be able to be efficient and be effective. So to be able to have a tool that lets you efficiently get products to market, do it with high quality and be effective in doing that. That's really the challenge for a lot of organizations. So I can't emphasize enough having that culture of data ops. We hear about data cultures, a subculture of that is that you have a data ops culture within your organization. It can't be, you know, the old style, top down. Some data architect says thou shalt do it this way. Right. It really does have to build, you have to build a culture around that. [00:16:50] Speaker A: Yeah, yeah. So that's from, from the data engineers up. Right. They have to be on board with using get and doing check in and check out and not just pushing stuff over to production without it being tested. You have to build that entire organizational culture really from the top down and the bottom up. Right. And there has to be support from the top to allow them to build that culture down at the bottom. [00:17:17] Speaker B: Yeah. And I think this expands too as you go out. If you think about those business users that are, or these users that are using data in a self service way. Right. And so now they need to be aware of the data culture and that we have this automated way of doing things, whether it is building a new Metric or keeping it up to date or you know, getting data built. And what is the, you know, what's that SLA again for when you're going to receive that data? Is it real time, is it daily, monthly? And what's the quality of the data or the, the meaning the semantics of the data so that you as a self service user can do more with it. Ask that next question and that next question and that next question and get to the insight that you need to take an action. [00:18:03] Speaker A: And that's really kind of where the concept of data catalogs comes in to the ecosystem is there's got to be someplace where the business users can go to get that understanding and see the sla. How fresh is this data? I remember building data warehouses where we ended up building dashboards on top of the ETL process so that we could see when was the last time the process ran. So we know did we get yesterday's data or not or did it fail overnight and we really got 48 hour old data rather than 24 hour old data. Right, but that's, it's all there. And yeah, the more we get out to the outside of it and out into the, the actual business of running the business and the business analytic users, they need access to that data so that they can trust what they're, they're making their decisions on, on this data that it's something that they know. [00:18:57] Speaker B: Yeah, let me just add to that too. When you get out to the edge, it's a fan out problem. We call this in software metrics. It's a fan out problem if you don't address it at the source. Right. It gets harder and harder to keep fixing data and the issues associated with them as they do. More joins, more usage added, added, added. You get this fan out problem that the number of challenges associated with it keeps growing. So you really have to get it as close to the source and that's where your data ops program helps you out tremendously. [00:19:33] Speaker A: Yeah, I see. There's a question came in here. How have you overcome a scenario where data engineering leaders don't subscribe to the concept of data catalogs? [00:19:48] Speaker B: Yeah, I think what you one, if that question is for me, I would, I would kind of push back to say let's take a look one thing I've done, let's take a look at what your teams are doing and do a day in the life of what I call a ditlock. Go sit in the day in the life of that business user or that's asking the question or that analyst who's asking the question. And you will quickly see the value of that data catalog and understanding. And then I would also tie it back to. Because we've all had this. Kent said the number was X. Sonny says the number is X times 10%. You know, and now we've got this challenge. And so to be able to have that tie it back to something they've experienced directly shows that value. [00:20:39] Speaker A: Yeah, yeah, yeah. You've got to make the business case, basically, whether it's in the IT department or outside the IT department. [00:20:46] Speaker B: Yeah. And any one of these things is, you know, your data product really is only as good as your ability to communicate its value to your stakeholders as well as to the marketplace. So you've got to be able to do those two things. I will, at the risk of running off the rails here a little bit, Kent, we saw recently two great coaches in football retire. Right. Bill Belichick, we saw Nick Saban retire. I worked for a company out of Tuscaloosa, Alabama. Nick Saban would come and talk to us and talk to us about motivating us and about process. But the idea here is process really is king. Nick Saban was all about the process and the outcomes will take care of themselves. Do the right things that give you the best chance at those outcomes and they will take care of themselves. So having a repeatable process, understanding how each piece works together, is mission critical. And so I would look to pushing those processes into place so that you get consistency, repeatability, and I don't know, seven national championships, you know, fairly repeatable, fairly repeatable. [00:22:05] Speaker A: It's got a good record there. Yeah. So since you're a fellow data modeling enthusiast, I have to ask, the question is, where do you see data modeling playing a role in all that we're doing? Especially with the, I'll say, the evolution and increasing growth and hopes for AI. [00:22:27] Speaker B: Yeah. Well, one, I think there is so much data out there right there. There is a ton of data out there. You know, whether we're getting data from IoT streaming of our barbecue systems or transactional systems that are out there, or, you know, even now sharing data in from third party partners. So I would, I believe modeling becomes even more important now that we know the origins of that data, the quality of that data, the, the granularity of that data so that we can solve the problems that are in front of us as a business. So to me, modeling has become even more important. I think there was a day when we just thought we can all query RAW files directly. [00:23:23] Speaker A: We don't need no stinking schemas. [00:23:25] Speaker B: Do the schema less thing. And schema less also meant semantic lists. We didn't know what these things meant. So I think when you get to the point where you say, oh, I need to understand the meaning of the data, now you need models, just like we do every day. We think in models every day. We use, you know, kind of 80, 20 rules that they're not perfect, but they model what we are doing, what we're trying to achieve. So, you know, having that idea of thinking in models really leads you to the idea of, hey, I really need to model my data so I can solve problems based on those. They're repeatable, they're easy. [00:24:07] Speaker A: It's a means of communication. The type of modeling that you and I have done in the past. We're producing a diagram and it's a picture. The old picture represents over a thousand words. We could do a lot more with a picture of the data than we could with just listing out and looking at a spreadsheet. One of the things that I think people get hung up on is when we say data modeling and that we need these models, it doesn't have to be and isn't exclusively a schema diagram. It's not just a map of the tables in the database, like you said, it's the semantics and the metadata. And, you know, what does this really mean? And I think that's where data lakes turned into data swamps, is the poor data scientists were out there having to try to figure out what all that data was. It's like, how are you going to build a machine learning model if you don't have an understanding of the data? You know, and I know you're getting the right data. If there's no definitions, you don't know, you know, this file, how is this file related to that file? We have no idea. We have to figure it out through data profiling. And that was an awful lot of, I'll say, wasted effort on the parts of the data scientists who, you know, were there to do a different job really, to build these models and enable AI. But they had to spend all the time, you know, basically doing data wrangling to figure out what was there. Because there weren't any data models. [00:25:34] Speaker B: That's right. And I think the value when we look at data models, there's so many different ways the word gets a little bit overused. But you've got warehouse models, you have application models, you have semantic models, we've got AI models, we've got transformation Models, like, things like dbt, where we're transforming pipelines, that sort of thing. So definitely having an intent. And I think that's the value that models give us is this is an intent. This is what we are, you know, we're putting a domain around what we are thinking about and here's our intent and this is what we're driving toward. [00:26:11] Speaker A: Yeah, exactly. [00:26:12] Speaker B: So, and just like anything else, here's the deal. A lot of folks, oh, I'm going to spend a lot of time on this. What you really need to think about is everything that we do as software engineers is data engineers and architects. We really should be thinking of how can we do it so we have continual improvement so that it's adaptable so I can make that change. And so the idea that I make a software change or a model change, it doesn't have to be a gigantic enterprise to do that or endeavor. [00:26:42] Speaker A: Yep. Well, since, since this is our first show for 2024, Sonny, in the last minute or so here, I want to give you an opportunity to tell us at least one of your big predictions for 2024. [00:26:56] Speaker B: Yeah, I, you know, we've got a predictions ebook out at Thoughtspot and Yeah, there you go. You guys can check that out. I would encourage you to do that. There's about 10 predictions in here, but the one I think is probably very appropriate to what we're talking about today is data quality, data mesh. So data contracts and data mesh will continue to face off with controversy a little bit about what is data mesh, what are data contracts, how valuable are they? But that's still going to be pushed back against with, hey, I have a need for data quality. So check out our predictions there and let me know what you think. [00:27:39] Speaker A: Yeah, yeah, no, the data quality is pretty key because it's still the rule. Garbage in, garbage out is still the rule. There isn't an AI that can take bad data and do the right thing with it. Really. Right. It's not that smart. It's augmented intelligence. It's not automated intelligence, really. [00:28:01] Speaker B: That's right. And I'll leave you with one thing from that experience said they estimate companies lose between 15 to 25% of their revenue because of bad data. That is eye opening. It's shocking to me. [00:28:16] Speaker A: Yeah. So I think about that if we start feeding that at scale into AI, they can lose that much money faster. [00:28:24] Speaker B: You're absolutely right. [00:28:25] Speaker A: You really need to get that data quality. That's key. It's really, I mean, seriously, it is. I think it's key to success. You know, having the model, having the semantics, but you've got to have the data quality because even if you have all of that information, if the quality of the data is bad, you're not going to end up with a good result. There's just no way around that. [00:28:45] Speaker B: Yeah, that's exactly right. And on a consumption based area where I am, we're kind of the top of the stack. We're integrating with business users. Data quality and trust are our number one issues that we see. [00:28:57] Speaker A: Absolutely. You can't build a good data culture if there's no trust in the data. [00:29:01] Speaker B: That's exactly right. [00:29:04] Speaker A: So what's next for you, Sunny? What do you got on your agenda here for the first part of 24? [00:29:11] Speaker B: The first part of 24. Well, there's enterprise Data World is coming up, so I'm excited about that and I know unfortunately I'm going to miss out on Day to Day Texas. I really wanted to be there, but I know you'll be there in a lot of my cohorts, friends and folks will be there as well. But I'll be freaking out online. [00:29:32] Speaker A: That's just a couple days from now. [00:29:33] Speaker B: Yeah, exactly right. So I am excited about TDWI or the Enterprise Data World. It's going to be exciting. [00:29:41] Speaker A: Good. Yeah. And that Enterprise Data World is in March, is that right? [00:29:45] Speaker B: Yeah, I think it's March 9th through the. Look at the exact dates here, I think. Oh, March 24th through the 29th. [00:29:54] Speaker A: Okay. And that's in Orlando this year? [00:29:57] Speaker B: Yes. [00:29:57] Speaker A: Used to be in San Diego all the time when I was going, but yeah, this year they're going over to Orlando. [00:30:02] Speaker B: All right. Yeah. Well, that's not awful. It's nice and warm. [00:30:05] Speaker A: Yeah, hopefully. So what's the best way for folks to connect with you? [00:30:12] Speaker B: You know, the best way to reach out to me and connect with me is you can find me on LinkedIn everywhere or you know, search for Sonny Rivera barbecue on, on YouTube and you'll find my barbecue. YouTube. But no, really, seriously, check me out on LinkedIn and just search for Sonny Rivera. You'll find me easily. [00:30:30] Speaker A: Awesome. Awesome. So, well, thanks. Thanks for being my guest today, Sonny. It's, you know, gotta close out here. Thanks everyone for joining us for the the first podcast of 2024. Be sure to join me again in two weeks. My guest is going to be a hands on data ops implementer and architect, Ronnie Steelman, who's the CEO of the consultancy Quadrubyte. And as always, be sure to like the replays from today's show. And tell your friends about the True DataOps Podcast. Don't forget to go to TrueDataOps.org and subscribe to the podcast so you don't miss any of our next episodes. So until next time, this is Kent Graziano, the Data Warrior, saying goodbye. [00:31:15] Speaker B: Thanks, Kent.

Other Episodes