November 25, 2024

00:34:15

Inna Tokarev Sela - #TrueDataOps Podcast Ep. 41

Hosted by

Kent Graziano
Inna Tokarev Sela - #TrueDataOps Podcast Ep. 41
#TrueDataOps
Inna Tokarev Sela - #TrueDataOps Podcast Ep. 41

Nov 25 2024 | 00:34:15

/

Show Notes

In this episode, Kent interviews Inna Tokarev Sela, founder and CEO of illumex.ai, a company revolutionizing data potential with its Generative Semantic Fabric. Recognizing the complexities of unifying business data semantics—essential for GenAI readiness—illumex created a platform that simplifies semantic mapping and alignment. illumex is widely used by data-intensive enterprises for GenAI, Data Governance, and multi-cloud initiatives, ensuring swift and error-free data-driven decisions.

Inna's career reveals a consistent theme: bridging the gap between data investments and decision-making. She previously held roles as VP of AI at Sisense and Senior Director of Machine Learning at SAP. An inventor with multiple patents, she speaks frequently at top data and AI conferences. Inna holds an MSc in Information Systems focused on neural networks and completed the Stanford MBA executive program. She also leads the Women in Data Israel chapter.

View Full Transcript

Episode Transcript

[00:00:03] Speaker A: Welcome to the True DataOps podcast. We'll get started in a few seconds to allow folks to get logged onto the live stream. Be back in a few. Okay. Welcome to this episode and season three of our show, True Data Ops. I'm your host, Kent Graziano, the data Warrior. In each episode, we want to bring you a podcast discussing the world of DataOps and the people that are making DataOps what it is today. Be sure to look up and subscribe to the DataOps Live YouTube channel. That's where you're going to find all the recordings from our past episodes. If you missed any of the prior episodes, now's a good chance to catch up. Better yet, you can go to truedataomps.org and subscribe to the podcast. Then you'll get proactive notifications when we have new episodes. Now, my guest today is CEO and founder of Illumix AI Ina Tokarev Sela. And I hope I pronounced that as close to correct as possible. [00:01:11] Speaker B: It's perfect. Ann, thank you so much. [00:01:14] Speaker A: Yeah, she's a multi patent holder and an expert in knowledge graphs, natural language, deep learning data product monetization, gen AI, very long list here. And then something that she calls generative semantic fabric, which I'm really fascinated to learn more about. So welcome to the show. [00:01:35] Speaker B: Happy to be here. Now, the expectations are high. [00:01:39] Speaker A: Yeah, well, you know, anytime I get somebody on here who's got patents, I have to mention that because I think that's just an amazing thing to be an accomplishment to be able to have invented something so new and different that you can get a patent on it. That's what we need. That's what innovation is all about and what we try to do in our industry. [00:02:02] Speaker B: Yeah, well, cheers to that. [00:02:05] Speaker A: So for folks who don't know about you, could you give us a little bit of your background in data architecture and AI and your career and a little bit about your founding of Lumix? [00:02:17] Speaker B: Yes, of course. I'm Ina Tokaru Sela. I'm founder of Elements and I would start. We are here in the data architecture context. So let's say review my career in the lens of data architecture. So you might remember this program called matlab. So my data architecture journey started in programming graphs in Matlab for something which we call multidimensional geometry. So it was during my academic studies and we're working with Nestle and we convinced Nestle that graphs is the best thing they can do for the analytics. And it worked. It was magical. I fall in love with graphs forever. And this is why I actually continue to be fascinated with graphs as a technology and this experience actually brought me to start my career at SAP. In SAP I was lucky to be part of journey of SAP Hana Cloud platform in memory architecture, Memory database and Hadoop was already, as you know, the same in the background. Our customers had both architectures in parallel. So it was very interesting journey to see how the industry evolves. And here I would say the main focus was to really navigate enterprise customers between all of those architectures and make decisions for them sometimes to know how to operate quicker, better with higher roi. And then I switched to sisenses deep artificial intelligence and finally we started to do new gen knowledge graphs. So from Neo4J to Neptune, GraphDB, Arango, all those graphs. So we tested everyone and we implemented architecture for gen Bi on graphs and sisense. It was pretty cool. Then I started Elements Ellumex, as you rightfully mentioned, is based on data architecture that we patent generative semantic fabric. It's a mouthful definition of a knowledge graph of semantic embeddings. Think about vector database combined with knowledge graphs. This basically encapsulates the context and reasoning behind your data. [00:04:30] Speaker A: Wow. We'll talk a little bit more about that in a bit Cause I do wanna understand a little bit more on how that fits into the architecture really well. Now this season on the show we were trying to kind of take a little step back. This is the third season and so we wanna look at our world of true data ops and what's evolved and some of the things that we've all learned in the last couple of years. Now you're obviously, you've been a leader in the field of machine learning and AI and helping customers take advantage of their data to make real business decisions. So could you tell us a little bit about how you've seen the space evolve in the last couple of years and some of your focus now at Illumix. [00:05:14] Speaker B: Yeah, DataOps is a discipline that was fascinating for a long, long time and I must say especially in enterprise, you do not really touch it the way that you touch it in modern data stack companies. Right. So in enterprise it's about monoliths and how you, you're basically building workflows of CI CD around monoliths and I'm fascinating in how data ops has evolved especially over the last five years, let's say starting from, you know, this orchestration of modern data stack data flows experience to Basically this generative AI friendly in 2022, 2023 where every data ops operation was challenged with generative AI use cases and implementations and of course lots of frustration because things did not scale as expected or as frictionless as expected. And this year I think cost is being the major theme. So I would stress that the cost of running data ops actually lower than cost of not running data ops. But we can discuss it later in the pillars context. What I see for 2025, Gartner already coined 2025 as AI ready data a year. Right. And it stresses back that data ops are instrumental to basically creating those data environments, those microservices for data. Right. And having testing monitoring around that and have collaboration environment for multidisciplinary teams and so on, so forth. So I think importance of data ops in the year of AI data is actually emphasized. [00:07:08] Speaker A: Yeah, yeah. Like you said to. I mean I think we've got a combination of things. From what I've seen is like we've got the scale right. Things have just continued to grow and grow and grow and now we've thrown in really trying to use AI and gen AI and pretty much everybody agrees that if you don't have the data right, then the outcome from AI and gen AI is not going to be very great. It's like trying to manage that. I think you're the first person that said the cost of doing data ops is less than the cost of not doing data ops. And that's an awesome perspective. I hadn't heard anybody actually put it that way before, but I definitely agree. [00:07:51] Speaker B: Yeah. And what we also say it takes generative AI to get generative AI ready. And this is why, especially from day one in Illumix. So we started the company 2021. We incorporated both graphs and semantic models as part of our architecture and we combine our expertise in building ontologies for basically data reconciliation and data management and then self service access to data analytics. But it's also very much dependent on the metadata. Customers have our ability to bring a customized experience to every landscape if it's on premise on the cloud, more modern platforms like Snowflake for example, or more traditional ones like on premise Oracle installations and so on, so forth is basically through metadata we able to capture the nitty gritty of activities in every enterprise by basically mimicking the metadata flows. [00:08:54] Speaker A: Yeah, and it's like, yeah, you mentioned a couple of things there. You mentioned ontologies, which is something that I've been involved in for years because from a business perspective and I think this is where we get to. We had conversations about the Semantic Web years ago and just semantic layers in general, they have to be in Business terms. If the business people were really going to take advantage of all this data that's all over the place, they have to have some way to understand it in terms that they understand that aren't technical like some of us tech tech geeks do. And at the same time though the metadata necessary to put all that together has become even more important like you said because of gen AI. Right. How is a machine going to understand it in terms that are useful for solving business problems? Right? [00:09:45] Speaker B: Yeah, yeah. So it brings me this might explain why we call this technology generative semantic fabric Generative because we do use generative AI from day one semantic. It's because we empathetic about business users who are actually human in the loop. Right. So you can build whatever architecture you would like. To me current perspective of any new data project should encapsulate potential use of generative AI agents or analytics on top of it. So naturally once everything has been governed and certified and I do not believe in governance which does not include domain experts and more business oriented users. So this is semantics. It's super important. We have to wrap up whatever functionality and context and reasoning we have in those systems, in application workflows which have semantic meaning and business descriptions and all of that and fabric goes to. So I mentioned semantic layers. Semantic layers are usually encapsulated in one tool, either warehouse, maybe some BI tool and so on so forth. To me business logic, it might be spread around the whole organization. It might also be different in different units, maybe by design, like different geographies might have different definitions. But anyways, if you only encapsulate your semantics in one tool, especially for big organizations, it's not viable just because data spreads throughout so many systems. And this is why I do believe that we do need to have this repository, this glue which can connect to any interface in organization and keep the semantics aligned. [00:11:28] Speaker A: Yeah. And with you 100% on that I told you before the show this a conversation that we had even when I when I worked at Snowflake and I've had had it continuing after Snowflake is that that's one of the big challenges is I started off in data warehousing and BI where it's like okay, we use business objects, so the semantic layer was inside of business objects. But if you wanted to then use Tableau, you had to build it all over again. And the odds are that it's the business logic's not necessarily going to come out the same between the two And a lot of customers I worked with over the years Definitely. That's like back 30 years ago the dream was to have the single source of truth, the data warehouse. Now it's like can we have one place where we can access the data in terms we understand regardless of what tool we want to use to do our analytics or reporting? And then again, like I said, throw the gen AI stuff on top of it. We want to run AI off of the same thing that we're running our BI off of. Right. And having that one place, it makes a lot of sense to me from, certainly from, from that perspective of trying to make the data useful to as many domains and audiences as possible. [00:12:49] Speaker B: Yeah, I think the biggest thing is the biggest promise of generative AI and enterprise setting goes to intelligent decision making on scale. And of course right now we have this closed gardens of application environment where you have to pre model basically the data model. And the new generation of data modeling is context building. So more expensive people of new generations, semantic data scientists building context and reasoning for organizations with assumption that they actually capture all the business logics in correct way and match it to the right data. So we have the same assumptions that we had for data modeling for BI right now with a more black box technology. So I think it's lots of risk management that we should take in this journey because to me at least it's taking even bigger leap of faith that we did for bi. [00:13:48] Speaker A: Yeah, no, I agree with you, especially with, you know, of course everybody knows about hallucinations and all that, so we never talked about bi. We talked about bad data. Right. We had data quality issues but we didn't have to worry about hallucinations and things that were like coming from nowhere and we didn't understand where did that even come from. And that is now probably one of the primary worries that people have is we want to make sure that whatever's coming out of the AI, can we trust it. Right. That's one of the pillars and you mentioned governance and that's definitely one of our seven pillars of true data ops is governance and security and managing. All of that. How important or maybe even more important do you think that sort of thing is now today because of things like AI and gen AI? [00:14:44] Speaker B: Totally. The importance is still there. I will stress that three major pillars out of DataOps pillar that you mentioned are most important. So the first one of course is governance. You don't want people to access what they're not supposed to access and it goes to derivative insights and so on, so forth. So governance should be not only part of data analytics behind generative AI but also part of every generative AI interaction. Second, I would say this collaboration environment, right? Single source of truth. Single source of truth definitions or at least ability to document on scale with as much as the automation augmentation in our case, we do believe in augmentation actually bring people to not to certify definitions which are automatically generated. So have combination of both words. And the third one which we didn't discuss yet is basically user experience. So user experience right now with generative AI and especially with semantics models is this expectations from business users to use the same terminology that data is stored, data is labeled in the data sources. Because semantic models is work by proximity search. If you use different terminology articulated equation in different manner that your database is labeled here you got your hallucination. And this is where business users have hard time to actually trust the technology. Especially for structured data, it's black box. You usually do not know what sources are and how the question and answer was articulated and all of that. So basically trusting and using the answers for decision making is a big gap. So I would add user experience as a big issue, specifically in generative AI contexts. [00:16:33] Speaker A: Yeah. And when you talk about labeling the data, what do you mean when you say that. [00:16:40] Speaker B: For structured data? Think about databases, data lakes, warehouses. My favorite SAP example. And SAP is alike to call the data models in absolutely non semantic naming convention. You simply cannot put semantic model on that. What I mean by that for computer vision, you had to label pictures, it's a dog, it's an elephant, and so on so forth. So now we need to give our tables, our columns, our transformations, meaningful names. So when someone asks about the metric which is defined as alias in your data pipeline, semantic model can refer to it in correct way. So this is what I mean by labeling. [00:17:25] Speaker A: So it's applying the business terminology and the ontologies to the structures underneath, right? [00:17:36] Speaker B: Yeah. So we actually took approach which is a little bit different from derivative data approach. So I think the typical approach with RAG and other techniques of encapsulating data context is basically okay, we handle data labeling, we named tables, a meaningful name and then we have those examples chunks, right? And that we feed it in into the beast and we expect it to work without really any guarantee. So we said especially for enterprises, especially in domains which are highly regulated, it's kind of, you know, too much of leap of phase, too much of risk to take. So let's make it transparent, right? Let's build ontology or knowledge graph out of a. Out of business terminology. Think about this ontology of Definition what is customer? Where what's the policy? And so on so forth. What are the relationships? And let's incorporate it as vector embeddings so it's automated context and reasoning under the nice on top. What you see is actually your business glossary, right? So metric store, business glossary auto generated and then domain experts can actually certify all the definitions. And the best part that when you implement generative AI GenTech analytics, the chatbots are going to go through this business glossary which is certified. So everything is deterministic. Why it's hallucination free because actually every prompt is already matched to permutation of definition that you have. So it's kind of this managed environment that you cannot program your prompt. And we have this full explanation about how logic has been calculated. I think it's more healthier approach because I do not believe, I think what was ydc someone else who just published this marketing material which say let's take human out of a generative implementation. I say let's take human in generative implementation. But you know, in way it's scalable, it makes sense. [00:19:36] Speaker A: Right. And so I think the certification process is probably one of the more critical pieces and that really falls under, you know, it's a form of governance. Right. Is to make sure that what we're calling an apple, an apple to, you know, trivialize it. But you know, so that the chatbots, when I ask a question, it doesn't misinterpret what I asked and is able to match it to, you know, the semantics of the business properly. [00:20:06] Speaker B: Yeah, yeah. So I think it's a sympathetic thing to do is actually explain user how you understood the question and which data is mapped and so this, you know, basics and majority of tools do not provide that. Which is a very pity. And I actually think governance is cool. Governance is on rise. Governance is cool and, and governance should not be this, you know, tedious job and going documenting stuff. Governance should be on conflict resolution, on collaboration with business owners and business members and maybe just, you know, also collaborating with strategic planning departments to understand where logic, where business logic is going, what, what should be supported in the future and so on so forth. So actually we're coming from this proverbial dashboards which are static, which still will be kept for KPIs. Right? You will always have those KPIs. You want to open your dashbo. You see, like as I do every morning I look into the KPIs and all your pipeline and everything, it's fine. But for Actually make your decisions based on data. I expect every employee who comes with initiative or budget request or anything else to have numbers. So now everyone should have numbers. There are no excuses anymore. [00:21:20] Speaker A: And like you said, it's transparent as to where those numbers came from, how they were calculated, what was the logic behind it, that it's not a black box, it's not just a Excel chart. [00:21:31] Speaker B: Yeah, exactly. Exactly. [00:21:33] Speaker A: Yeah. No, no, I think that's, that's great. One of the other things that occurred to me too is with all this, you know, the labeling and you know, the, you know, the human. Keeping the human in it to say, yes, this is correct. Then on another part of the governance aspect, the classification of the data, such as, you know, pii, phi data, any kind of sensitive data, we should then be able to automate tagging the data that way as well in order to avoid any controversies or crossing the lines of things like the various regulatory agencies. [00:22:16] Speaker B: Right, absolutely. Back to DataOps. Lineage has profound importance in this flow because you do need to understand the sources for pii. For example, PI might be seven transformations down the road, but in pharma you're interesting to know that. Right. And the second is monitoring. So being able through metadata actually monitor what has changed and what's the impact and what is not reliable anymore. You know, both goes to both AI dashboards and generative AI. It's valid for also. So basically having this guide rails is super important. [00:22:57] Speaker A: Yeah, and that's one of the other pillars is automated regression testing and monitoring. We kind of combine those two together and being. I think that's one of the benefits I see certainly with AI is being able to actually automate some of that more so than we did. Because in the data world we were horrible, horrible at testing, at regression testing, especially our algorithms. And back when we were just doing etl, it's like there wasn't a lot of validation going on. The assumptions where the ETL was correct and you never knew it wasn't until an end user ran a report and said those numbers don't look right and then you start back tracing it. So I think having that in there and being able to automate that even more. But then like you said, the monitoring. Because even if you get your algorithms are correct at the time that you build them, and this could be your machine learning algorithm as well. Right. If the nature of the data somehow changes in a way that was never anticipated, you might get some weird results, even though everything was tested along the way. And I think I see that's where the monitoring comes in is right. Is making sure that things were, are within some boundaries somewhere, some, some sort of parameters of reasonableness that. Yeah, this is still right. [00:24:20] Speaker B: Yes. It works as expected or designed for sure. [00:24:24] Speaker A: Yeah. Because otherwise again, we're back into where did that answer come from? How did we get there? So what you said though, about transparency, I think that's even more important along with lineage and all of these things, to be able to very easily see where did that result come from and to be able to trace it back and say, oh, you know, we were expecting that field was between 1 and 100 and now we're getting values in the thousands and that wasn't something we expected. And that changes the way we need to think about doing this. [00:25:00] Speaker B: Yeah. Or even this query should have taken two minutes, but it actually ran for hours. So it's another, I would say that cost is another parameter of governance and this is something which we should be monitoring closely, looking forward. [00:25:16] Speaker A: Yeah, yeah. I think in the last couple of years there's been a movement towards finops. [00:25:21] Speaker B: Right. [00:25:21] Speaker A: Is making that more transparent, especially with so many people having moved to the clouds. And then if it's a subscription service or it's a pay as you go service, at the end of the month they get this massive bill and go, what happened? How did we manage to spend that much money? What did we actually do? But to have again back it up farther that it doesn't get to the end of the month before you see that something went crazy. Like you said, this should have run in two minutes and it's now been 60 minutes. We need to take a look and see and have an alert, right, A proactive alert that says, hey, somebody needs to go take a look at this. [00:25:55] Speaker B: You might even want to limit some of the users asking those questions in the first place. Right. So if you have this estimation, this query is very heavy query. You might want to just allow that to specific users overall to manage that. Yeah. So Agents Ops, it was called Agent Cost. [00:26:14] Speaker A: Agent Ops. [00:26:15] Speaker B: Yeah, Agent Ops. [00:26:17] Speaker A: Yeah, that's good too. So do you have any advice to companies? I mean, you're obviously run a software company, so I usually ask the question about buy versus build. So I'm sure you get into these conversations all the time about, you know, should we buy a tool that does what we're looking for versus, you know, all the smart people in our company, we can just build it ourselves. What do you, what do you end up telling people about that? [00:26:45] Speaker B: You know what, it's, it's very valid Point at the end of the month I look in the burn rates and you know what is the highest cost people? It was the same in corporate, it was the same in growth startup environment. So, so our high spend is people. So to me it's always, always buy if you can. Right. Of course you need to have the tco. I would also stress always start with the business case. What's your business case? If the business case right now is self service access or customer segmentation or a call center automation. So start with the business case and then go back to what you need to build this business case and it goes back to, you might need the AI data solution or governance solution or you can start with chatbots right away because everything is rosy. It really depends on the situation you're in. But I believe, believe our people are going to be even more stressed in the future, just not with the tasks they perform today. There will be new, new generation of tasks for them. Never exchange labor for, you know, for dollars, for capex dollars. [00:27:53] Speaker A: Yeah, yeah. And you know, the, the fear so many people have of, you know, AI is going to take my job. The way I, I say it, it's like it's going to change your job. Right? [00:28:03] Speaker B: Yeah. [00:28:04] Speaker A: Even my son who's in university now, his, one of his professors for the paper, he said I want you to actually use AI, so use chat GPT to produce the paper where a lot of people are telling them no, no, no, that's cheating, you can't do that. And so he's now already learned how to and he's a really good writer and he writes very quickly as it is. But we talked about this yesterday, the productivity gain because instead of him spending a couple of hours writing this paper, it was like 15 minutes. Yeah, he got a perfect grade on it. [00:28:43] Speaker B: Yeah. But then you need to have new skill of this criticism. Right. Checking the sources, maybe cross validating, orchestrating the new types of skills we should acquire to actually being able to be augmented by all those tools and of course sensitivity to privacy and all of that is required right now in the workspace. [00:29:06] Speaker A: Yeah. And so like any new skill there's a learning curve and we've got to pay that upfront cost, which is, he talked about that, about how long it took him to learn how to do it over the last couple of weeks because they gave him the assignment, you know, about a month ago. I said yeah, but now you've learned how to do it, now you have the productivity going forward. Right. And so doing these sorts of things, working with These type of tools with gen AI to make business decisions, we can spend more time I guess even you know, discussing is this the right thing to do. But we can come up with a much probably hopefully a better proposed solution faster and move on and then there will be a lot of jobs. I think that you know, it would just be, it's just productivity enhancer for people and then that means they're going to be able to do new and different things but they got to pick up a couple of new skills along the way. [00:30:00] Speaker B: So here in elements we believe in application free future and we believe that in solutions like ours have of course major impact on that. And this application free future makes you of course more efficient because you don't need to learn systems how to operate a CRM system, how to operate your system, you just have your task made decision and maybe your editing workflow, what have you. But for that you have to have to demonize, demonolitize. So turning monoliths into more microservices on the data layer, on the software layer, in the modeling layer and semantics layer and the agents. So we need to orchestrate all the parts and this is where data Ops, you know, in more invider context of data ops becomes even more profound as underlying principle. [00:30:52] Speaker A: Right? Yeah, no, completely agree with you. That's it. It's a new, it's a new world and people are going to have to think a little different. But yeah, the application of data Ops like you said is sort of expanded now and it's even more important right. If we're going to do these sorts of things. So we're pretty much out of time. So I wanted to give you an opportunity to talk about, you know, what's coming up next for your company, any events that you're going to be at if people want to come out and meet you and hear you speak for longer than what we were able to do here today. [00:31:28] Speaker B: It's going to be exciting 2025 and I see that Elemax is going to be to be focusing more and more of course on the use cases which we already cover, which is the added data and governance and genetic analytics deployment from the prism of self service access. So please connect to me if you're interested in those. November is also super busy month, so next next week I'm in New York, I'm going to attend quite a few of events. So one of them is Scale up AI with Andrew Ng. I mean I'm humbled to hear what he has to say about the future of AI. And also there is a very good conference for women in data. So basically how to sell AI solutions and we are going to learn from OpenAI and other companies which are actually, you know, selling to us all those credits. So I want you to understand what the strategy and what is the future of, you know, of all those TCO topics. And this brings us to, you know, to my personal area of interest. I think exploring TCO and predictable total cost of ownership of managing LLM pipeline and deployment is going to be big issue and big topic for discussions in at least in the next few months. So we come we came up with a blog on this topic and you can see this link featured to you on that as well. But please do connect to me on LinkedIn and let's have discussions offline. [00:32:59] Speaker A: All right, thank you. And for our viewers, we just put up a whole bunch of QR codes for these events and her LinkedIn profile. You can always replay this and take a picture of the QR code if you weren't able to get it, catch it when it first went by. Well, thank you so much for being my guest today and your insights in this area. I think it's fascinating. Things are changing faster than probably any of us could ever imagine. We've seen things change fast over the last couple of decades and it's like it's accelerating. Even more surprisingly to many of us, it's going even faster now and takes a lot more. So great talking to you today. [00:33:43] Speaker B: Likewise. [00:33:44] Speaker A: Thanks to everyone else online for joining. Be sure to join me again in two weeks when I guess is going to be Snowflake data superhero and data practice lead at Interworks, Chris Hastie. And as always, be sure to like the replays from today's show and tell your friends about the True DataOps podcast. Don't forget to go to truedataops.org subscribe to the podcast and then you won't miss any of our future episodes. So until next time, this is Kent Graziano, the Data Warrior, signing off. For now.

Other Episodes