In this episode of the Utilizing Tech podcast, Stephen Foskett, Allyson Klein, and Gina Rosenthal discuss dark data in edge computing. Dark data is unutilized or unknown data collected by organizations. The distributed nature and use of third-party apps can make it challenging to handle dark data, limiting insights and posing security risks. Establishing a stronger IT-business connection is crucial. Observability solutions and data analytics can aid in discovering and centralizing dark data. AI has potential for data hygiene improvement, but human-driven cleaning is still necessary. Despite challenges, edge computing offers better data management due to controlled deployments.
Stephen Foskett, Organizer of the Tech Field Day Event Series, part of The Futurum Group. Find Stephen’s writing at GestaltIT.com, on Twitter at @SFoskett, or on Mastodon at @[email protected].
Allyson Klein, Global Marketing and Communications Leader and Founder of The Tech Arena. You can connect with @TechAllyson on Twitter or LinkedIn. Find out more information on The Tech Arena website.
Gina Rosenthal, Founder and CEO of Digital Sunshine Solutions. You can connect with Gina on LinkedIn and listen to her podcast, The Tech Aunties Podcast. Learn more on her website.
Follow the podcast on Twitter at @UtilizingTech, on Mastodon at @[email protected], or watch the video version on the Gestalt IT YouTube channel.
Transcript:
Stephen Foskett: Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT This season of Utilizing Tech focuses on edge computing, which demands a new approach to compute, storage, networking, and more. I’m your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT and joining me today as my co-host is Allyson Klein. Welcome.
Allyson Klein: Hey Stephen, it’s great to be back.
Stephen: It’s great to have you back and you know we are doing a special episode here kind of mid-season talking to some of our Field Day delegate friends about edge which is one reason that we decided to invite to join us today Gina! Gina, welcome to the show.
Gina Rosenthal: Hey y’all thanks for having me.
Stephen: So today on the episode we are going to talk about something that is I think sort of lurking in the background of a lot of edge conversations and that is this whole idea of as Allyson coined it, dark data. Allison, tell us a little bit about your thought of dark data.
Allyson: Well dark data really in my mind refers to the data that an organization may have created or collected but is unaware that is within their data stores and is not actioning on in terms of monetization or service delivery associated with that data and I think that are what they have in terms as has dark data present and it’s that it’s that opportunity to explore you know expanding what they’re doing with their data if they can actually find out where their dark data is and put a light on it.
Stephen: And I think one of the things that’s challenging here is that there certainly is dark data in the data center and in the cloud but in the edge, well you really don’t know what you’ve got right Gina?
Gina: We should kind of know what you’ve got if you’re doing it right. I think so that’s kind of I think why you have the dark data but if I think if you’re intentional with what you’re doing at the edge and you know what data is being created and what data is available, you’ll know whether it’s something you want to pull into your data lake or if it’s something you want to not keep or you’ll be able to make those decisions but I think you need to know what kind of data you got everywhere though.
Stephen: well is that realistic? I mean do people know what they’ve got going on out there?
Allyson: I just think that you know I’ve spent a lot of time in the industry and I think that we assume that all of this is a lot easier than it actually is in practice and the reality is IT organizations have tremendous priorities associated with security, keeping their workloads up, maybe some things fall through the crack in terms of data that’s being created by different business groups at the edge. Maybe they don’t have the rightTools in place in terms of observability and discoverability so that when this data is created it’s obviously flagged for their IT organization to explore and maybe the it organization knows about the data but it’s dark to the line of business that it would actually see value from it. So I think that there’s a lot of different angles that you can use in terms of thinking about what the challenge or opportunity statement of dark data is.
Gina: Right I do agree with you. I think IT probably knows about the existence of dark data or the ability to create more dark data and makes decisions based on costs and not necessarily cost, you know, opportunities for the business because I know we used to do that back in the day and I don’t think that’s something that’s really changed you know. I think another opportunity to create dark data is probably the use of SaaS applications which are used everywhere to try just to get business done as fast as possible.
Allyson: That’s really interesting Gina, how do you see the tie between SaaS and dark data. Can you say more?
Gina: Well sure. You know so I help a lot of people with marketing because that’s what I do and I think a lot of times people get frustrated with the overall big company rules about how you can do marketing especially collecting customer data, where are they, where are all your customers at? They just want to throw something together collect some leads and go on with it. So you’ll have a sales team wherever they may be, a sales leader saying just go get this done for me, go use XYZ application, just use your credit card, run it, and get it, sometimes really not thinking about well there’s an entire you know CDP which is supposed to be gathering all the data in one place from everywhere it might be it’s the idea to have this single pane of glass as they like to say and also security and everything especially with personal information which we know has lots of governance around it. So the overall line of business may not know about it. The sales leader that’s decided to go off and create his own workshop to get the leads he needs to drive his business may not know about the opportunities of tying all the data in the organization together to find more customers and find more leads and the rest of it. So you end up with kind of fragmented siloed data all over the place. For Edge I think it would be interesting to see if there is you know, because I think it’s interesting what you said a few minutes ago that there’s… it’s really hard and there’s not is there any best practices that people are using. When it comes to building an edge application, are they just grabbing whatever looks good at the time which will fit to get it out the door and fits within the cost constraints just to make it happen and nobody ever circles back around and tries to put that into a bigger system.
Stephen: I think when you’re talking about edge, one of the things that we came up when we were talking about far edge applications and and for example at retail was that you have maybe not SaaS applications specifically, but you have applications that are being brought in by different people, different departments, some of them are outside applications, some of them are inside, but from different departments, and all of those are running and are expected to run everywhere remotely and I think that that seems like kind of an analogous situation. So you have you know for example maybe an advertising application that’s running in your Quick Mart and that application came from a third-party vendor and it has no real ties back to the company except through its own communications links and its own service provider and its own data format and so on. And all of that is really isolated from the overall in you know the overall corporation. I think that that’s really probably a very normal use case at any edge and that’s why you know when you brought up this idea of data silos and dark data it really resonated with me in terms of edge because I could totally see that happening in this environment where you’ve got all these different applications running. It really is like sales enablement applications running on phones and laptops and everything. It’s just a completely different set of data right?
Allyson: I think that one of the things that comes to mind is it’s a technology problem and it’s a human problem because without full stack integration and really thoughtful tie-in of you know the leads generated from this application automatically flow into my CRM. You know those things can be done but are they done all the time. Sometimes not the human problem is really having all of these different organizations deploying applications and maybe not seeing the point. To Gina’s point earlier, the sales lead leader doesn’t really care if we capture those leads, that contact information for further purposes, he just wants to meet his quarterly numbers and so you know lost opportunity. I think that you know the edge is kind of accelerating both of these challenges because so much of the data that is generated today is being created at the edge. It needs processing at the edge and we are seeing that these two challenges technology and cross-departmental communication are just getting, you know, kind of in the way of the full value proposition that’s in front of us.
Gina: If you think about how much better you can do things if you have the data in one location that you can write applications against and how much better you can derive data for the entire organization or think of it another way. If you can think of how much better it would be if you were able to take you know whatever it was a sales leader being able to push him to what he needs to create his own sales, his sales program, but all protected using all the data from an organization that’s kind of the short-sighted not to do that but human nature is like you said just to get it out there and get it done, meet my number.
Stephen: Another thing that occurs to me is that in many cases with edge, we’re talking about data that is intentionally darkened let’s say that so it’s you’re collecting metrics from, you know, machinery or you are you know collecting video and processing it and so on. The idea is that you are going to collect and process that data at the edge, you’re going to extract what you think is valuable today from that data and then you are going to either chuck or just ignore that source data and send back only the results and I think that in many ways this is an optimization that people are doing in order to optimize bandwidth and connectivity and cost And you know, but at the same time that’s dark data too right? So I mean if you think about like a security camera example for you know for instance you’ve got a whole bunch of security cameras they’re all processing data, you’ve got maybe a machine learning algorithm that is processing that and and extracting the interesting bits and only sending the interesting bits back home, who’s to say that you don’t need or want that data in the future? And even if you are pretty sure you don’t need it because it’s just an empty parking lot or a machine that’s idle. You know maybe some of the other data that you’ve rejected, you know, you might need or maybe the machine, the algorithm rejected some of it, on its own and you find that, oh man, I wish I could see what was going on at this time code. I think that’s another area where data could be dark even in a system that is correctly configured to ship some data back to the source right?
Allyson: It’s an interesting question Stephen because at some point what is dark right is it stuff that you don’t know about? Is it stuff that you’re not acting on? Is it stuff that you’re throwing away? I don’t know where the right demarcation point is and I think that’s worth exploring is if an IT organization has already looked at this particular use case and said you know what we’re just going to move this portion of the data because we already know that this other stuff is junk. I mean it’s like, it’s almost like at that point does holding on to the junk become data hoarding, you know, is where is the demarcation line that it’s no longer dark, It’s useless and we’re just going to get rid of it. I understand your point about the security cameras I mean I maybe I’ve watched too much true crime episodes, but I always know that it’s bad when they lose the data of the security camera. That’s a common theme. But I think that no, all joking aside, and black humor aside, I think that if we think about an IT organization making thoughtful decisions about what they want to keep, I don’t know if that is dark, if it’s the if it’s a really good data policy because I don’t think that the right answer is that an organization should hold on to everything for all time because what is useless may become useful in the future. I don’t know I could be talked out of that position.
Gina: I like everything you said but I do want to say I’m not sure that the IT department should be deciding all of that information because there’s more to IT than just you know there’s a cost involved with keeping it but then there’s also it’s all discoverable at that point in time and there’s also the security piece like that, does any, is keeping that much data you know what the data is, does any of that expose you know you give you a wider exposure platform? So I think again this goes back to you want to know what’s what your IT department’s throwing away and they should be able to provide you with the rationale for why they’re throwing it away, that if they are hoarding it, you say should be able to provide you with that rationale, but there probably needs to be more than the IT department involved, probably legal department, whoever is responsible for security. And then just the business itself, like would you do anything with this data and what would you do and deciding from the business point of view what should be kept and what shouldn’t.
Stephen: Yeah that I mean I completely agree with you that businesses should be able to make these decisions but I still feel like there’s a chance that they’re not going to be able to make a good decision that they’re going to you know, for a variety of reasons, either because they don’t know what they don’t know or because maybe they did decide that this wasn’t needed and now they find out it is. Or you know we as you brought up at the beginning Gina that the idea that sort of third-party external applications and so on, you know they may have value that customers, that companies aren’t even aware of and I think in all those cases it’s possible that there could be you know data out there at the edge that people kind of wish there wasn’t or wish they, and maybe that’s the next thing to talk about is sort of what is the repercussion here? I mean is it bad to have data that you don’t know about or is it okay? Is that just how business works?
Allyson: I think that ultimately you’ve got a situation where it depends on the type of data that it is right? There may be data that you don’t necessarily want to hold on to from a standpoint of you know it could open you up to privacy concerns, it could open you up to a number of different reasons why, your lawyers might tell you actually it’s not good business practice for us to hold on to this data. So to that point if you don’t know you’re collecting it and it’s just sitting out at the edge somewhere that could be exposing you to things that you don’t want to be exposed to. And then I think that you know there’s always security risks associated with data and so if you’re, if you don’t know what you’re protecting then how do you know if you’ve been breached? If you don’t know what value that data is how do you know the cost of that breach or the you know the the business risk associated with it? So there are a lot of questions associated with this that lead me to the determination that if it was my IT organization, I would not want to have dark data exposure because it’s a very difficult thing to wrap my hands around quantifying what the risk of that exposure is.
Gina: I would also think, you know, as an application owner, I would want to know the risks for sure, but there’s also an opportunity so if I’m if I own an application or a program from a business perspective I want to know if it’s creating some, even if it’s a SaaS application or if it’s a you know if it’s it’s something we built for Edge consumption. I want to know what data is being created or the possibility of what data can be created. I want to know what it is not keeping. Is there something they could keep that we want to keep and how do we make sure that’s secure and if it’s discoverable? What does that mean to the organization? so I think there has to be a tighter connection between it and the business owner of whatever that program is or an application itself to so just having the notion that there’s such a thing as dark data that can be bad but it also could be really good if that might bring the data to the data scientist that will help them solve the like the last piece of the puzzle to help them solve a problem for the business that bit of data that they didn’t know existed until they did know. So I think that the business folks need to understand what dark data is and work with their IT people to understand the risk of keeping the data.
Allyson: You know it’s interesting, I was just talking to the the good folks over at Calyptia Stephen and one of the things that he said to me, and for those who don’t know who Calyptia is, they um build observability solutions, they’re the ones who are behind Fluent D if and a lot of folks are familiar with Fluent Bit and Fluent D from the open source space. And one of the things that he said that really perked my interest was his customers after using their solutions, their data problems got worse because they discovered their dark data and all of a sudden they had a bigger you know, data hygiene challenge ahead of them in terms of what to do with all of this data that. They didn’t even know existed. But it does kind of validate the point of what we’re talking about that every organization has it. A lot of companies are not using those observability solutions to find it and once they do, they’ve got a big cleanup on aisle five challenge in front of them in terms of getting it in order and figuring out what’s valuable.
Stephen: I want to talk a little bit about tools too here so you mentioned Calyptia a couple other companies that I’ve been talking to about similar issues, I talked to hammer space earlier. They’re talking about using basically bringing unstructured data uh from everywhere and anywhere into you know into corporate control and centralization. Another company I just talked to today was Resilio, which is another company that’s looking at basically how can we transport lots of data from lots of places and consolidate it into into one spot. There’s a whole world of tools out there, you know, you’re talking you know observability tools, you’re talking data analytics tools. I mean I’m sure that the the usual suspects in the cloud would be useful in edge conditions as well in terms of basically collecting and centralizing and getting smart about data. And then of course there’s one more elephant in the room and that’s AI and I know that this is something that both of you have talked about in terms of how can we maybe use that as a tool to help with this dark data problem. What do you guys think about tools and about AI specifically?
Allyson: Well I’m I’m interested to hear what Gina says on this. I mean obviously the Holy Grail of getting a handle on all this data is actually to go train an algorithm and do something interesting with it and I think that that is the main opportunity in front of IT organizations to help be the center of business growth is is looking at ways to apply AI to their business. I do think that you know does data discovery get better with AI, does data hygiene get better with AI, I can’t imagine that data hygiene wouldn’t get better with AI but I haven’t seen anything that says that you know somebody’s come up with the solution yet and maybe I’m just unaware. Gina, have you seen anything in the industry, in this space?
Gina: I think this is a really tricky question right because AI is going to use algorithms to extract meaning from data so the quality of the data is going to be what actually gives you a result. So an example I have I’m helping a company called Simon Data and they have a platform that runs on top of Snowflake Data and it’s a marketing platform so they’re looking at taking all of the information about customers that might exist in Snowflake and turning that into meaning for the marketers so that they can give very personalized campaigns down to their customers that they already have or prospects they might see in the pipe. And what that, what they have to do that is a ton of data cleaning so they can’t run any kind of AI to do anything until they say okay, this customer maybe this customer is, let’s say, Delta Airlines. I’m just going to throw something out there, I have no idea if they’re even a customer of theirs, that might be in their CDP and even in their snowflake platform 500 different ways which is the one way they want to talk about that, and what does the one source of data that’s going to be the best source of data to describe their customer base and their PLA and their prospects face. They have to get through a whole, customers have to go through a whole methodology of cleaning that data and getting it straight before they can even apply AI to it because if you don’t you’re going to get garbage towels so when you run an algorithm and it tries to find something, that tries to do something, it’s not going to do a great job. So if you have an algorithm that goes across all of your data sets to find the dark data if you’d have to have it just all really defined really well what is the what is the data I’m looking for and can I look at this and you have to have that all defined before the algorithm could even do anything.
Allyson: So what you’re saying I think is it’s probably not going to find the dark data?
Gina: I don’t think so. I think that I think that well I don’t know. I don’t think that would bean AI. I think that might be a tool or a script like you guys are talking about a tool to actually go and find those extra data sets. Once you find them and you bring it in line with your other data sets that you have, then you probably could write an algorithm to do something but like what would you want it to do I don’t know if you can write it. Maybe you could write an algorithm to clean the data to get it to the right place. But I think that’s one of those things where you’re going to have to go in like line by line to find every single Delta Airlines everything away, it’s misspelled everything because that’s going to be potentially too hard to teach an algorithm to do because of the things that humans do to create the data.
Allyson: Yeah I think that this comes back to the human problem right, we create these challenges and create bad data and until you go through the cleaning process, I haven’t seen an AI cleaner yet, that would be lovely because nobody likes data cleaning, nobody likes doing it, but unless you do that your algorithm is going to be trained and in a way that isn’t going to be effective. So I think that AI is the opportunity statement of why you want to go find that dark data why you want to clean it why you want to talk across organizations of why it’s valuable but I don’t think it’s the solution to finding it. I think that’s observability and I think it’s discovery solutions and I think it’s roll up your sleeves and clean some data.
Gina:Yeah I tjoml that too. I think that’s more of an Ops type of role and then the data scientists and the data science is more for the algorithms but foreign about dark data. I think dark data can be very exciting and very good depending on you know, on what it is and the data that comes back. I just think it comes back to good old data center hygiene though, it just does. You can’t, I also want to say I think it would be really dangerous potentially to let AI, as things stand now, to let it go and clean itself or to clean its own data because what would, what kind of dark data could an AI algorithm create? I think that would create more dark data than not our junk data you have to be really careful with all of that right now.
Allyson: I agree with you Gina. I don’t think it’s a negative I think it’s buried treasure you know you’re going to get a lot of treasure boxes. Maybe some of them are going to be empty, but that’s kind of that thrill of is this something that’s going to be valuable that we didn’t even know we had. That’s exciting.
Stephen: I would say that there’s actually a change a reason for optimism here when it comes to edge, specifically because one of the challenges and opportunities of this environment is that in many cases, it’s not general purpose computing, it’s not general, it’s very specific purpose that’s being deployed by a specific organization for a specific reason and I think that that gives us the possibility, unlike for example you know in the data center where you know basically almost anything could be run or on the desktop where certainly anything can be run. You know, at the edge if if you’re deploying something, if you’re deploying an application or some hardware, you kind of you know what you’re putting out there. It’s not like you there’s just random stuff running out there and that means that those organizations might have better handle on the data that they’re collecting, how they’re processing it, how they’re organizing it, how they’re retrieving it, how they’re centralizing it, then it might be in other kinds of environments. You also don’t have the problem of the proverbial you know give my credit card to Amazon problem that we have in the cloud where somebody can just deploy something and then it gets out of hand because there again, I mean you’re just not going to be deploying that stuff at the edge unless you really know what you’re doing and so you know for all the things that we’ve said, I do feel like there’s some optimism here, that this time may be better than than previously. Well thank you so much Allyson and Gina for joining us today on Utilizing Tech as we wrap this up, where can we connect with you and continue this conversation on edge and any other topic you’re interested in?
Gina: Yeah you can find me on LinkedIn, Gina Rosenthal or at Digital Sunshine Solutions – with an s – .com and also we talk about things just like this. We actually are are going to publish a new episode of our podcast with interviewing a data scientist, it’s called Tech Aunties, so TechAunties.com.
Allyson: so you can find me at TheTechArena.net you’ll see a number of different interviews from across the edge on my platform as well as a 2023 edge report that you might want to download and check out around some of the key challenges associated with enterprise adoption of edge. You can also find me at TechAllison on Twitter and at Allyson Klein on LinkedIn.
Stephen: And as for me, you’ll find me here on Utilizing Edge every Monday, on the On-Premise IT podcast most Tuesdays and on the weekly Gestalt IT news rundown onWednesdays. So thanks for listening to Utilizing Edge, part of the Utilizing Tech podcast series. If you enjoyed this discussion we would love to hear from you please reach out to us. Find us on most social media networks at UtilizingTech or just drop me a line, you’ll find me at SFoskett on most social media networks. Also if you like listening to this you can find us in most podcast applications as well as on YouTube. This podcast is brought to you by GestaltIT.com your home for it coverage from across the enterprise. For shownotes and more episodes, head to our special dedicated website UtilizingTech.com and as I said you can find us on social media at Utilizing Tech. Thanks for listening and we will see you next week.