When it comes to edge computing, money is not limitless. Joining us for this episode of Utilizing Edge is Carlo Daffara of NodeWeaver, who discusses the unique economic challenges of edge with Alastair Cooke and Stephen Foskett. Cost is always a factor for technology decisions, but every decision is multiplied when designing edge infrastructure with hundreds or thousands of nodes. Total Cost of Ownership is a critical consideration, especially operations and deployment on-site at remote locations, and the duration of deployment must also be taken into consideration. Part of the solution is designing a very compact and flexible system, but the system must also work with nearly any configuration, from virtual machines to Kubernetes. Another issue is the fact that technology will change over time and the system must be adaptable to different hardware platforms. It is critical to consider not just the cost of hardware but also the cost of maintenance and long-term operation.
Hosts and Guest:
Stephen Foskett: Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on edge computing which demands a new approach to compute, storage, networking, and more. I’m your host Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT and joining me today as my co-host is Alastair Cooke. Welcome to the show Al!
Alastair Cooke: thank you Stephen it’s lovely to be back on the show.
Stephen: It’s lovely to have you. We have been talking since, well since forever about the various factors that affect the computing decisions we make and yet one of the factors that we don’t talk about enough is economic. I think it’s easy to think well we’ve got all the money in the world we can do anything we want, but especially when it comes to edge that is not a valid assumption.
Alastair: Yeah I think you know my background as having worked with some pretty large organizations and it feels like they have infinite money. You look at a global pharmaceutical company you soon learn how large amounts of money are. But that’s very different when you start looking at edge because it’s like a retail business operates on very thin margins and edge is typically running out all kinds of retail businesses where there isn’t a lot of money sloshing around and the ability to extract as much value as possible, revenue in the end, out of that money is important. It’s not always revenue, there are organizations that aren’t revenue driven, but in the end it’s getting as much value as possible out of that spend as it seems to be much more critical at the edge.
Stephen: Yeah and there’s a multiplier effect as well that we have to consider here especially with edge but I mean obviously the same thing is true. When you’re buying a bunch of servers for cloud or data center or something but but with the edge uh the multiplier I think it tends to well multiply real quickly because if you’ve got hundreds of sites or thousands of sites every decision you make can make kind of a huge impact on the uh the ultimate bill and the ultimate cost effectiveness of the solution that you’re you’re providing so that’s one of the things that we were talking about with our guest this week Carlo dafara from node Weaver welcome to the show thank you for joining us uh you want to introduce yourself a minute?
Carlo Daffara: Thank you very much it’s a pleasure for me to be here as well. My name is Carlo Daffara, CEO of NodeWeaver and I’ve been working in the field of economics for IT for the last 15 years.
Stephen: And of course this is an area that you know quite a lot about because you have been instrumental in developing a very practical solution for edge computing. Tell us a little bit more about this topic from your perspective.
Carlo: Well the overall economics of the edge is a something that gets overlooked a lot because everyone focuses on the technology alone or on a specific single use case and everything works perfectly. When you are in a lab everything works when you have one or two servers when you have someone there that is able to manage or repair something it becomes much more difficult. When you have ten thousand locations, when the locations are in different legal jurisdictions when you have problems because you are installing something on top of a telephone pole or in a place where basically it’s not possible to reach things easily or you don’t have a monitoring keyboard. So the economics should take into account not only what works today in a lab, but what gets deployed and what will be used and what application will one there now and in the future.
Alastair: I think there’s a really interesting point in there of this idea that edge locations can be as as strange as something that goes at the top of a power pole and that there must be some economic factors here that are delivering value in these places we wouldn’t previously have put compute resources Carlo I’m interested in seeing what you’ve seen with customers about how they’re using these cost effective solutions to deliver value that they wouldn’t previously have considered they could possibly deliver.
Carlo: We have a wider variety of customers in many different areas starting from Industrial Automation where is our initial deployment cases were. The basic idea is that edge is not a single, let’s say, a concept there is a wide spectrum of things that people call The Edge. Edge is a very small device that is attached to a data collector system. The Edge is a video recording unit in a casino or it may be a massive processing system for doing AI in areas where maybe for legal or bandwidth reason you’re not able to send too much data. So there are lots of areas. We start from the very small devices that can fit in hand that have two physical cores and just two four gigabytes of memory but they run very important applications. For example recording data for something that needs to provide reliable timestamps up to extremely sophisticated applications that do data processing at scale. So it’s a wide variety of applications.
Stephen: It’s interesting to me Carla that one of the things that you start with when talking about this is not the constraints of Finance but the constraints of technology that may demand uh compromise in finance. In other words you didn’t come to it right off the bat and say oh yes edge is, you know, you can quickly get things too expensive and you’ve got to control costs. You came to it and said people want to control costs and sometimes the technologists have to come back and say no no we need a system that’s good enough here. Am I me reading you wrong or is that really where you start with your conversations?
Carlo: Oh the edge is about the application. The technology comes last. What you care about is that, for example, you have a predictive maintenance system and you need to collect information at a certain speed and so you have a certain amount of data to be delivered and processed in a certain amount of time. That is the key constraint the application because that’s what drives the value for the company, for the end user. When you have that everything else is added cost the ideal situation would be to minimize the hardware that is needed to deploy and execute this application and the second aspect is everything else because you need to send the hardware somewhere. Someone has to install it and someone has to manage these devices in the field. So what we look at is to minimize the total cost of ownership and management for a wide range of applications and for a long period of time there are customers that deploy application that will be in the field for 10 to 15 years which means that you have to think about things about like a hardware replacement. How do we replace the hardware? What kind of complexity does it involve? Do we need to shut things down? Do we need to send a trained technician? We have a customer in the south of Sahara and it takes two days by drive to reach there and if you have to send a trained technician you are first to find one there and then you have to pay for him to go there. It’s a huge cost.
Alastair: I think that also hits on on a really important idea that it’s about total over time cost so you know there’s a usual tension between “I need enough resources, enough capability to do what’s being asked to deliver value” versus “what that’s going to cost for the hardware.” But I think that idea of the engineer who has to drive for three days, two days, although two days back as well, that piece highlights to me how this explodes when it goes wrong. So if we get the math wrong and the economics wrong in a data center we might be ten, fifteen, twenty percent over and in a cloud environment that’s terrible, but in an age environment if you get things wrong and you have to go back out and send engineers to every site you’re talking about a multiple of your normal operating costs for that site and I think that’s where the the focus of cost needs to be more that the whole life cycle of the application and remembering and then over ten years we’re probably going to want to deploy additional applications out there. There must be some tensions around having enough resources for future applications versus just current applications.
Carlo: Yeah the the big issue for the end user is that they start with one application that they need to deploy. They have a use case, they have the economics for it which means that they know what kind of benefit they expect the application to bring in terms of added value, increased reliability, and so on. The edge application is built around that initial core application. What we found is that after roughly one year or two years, they start to deploy more because they see the value of it. They are already invested in the infrastructure, the software, the knowledge that is needed to understand how to keep it up and at that point they start to see the value of platforms that can grow more or less linearly without having to change everything or to have to drop things down so that it will be able to, let’s say, continue to operate, even if you change the application itself.
Stephen: Carlo I want to get back to one of the things you said at the very top and that is the importance of making sure that you have a functional system in the lab because as you point out it’s very easy in your lab or on your desktop to put something together that sort of works but to have something that is guaranteed, absolutely, definitely, one hundred percent will work when it’s deployed on site when it’s deployed at scale and when it’s deployed over time is absolutely critical. How do you do that? I mean how can you possibly test that with, you know, and know for sure that it’s going to work?
Carlo: Well the key aspect is treating everything as a possible failure, both the hardware and the software. That’s why there is one aspect of edge which is autonomics the ability for the system to be able to compensate for failures, which will happen. If there is one thing that will be certain is that you will have failures so you need to have a system that is able to reliably take and handle issues like storage it doesn’t work, maybe sometimes work and sometimes don’t. Like for example we have a system that has been deployed in a platform that we discovered later on that was vibrating. So when it started vibrating, the storage stopped working and it started back again after a few minutes. Or you have systems that overheat, like, we had one in Ethiopia which is, let’s say basically exposed under the sun. Everything including the software components themselves needs to be treated as something that can fail and needs to be able to restart or compensate automatically. When you have something like this you can be reasonably secure that you have a minimum level of support for the infrastructure for supporting your application and eventually have someone that can do the fine tuning if it’s needed. But the idea is that when you deploy ten thousand systems you will have roughly one percent that do have some kind of failure and you need to make sure that this failure is handled automatically because otherwise you are looking at having a full staff of ten people or so just doing fire fighting’
Alastair: Do you see customers looking to to receive that sort of redundancy and reliability out of an underlying platform which is more akin to how enterprises build their applications right, that application can assume everything underneath it is perfect, or do you see customers building it more like it runs on a cloud, where your application has to tolerate the underlying infrastructure failing? Or is it a combination of both of those things that comes together to build the system?
Carlo: that’s a very good question. It really depends on the customer and the kind of basic technology choices that they make when they deploy an application. What we saw from our current customers is that, first of all, despite all the talk about containers, a vast majority of them still deploy VMs. They do have homegrown VMs, they do have application from major providers that still run in VMs, and they will keep running VMs for a long time. So you need to have some underlying layer that provides reliability for these VMs. You cannot simply expect everything to be handled at the application level. We see a movement towards reliability at the highest level, for example, through Kubernetes or other, let’s say, management platforms. The biggest problem is that some of this platforms come from the world of the data center, especially in large scale data centers, so they expect a level of availability and then in a quantity of resources that sometimes is not available at the edge. We know a customer that started in the edge to deploy a platform based on Kubernetes and they started by saying okay, we need to boot it 192 gigabytes of memory and basically when the technician said okay we have a space for something that is book sized and should consume no more than forty watts and basically it will have eight gigabytes of memory say oh well, then we were not use Kubernetes. The biggest point is that again the application is king. What drives everything is the application. If the application runs in a VM then we need to provide the reliability for it. If it runs in containers, sometimes it’s done by the higher levels most of the time. They expect some aspect of manageability and reliability to be provided by the platform anyway.
Alastiar: I think you highlighted a recurring theme that that although the dream of the edges sold on Kubernetes and containers the reality of the edge is is still a heck of a lot of legacy on what we normally refer to as production. And I think that yeah that perspective on Kubernetes as as being heavyweight is not uncommon. How do you run a Kubernetes cluster at the top of the power pole or the container orchestration also fits in a whole other dimension when you’re talking about Edge because Kubernetes wasn’t designed for running ten thousand clusters, it was designed to run ten thousand containers in a cluster.
Stephen: And then there’s there’s the disconnection aspect as well Alastair, that we’ve talked about where it was not designed to have occasional or interrupted connectivity and so on.
Alastair: Yeah and we see a lot that a lot and things that work really well on the cloud that are then being pushed out to edge. Some of the larger edge solutions they say well it all runs nicely so long as you’ve got a full-time connection. But it doesn’t operate by itself without, and I think one of the things I liked about the NodeWeaver solution as I was looking at it was the idea about this autonomic management and having a minimal required infrastructure because they do this thing called DNS Ops where rather than having a heavyweight infrastructure to deliver configuration, it’s just, look up some DNS entries to find your configuration. Carlo how much infrastructure do customers actually need to have in place in order to be able to get some value out of edge platforms and on the one you know the most of course is NodeWeaver?
Carlo: Well on the edge side we have customers that deploy applications for example in the industrial world, they do have fanless systems with two physical cores and eight gigabytes of memory so they are very small. We do lots of industrial controls like SCADA that tends to be Windows machines with 16 gigabytes of memory and this let’s say infrastructure side tends to be very light because DNS is universal. It works and is distributed, it’s reliable, it takes a very UDP packets so they are very fast, very quick, and the overall layer including for example all the monitoring distributed monitoring aspect usually takes one or two VMs stored somewhere just to archive the data for logs and something like this, So it’s a something that can be done really by any company of this size.
Stephen: I think that some of the people in an enterprise might disagree with you about the reliability of DNS but I should point out that the unreliability that people encounter is often due to changes in configuration not to the inherent unreliability of the system itself. I think most of the most of the errors that we hear are actually errors that someone has just committed. So given this and on the VM topic as well, I think another aspect too is that even if you are one hundred percent containerized now there’s no saying that you won’t be needing a VM in the future because as we talked about, this thing is going to be out there for a long time. You don’t want to touch it, you don’t want to mess with it, you know, it should be ready for that eventuality as well. And I think that that’s another aspect and another reason that these I guess hyper-converged systems if we can call them are attractive because essentially you can run anything on it. Is that the idea I mean NodeWeaver supports a heck of a lot of applications running on these nodes, pretty much anything.
Carlo: Yeah we have going from extremely old systems for doing microscopy and running on Windows 95, we had lots of applications in the financial sector, we have lots of virtual network functions. One of the largest cruise ship operators has all the onboard networking that is done through NodeWeaver and it runs multiple virtual appliances by major vendors and they all appear as running on bare metal and that’s very important because they need to be certified. Some of those applications are simply not containerizable, they need special kernels. They need special device drivers and this means that you need to run them in a VM. Actually what we do is that we run Kubernetes as well running in what we call thin VMs which are very thin hypervisors that are similar in design to Intel’s CATA containers, but they are designed to run nearly everything instead of just one or two things and this way we have a fairly good efficiency. We basically have the same performance of the container pure container layer but it’s completely insulated and so you can even run as some customers do multiple versions of Kubernetes at the same time.
Stephen: And the key is that it’s incredibly lightweight like, you know, I mean because you and that I think that’s the the technical differentiator here is that your hypervisor is really not taking up much memory at all and I think that when we talked about the solution, that was the thing that really impressed me was that you know it doesn’t. It’s very thin.
Carlo: Yeah we had the the possibility to work with the European Commission on a few research projects on this and minimization of the platform itself. S0 we are fairly proud of being able to run the orchestrator the autonomic system software defined storage networking and hypervisor in less than one gigabyte of memory and that is basically a very important point from the economics point of view because if your application takes uh a few gigabytes of memory, you are not having to buy a much larger hardware to run your application. You just need to run exactly the hardware that you need to run it if it was executing on bare metal.
Stephen: Yeah and it’s the same when it comes to storage as well as we discussed, I have a lot of experience running various Kubernetes flavors and distributions and many of them take up a lot more storage than you would expect, especially as they’re running over time. And again, that’s another thing I think that people don’t realize that, you know, you can install it on just a few gigabytes but pretty soon that guy’s going to be consuming many many gigabytes of storage capacity for logs and random stuff that Kubernetes puts out there.
Carlo: Yeah the biggest problem is that Kubernetes has been designed for a different environment. In most edge devices you have a limited amount of space because the devices tend to be small. They also are designed for a hardware that needs to be reliable which means that is not very fast. And Kubernetes takes for granted that you have a nearly unlimited storage and that this storage is available, which means that you will always be there in one form or another. So it’s not that Kubernetes is not good Kubernetes is a wonderful technology. The point is that trying to apply Kubernetes as is everywhere brings its own impedance mismatch and it becomes difficult to adapt things to the edge itself. In our platform, storage is treated as a sort of a cross between a an object storage and a transactional system and we had to do this because we take for granted that we will have shutdowns and power off and Hardware failures in more or less continuously. In fact, one of the things that we test is that we have a server that needs to be shut down forcibly every roughly three or four minutes and it needs to survive and this is something that is not so strange. We had a customers deployment in areas like rural India where power failures are so common that are basically no one cares about them anymore but the hardware does and the software especially does, your application does.
Alastair: Cycling back to the economics so as we started into this. It does seem like leaning out your application and the infrastructure that it requires is an important part but I think I want to bring back the idea around that over time the operational effort of getting people there getting it deployed out, getting hardware replacement when you find that you can deliver more value by having more hardware out. What kind of things do you see as being important with customers around that that Journey towards making things far more scalable economically than maybe a data center operational model does?
Carlo: Well there are a few things that we have seen in the last seven years and one, for example, is the basic assumptions that the hardware will change. You cannot take for granted that you will always have the hardware available. We had this example in a retail customer during the pandemic, they had to replace a system and they had no way of having it shipped so they had to use whatever hardware they have available, which was a PC used by the secretary. So the basic idea that you always have the hardware, there will be a technician there that is able to replace it, that the replacement will be transparent and especially that the application will stay the same. One thing that we have found is that the application makes changes with time so the configuration, the tuning that, for example, you can do in the beginning to make it run optimally will not be optimal one year from now. That is why we have an engine that watches what the application is doing and uses AI to adapt the hypervisor parameter, to adapt to the workload that is running now because it is not the same one that we was running one year ago and you have a different volume, for example, in video streaming application you start with ten cameras and after six months there are four hundred cameras on a single node and you have to change things because it will not run otherwise. Having someone that go there and do this kind of manual tuning is extremely costly. It needs a lot of competence and also takes… needs to manage multiple companies and multiple vendors to work at the same time, which is like herding cats. So you need basically to have something to do it on your own if you’re able to automatically tune something to reach the 90-95 percent efficiency you’re done. You don’t need anything more. That is a huge value because the customers simply see everything running as it should instead of the grading performance with time.
Stephen: How do you deal with the fact that a system might have multiple different node types with different hardware capability all working together? I mean I can see that over time you might have a very old node and a very new one and a very off-the-cuff repurposed desktop or something all working together. How do you balance that? how do you decide how to make proper use of the resources on those nodes?
Carlo: Well that’s a very important point because one of the things that we found in the industrial world is that after five years the hardware that you want to replace probably is not manufactured anymore and it’s so old it’s not economically effective. You need to buy something new. So what we do is to take not only things like the CPU speed or the amount of memory but we take into account a whole bunch of other things like how many interrupt you’re processing, what kind of network card you have, and basically everything through a group of small binaries called probes that run every system. Then we dispatch the individual pieces to the individual nodes and we see how they perform, so they are going fast enough, they are going too slow, and they basically move and migrate on their own. There is no central point of management. Every node watches the others and try to see “I’m not able to take anymore because if I take a little bit more I will start to degrade my performance so please someone of you takes some of my work” and this kind of thing balances itself over time so it’s not let’s say a precise solution an analytically computed but it’s sort of stochastically reaches the best performance over time.
Stephen: And this is key to economics as well because essentially what you’re talking about is making optimal use of the equipment available and not sacrificing the the cost for consistency but making, you know, the most you can out of the equipment available.
Carlo: Yeah exactly. Also equipment changes with time. SSD disk will become to be slower over time because they start to have too many rights. Rotational units may become more or less damaged and even your system can become slower because it’s accumulating dust and the temperature inside the growth which is a few of the things that we have found over time when you deploy this in the field. You discover lots of things. The key point is that having the system do it on its own without the need of a central management means that every node takes some of the load itself, it’s not… you need a big very big node in the center to manage everything and the other aspect is that this is done continuously. So the kind of balance that works today will be different ten months from now, one year from now, when the system itself will be different.
Alastair: So I think in terms of the economics that we’re talking about there’s there’s a couple of pieces here. One of the things is we have a relatively static amount of resource available this year and yet we need to make the best use of it as our workloads are changing over time so we’re delivering the most value. And then there’s another dimension around how you actually physically operate that overtime that the enemy of an edge deployment is sending a human to site and particularly sending a skilled human to sight. And then there’s always the when we first deploy stuff out costs so I think there’s there’s kind of three dimensions to where things can go wrong and I think my takeaway is that at the edge these three dimensions can each go far more extremely wrong than we would see in in a more centralized deployment.
Carlo: Yeah that’s absolutely true. It has been a huge effort for us in the first deployments that we did actually to go with the customer and see what they are doing and why they are doing the thing that they do. They always have a reason. If you go in a plant you may have regulations which means that for example your hardware has to be checked before entering, which means that you cannot bring anything in outside of the hardware itself. You may not be able for security reason to have it an external technician to go there which means that you have to lay things in a single sheet of paper instruction and everything needs to be done only with a screwdriver. That’s why when you do, for example, zero touch deployment you basically just boot up the hardware without a monitoring keyboard because in most areas you don’t have monitoring keyboard and you just wait and after roughly two or three minutes you hear the sound the the system playing a tune which is a happy tune, which means that everything works fine. And if it’s not you hear something like a a bad tune which means that the hardware is not working and you need to replace it.
Stephen: Yeah I think the that these are the key factors to consider and Alistair, I really love how you sum that up. I think that the key for me is really what you pointed out there is that any of these things can explode. It’s easy to think that the initial hardware choice is the most important factor because if I decide that I need to deploy, you know, thirty two gigs of RAM instead of sixteen gigs of RAM, multiply that by a thousand, and then there’s my total cost of that decision, that’s really not the right way to think about it. I need to deploy three nodes or four nodes or whatever, that’s really not the right way to because you also have to think about growth over time you have to think about maintaining serviceability over time and as you mentioned, and it’s so true, depending on the environment, the operational and hands-on aspects can really really wreck the economics of the entire situation. So given all of this I think that it’s pretty clear to say that the optimal solution almost anywhere is going to be a system that is very flexible, makes best use of the hardware at hand and also is, as Carlo was just saying, very hands-off, very zero touch, because even if you do, even if it’s not a big deal, you don’t have to have somebody drive across the desert for two days to fix it. You may just not want somebody to have hands-on, you know, and so I think that a very autonomous and configurable system is really the ideal one. So thank you so much for joining us here, Carlo. It’s been a lot of fun talking to you. We can’t wait to see you as well at Edge Field Day. Before we go where can people connect with you and continue this conversation and maybe learn more about NodeWeaver?
Carlo: Well they can go to our website at the nodeweaver.eu but we really would love to have everyone to watch us in the Edge Field Day where we will try to show what we can do in the best possible way and especially get the questions for your attendees.
Stephen: Absolutely and we welcome questions during Edge Field Day as well so please do find us on your social media, on LinkedIn, and so on. Alastair how about you?
Alastair: Well you can find me online and it’s my DemitasseNZ for New Zealand brand as well as the Brown Bag. I’m very involved there. So you can catch up with me at VMworld either in U.S or Europe. I’m hoping to be involved in Edge Field Day 2 as well. I really enjoyed Edge Field Day 1 and definitely the questions are an important part this Edge Field Day and the whole Tech Field Day family is about a conversation between vendors and technologists who who have their own perspectives and interests.
Stephen: Absolutely and I do love a good demitasse of coffee especially New Zealand coffee, so looking forward to seeing you again Al. Thank you for joining us and listening to this Utilizing Edge podcast episode. This is part of the Utilizing Tech podcast series. If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving a review. We would love to hear from you. This podcast was brought to you by GestaltIT.com your home for IT coverage from across the enterprise. For show notes and more episodes head over to UtilizingTech.com or find us on Twitter or Mastodon at @UtilizingTech. And as I mentioned, Edge Field Day is coming in July and you can learn more about that at TechFieldDay.com thanks for listening and we’ll see you next week.