Achieving High Availability at the Edge with StorMagic

Although everyone wants high availability from IT systems, the cost to achieve it must be weighed against the benefits. This episode of Utilizing Edge focuses on HA solutions at the edge with Bruce Kornfeld of StorMagic, Alastair Cooke, and Stephen Foskett. Although it might be tempting to build the same infrastructure at the edge as in the data center, but this can get very expensive. Thinking about multi-node server clusters and RAID storage, the risk of a so-called split brain means not just two nodes but three must be deployed in most cases. StorMagic addresses this issue in a novel way, with a remote node providing a quorum witness and reducing the need for on-site hardware. Edge infrastructure also relies on so-called hyperconverged systems, which use software to create advanced services on simple and inexpensive hardware.

Hosts and Guest:

Stephen Foskett, Organizer of the Tech Field Day Event Series, part of The Futurum Group. Find Stephen’s writing at GestaltIT.com, on Twitter at @SFoskett, or on Mastodon at @[email protected].

Alastair Cooke is a CTO Advisor for The Futurum Group. You can connect with Alastair on LinkedIn or on X/Twitter and read his research notes and insights on The Futurum Group’s website.

Bruce Kornfeld, Chief Marketing and Product Officer, at StorMagic. You can connect with Bruce on LinkedIn and find out more about StorMagic on their Website. You can also see presentations from StorMagic at Edge Field Day 2 in July.

Full Transcript

Stephen Foskett: Welcome to Utilizing Tech the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on edge computing which demands a new approach to compute, storage, networking, and more. I’m your host Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. joining me today as my co-host is Alastair Cooke. Welcome Al.

Alastair Cooke: Hi Stephen, nice to be back talking to you today.

Stephen: It is great having you from the other side of the world. Now we both got into IT a while ago and we’ve both seen quite a lot of water go under that bridge. It’s interesting, edge is quite a bit different from data center or cloud IT, don’t you think?

Alastair: I think it draws from both of them and that it is definitely. You can’t just transplant from either of those locations and expect everything to work the same at the edge. But fundamentally we come back to what is the business problem you’re trying to solve and how can we do that with the least effort and least risk.

Stephen: And yet some things in Enterprise Tech, some of the data center concepts that you and I have spent our whole career working on, things like high availability and continuous availability, you know, that’s a little bit more difficult to accomplish and some people I think are maybe throwing up their hands and saying “I’m not sure that we can have this.” Maybe in Edge environments we need to make things disposable and run it on a single node instead of multi-node or things like that. It’s a real challenge.

Alastair: Yeah I think there’s a mix in there. There are some very cost conscious places where there really isn’t valuable data or services that are really critical, where an outage of five minutes an hour a week is inconvenient but it’s not business critical. But I think there are some edge use cases where that does become really critical. Imagine you have an outage at your favorite gas station and you can’t fill up for a week that seems like a problem I mean even not being able to get service for five minutes is probably going to lead customers to go to a different station. And so there can be some business impact even at the places that we think of as being relatively low cost, low valuable data, but actually the service delivery there is important.

Stephen: Well absolutely I mean I need my Slurpee and my hot dog too I mean don’t forget that. But one of the things that occurs to me is that… one of the reasons that we don’t have some of these things, I think people would want high availability at the edge. I think it’s just too hard to achieve in some cases. It takes too much hardware, too much software, too much cost and I think that that’s the reason that some people aren’t implementing it. But but that’s what we’re going to talk about today on on Utilizing Edge. Our guest today is an old friend of mine, Bruce Kornfeld, who is Chief Marketing and Product Officer at StorMagic. And one of the things about StorMagic is that they’re trying to tackle this high availability question at the edge. So welcome to the show, Bruce!

Bruce Kornfeld: Thanks Stephen, great to be here. Thanks for having me. Looking forward to a lively chat.

Stephen: Well it’s always a lively chat with you. I am really interested in diving into this because like I said one of the challenges for high availability is the idea that you need to have basically redundant hardware, you know, and redundant hardware is expensive. So if I tell you you need to have three nodes instead of two or if I tell you you need to buy 33 percent more storage, 33 percent more memory, 33 percent more everything, that’s a huge problem when it’s multiplied by a thousand nodes right?

Bruce: Yeah listen at StorMagic, this is, you know, we talked to customers about this all the time and a lot of times they just cannot cost justify the extra hardware and software needed to achieve the high availability and they work with their organizations to say sorry you know what, we can’t, the budget isn’t there so we’re going to have a single node we’re gonna have a single server and if that thing goes down. Here’s the impact, there’s going to be downtime for four hours, a day, a couple days, don’t know but there are cases that have to go that route but we’re seeing a lot of cases the other way that, because technology is becoming less expensive, there are ways of achieving both.

Alastair: I think there’s a really interesting piece also around efficient use of the storage because there are a few solutions around that that do this cluster of servers and turning local storage into shared storage which is the most cost effective way of doing this. But some of them put quite a lot of overhead and of course the same story as having that extra third node having an extra 20 percent of resources in every server that you’ve got at ten thousand sites, that gets to be really expensive. So I think there’s a need for solutions that actually fit to the small scale.

Bruce: Yeah typically what we’re seeing is that a lot of IT professionals are used to what they’ve worked on in the past which is typically they built a big data center or a small data center that has dozens, hundreds, thousands of servers and lots of storage. The inclination is let’s just do the same thing at the edge and that’s a big problem because typically you need at least three servers and yeah, using the internal storage, that’s the whole software defined storage thing, and there’s a lot of solutions for that out there and they all do it fine and they all provide the high availability, but typically it requires at least three, sometimes four in order to ensure that that a node can go down. So you know, a lot of times it comes down to the cost equation and the management bit because you know, typically these data center designed software-defined storage HCI use the flavor of word that you want, but they’re designed for big enterprise and the edge presents a lot of challenges that make those solutions hard to fit and mostly, it’s mostly cost. I mean it really comes down to cost.

Stephen: Yeah that’s my feeling too. It really is comes down to cost because you know I mean a lot of people listening here you might not be storage nerds, I’m a storage nerd, but if I say RAID to you the first thing you’re probably thinking of is waste that that’s extremely expensive, you know oh I got to go buy twice as much, I got to buy three times as much, you know. If I say high availability the first thing you’re going to think of is waste. Oh, I got to go buy multiple systems, I got to go buy multiple servers, I got to ship them all out there, I got to power them, I got to keep them running and they’re not doing anything, you know, why do I have to have all this wasted idle hardware out there just in case something goes wrong and honestly as much as it pains me as a data center guy to say this, you’re not wrong about that. RAID is overhead, it is waste. High availability uh you know having an extra system out there, it absolutely is. And you need to think about that, you need to think about the cost, and you need to think about the benefit of deploying this stuff and whether it’s worth it. Right?

Bruce: yeah yeah the way… I completely agree and a lot of customers still, they cost justify it. It makes sense for their business to have the RAID I mean RAID is very very common as you know. But, you know the good news there is the cost of the physical storage has come down so much that yep, instead of, you know, three drives we’re going to have to buy four or five, but that cost isn’t a major one. It becomes more significant when you’re talking about another whole node a whole other server with five drives in it, etc. The technology that, you know, that we deliver that helps on this case is to allow for a two node solution instead of three. Most everyone else requires three and that’s a huge cost when you have dozens, hundreds, or thousands of sites and the technology is just, it’s kind of simple. You and I have talked about in the past but it’s it’s a remote witness and a lot of… because you still need three nodes in order to have a proper HA solution and that I think that’s… I don’t think that the topic of this is to argue that down to the bottom but I’m pretty sure that most, um, IT people would agree that you need three to have a very highly available cluster or quorum.

Stephen: so that’s I think the key that again, to the it audience that we have, I think that it’s important to understand yes, if you have a cluster of, well if you have a cluster of one node then you don’t have a cluster I guess. You have a node. And if it goes down it goes down. If you have a cluster of two nodes and you run everything on one of them and if it dies then it runs on the other one, that’s all well and good, but it’s really hard to make that highly available because what happens if you snip the connection between those two nodes and both of them think that they are the the node that’s when things go wrong. And that’s what Bruce is talking about here. The reason that we say we IT people always say you need three nodes is exactly that. They call it a split brain or something like that where essentially you’ve got two nodes that used to be coordinating and now are no longer coordinating and both of them are like I’m in charge here, you know, uh Alexander Higg. And they’re all but you know running off saying you know Hey listen to me I’m doing my thing, I’m I’m making the decisions and that’s a big problem because things can get corrupted. You can have torn database accesses for example, you can have a client that’s talking to this one and thinks that he’s just charged my Slurpee and this other one is like nope, no Slurpee here, and suddenly you’re losing money. It’s a huge problem and that’s why most cases you can’t have a two node cluster. You need a three node cluster and again that’s the thing that StorMagic is focused on that’s a thing that a lot of companies are focused on trying to figure out how can we not have to have three nodes

Bruce: Yeah there are technologies in order to do that without putting that third node on every single site and that’s where, you know, that’s where we’ve architected our software-defined storage solutions. They’ve been architected for this environment, for the edge, in that we we designed it so you don’t need that third node on site you can have the third node in the cloud, you can have the third node in your data center and the way that it basically works is it’s a simple little VM that acts as the third node for lots of sites without up to a thousand sites for one little VM so that acts as a node for Quorum for up to a thousand clusters. And that’s simply how we do it and it is pretty amazing what we hear from customers because they, you know, when they hear about our solution they’re blown away that oh so I can do this with two. It’s reliable. I do get the real HA with proper Quorum, I just don’t have to invest in that and that third physical server and that’s pretty much the way it works.

Alastair: I think there’s another aspect to that I really want to highlight is the one-to-many relationship between the central resource and the clusters because a lot of the data center solutions that have been adapted to remote office branch office use case, you have one-to-one kind of mappings and that works just fine if you’ve got half a dozen or a dozen sites, but when you’re getting to hundreds and thousands of sites, that mapping of a witness that’s a one to many becomes absolutely crucial. I don’t want to have to run 10 000 witness VMs in my data center because I’ve got ten thousand sites. I’d rather only be running fifteen to twenty of them ideally. I’d find incredibly a cloud solution for running those so I don’t even have to have them in my data center. So there’s a whole element around the manageability for large numbers of clusters that becomes very different to what data center infrastructure and robo kind of use cases are used to when you come out to that vascal of edge and that’s where I see some of the differences between enterprise technology and cloud technology kind of start to play into this discussion of what’s the right thing for your edge?

Bruce: Yeah and you know if you think about, I mean, I completely agree Al. If you think about it, the way that systems have been designed for years and years and years have been for data center and cloud. That’s where the money’s been, that’s where vendors go “let’s go get as much market share as we can selling to cloud and large data center.” And the technology that exist are awesome and they work great but they are they’re dependent on either a three node or some of them do have witnesses but these witnesses are in the data path and that’s the technical thing I wanted to mention here is that the the core of the reason why we’re different is that we’re proud to say that we don’t do Network RAID or erasure coding however you want to, you know, there’s different ways of doing it but these data center class cloud class storage and HCI solutions, they scale across hundreds of nodes so they have this complicated erasure coding that works for the data center. Once you try to do that at the edge and then do a remote witness, you can, but that remote witness is still in the data path so now you’re sending all of your local IOs, all of your credit card transactions, all of the information from the windfarm, whatever it is, it’s not just staying local it’s also having to go to the cloud and back so that’s just not cost inducing, I don’t know if that’s the right word, cost enabling either right? So with the way that we do things it’s a simple synchronous mirror locally so the data is really fast, really local, and then the root remote witness is just a heartbeat right? We’re just checking are you alive? Are you alive? Are you alive? And if we don’t get an answer then then we know how to coordinate things from the cloud. Simply put that’s how we do it.

Stephen: There’s another aspect to this too and I think that, again a lot of solutions that are kind of brought from the data center to the edge are kind of over…. they’re over engineered. There’s a lot more hardware in there, you know, my first edge environment that I was exposed to was a retailer and they had a fiber channel SAN with a fiber channel switch with a multi-controller storage array and they put these in retail stores. And it was, let’s say, fairly expensive. As you’ve heard if you’ve been listening to Utilizing Edge this whole season and if you tuned in for Edge Field Day, a lot of companies are, well let’s say they’re not doing that. What they’re doing instead is they’re using software, what’s called software defined, basically software running on commodity hardware as you heard Bruce just mentioned hyper-converged. The idea is that instead of having special specialized hardware to do things like controlling storage devices, you have software that does it on standard PC stuff. We’ve talked about our Intel NUCs quite a lot but of course there’s a lot of other x86 type and Arm type PC stuff that’s deployed out there. And in all these cases, essentially what the result is that you have an inexpensive piece of hardware that you’re deploying out there and in some ways this makes it easier to deploy redundant hardware because essentially if you’re deploying a redundant device that costs fifty thousand dollars, well then that’s a big expense. But if it’s a $500 device then it’s a lot less of an expense. Now it’s still an expense because you may have to buy a thousand of them, but it’s not quite so bad and I think that that’s another aspect here. Most of the products that we’re looking at here are either software or incredibly inexpensive hardware.

Bruce: Yeah I completely agree and I would say that, you know, this may not be a unanimous agreement across the table here, but I would say for Edge use cases and, you know, I’m thinking small sites, this isn’t hundreds, this is a small site they’ve got ten or twenty VMs they need to run, the days of the three-tier architecture are gone. Like they’re not going to buy a couple of servers and fiber channel and an external SAN. I mean believe it or not we still see customers running in that way and when we show them, guess what? You can take a couple of servers these days, you know, I’m not an investor either way, but I’m hearing AMD’s doing quite a good job with performance these days but it doesn’t matter, Intel or AMD. Take two servers super high powered, you can get plenty of storage in them you don’t need the expense of that third thing, that SAN, to manage, to pay for, and by the way, it’s still a single point of failure right? You still have three yeah you have two servers and you have this thing it can still fail and then, you know, both servers are gone right? So I’m saying you know the long-winded way to say that hyper converge, what you’re talking about software defined two servers use the internal storage, that’s, that is here today. It’s not everywhere at the edge but it’s it’s definitely heading in that direction.

Alastair: So Bruce, you talked about Edge sites that have half a dozen, a dozen virtual machines. What do you see in those virtual machines because we hear a lot of noise from the everything’s containers in the Kubernetes part of the world but I suspect your experience of what a lot of customers are running at the edge is not all the latest and greatest because although the future’s here it’s not evenly distributed. Are you seeing customers that just want to run containers, that are looking at highly available solutions or are you seeing a much, what I’m expecting is, a much more mixed population of applications at the edge?

Bruce: Yeah that’s a really good question Al. So yeah, so I’ll say this, primarily what we still see is it’s a virtualization world at the edge. Hypervisors are everywhere at the edge. It’s mostly mostly VMware. Microsoft is doing a great job but it’s some VMware, it’s some Microsoft and then KVM is making a play, right. There are some som users that are technical enough, happy with open source and they’ll go with Linux KVM build out there. But containers are not mainstream at the edge yet but we definitely are talking to a lot of customers about it because as you both know containers are very popular in the cloud and in the data center. So guess what software developers you know they don’t think about the infrastructure at the edge, all they know is the next retail application is going to be a container right? So the it folksthe infrastructure team that’s building the Next Generation Edge they know that they have to be ready for containers and in fact we do have customers doing it today and typically the way we see it running today is they’re simply running containers inside of a VM and that’s perfectly fine way to do it. The purists will say it’s not the best way and long term everything should be containers so you don’t have the extra layer of the hypervisor. We’ll get there eventually but for now and I would say at least the next five years, you’re going to see a mix of both.

Stephen: We’re definitely going to see more and more containers I think. I mean anyone who’s has actually experienced this, I mean, from my perspective, whether it’s on-prem cloud or edge, I don’t want to run things not in containers because of the benefits that containers give you in terms of being able to create the exact correct software environment for whatever the application is and it’s known good. I’m looking at you python, I’m looking at you. You know, everything that I run is I convert it into a container and I’m seeing the same with, you know, at the edge with Kubernetes I mean we talk with Brian and with some of the other companies about this as well, you know Kubernetes gives us a common common language, a common concept for deploying applications whether they’re on-prem, in the cloud, or at the edge. So Kubernetes is going to be at the edge but it’s not going to be used at the edge the way it’s used in the cloud. It’s going to be used at the edge as more of a deployment and a language for applications in my opinion uh you you think so Al?

Alastair: Yeah absolutely so fundamentally I see Kubernetes or containerization not necessarily, Kubernetes, I have mixed feelings about the overhead of running Kubernetes at each site but definitely containers is that the mechanism for software distribution. I think that’s the thing that Docker got really right with containers. And now as I’m building applications I’m choosing to package them as Docker containers and I find that so much easier than having to install dependencies on every server when my application needs to run. So I think, yeah, increasingly as new applications are being built, they’re being built in containers. But returning to Bruce’s point, there is always a really long tale of application fully built for containers and until there’s a mass extinction event that cuts off that long tail we’re going to see them. We may see a decrease as the cost for running them continues to increase and then containers become more cost effective and I think I’d love to see more of these edge platforms doing more integrated containers. Last time I talked with Bruce was for the GigaOm report on hyper-converged infrastructure and they were in the radar for edge hyper-converged infrastructure and there was sort of a separation between VM focused and container focused application vendors in there and it was an interesting thought about which is best for which use case.

Bruce: Yeah the other thing I’d say about containers if you take this another step as we talked about container usage for cloud and data center, and it’s going to make its way to the edge eventually. Think about though, the edge isn’t everything obviously there’s another there’s another group we haven’t even talked about today which is small businesses small and medium business. They might have one site, they might have five, they have the same pain points that we’ve been talking about today. They want high availability. They want to be able to run all of their local apps with one hundred percent uptime. They don’t have budgets, they don’t have IT people at their locations. Containers are probably a long way away for them, right, because they just don’t have the skill set to do it. They don’t have an IT team in the, you know, up in a data center somewhere pushing containers to them. So just another example that I probably see the container world it’s going to take a while to be pervasive everywhere particularly in the SMB world. That’s probably going to be even longer I would think.

Stephen: Yeah and particularly in the edge world. I’ll point out too because that’s the thing, as long as you need one virtual machine then you need a hypervisor at the edge and there are so many applications just like we saw in the VMware space in the data center. One of the superpowers that VMware brought and now modern hyper converge and other virtualization platforms bring is the ability to basically take that old thing and run it on this new thing. And that’s even more true in the edge where you may have an application that not only is it not containerized, it’s not really intended to be run on a virtual machine, but you can run it on a virtual machine because it does, it’s none the wiser. And so I feel like that’s actually a reason that we’re going to need virtualization at the edge even though we’re using containerization for more modern apps.

Bruce: yep completely agree completely agree. Yeah it’s one thing that you know you mentioned it just you know, something just jumped into my head about using existing infrastructure and that’s another thing that we see is edge, I’ll just say it, edge is cheap, right? They they can’t afford an extra five hundred bucks, let alone two thousand dollars per site because it adds up fast. So a lot of times when they’re making this HA decision, “can I afford it? Can I afford it?” The thought process has always been, okay, I got a single server and I can think of a lot of customers that we’ve had this conversation with, got a single server can I afford to go to three? That’s a big that’s a big jump. Can I even afford to go to two? That’s a big jump. You know one of the things that that we do because we’re, because of the way we do things synchronous mirror, it’s not a ratio coding, we’re allowing customers to cluster different types of servers, different brands, different models, different ages, it doesn’t matter as long as it can run the same version of the hypervisor. So there’s another example of keeping customer costs down is they can leave their server in production roll a new one in, cluster them together in a way they go they’ve added HA for the cost of one server instead of two or three. So that’s another little benefit of going with a two node HA solution at the edge.

Alastair: So you might see that fit with budget cycles where one of the two nodes gets replaced every two years or three years right and you can just do this rolling thing rather than having to do the whole bulk update, and of course you’re then doing it as a failover between two nodes, you know fail over from the oldest node to the newest node, retire the old one, put in a new newest node. And that could be very cost effective for spreading that so that cost over the years, particularly here in New Zealand where we like a low-cost solution and so we’ll retain servers for five to seven years particularly in the provinces where I am.

Stephen: Yeah we may actually see a situation out to that point where companies are basically going to use the stuff until it falls over and then dispose of it in a situation like that. You’re almost guaranteed not not to have the same hardware on both nodes because it’s almost impossible to imagine a situation where five years from now you’re going to be able to buy the same node, you know, something’s going to be different.

Alastair: Now you are hitting towards the cloud economics of running run it until it fails and replace it when it fails or maybe a little ahead because you probably want to have continuous availability through that failure.

Stephen: Are you seeing that though Bruce? That kind of situation where companies are basically running it until it falls over and fails over and then replacing it?

Bruce: Yeah we do see some of it. I would say it’s more on the smaller company side. Our large customers that have thousands of sites they have they have pretty tight protocols but I would say four years, five years is pretty common in those environments even. But I would say yeah we do we do have customers that get every every day that they can out of a server. Yeah so you see, you know, you see some out there for six or seven but I’d say for the big ones it’s probably not going to you’re not going to see it go past five. That’s what I’ve seen.

Alastair: As we’re heading towards the close I want to really bring up the idea that it continues to be build the right solution for the problems that you have and so this idea of having a highly available solution as being really cost effective just adds another tool you can use to solve the problems for the organization with the right piece of technology. Sometimes the right technology is a single node that will will fail and we’ll have an extended outage because that fits the business risk. Sometimes that risk’s unacceptable and so you need a multi-node cluster, maybe two nodes. Sometimes you need something bigger. The edge isn’t just a single solution for a single problem and I think knowing the possible solutions is really important. Bruce, have you got anything to bring this one home as well?

Bruce: I would just say first of all thanks for including me in this conversation. As you can tell like I’m quite passionate about this topic, particularly for Edge. I see nothing but growth of edge computing. I would even go so far as to say that we’re starting to see cloud applications and cloud decisions coming back to edge because they’re realizing the cloud is very very expensive. So I see this problem getting worse it’s getting bigger there’s going to be more customers more opportunity to think about how do I run applications locally at the least cost with high availability. So I love talking about this stuff I’m glad you brought me on here. We will be hopefully continuing this conversation at Edge Field Day coming up in July which we’re excited about. But yeah it’s been fun so thanks for having me on.

Stephen: Yeah well it’s great to have you and I have to say too one of the I think it’s interesting you know in the discussions that we’ve been having here on Utilizing Edge as well as Edge Field Day and of course the Roundtable Discussions that we’ve done as well on this topic, it occurs to me that edge is sort of a forcing function for us to reevaluate some of the decisions we’ve made just like cloud was, just like the, you know, the intense focus on automation and everything as code and the cloud forced us to reevaluate a lot of the things that we were doing in the data center. Edge is forcing us to reevaluate a lot of the decisions that we might make and I think that we’re coming up to a much more interesting solution and one that they’ve, you know, so far, a lot of technologies are kind of trickling into the edge from cloud and from on-prem data center. But I think over time some of the lessons that we’re going to be learning here are going to be trickling right back out and we may be starting to say, you know, maybe we shouldn’t do this or do that, you know. Maybe we shouldn’t invest in this high-end piece of hardware, we can do it in software. I mean that’s kind of one of the messages of cloud as well so it really is an interesting conversation and I really appreciate you hopping in here, Bruce. Before we go, where can we connect with you and continue this conversation?

Bruce: Well I love to chat with you Stephen and Al, anybody in any anyone in the industry. We will be attending uh Edge Field Day where it’s going to be a nice open format, broadcast live, so we hope that we can take questions from the attendees in the room plus anyone else listening. So let’s have some fun with Edge Field day in a couple weeks, six weeks, something like that.

Stephen: Al, how about how about you? Where can people connect with you?

Alastair: You can find me online as DemitasseNZ both on Twitter and also at demitasse.co.nz on the intertubes. I’m going to be at VMware explore in LasVegas later in the year and if you’re in Europe you can catch me at VMware Explorer in Europe much later in the year too.

Stephen: And as for me, of course, you can find me here at Gestalt IT. You’ll find me at Edge Field Day. Also, of course, we do have our Tuesday On-Premise IT podcast as well as our Wednesday News Rundown. So thank you for tuning in for Utilizing Edge, part of the Utilizing Tech podcast series. If you enjoyed this discussion, please give us a subscription. You’ll find us in all your favorite podcast applications and maybe check out GestaltIT.com, your home for IT coverage from across the Enterprise. If you are interested in show notes, and we’ve been doing a lot more expansive show notes for this season, head over to UtilizingTech.com. You’ll also find us on Twitter and Mastodon at Utilizing Tech. Thanks for listening and we will see you next week.