Move slow and break shit
GitHub Outages Since Microslop Acquisition
Submitted 1 week ago by 0x0@lemmy.zip to technology@lemmy.zip
https://lemmy.zip/pictrs/image/30b98e80-7e91-4d12-b1b7-f6223d364626.avif
Comments
DahGangalang@infosec.pub 1 week ago
Obv a gross looking chart, but I am bothered that the left hand scale is trimmed off. I expect those are 10% increments, but wouldn’t be shocked if Original was like 99.0, 98.0, 97.0, etc.
raspberriesareyummy@lemmy.world 1 week ago
Thank you! I was thinking “it can’t just be me that’s bothered”
merc@sh.itjust.works 1 week ago
I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.
Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.
A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.
p03locke@lemmy.dbzer0.com 1 week ago
Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem
No, it means you don’t have outages. Ever.
Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involving, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.
It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.
merc@sh.itjust.works 1 week ago
No, it means you don’t have outages. Ever.
No, that’s infinite nines, which isn’t possible.
Five-nines is something like 7 minutes of downtime throughout the entire year. At best, you might have automated failover systems that require tiny outages. No human involvement, though, unless you’re deal with some major breakage that would have killed the five-nines commitment that year, anyway.
Yes, you have automated failover systems. But, if something happens which causes those systems to fail over, you need to immediately investigate what happened and why. Even at four nines you have automatic failover, redundant system, hot spares, etc. But, you accept that sometimes not everything will work as planned and you’ll need to fix something. Five nines is just that and more.
It’s takes a human something like 5-10 minutes just to get out of bed and figure out the situation, anyway.
Right, which is why I said that four nines is your realistic maximum if you’re going to have people on call who aren’t actually at their desks. To get better than four nines you need to have around the clock coverage with people at their desks so when a system breaks you have eyes on it in something like 30s.
Waraugh@lemmy.dbzer0.com 1 week ago
I’m used to environments where they expect five nines, get 3-4 nines, and fund for 1 nine.
HrabiaVulpes@europe.pub 1 week ago
I cal bullshit on “Gamers wouldn’t put up with a two-nines system for a video game”
Elder Scrolls Online has a weekly scheduled outage for about 8h. Every monday. Players have been complaining about it for years, but game is still popular.
merc@sh.itjust.works 1 week ago
How often is it offline outside the maintenance windows?
Yeah, maintenance windows are annoying, but they don’t really count when describing the availability of a system. Many government systems are only available during normal business hours. That means they’re offline for 16 hours per day. What matters is how available they are when they’re supposed to be working.
For Elder Scrolls, two nines would mean that the game was allowed to be down for more than an hour a week outside of those maintenance windows. Or, if measured by quarters, which is more typical, the game would still have those maintenance windows, but, in addition, it might be down for a full day once per quarter.
Basically, the 8 hour windows every Monday is a trade-off so that the rest of the week is uninterrupted. They probably manage three nines the rest of the week by shifting any serious maintenance into the weekly downtime.
And, as for the game being “still popular”, one site says that there are currently 7199 players in Elder Scrolls but more than 161k in World of Warcraft. It could be that part of the reason that World of Warcraft is more popular is that it doesn’t have 8 hour maintenance windows every week, but it does often have 2+ hour windows. The number of players who are willing to put up with 8 hour maintenance windows every week seems pretty small.
raspberriesareyummy@lemmy.world 1 week ago
Nothing to make a point like snipping off the y-axis scaling.
I hate Microslop like any person with > 2 brain cells, but that graph is useless - all visible y-entries end in a 0 - might as well be 99.990, 99.980, 99.970, …
Jordan117@lemmy.world 1 week ago
It’s just Xitter’s image viewer cropping it automatically; the original upload has it.
prenatal_confusion@feddit.org 1 week ago
It is still bad practice to select a narrow window from a axis like this and show the difference that seems massive relative to what is shown but isn’t that significant when we can see the relation to the whole.
Graph 101
k0e3@lemmy.ca 1 week ago
Surely they could just Copilot their way out of this mess lmao
tja@sh.itjust.works 1 week ago
They are trying ^^
JordanZ@lemmy.world 1 week ago
My understanding was GitHub was primarily hosted on AWS when Microsoft acquired it. I’m assuming a lot of that instability has been caused by moving it over to Azure in bits and pieces.
BleatingZombie@lemmy.world 1 week ago
I think you’re right, which is funny because now I dont trust Azure either
paris@lemmy.blahaj.zone 1 week ago
damrnelson.github.io/github-historical-uptime/
A lot of this is GitHub Actions alone, but a lot of it isn’t. I also don’t know how well GitHub tracked outages before the Microsoft acquisition. It’s entirely possible the graph looks so bad because they only took outage tracking seriously after being acquired. I don’t know.
Further related discussion on Hacker News: news.ycombinator.com/item?id=47925317
possiblylinux127@lemmy.zip 1 week ago
It is impressive how bad Microsoft is fumbling the bag
0x0@lemmy.zip 1 week ago
Safeguard@beehaw.org 1 week ago
Is that real? Because that… Makes it real clear…
MBech@feddit.dk 1 week ago
How does this corrospond with growth? I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?
I’m not questioning wether or not Microsft has issues, I just find it relevant wether or not they very suddenly saw a 2000% increase in server usage or something.
jatone@lemmy.dbzer0.com 1 week ago
I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?its not there are scale points where once you hit a critical number you need to re-architect your backend. 1k,10k,1mil, etc. usually these vary based on your app. but they’re usually exponential so once you hit the higher levels it takes much longer to reach the next level.
on top of that you usually by the higher tiers have proper backpressure and signals being sent to the frontend systems to dynamically manage the load generated. so suddenly uptime is much easier.
when you see large repeated failures like this the cause is almost always corporate causing issues.
- reducing engineering budget.
- not listening to engineering department on product decisions. (see the recent product manager AI generated commit that got merged and caused a mild uproar of 'co authored by copilot)
- rushing nonsense out before its ready.
bagsy@lemmy.world 1 week ago
But the payment processing service has 9 nines of uptime…
SocialistVibes01@lemmy.ml 1 week ago
How many of those outages were due to AI training?
ServantOfRa@lemmy.blahaj.zone 1 week ago
Remember when mSlop bought HotMail? Same shit, different decade.
bitjunkie@lemmy.world 1 week ago
That’s just fucking disgraceful.
possiblylinux127@lemmy.zip 1 week ago
You should see what they are doing to Minecraft
bitjunkie@lemmy.world 1 week ago
Unfortunately I have, my kid is absolutely fucking obsessed with it
9point6@lemmy.world 1 week ago
Image
Damage@feddit.it 1 week ago
I see one nine
huquad@lemmy.ml 1 week ago
Microsoft never promised where the nines would be
caseyweederman@lemmy.ca 1 week ago
I see six
AnUnusualRelic@lemmy.world 1 week ago
Lies! 89.98% has two nines in it!
raspberriesareyummy@lemmy.world 1 week ago
Thank you, that is much more helpful than OP graph