Destiny Back End Stability

Over the past few months Bungie put out a few posts on Destiny that detailed what they were doing on the back end and I finally got a chance to read through these.

THIS WEEK AT BUNGIE – 05/18/2023
THIS WEEK IN DESTINY – 06/29/2023

I just want to take a moment to clap for more visibility of back end work for large scale games. The work here is often invisible and there is limited detail for how a lot of games work.

The TLDR on the first post is they are upgrading their services and hardware to new tech. They have been background updating new releases but have put the game into a bad state requiring total downtime. They moved to a downtime model for all releases.

There isn’t much detail at all in this first post, but it’s a glimmer into something at least.

The second post has a few things:

There is a service called Claims that handles a lot of gameplay traffic.
Claims has some connectivity issues and recovery from that isn’t working correctly.
When it doesn’t work correctly it can reach a death spiral state and the whole game has to be turned off.
They are continuing to improve their deployment and incident response process.
They will improve their logging and alerting for Claims.
Fix claims auto recovery and kill dead/old messages in the backlog when this happens.
Add some chaos testing and and more logging in non Claims services
Add more claims unit tests

Improvements

After reading both of the posts as an outside reader I still don’t really know what is going on, what Claims service is or the architecture and what challenges they have solved and what challenges really remain. I think there could be a few improvements to the communication on this:

Describe the high level architecture a bit more, add an architecture diagram. I don’t really understand Claims service in the bigger picture. Even what kind of service or what kind of connections we are talking about or why that is even an issue. I’d like a little more background detail to be able to understand.
It’s good to keep things at a fairly high level but this seems watered down maybe a bit too much. Or maybe that is intentional as the audience may not be that technical. A lot is in generalities or using non technical terms such as “bogged down”. Does that mean CPU constrained?
Explain the why. Both the posts talk about the problems but little can be gathered about why these issues are important. Expand on the opportunities to make things interesting. “background update has put our services into a bad state” as an example. There’s probably something much more interesting there to be shared and something I can relate to. Something in a bad state is really hard for me to understand the importance of.
Highlight the challenges you have and the ones you solved and how you solved them, and describe the challenges you need to solve. This is the interesting meat of these. I bet there was some really cool work that was done that really isn’t highlighted here and I bet there were clever or interesting fixes to things. There were problem many surprises and a-ha moments but little is described here.
The “action items” either don’t relate to what have been described or are too generic. “Continue to improve things” and then bolded these will continue forever doesn’t really tell me anything interesting. What is the interesting stuff you are actually doing here? What is not at the bar you are expecting?
Adding logging and alerting takes months? There has got to be some interesting nugget there on why this is measured in weeks/months rather than hours/days.
There’s a lot of language on being careful, and changes can make things worse, and things can’t happen overnight. This makes sense but meanwhile things seem to still be breaking without them. It feels like this language is in the wrong mindset overall. “These changes are designed to minimize the risk of further degrading stability, while helping us to confirm the effectiveness of fixes further out on the roadmap.” I don’t understand why this is a bullet point at all, it’s a note on another item not an action item itself? Just feels more like ass covering on the we have to be super cautious on changes language. Makes me more worried. Why do players need to worry about how careful you need to be? Are things really brittle?
A post noting you are continually improving the incident response process, followed by a long outage 3 days later with no communication kind of takes the wind out of the sails on some of the follow ups.

Despite some improvements I do have to tip my hat to the folks at Bungie for putting this all out there. I know this stuff can be hard to get in public blog posts, especially with folks like me critiquing 🙂 and revealing inner workings and details isn’t always appreciated. They didn’t have to share anything at all, but I think moving towards a more transparent future is better for everyone. Running back end game services at scale is actually really difficult.

For someone that is less technical on back end systems maybe this post is great as it can probably be understood by a larger portion of the audience. Maybe this post isn’t even meant for me.

Pete Shima