Reliability in Games: High Level Issues

This is part 1 of a series on Reliability in Games(RIG). Where we explore why games have so many reliability problems. This focuses more on large online games rather than single player or board games.

This post goes over only the higher level grouping of problems.

Games are Complicated and Built Under Pressure

Game development isn’t exactly a “clean” process. In fact many describe it as messy or chaotic and the larger the game the more complexity. As a game is built it will likely change significantly over time. This isn’t copletely different from a software product (that isn’t a game) but swings in game development I find to be a lot larger than other software. If the game you make isn’t fun (using the term loosely) enough, it can be back to the drawing board starting all over, sometimes multiple times. Sometimes large portions of work are thrown out, sometimes even years worth of work.

Game development is also typically quite long, even for smaller indie games it can be multiple years and its not uncommon for AAA games to be 5-10 years in development and tens or hundreds of millions of investment. With as much changing during the development window the tech underneath it also has continually changing requirements. Maybe first it was single player, then it was multiplayer, or maybe it has one type of loadout or gameplay element then a month later completely different.

Extending a game’s launch typically doesn’t reduce pressure to launch but only increases it, all available time will be consumed in similar ways the game was built. More time doesn’t often mean there is suddenly more time to complete what is needed, it just means more ideas and more changes and more pressure. This is similar to the Law of Stretched Systems. Games will go into hard or soft lock where no additional changes can be made that combats some of this. I think it is only near hard lock where someone can really know if something is going to ship or not.

This pressure creates an interesting environment for back end teams. Requirements for the back end are never set and final but always changing, sometimes with major changes leading to the need to fully rewrite. But there isn’t time for that and you can’t rewrite the back end every 3 months either.

It’s this pressure and complexity that tends to bleed into everything that is done often resulting in cut corners, hacked together working models or an amount of debt that can ultimately slow things down over speed things up.

Because of the complexity and pressure, communication overhead can be a high cost and the lack of it can cause teams to be misaligned often building components that may not match well together, but the game has to ship.

Backend as a Utility

There’s typically a large skill gap between in game and back end. You could think of this as In Engine development and Out of Engine development although that is oversimplifying it a bit. These areas can require many different skillsets and the development itself can differ wildly.

Occasionally you can find someone that can do both well, but they are pretty rare and if they do they typically aren’t deep experts in either. This leads to not only a skill gap but a language/communication gap between gameplay and back end where the other side seems like a black box. Gameplay folks don’t understand back end and Back end folks don’t understand In Engine.

In smaller dev teams there might not even be a dedicated online or back end person, and the same goes for infrastructure. If you had to choose between gameplay to make your game fun/better or back end, you will probably choose gameplay, art, etc and backend can be seen as much of a utility – something we know that is needed but doesn’t offer the same game value as art, gameplay, or graphics. That is until it’s realized the game can’t run without the back end and art, gameplay, graphics, etc don’t matter if no one can login.

This is a fundamental difference in games versus SaaS products. In games the back end is a utility providing power and water, but in a SaaS the backend is often a lot closer to the product.

This is one of the major high level issues in reliability in games, it is always at odds with gameplay. Expertise isn’t always there, with people often juggling multiple hats and the issues in the core gameplay loop are prioritized over backend.

This also often creates another issue – a big lack of investment in back end either staffing or money or expertise.

Game companies that do well in this space typically have figured out how to manage this or have incredibly good folks running things who have the expertise or are particularly good at navigating this.

Popularity and Ability to Scale

Some games at launch do have unexpected popularity. This isn’t just a few more players, it could be 10x, 100x or 1000x more than expected. These are more of an edge case than the standard and a lot of games might plan for a million players at launch, get significantly less than that and still struggle. But why is that?

You can often see this in comments on reddit or other areas, especially for a known big deal game launch.

“They knew this was going to be a big release why didn’t they plan for this? “

In reality they probably did but either ran out of time or ran into one of many unexpected surprises.

“Add more servers!”

Oftentimes just adding more servers would actually make things worse but is a common response when folks can’t login. It’s oftentimes just not raw dedicated server count that is the actual issue.

The answer is easy – unexpected bottlenecks or surprises

At launch the backend and the game are unoptimized in many ways. The game was done under crunch, probably the back end as well and there simply wasn’t time for polish.

Player behavior can also be another surprise. Maybe you didn’t expect players to just be opening gift boxes for 8 hours straight and the database load is nothing like you saw in testing.

There can also be a lack of expertise in scaling, how do you scale up the database if you never have before? How do you move from a single server to sharded or replicated? How do I scale up in the cloud or kubernetes? Why is the cache not working? Where are the calls coming from?

Cloud limits are often a gotcha as well. Many cloud providers have unexpected invisible limits that when you hit them you will be stuck.

And one of the most common is the game client doing unexpected calls, or DDoS’ing the back end. When things start failing, the game client may react in ways that were never tested at scale, retrying calls or re-running entire game flows all calling the back end. This can be very difficult to understand at scale.

If your game has become way more popular than you expected at launch, that’s a great problem to have, but if it’s less popular than you thought and there are still issues, there are probably one of many surprises that you found.

The Trifecta

It is the trifecta above of very high level problems, unending complexity, rushing and cutting corners, backend as a utility more than a driver for the game, and ability to manage scale even if it is above the projections. Player communication is also a huge factor that can magnify these issues in either direction.

Back end issues at launch can make or break a game, but then again if it isn’t fun or doesn’t have the right hooks the backend may not matter anyway.

This is another interesting point –

Pete Shima