Clive Johnson CIG talks about the current 30k issues.

Aramsolari

Vice Admiral
Donor
May 9, 2019
638
2,070
500
RSI Handle
Aramsolari
All very interesting stuff. TLDR: "It's not worth it for us to focus on it right now, we'll wait till feature/content is done".

I see what he's getting at but it's so frustrating at times. Comments?


Clive Johnson [email protected]
Today at 20:59


https://robertsspaceindustries.com/spectrum/community/SC/forum/50259/thread/server-issues-30k/3034939
There's quite a lot to answer in your question. I'll try to cover the major points but apologies if you feel I have left out something important.

When will the servers be stable? At the end of beta. Why not before? Because we need to finish making the rest of the game first.

When a game is being worked on as a closed alpha the focus is on feature and content development. Stability and bug fixing take a back seat and only issues that would hinder further development are addressed. This may sound an unprofessional way of doing things but the idea behind it is to try out ideas as quickly and cheaply as possible. That allows the developers to find out which parts of the game's design work and which need revisiting. There is no point spending time bug fixing a feature that may change or even be completely pulled from the game at any time. Development will continue with the game in this semi-broken state at least until all features and content have been locked down. The game then enters the beta phase of development where bug fixing, optimisation, balance and polish are at the forefront. Ideally no feature work happens during beta but there's almost always some last minute changes pushed in.

SC of course is open development, so while the focus in alpha is still on trying out different ideas, we need the game to be stable and functional enough for backers to test it and give their feedback. The key word there is "enough" which of course does not mean perfect. It is important that we strike the right balance between bug fixing and further development: too much bug fixing and development slows, too little and we don't get enough feedback or the bugs hinder further development.

Has CIG got the right balance between bug fixing and development?

The problem with determining whether a build is stable "enough" is that we can only look at how stability affects the playerbase as a whole, i.e. the average. There will therefore be some lucky backers who experience far fewer crashes or other problems than average while there will be some poor souls for whom the build appears a bug-ridden crash fest. Ask the lucky players if we have the balance right and they might say, no the game is stable enough and we need to focus more on expanding the game. Ask the unlucky ones and they might still say no but want us to stop working on new features until all the current bugs are fixed. Very few people are going to say yes.

As a rule of thumb, before releasing a patch to Live, we try to make sure it is at least as stable as the previous Live release. Some patches may be more or less stable for particular play styles than previous ones but, overall, stability should get better from patch to patch. Of course sometimes things don't work out how we'd like and average stability will end up not as good as it was on the previous version.

Why aren't we fixing the server crashes causing 30000 disconnection errors?

We are. It only seems like we aren't because, regardless of the cause, all server crashes result in clients getting the same 30000 disconnect. This disconnect happens because once the server has crashed the clients suddenly stop receiving network traffic from it. They then wait for 30 seconds to see if traffic will resume (incase the server was stuck on a temporary stall or there was a short network outage) before giving up, returning to the front end menus and showing the disconnection error. During these 30 seconds clients will see doors fail to open as well as AI, terminals and other entities become non-responsive. Backers sometimes mistake these symptoms as a sign that the server is about to crash, and you might see in-game chat saying a server crash is incoming, but the truth is that the server is already dead. It is an ex-server. It has ceased to be. If we hadn't nailed it to its perch it would be pushing up the daisies. (In-game chat only continues to work because that is handled by a different server.)

When a new patch is being prepared on PTU, new builds are available for download almost daily. Once DevOps in ATX has pushed the new build up to the servers and made it available for download they then monitor the build for the first few hours, often working late to do so, looking for anything to indicate a problem that needs dealing with immediately. For the next few hours people play the game, uploading their crash reports, submitting to the Issue Council, responding to feedback forums, etc. Server crashes are all automatically recorded to a database. When the EU studios wake up, Technical QA look through the uploaded client crashes and recorded server crashes and make an initial assessment of which are the worst offenders, based on how often they happen and how soon after joining a game. Server crashes almost always go to the top of the pile, purely because they affect more people than individual client crashes. Jiras get created and passed on to Production. Production do three things here: first they send the crash Jiras to the Leads for triage, second they confirm priorities and which crashes QA should try to reproduce or otherwise assist with, third they flag any particularly bad crashes with Directors for priority calls incase additional people need to be reassigned to try and ensure a speedy resolution. Meanwhile the Leads triage the crashes making sure they go to the right Programmers on the right teams. Then the Programmers investigate the bugs, often working with QA to find as much info on the bug as possible. Most of the time Programmers can commit a fix the same day but sometimes it might take a day or two longer. In rare cases it can take a couple of weeks to track down the problem and come up with a fix. In very rare cases the bug is a symptom of some deeper flaw that will require restructuring some system to work a different way, can't be done in time or without significant risk for the current patch, and needs to be added to a backlog to be scheduled for a future release. As ATX comes online Community and DevOps publish their reports on the previous build from information gathered over the past day. Production kicks a build with all the latest fixes and meet with QA, Community and DevOps to make an assessment on whether the new build is likely to be better than the last or whether additional fixes are needed first. Production pass their recommendation onto the Executives who make a go/no-go decision on the trying to push the new build to PTU that day. If yes ATX QA and DevOps start working their way through a pre-release checklist that takes several hours to complete. When LA comes online EU Programmers may hand over any issues that were specifically for LA teams or that EU teams were working on but are unresolved and would benefit from continued investigation after EU has finished for the day. When ATX have completed the pre-release checklist, and if the build has passed, the cycle starts again.

If we are fixing the crashes why do 30000 disconnections keep happening?

Between every quarterly release we change a lot of code. Some of it completely new and some of it merely modifications to existing code. Each change we make has a chance that it may contain bugs. We're only human and all make mistakes from time to time so each quarter there is the potential for having added a lot of new bugs. There are processes in place to reduce the chances of that happening but some always slip through. Once a bug is discovered it needs fixing. Sometimes a fix doesn't work. Sometimes it only fixes the crash in some cases but not all. Sometimes the fix itself has a bug in it that can cause other problems. One of the things we see quite a lot is that once a frequent crash is fixed one or more other crashes will start appearing more often. That happens because the crash that was just fixed was blocking the other crashes from occuring as much as they otherwise would have. As mentioned above there are also crashes that can't be fixed immediately and need to wait until there is more time to fix them properly or until some other planned work is completed. Eventually though the majority of the most frequent crashes get fixed. What we are then left with are the really rare crashes, the ones that only occur once every month and we don't yet have enough information to fix or reproduce them. One of these rare bugs isn't going to make much difference on its own but a hundred such bugs would be enough for at least three server crashes a day.

If we can't make the servers stable why don't we provide some kind of recovery?

It has been suggested that providing some kind of cargo insurance could prevent players losing large sums of aUEC when their server crashes mid cargo run. I believe this has been considered but the potential for it to be abused as an exploit is clear. Until that problem is solved cargo insurance is unlikely to appear in-game.

Another suggestion is to add some kind of server crash recovery. The idea here is that when a server crashes, all the clients would be kicked back to the menus with a 30000 as they are now but would then be given the option to join a newly spun-up server that has restored the state of the original from persistence. This is actually something we're hoping to do but it requires more work to be done on SOCS and full persistence before it can happen so is still a long way off.

There have also been other suggestions such as clients or servers saving out the game's state in local files but these aren't secure or it would be a temporary solution and a waste of work to implement and maintain that could be spent working on the proper solution instead.

For now the best option is for us to continue to fix crashes as we find them and hope that servers are stable enough for most players to be able to test the game.
 
Last edited:

August

Admiral
Officer
Donor
Aug 27, 2018
2,574
9,637
750
RSI Handle
August-TEST
It is still an alpha product and logically I agree with the above.

I see both sides of this though and asking people to remain enthusiastic about a product which is regularly broken is a tall order.

The community which wants a stable game is also the source of income for the games development. Not addressing stability is far more efficient, but also bites the hand which feeds.
 

Aramsolari

Vice Admiral
Donor
May 9, 2019
638
2,070
500
RSI Handle
Aramsolari
Indeed it is. Thanks for sharing with us.

As for the TL:DR part......This is one article that EVERYONE should read in it's entirety.
A MUST READ for all Alpha TESTers.
Pleasure. I saw that posted on the Star Citizen subreddit, went over to the Spectrum thread from there and read the entirety. Thought everyone here should take a peek at it too. It kinda explains their stance on 30ks and their apparent decision to not prioritize it.

I agree with Clive Johnson in that trying to address the constant 30ks now would probably slow development. That said, it's pretty frustrating. As we're entering a free flight week, it's not gonna make a good impression on people looking to try out the game either.
 
Last edited:

Tei

Commander
Dec 8, 2019
162
406
100
RSI Handle
TeiwazWolf
All very interesting stuff. TLDR: "It's not worth it for us to focus on it right now, we'll wait till feature/content is done".
Ufff your TLDR scared me a bit - this game will be in alpha/content creation for another decade....
Thought they will just add new stuff (bugs) and wait till beta with fixing them.
But after reading full thing I see that they will do some bug fixing along the way. Just hope that it will be healthy ratio....

As a both software developer and a consumer of software I can empathize with both sides.
However CIG is more on sloppy side of code quality - at least it feels that way, without having ability too peek into code. But I rememeber old show, Bugsmashers, and there were some bad things, like a hidden time bombs.
So if they do not improve their practices, both by staff being more aware about code change implications and by decoupling as much as possible, we will all be frustrated by updates like current 3.9.0.
 

Vavrik

Grand Admiral
Donor
Sep 19, 2017
2,897
11,904
1,100
RSI Handle
Vavrik
As a software developer and a consumer of software I can emhatize with both sides.
However CIG is on more sloppy side of code quality - at least it feels that way, without having ability too peek into code. But I rememeber old show, Bugsmashers, and there were some bad things, like a hidden time bombs.
So if they do not improve their practices, both by staff being more aware about code change implications and by decoupling as much as possible, we will all be frustrated by updates like current 3.9.0.
I agree with that. Bugsmashers used to bother me, because I never once saw a unit test being performed. Without them, you can't really tell when things are broken. And they still don't seem to be using try-blocks, which is how you capture a failure and prevent the user from crashing out of the game.
 

Radegast74

Grand Admiral
Oct 8, 2016
2,457
8,975
1,400
RSI Handle
Radegast74
Worth reading for this bit alone!
....regardless of the cause, all server crashes result in clients getting the same 30000 disconnect. This disconnect happens because once the server has crashed the clients suddenly stop receiving network traffic from it. They then wait for 30 seconds to see if traffic will resume (incase the server was stuck on a temporary stall or there was a short network outage) before giving up, returning to the front end menus and showing the disconnection error. During these 30 seconds clients will see doors fail to open as well as AI, terminals and other entities become non-responsive. Backers sometimes mistake these symptoms as a sign that the server is about to crash, and you might see in-game chat saying a server crash is incoming, but the truth is that the server is already dead. It is an ex-server. It has ceased to be. If we hadn't nailed it to its perch it would be pushing up the daisies.
 

Bambooza

Grand Admiral
Sep 25, 2017
3,882
12,263
950
RSI Handle
MrBambooza
I agree with that. Bugsmashers used to bother me, because I never once saw a unit test being performed. Without them, you can't really tell when things are broken. And they still don't seem to be using try-blocks, which is how you capture a failure and prevent the user from crashing out of the game.
I've always found unit tests to be one of those two edge swords, damned if you do, damned if you don't, due to how long they can take to program. I've been using them a lot on the new net core project I am working on at the interface level but in the best, with game development, it always got messy with them in the code especially if they were not coded correctly and maintained. As Clive said when developing new features huge portions of code gets thrown away new methods added and changes to the logic are tweaked.

It has always been my opinion at the minimal, interfaces should have unit tests on them and be well documented but even here I find people go into a class and make things public static instead of doing the work of rewriting the code correctly.
 

Phil

Space Marshal
Donor
Nov 22, 2015
980
2,602
1,650
RSI Handle
Bacraut
I understand alpha, but lets be clear, CIG created this pre-production/production/Alpha type of play as we go development, this isn't just a normal alpha so I don't accept the normal alpha responses.

I understand some people are effected more than others, personally I have suffered a 30k almost every session I have played in since 3.9 hit and 3.8 wasn't much better, there were times I had 6 in one day, so I get it, alpha, yes, but it should still be playable, 30k should be a priority, if I had to recommend this game right now.... I wouldn't, I would tell people to wait until the 30k was fixed and if they can't or won't fix it my time will be pretty much 0%, sure I may log on to do the fleet week but that's it, I am tired of logging on spending 10 min flying somewhere only to have a 30k happen before I can even get started, my time is limited, I get a few hrs a day if that and maybe some extra on the weekends, not enough to spend my time restarting in SC over and over again lol.

Outside of the 30k's my complaints are limited, again the game looks great, I wish it was further along, I could care less about SQ42 its a single player campaign that most of us will have done in the first week so getting hyped up over that seems a bit much to me when the main part of the game is still years away, but I get it, we all want something at this point.
 

Aramsolari

Vice Admiral
Donor
May 9, 2019
638
2,070
500
RSI Handle
Aramsolari
I suspect it has to do with some a specific gameplay loop.

I do a lot of bounty hunting and have not had much issue.

Are you trading a lot?
I've been doing mostly mercenary/bounty stuff and my server experience has been erratic. 3.9 has definitely been less enjoyable to me than 3.8 in that regard.

I'm staying away from any gameplay loop that requires me to invest my ingame money (ie. hauling). It's just worth it right now.
 
  • Like
Reactions: Phil

Bambooza

Grand Admiral
Sep 25, 2017
3,882
12,263
950
RSI Handle
MrBambooza
*knocks on wood* I've had only 1 game drop outside of the buggy finished prison time back to civ.

The biggest issue I've had is with the turrets doing this weird sticking while firing instead of allowing you to turn off the gyro and smooth mouse movement. It makes it very hard to track targets even in ships moving slowly.

The other issue i've found is that sometimes POI's don't always populate their trade terminals even if your ship shows up on the screen. (The workaround seems to be to take off and land again)
 
  • Like
Reactions: Phil and Aramsolari
Forgot your password?