How to handle big issue on live server that already broke a lot of user data

Have you ever caused a big problem on production and panicked, not knowing what to do to remedy the shitty situation you've just caused? Well, I did. So here's what I've found.

1. Find what caused the issue

Its probably the most recent changes you've just pushed. If not, then try to remember whatever code you're ever had doubt in, it's usually one of these. If you cant find it, tough luck. Lets move to the next step. Don't spend too much time on finding the rootcause.

2. Stop the function that caused the issue (eg:cron that cancels invoice)

In my case, it was a cronjob. Easy fix, just disable the cronjob on both the server and in the codebase. If it was a crucial add/update process, then just disable it in the code, returning an alert 'something went wrong, we're currently fixing it' should suffice. Just make sure to apologize to your support person later.

3. Update current live database with data from backup

That is, if you even have a backup in the first place. If not, then you've just learned a valuable lesson in database management. Always have backup. Also good luck trying to remap/fix the broken data.

4. Find the underlying issue & its solution

Now that everything have stopped breaking itself, its the perfect time to find the rootcause & solve the problem. Make sure you imprint this problem in your memory, we don't want to cause another problem on production now do we?

5. Fix code & deploy changes

After you've fixed it, make sure to have another programmer or your tech lead to review your changes. Best case scenario they see that it can be even further improved. After that, deploy and test the function again on live. Hopefully, it wont break anything this time.

Firdaus Safari

Search This Blog