HepBoat Postmortem – Late Post

In the past few months, we received reports that HepBoat was slow while using the bot for their normal daily tasks during peak hours. After investigating the possible cause of the slowness, we first determined that migrating to another server with better resources would be something to look into in order to help eliminate the slowness as the previous server was several years old and still had some remnants of old code lying around. 

However, with the bot’s state worsening, we were not able to provide any prior notice before beginning the process as we needed to do the migration right away since its performance was unacceptable. As we began the migration, we soon ran into issues which can be described in the following series of events. 

NOTE: All event times are in Eastern Daylight Time. 

July 29th, 2019 –

02:32 (2:32am): We began to initialize a server transfer, which initially looked fine at first. 

Morning Hours: It was soon noticed that we ran into migration issues when the migration process was slowing one of the drives where the database data were located. Due to this lag, we started to immediately migrate data from the disk as we suspected a faulty drive. However, before we could take action, the server went offline mid-migration and would no longer boot, citing corruption to the boot record of the disk and the partition table. This meant that we had lost access to the entire PostgresSQL database (server whitelists, configuration, infractions, etc.) before it could be fully migrated.

We normally back up this data about every two days, but due to the issues where the data directory had changed in the Postgres configuration (the data was outside of the scope of where we would backup this data – our backup script backs up 8 different servers), we did not have the latest backup of the database itself. The only backup that we had was from July 1st, 2019 – nine days before we began the Rawgoat migration. Because this would mean that we would lose roughly at least 29 days of data, this was unacceptable, and we kept looking for more solutions.

Still Morning Hours into the Evening Hours: After finding out about the migration issues, an investigation was launched to see what course of action could be done for the corrupted drive. We did not want to restore to the July 1st backup as it did not cover the servers that were whitelisted after that date and would require our users to go back and write the configuration for their server again.

Our final course of action was to attempt to work with a prior partially corrupted snapshot in order to retrieve the data from the failed drive. This process took eight hours with uncertainty of whether we would be able to get the data back due to its partially corrupted nature, but it was worth the effort as we were successful in the end. The snapshot let us try many different  methods to retrieve the data, as we could run data destroying operations that could possibly recover the data we needed without worry. For safety, we could always revert back to the original snapshot to try again and access the data remaining on the disk. Typical data recovery operations are dangerous as they can cause further damage to the disk. The snapshot in this case proved to be our saving grace.

July 30th, 2019 –

01:20 (1:20am): With the newly retrieved data, we finally began migrating it into the new server before starting the recovery process to backfill missed messages from all HepBoat servers while the bot was down. We ended up recovering about 516,000 messages that were backfilled during the 23 hours outage – all of which are safely back in the database.

03:03 (3:03am): At this point, we had completed the recovery process, and the bot was back to normal speeds. The bot was marked as up to full speed, and we continued to monitor its performance. 

Final Comments – 

With this timeline of the events, we determined that there were 23 hours of outage. While it is not quite a day, it is not acceptable as we feel that our users should not be left without HepBoat for this long period of time; hence we spent 15 hours straight as a team to make sure we got every byte of data back.

We investigated of what we can do in order to improve this situation for any future references. What we are planning to do is to address the backup issues to ensure that they are more frequent and accurate backups of Postgres and have confidence that we will have the service back up and running in a shortest amount of time as possible in case issues like this happen again. 

In the end, we decided that the next course of action was to rewrite HepBoat so the bot will be more stable even during the peak hours. Not only that we will be rewriting HepBoat, we will be migrating the libraries from disco.py to discord.py with the exclusive pizza plugin. As of the current date (November 23rd, 2019), HepBoat rewrite is about halfway completed and there are still some work to be completed before it is ready for public. We will share details as we get closer to the public release for the rewrite of what it would mean for the HepBoat users, as well Rawgoat users who will be migrating to HepBoat. Stay tuned for this.

We want to thank you for your patience and understanding during this outage and the ongoing amount of problems in the past few months, we truly appreciate you guys. 

If you have any questions regarding the outage, please do not hesitate to ask. 

Best Regards, 

JakeyPrime, Bunnerz, Ghoul, and the rest of the HepBoat Development Team,

Leave a Reply

Your email address will not be published. Required fields are marked *