Thursday 2 February 2017

GitLab went offline after SysAdmin deleted the wrong folder



Online source code repository similar to github, GitLab went offline for more than 12 hours after one of the SysAdmin deleted the wrong folder in production. The service has been restored as of now and the data loss would impact less than 1% of the user base specifically peripheral metadata that was written during a 6 hours window. 

GitLab, in a Google Docs File kept updating their operations. The possible impact according to the docs are : 

Impact



  • ±6 hours of data loss
  • 4613 regular projects, 74 forks, and 350 imports are lost (roughly); 5037 projects in total. Since Git repositories are NOT lost, we can recreate all of the projects whose user/group existed before the data loss, but we cannot restore any of these projects’ issues, etc.
  • ±4979 (so ±5000) comments lost
  • 707 users lost potentially, hard to tell for certain from the Kibana logs
  • Webhooks created before Jan 31st 17:20 were restored, those created after this time are lost


Also, there were several problems encountered during the restoration process.


  • LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
  • Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
  • SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
  • Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
  • The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
  • The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
  • SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
  • Our backups to S3 apparently don’t work either: the bucket is empty
  • We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.


GitLab in it's blog said, "Losing production data is unacceptable, and in a few days we'll post the five why's of why this hapened and a list of measures we will implement". 

Twitter is praising GitLab for the transparency with which the company has handled the things. Everything was updated through the blog / twitter account and the Google Docs. This was a really great way of keeping the users and press updated and GitLab surely deserves a praise for it. Though, I doubt neither GitLab nor any other organisation will even dream of any such worst situation. 


0 comments:

Post a Comment