Incident Report for Outage on April 3 2019.
Posted by Kian Ng on 08 April 2019 06:18 PM
Dear customers, |
This report was compiled after detailed interview and investigation with our technical team and software vendor (Virtuozzo.com). Any communication provided prior to this incident report should be disregarded as our team may have provided information that was believed to be accurate at the time of outage but shown to not be so after investigation.
(1) A bug in the virtualisation infrastructure running ReadySpace Cloud Servers was discovered.
(2) There are currently 2 versions, v6 and v7, and affected VMs would have been running on either or both.
(3) The issue occurred when v6 VMs were suddenly and unexpectedly unable to connect to storage as observed by engineers. Error : MON ERR MDS#4 died unexpectedly (122): snap: no objtype 12 found.
(4) Support ticket was submitted to software vendor and issue cause was identified as a bug. (Ticket Reference - [Virtuozzo #26185] Ticket Contents - "Object type 12 is MDS_OBJ_KVSTORE. It was added in VZ7.0.8 storage release. I have submitted a bug with internal ID PSBM-93335 to our Developers Team. Regarding this case, for now you is not able to create vz6 MDS in the mixed mode cluster. Workaround - upgrade this vz6 storage node to VZ7.”)
(5) No immediate fix from Virtuozzo (VZ) was available.
What were the issues?
(1) VZ6 and VZ7 were suddenly unable to run in mixed mode after VZ7 recent storage release.
(2) No errors were encountered for 110 days after the initial update to VZ7 until an unexpected error occurred in the early hours of 3rd April. (MON ERR MDS#4 died unexpectedly (122): snap: no objtype 12 found (3)). Immediate action to initiate recovery was required to prevent data loss as many clients had not procured a backup service.
What was issue resolve?
(1) As advised by Virtuozzo, the workaround to this issue was to upgrade VZ6 storage node to VZ7.
(2) As 2 nodes in VZ6 that contained 12TB of data was degraded, we had to add additional storage (at least 4.5TB) into the VZ7 cluster to accomodate the data transfer. Again, this was necessary in order to prevent any form of data lost as many affected clients had not purchased a back up.
(3) Servers were immediately deployed and connected to the VZ7 cluster to allow sync to start. This process would have taken 5 days to complete which meant data would be unavailable to server nodes for 5 days.
(4) Another server node of 6TB was deployed and connected to the VZ7 cluster to speed up the recovery process. Although the amount of data to be replicated was massive, the process was successfully reduced from the initial estimate of 5 days to approximately 20 hours.
(5) Within this 20 hours, data first needed to be duplicated in degraded mode which took about 10hrs. A workaround was then applied by Virtuozzo Support Team to allow data from the disconnected VZ6 to be moved into the VZ7 storage cluster. This took another 10hrs. During this process, some VMs were in degraded mode and may not have been accessible. This was because individual VM data chunks may have been separated between VZ6 and/or VZ7.
(6) Once completed, all VMs were back online, albeit with low IOPS. This was because another VZ6 storage had to be replicated. This meant that, while the servers were online, data replication continued, resulting in the low IOPS.
(7) Once completed, all VMs will be back online with high IOPS.
(8) The last step of this recovery will be to upgrade all VZ6 compute and memory to VZ7 compute and memory.
What is the status now?
(1) Normal service has resumed and customers are able to access the Cloud Servers (VMs).
(2) Customers who are using vz6 compute and memory are schedule to be migrated to vz7 compute and memory.
(3) Software provider (Virtuozzo) is producing a fix to this issue.
Conclusion and next steps:
(1) A maintenance window will be carried out immediately to add on more SSD server nodes for more IOPS.
(2) This issue was caused by a software bug that surfaced unexpectedly between VZ6 and VZ7 storage nodes. Reference of Software Define Storage we use - https://www.virtuozzo.com/products/virtuozzo-storage.html
(3) We will no longer provide mix mode cluster from VZ7 onwards. Customers who will need to have latest updated version will need to migrate their VMs to another cluster.
(4) Although software bugs can occur, regardless of the number of HA ratios, we will be extending our cluster to a higher HA ratio.
(5) We will move our websites from main cluster to an isolated cluster so as to maintain constant communication with our users during service outages.
(6) We will re-evaluate VZ Storage cluster and start exploring alternative software define storage solutions. Eg, Ceph.
(7) We sincerely and unreservedly apologise for the severity of downtime and the extreme inconvenience caused to our customers and their users.
(8) The management team at ReadySpace would like to thank our technical team for working around the clock and the customer service team for managing relentless client queries during the time of outage.
Should you have further queries, please contact us via firstname.lastname@example.org with subject “Outage on 3rd April"