RSS Feed
News
Apr
8
Incident Report for Outage on April 3 2019.
Posted by Kian Ng on 08 April 2019 06:18 PM
Dear customers, 

This report was compiled after detailed interview and investigation with our technical team and software vendor (Virtuozzo.com). Any communication provided prior to this incident report should be disregarded as our team may have provided information that was believed to be accurate at the time of outage but shown to not be so after investigation.



What happened?

(1) A bug in the virtualisation infrastructure running ReadySpace Cloud Servers was discovered.

(2) There are currently 2 versions, v6 and v7, and affected VMs would have been running on either or both.

(3) The issue occurred when v6 VMs were suddenly and unexpectedly unable to connect to storage as observed by engineers. Error : MON ERR MDS#4 died unexpectedly (122): snap: no objtype 12 found.

(4) Support ticket was submitted to software vendor and issue cause was identified as a bug. (Ticket Reference - [Virtuozzo #26185] Ticket Contents - "Object type 12 is MDS_OBJ_KVSTORE. It was added in VZ7.0.8 storage release. I have submitted a bug with internal ID PSBM-93335 to our Developers Team. Regarding this case, for now you is not able to create vz6 MDS in the mixed mode cluster. Workaround - upgrade this vz6 storage node to VZ7.”)

(5) No immediate fix from Virtuozzo (VZ) was available.



What were the issues?

(1) VZ6 and VZ7 were suddenly unable to run in mixed mode after VZ7 recent storage release.

(2) No errors were encountered for 110 days after the initial update to VZ7 until an unexpected error occurred in the early hours of 3rd April. (MON ERR MDS#4 died unexpectedly (122): snap: no objtype 12 found (3)). Immediate action to initiate recovery was required to prevent data loss as many clients had not procured a backup service.



What was issue resolve?

(1) As advised by Virtuozzo, the workaround to this issue was to upgrade VZ6 storage node to VZ7.

(2) As 2 nodes in VZ6 that contained 12TB of data was degraded, we had to add additional storage (at least 4.5TB) into the VZ7 cluster to accomodate the data transfer. Again, this was necessary in order to prevent any form of data lost as many affected clients had not purchased a back up.

(3) Servers were immediately deployed and connected to the VZ7 cluster to allow sync to start. This process would have taken 5 days to complete which meant data would be unavailable to server nodes for 5 days.

(4) Another server node of 6TB was deployed and connected to the VZ7 cluster to speed up the recovery process. Although the amount of data to be replicated was massive, the process was successfully reduced from the initial estimate of 5 days to approximately 20 hours.

(5) Within this 20 hours, data first needed to be duplicated in degraded mode which took about 10hrs. A workaround was then applied by Virtuozzo Support Team to allow data from the disconnected VZ6 to be moved into the VZ7 storage cluster. This took another 10hrs. During this process, some VMs were in degraded mode and may not have been accessible. This was because individual VM data chunks may have been separated between VZ6 and/or VZ7.

(6) Once completed, all VMs were back online, albeit with low IOPS. This was because another VZ6 storage had to be replicated. This meant that, while the servers were online, data replication continued, resulting in the low IOPS.

(7) Once completed, all VMs will be back online with high IOPS.

(8) The last step of this recovery will be to upgrade all VZ6 compute and memory to VZ7 compute and memory.



What is the status now?

(1) Normal service has resumed and customers are able to access the Cloud Servers (VMs).

(2) Customers who are using vz6 compute and memory are schedule to be migrated to vz7 compute and memory.

(3) Software provider (Virtuozzo) is producing a fix to this issue.



Conclusion and next steps:

(1) A maintenance window will be carried out immediately to add on more SSD server nodes for more IOPS.

(2) This issue was caused by a software bug that surfaced unexpectedly between VZ6 and VZ7 storage nodes. Reference of Software Define Storage we use - https://www.virtuozzo.com/products/virtuozzo-storage.html

(3) We will no longer provide mix mode cluster from VZ7 onwards. Customers who will need to have latest updated version will need to migrate their VMs to another cluster.

(4) Although software bugs can occur, regardless of the number of HA ratios, we will be extending our cluster to a higher HA ratio.

(5) We will move our websites from main cluster to an isolated cluster so as to maintain constant communication with our users during service outages.

(6) We will re-evaluate VZ Storage cluster and start exploring alternative software define storage solutions. Eg, Ceph.

(7) We sincerely and unreservedly apologise for the severity of downtime and the extreme inconvenience caused to our customers and their users.

(8) The management team at ReadySpace would like to thank our technical team for working around the clock and the customer service team for managing relentless client queries during the time of outage.



Should you have further queries, please contact us via help@readyspace.com with subject “Outage on 3rd April"


Readyspace Team
Read more »



Apr
3
Issue with storage at our Cloud platforms
Posted by Perez Koh on 03 April 2019 05:22 AM

Dear customers, 

We are having an issue with storage at our cloud platforms and some users on the Cloud Server and Cloud Infrastructure platforms may experience disruptions to their services during this period.  Our Level 3 engineers are already working on this case and all services should be resumed soon.

We apologize for any inconvenience that may inadvertently occur. 

ReadySpace Team.

Update 1:  We have added storage to the cluster and now configuring.

Update 2: The recovery process is ongoing. Some of the affected services already came back online. 

Update 3: https://helpdesk.readyspace.com/index.php?/News/NewsItem/View/497/update-issue-with-storage-at-our-cloud-platforms


Read more »



Jul
4
Issue Affecting Singtel Users
Posted by Kian Ng on 04 July 2018 11:36 AM
Dear customers, 

If you are a user of Singtel services, you may experience some difficulty connecting to our services as there is a reported issue with the Singtel network. There are currently no reported issues on any of our services. You may wish to check with Singtel if you do encounter any difficulties with connectivity. :)


http://www.channelnewsasia.com/news/singapore/singtel-customers-experience-internet-connectivity-issues-10497642


Best regards,

ReadySpace Team
Read more »



Jul
4
Issue Affecting Singtel Users
Posted by Kian Ng on 04 July 2018 11:26 AM
Dear customers, 

If you are a user of Singtel services, you may experience some difficulty connecting to our services as there is a reported issue with the Singtel network. There are currently no reported issues on any of our services. You may wish to check with Singtel if you do encounter any difficulties with connectivity. :)


http://www.channelnewsasia.com/news/singapore/singtel-customers-experience-internet-connectivity-issues-10497642


Best regards,

ReadySpace Team
Read more »



Jun
21
Temporary Change of SG Contact Number
Posted by Kian Ng on 21 June 2018 04:50 PM
Hello!

  • We'd like to let everybody that our Singapore telephone number will temporarily be changed to :
  • +65 6914 2694
  • This number will be in use until further notice. All other channels of communication remain unchanged.



Cheers!


ReadySpace Team
Read more »



Jun
21
Update on Delayed Email Issue
Posted by Kian Ng on 21 June 2018 02:13 AM
Dear all,

With regard to the previously reported issue with delayed sending/receiving of email that may have affected some users, we are pleased to announce that all users should have normal service restored by now.

Thank you for your patience and sincere apologies to all users who had been inconvenienced.



Regards 


ReadySpace Team
Read more »




ReadySpace Helpdesk - Giving you space for growth