Software bug caused failure of encrypted mail service ProtonMail

A software bug caused the encrypted email service ProtonMail and other Proton services to be unreachable for an hour last Monday. The email provider describes this in an analysis of the failure . On Monday morning, February 1, Proton had planned maintenance on the database hardware, which was also carried out successfully.

For maintenance, the API servers were put in an offline mode. Each server has a local cache of important configuration information. This data expires after three to four minutes, after which the server retrieves a new configuration. Normally the expiration times of the servers are staggered. Enabling the offline mode, however, ensured that the servers were virtually synchronized.

Normally, the configuration cache is not required when the server is in offline mode. A bug in the application code caused the requests from logged in users to erroneously need this data, even in offline mode. After the configuration cache of each server expired after a few minutes, requests were sent to the database that was still offline.

According to ProtonMail, processing requests normally takes milliseconds, but now they waited ten seconds for a timeout. This had a knock-on effect, putting so much stress on the servers' memory and processors that they stopped responding. The servers could only be restored by means of a manual hard reboot. However, this took some time. Some of the servers that came online were still misconfigured, which meant engineers had to check each server before all services could be restored.

ProtonMail says the incident has shown that the infrastructure and code need to be tested more extensively for exceptional circumstances, such as when certain services are offline. In addition, the malfunction underlines the need to test more often. The offline mode that ProtonMail used had been tested more often in the past, but since the last test and the incident on February 1, the number of users has increased significantly. According to the mail provider, more testing should have been done. ProtonMail will also ensure that engineers are 'on-call' during maintenance should something unexpected happen.

