A few weeks ago, a friend who looks after a web server had an outage on their website and asked me to help troubleshoot. The cause of the outage surprised me and is the reason why I’m writing about it.
The website outage was due to a dependency it had on the server of the Certificate Authority (CA) who issued its TLS certificate. Yup that’s right, the availability of website was directly tied to the availability of its CA’s server; and during that time, the CA’s server had an outage.
So how do CA’s servers affect the availability of its customer’s web applications? It has to do with OCSP Stapling.
What is OCSP Stapling?
Online Certificate Status Protocol (OCSP) is a protocol for obtaining the revocation status of a TLS certificate. It’s how Let’s Encrypt would have revoked their 3 million certificates on March 41 if they had followed through with it2. The validation of the certificate is done by connecting to the servers of the CA. If you inspect a TLS Certificate, under the Authority Info (AIA) section it will provide the OCSP address. Below is a screenshot of the AIA for the TLS certificate for https://letsencrypt.org
The OCSP request is usually done by the client, specifically the web browser. However, this can add additional overhead in the TLS handshake and place a large amount of load on the OCSP servers. This is where OCSP Stapling comes in, where the web application does the OCSP lookup instead and attaches (“staples”) the OCSP response when presenting its TLS certificate to the client. The OCSP response from the CA server is signed and contains a time-stamp for the client to validate its authenticity. The web application is then able to cache the OCSP response and query the OCSP server at regular intervals. This helps improve the over web application’s performance.
OCSP Stapling isn’t enabled by default on Apache servers. It needs to be enabled using the SSLUseStapling Apache Directive3.
The Outage
So how did a server outage on the CA server cause the website to become unavailable? If the web server is unable to obtain an OCSP response shouldn’t it continue its communication with the client without an OCSP or send the last valid OCSP response in cache. The client can then decide on the best course of action. However, it seems like there is a bug in Apache, where OCSP stapling passes on temporary server outages to clients4. This was report in 2014 against Apache 2.4.6 but seems like it has not been addressed.
The Fix
There’s an Apache Directive called SSLStaplingReturnResponderErrors5 which we can set to on. It is default to off. Doing this would mean that Apache would disable stapling when it receives errors from the OCSP server. The other option is to increase the duration a good OCSP response is kept in the cache. This can be done through the SSLStaplingStandardCacheTimeout6 Directive.
Final Thoughts
It looks like this has happened before with Let’s Encrypt in 20177. While reflecting back on this web site’s outage, I wondered how many other web sites were also affected by the outage. Doing a quick search on the internet for all Apache 2.4.x server with TLS certificates issued by that particular CA came up with 116,360 servers. Of which, 3,388 were based in Australia. These are just the public facing servers. I wondered how many of them were also unavailable during the CA server outage.
[3] https://httpd.apache.org/docs/trunk/mod/mod_ssl.html#sslusestapling
[4] https://bz.apache.org/bugzilla/show_bug.cgi?id=57121
[5] https://httpd.apache.org/docs/trunk/mod/mod_ssl.html#SSLStaplingReturnResponderErrors
[6] https://httpd.apache.org/docs/trunk/mod/mod_ssl.html#SSLStaplingStandardCacheTimeout