Unveiling the hidden bottleneck of a website outage

August 15, 2024

On a quiet Wednesday morning, the alarms went off warning us of a significant decline in the number of requests to the production servers. The metrics showed a 90% drop in the number of requests and a significant increase in the time needed to render the pages.

After a quick investigation, we discovered the culprit was the following code. Can you spot the problem?

function get_avatar_url($user_id) {
    $path = "/mnt/nas/avatar/user.png";
    if ( file_exists($path) ) {
        return "http://nas.example.com/avatar/" + $user_id + ".png";
    }
    return "http://nas.example.com/avatar/default.png";
}

Code shown here is simplified for the purpose of this article.

Background

Roughly between 2010 and 2016, we, the team of six, were taking care of a social network for a large Latin American telecom company. Our responsibilities ranged from developing new features to providing 24/7 on-call support. We were located in Europe, while most of the traffic and users were from the Latin America. In other words, we were working during the low traffic hours and providing support at our nighttime.

The code was a classical legacy proof-of-concept promoted to production. It worked, but it wasn’t scalable, and we were constantly putting out fires to keep the service running. As soon as we improved one thing, the traffic increased and something else broke.

The frontend code was a plain old HTML with CSS, while the backend was written in PHP, running against a MySQL database. Most of the users were accessing it from their mobile devices, the so-called feature phones. The devices did not support JavaScript, every single page had to be rendered server-side.

The service was running on ~40 servers, serving ~200 million pageviews every day to 2 million active monthly users. Fun fact, that’s the population of Slovenia, where the six of us were from.

This article is based on the state of the service at the time of the incident. It has undergone several changes since then, which are not reflected here.

Architecture

The proof-of-concept was a monolithic application running on the application server (AS). We scaled the application horizontally by adding more AS instances and having a load balancer distribute the incoming requests equally between them.

User-generated content, including photos, videos and avatars, was stored on a single NAS server with plenty of storage. The NAS was mounted as a network drive on the AS instances.

Nginx was used to serve the requests on both the NAS and AS servers.

The application was handling read and write database operations independently. It was possible to scale the read replicas horizontally, as well as sharding the data later on.

The overall high-level architecture looked something like this:

High-level architecture

The problem

Back to that quiet Wednesday morning. The traffic was almost at its daily minimum when the alarms went off, the response times were increasing yet the metrics didn’t show any anomalies. We started looking into it by taking one of the application servers from the active load balancer pool and connecting directly to it. This was the peak investigation approach we had at the time.

Not long after, we found the culprit in the following function, namely the line containing file_exists():

function get_avatar_url($user_id) {
    $path = "/mnt/nas/avatar/user.png";
    if ( file_exists($path) ) {
        return "http://nas.example.com/avatar/" + $user_id + ".png";
    }
    return "http://nas.example.com/avatar/default.png";
}

The NAS hard drive was mounted to each application server over the network. It seems early that morning we hit the file and socket descriptor limits when someone uploaded a new avatar. Since the limits were already maxed out, we needed an alternative solution.

The solution

We decided to resolve the avatar existence directly on the NAS. With this change, requesting nas.example.com/avatar/does-not-exist.png returned the default avatar instead of a 404 Not Found. The Nginx configuration was changed to something like this:

server {
    server_name nas.example.com;

+    location ~ /avatar/(.*)\.png$ {
+        try_files /$uri/ /data/avatar/default.png;
+    }

    location / {
        try_files /$uri/ =404;
    }
}

Secondly, we modified the code to act as if the avatar always exists:

function get_avatar_url($user_id) {
    return "http://nas.example.com/avatar/" + $user_id + ".png";
}

The next day, the traffic increased since the networking bottleneck was gone. We also gained more time to properly implement the user-generated content storage and logic in the upcoming months.

Lessons learned

When we inherited the system, one of the first things we did was setting up the monitoring system and automate as much of the deployment process as possible. It was a lot of work and quite an investment, but it paid off with this incident alone. Monitoring helped us identify the service degradation the moment it happened and steer us into the right direction. Being able to quickly deploy the changes was another win.

The team was equally engaged in finding the cause and implementing the solution. Having everyone understand how the system works proved invaluable at the time of the incident as well as later on.