Tuesday, March 29, 2016

Optimising NginX, Node.JS and networking for heavy workloads

Used in conjunction, NginX and Node.JS are the perfect partnership for high-throughput web applications. They’re both built using event-driven design principles and are able to scale to levels far beyond the classic C10Klimitations afflicting standard web servers such as Apache. Out-of-the-box configuration will get you pretty far, but when you need to start serving upwards of thousands of requests per second on commodity hardware, there’s some extra tweaking you must perform to squeeze every ounce of performance out of your servers.
This article assumes you’re using NginX’s HttpProxyModule to proxy your traffic to one or more upstream node.js servers. We’ll cover tuning sysctl settings in Ubuntu 10.04 and above, as well as node.js application and NginX tuning. You may be able to achieve similar results if you’re using a Debian Linux distribution, but YMMV if you’re using something else.

Tuning the network

Meticulous configuration of Nginx and Node.js would be futile without first understanding and optimising the transport mechanism over which traffic data is sent. Most commonly, NginX will be connected to your web clients and your upstream applications via TCP sockets.
Your system imposes a variety of thresholds and limits on TCP traffic, dictated by its kernel parameter configuration. The default settings are designed for accommodating generic networking use. They are not necessarily geared up for high-volumes of short-lived connections handled by a web server.
The parameters listed here are the main candidates for tuning TCP throughput of a server. To have these take effect, you can drop them in your/etc/sysctl.conf file, or a new config file such as /etc/sysctl.d/99-tuning.conf and run sysctl -p to have the kernel pick them up. We use asyctl-cookbook to do the hard work.
Please note the following values are guidelines, and you should be able to use them safely, but you are encouraged to research what each one means so you can choose a value appropriate for your workload, hardware and use-case.
net.ipv4.ip_local_port_range='1024 65000'
net.ipv4.tcp_tw_reuse='1'
net.ipv4.tcp_fin_timeout='15'
net.core.netdev_max_backlog='4096'
net.core.rmem_max='16777216'
net.core.somaxconn='4096'
net.core.wmem_max='16777216'
net.ipv4.tcp_max_syn_backlog='20480'
net.ipv4.tcp_max_tw_buckets='400000'
net.ipv4.tcp_no_metrics_save='1'
net.ipv4.tcp_rmem='4096 87380 16777216'
net.ipv4.tcp_syn_retries='2'
net.ipv4.tcp_synack_retries='2'
net.ipv4.tcp_wmem='4096 65536 16777216'
vm.min_free_kbytes='65536'

Higlighting a few of the important ones…
net.ipv4.ip_local_port_range
To serve a client request via an upstream application, NginX must open 2 TCP connections; one for the client, one for the connection to the upstream. If the server receives many connections, this can rapidly saturate the system’s available port capacity. The net.ipv4.ip_local_port_range directive increases the range to much larger than the default, so we have room for more allocated ports. If you’re seeing errors in your /var/log/syslog such as: “possible SYN flooding on port 80. Sending cookies” it might mean the system can’t find an available port for the pending connection. Increasing the capacity will help alleviate this symptom.
net.ipv4.tcp_tw_reuse
When the server has to cycle through a high volume of TCP connections, it can build up a large number of connections in TIME_WAIT state. TIME_WAIT means a connection is closed but the allocated resources are yet to be released. Setting this directive to 1 will tell the kernel to try to recycle the allocation for a new connection when safe to do so. This is cheaper than setting up a new connection from scratch.
net.ipv4.tcp_fin_timeout
The minimum number of seconds that must elapse before a connection in TIME_WAIT state can be recycled. Lowering this value will mean allocations will be recycled faster.

How to check connection status

Using netstat:
netstat -tan | awk '{print $6}' | sort | uniq -c
Using ss:
ss -s

Nginx

As load on our web servers continually increased, we started hitting some odd limitations in our NginX cluster. I noticed connections were being throttled or dropped, and the kernel was complaining about syn flooding with the error message I mentioned earlier. Frustratingly, I knew the servers could handle more, because the load average and CPU usage was negligible.
On further investigation, I tracked down an extraordinarily high number of connections idling in TIME_WAIT state. This was ss output from one of the servers:
ss -s
Total: 388 (kernel 541)
TCP:   47461 (estab 311, closed 47135, orphaned 4, synrecv 0, timewait 47135/0), ports 33938

Transport Total     IP        IPv6
*          541       -         -        
RAW        0         0         0        
UDP        13        10        3        
TCP        326       325       1        
INET       339       335       4        
FRAG       0         0         0

47,135 connections in TIME_WAIT! Moreover, ss indicates that they are all closed connections. This suggests the server is burning through a large portion of the available port range, which implies that it is allocating a new port for each connection it’s handling. Tweaking the networking settings helped firefight the problem a bit, but the socket range was still getting saturated.
After some digging around, I uncovered some documentation about an upstream keepalive directive. The docs state:
Sets the maximum number of idle keepalive connections to upstream servers that are retained in the cache per one worker process
This is interesting. In theory, this will help minimise connection wastage by pumping requests down connections that have already been established and cached. Additionally, the documentation also states that the proxy_http_version directive should be set to “1.1” and the “Connection” header cleared. On further research, it’s clear this is a good idea since HTTP/1.1 optimises TCP connection usage much more efficiently than HTTP/1.0, which is the default in Nginx Proxy.
Making both of these changes, our upstream config looks more like:
upstream backend_nodejs {
  server nodejs-3:5016 max_fails=0 fail_timeout=10s;
  server nodejs-4:5016 max_fails=0 fail_timeout=10s;
  server nodejs-5:5016 max_fails=0 fail_timeout=10s;
  server nodejs-6:5016 max_fails=0 fail_timeout=10s;
  keepalive 512;
}

I made the recommended changes to the proxy directives in the server stanza. While I was at it, I added a proxy_next_upstream directive to skip out-of-service servers (helping with zero-downtime deploys), tweaked the clientkeepalive_timeout, and disabled all logging. The config now looks more like:
server {
  listen 80;
  server_name fast.gosquared.com;

  client_max_body_size 16M;
  keepalive_timeout 10;

  location / {
    proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
    proxy_set_header   Connection "";
    proxy_http_version 1.1;
    proxy_pass http://backend_nodejs;
  }

  access_log off;
  error_log /dev/null crit;
}

When I pushed out the new configuration to the nginx cluster, I noticed a 90% reduction in occupied sockets. Nginx is now able to use far fewer connections to send many requests. Here is the new ss output:
ss -s

Total: 558 (kernel 604)
TCP:   4675 (estab 485, closed 4183, orphaned 0, synrecv 0, timewait 4183/0), ports 2768

Transport Total     IP        IPv6
*          604       -         -        
RAW        0         0         0        
UDP        13        10        3        
TCP        492       491       1        
INET       505       501       4   

Node.js

Thanks to its event-driven design that can handle I/O asynchronously, Node.js is geared up to handle high volumes of connections and requests out of the box. There are additional tweaks and observations that can be made. We’re going to focus on node.js processes.
Node is single-threaded and doesn’t automatically make use of more than a single core in your potentially multi-core machine. This means unless you design it differently, your application won’t take full advantage of the available capacity the server hosting it has to offer.

Clustering node processes

It’s possible to modify your application so that it forks a number of threads that all accept() data on the same port and efficiently load balance it across multiple CPU cores. Node has a core module called cluster that offers all of the tools you need to achieve this, however it’ll take a bit of elbow grease to work it into your application. If you’re using express, eBay have built a module for it calledcluster2.

Beware of the context switch

When running multiple processes on your machine, try to make sure each CPU core will be kepy busy by a single application thread at a time. As a general rule, you should look to spawn N-1 application processes, where N is the number of available CPU cores. That way, each process is guaranteed to get a good slice of one core, and there’s one spare for the kernel scheduler to run other server tasks on. Additionally, try to make sure the server will be running little or no work other than your Node.JS application, so processes don’t fight for CPU.
We made a mistake where we deployed two busy node.js applications to our servers, both apps spawning N-1 processes each. The applications’ processes started vehemently competing for CPU, resulting in CPU load and usage increasing dramatically. Even though we were running these on beefy 8-core servers, we were paying a noticeable penalty due to context switching. Context switching is the behaviour whereby the CPU suspends one task in order to work on another. When context switching, the kernel must suspend all state for one process while it loads and executes state for another. After simply reducing the number of processes the applications spawned such that they each shared an equal number of cores, load dropped significantly:
load-reduction
Notice how load decreases as the number of processes running (the blue line) crosses beneath the number of CPU cores (the red line). We noticed a similar improvement across all other servers. Since the amount of work spread between all the servers remained constant, this efficiency must have been due to reduced context switch overhead.

No comments:

Post a Comment