Tuning the TCP stack | System Administrator

Transmission Control Protocol and Internet Protocol (TCP/IP) is a standard set of protocols used by every network-enabled device. TCP/IP defines the standards to communicate over a network. TCP/IP is a set of protocols and is divided in two parts: TCP and IP. IP defines the rules for IP addressing and routing packets over network and provides an identity IP address to each host on the network. TCP deals with the interconnection between two hosts and enables them to exchange data over network. TCP is a connection-oriented protocol and controls the ordering of packets, retransmission, error detection, and other reliability tasks.

TCP stack is designed to be very general in nature so that it can be used by anyone for any network conditions. Servers use the same TCP/IP stack as used by their clients. For this reason, the default values are configured for general uses and not optimized for high-load server environments. New Linux kernel provides a tool called sysctl that can be used to modify kernel parameters at runtime without recompiling the entire kernel. We can use sysctl to modify and TCP/IP parameters to match our needs.

In this recipe, we will look at various kernel parameters that control the network. It is not required to modify all parameters listed here. You can choose ones that are required and suitable for your system and network environment.

It is advisable to test these modifications on local systems before doing any changes on live environment. A lot of these parameters directly deal with network connections and related CPU and memory uses. This can result in connection drops and/or sudden increases in resource use. Make sure that you have read the documentation for the parameter before you change anything.

Also, it is a good idea to set benchmarks before and after making any changes to sysctl parameters. This will give you a base to compare improvements, if any. Again, benchmarks may not reveal all the effects of parameter changes. Make sure that you have read the respective documentation.

Quick overview of the steps taken during data transmission and reception:

1. The application first writes the data to a socket which in turn is put in the transmit buffer.
2. The kernel encapsulates the data into a PDU – protocol data unit.
3. The PDU is then moved onto the per-device transmit queue.
4. The NIC driver then pops the PDU from the transmit queue and copies it to the NIC.
5. The NIC sends the data and raises a hardware interrupt.
6. On the other end of the communication channel the NIC receives the frame, copies it on the receive buffer and raises a hard interrupt.
7. The kernel in turn handles the interrupt and raises a soft interrupt to process the packet.
8. Finally the kernel handles the soft interrupt and moves the packet up the TCP/IP stack for decapsulation and puts it in a receive buffer for a process to read from.



How your operating system deals with data

Data destined to a particular system is first received by the NIC and is stored in the ring buffer of the reception (RX) present in the NIC, which also has TX (for data transmission). Once the packet is accessible to the kernel, the device driver raises softirq (software interrupt), which makes the DMA (data memory access) of the system send that packet to the Linux kernel. The packet data in the Linux kernel gets stored in the sk_buff data structure to hold the packet up to MTU (maximum transfer unit). When all the packets are filled in the kernel buffer, they get sent to the upper processing layer – IP, TCP or UDP. The data then gets copied to the preferred data receiving process.


Note: Initially, a hard interrupt is raised by the device driver to send data to the kernel, but since this is an expensive task, it is replaced by a software interrupt. This is handled by the NAPI (new API), which makes the processing of incoming packets more efficient by putting the device driver in polling mode.

Benchmark before tuning:

To get maximum or even improved network performance, our goal is to increase the throughput (data transfer rate) and latency of our network’s receiving and sending capabilities. But tuning without measurement (effected values) is useless as well as dangerous. So, it is always advisable to benchmark every change you make, because any change that doesn’t result in an improvement is pointless and may degrade performance. There are a few benchmarking tools available for networks, of which the following two can be the best choices.

Netperf: This is a perfect tool to measure the different aspects of a network’s performance. Its primary focus is on data transfer using either TCP or UDP. It requires a server and a client to run tests. The server should be run by a daemon called netserver, through which the client (testing system) gets connected.
Remember that the netperf default runs on Port 12865 and shows results for an IPv4 connection. Netperf supports a lot of different types of tests but for simplicity, you can use default scripts located at /usr/share/doc/netperf/examples. It is recommended that you read the netperf manual for clarity on how to use it.

Iperf: This tool performs network throughput tests and has to be your main tool for testing. It also needs a server and client, just like netperf.

To run iperf as the server, use $iperf -s -p port, where ‘port’ can be any random port.
On the client side, to connect to the server, use $iperf -c server_ip -p server_port. The result has been tested on my localhost as both server and client, so the output is not appropriate.

Start the tuning now

Before proceeding to tune your system, you must know that every system differs, whether it is in terms of the CPU, memory, architecture or other hardware configuration. So a tunable parameter may or may not enhance the performance of your system as there are no generic tweaks for all systems. Another important point to remember is that you must develop a good understanding of the entire system. The only part of a system that should ever be tuned is the bottleneck. Improving the performance of any other component other than the bottleneck will result in no visible performance gains to the end user. But the focus of this article is on the general, potential bottlenecks you might come across; even on healthy systems, these tweaks may give improved performance.
One obvious situation to tune your network for is when you start to drop received packets or transmit/ receive the message ‘error has occurred’. This can happen when your packet holding structures in the NIC, device driver, kernel or network layer are not able to handle all the received packets or are not able to process these fast enough. The bottleneck in this case can be related to any of the following:
1) NIC
2) Soft interrupt issued by a device driver
3) Kernel buffer
4) The network layer (IP, TCP or UDP)

It’s better to find the exact bottleneck from among these options and then tune only that area. Or you can apply the methods given below, one by one, and then find which changes improve performance.

To find out the packet drop/error details, you can use the ip utility.
More specific results can be viewed by using ethtool, as follows:

To find out the packet drop/error details, you can use the ip utility.

More specific results can be viewed by using ethtool, as follows:

ethtool-S eth0

Tuning the network adapter (NIC)

Jumbo frames: A single frame can carry a maximum of 1500 bytes by default as shown by the ip addr command.

enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
 link/ether 08:00:27:5d:4e:57 brd ff:ff:ff:ff:ff:ff
 inet brd scope global enp0s3
 valid_lft forever preferred_lft forever

Here, for eth0, which is the default network interface for the Ethernet, the MTU (maximum transfer unit) value is set to 1500. This value is defined for a single slot. Ethernet connections of 10Gbps or more may need to increase the value to 9000. These large frames are called Jumbo frames.To get Ethernet to use Jumbo frames, you can use ifconfig, as follows:

$ifconfig eth0 mtu 9000

Note: Setting the value to 9000 bytes doesn’t make all frame sizes that large, but this value will be only used depending on your Ethernet requirements.

Interrupt Coalescence (IC)
After the network adapter receives the packets, the device driver issues a single hard interrupt followed by soft interrupts (handled by NAPI). Interrupt Coalescence is the number of packets the network adapter receives before issuing a hard interrupt. Changing the value to cause interrupts fast can lead to lots of overhead and, hence, decrease performance, but having a high value may lead to packet loss. By default, your settings are in adaptive IC mode, which auto balances the value according to network traffic. But since the new kernels use NAPI, which makes fast interrupts much less costly (performance wise), you can disable this feature.

$ ethtool -c eth0

To disable it, use the following command:

$ethtool -C eth0 adaptive-rx off

Caution: If your system generates high traffic as in the case of a hosting Web server, you must not disable it.

Pause frames:

A pause frame is the duration of the pause, in milliseconds, which the adapter issues to the network switch to stop sending packets if the ring buffer gets full. If the pause frame is enabled, then loss of packets will be minimal.

To view the pause frame setting, use the following command:

$ethtool -a eth0

To enable pause frames for both RX (receiving) and TX (transferring), run the command given below:

$ethtool -A eth0 rx on tx on

Software interrupts:

SoftIRQ timeout: If the software interrupt doesn’t process packets for a long time, it may cause the NIC buffer to overflow and, hence, can cause packet loss. netdev_budget shows the default value of the time period for which softirq should run.

$sysctl net.core.netdev_budget

The default value should be 300 and may have to be increased to 600 if you have a high network load or a 10Gbps (and above) system.

# sysctl -w net.core.netdev_budget=600

IRQ balance: Interrupt requests are handled by different CPUs in your system in a round robin manner. Due to regular context switching for interrupts, the request handling timeout may increase. The irqbalance utility is used to balance these interrupts across the CPU and has to be turned off for better performance.

$service irqbalance stop

If your system doesn’t deal with high data receiving and transmission (as in the case of a home user), this configuration will be helpful to increase performance. But for industrial use, it has to be tweaked for better performance, as stopping it may cause high load on a single CPU – particularly for services that cause fast interrupts like ssh. You can also disable irqbalance for particular CPUs. I recommend that you do that by editing /etc/sysconfig/irqbalance.

Kernel buffer

The socket buffer size is:


These parameters show the default and maximum write (receiving) and read (sending) buffer size allocated to any type of connection. The default value set is always a little low since the allocated space is taken from the RAM. Increasing this may improve the performance for systems running servers like NFS. Increasing them to 256k/4MB will work best, or else you have to benchmark these values to find the ideal value for your system’s configuration.

$sysctl  -w net.core.wmem_default=262144.
$sysctl -w net.core.wmem_max=4194304
$sysctl -w net.core.rmem_default=262144
$sysctl -w net.core.rmem_max=4194304

Every system has different values and increasing the default value may improve its performance but you have to benchmark for every value change.

Maximum queue size:

Before processing the data by the TCP/UDP layer, your system puts the data in the kernel queue. The net.core.netdev_max_backlog value specifies the maximum number of packets to put in the queue before delivery to the upper layer. The default value is not enough for a high network load, so simply increasing this value cures the performance drain due to the kernel. To see the default value, use sysctl with $sysctl net.core.netdev_max_backlog. The default value is 1000 and increasing it to 3000 will be enough to stop packets from being dropped in a 10Gbps (or more) network.

TCP/UDP processing:

The TCP buffer size is:


These values are an array of three integers specifying the minimum, average and maximum size of the TCP read and send buffers, respectively.


The net.core.rmem_max setting defines the maximum receive socket buffer size in bytes.

There are a few different settings that all appear to be very similar. You can see that on Ubuntu 15.04 (3.18.0-13-generic) the default value for net.core.rmem_max is 212992. The default and max values are the same in this case. Raising this to a larger value will increase the buffer size, but this can have nasty effects in terms of “buffer bloat”.

A. net.core.wmem_max:

tcp_wmem (since Linux 2.4) This is a vector of 3 integers: [min, default, max].

These parameters are used by TCP to regulate send buffer sizes. TCP dynamically adjusts the size of the send buffer from the default values listed below, in the range of these values, depending on memory available.

  • min Minimum size of the send buffer used by each TCP socket. The default value is the system page size. (On Linux 2.4, the default value is 4K bytes.)

This value is used to ensure that in memory pressure mode, allocations below this size will still succeed. This is not used to bound the size of the send buffer declared using SO_SNDBUF on a socket.

  • default The default size of the send buffer for a TCP socket. This value overwrites the initial default buffer size from the generic global net.core.wmem_default defined for all protocols.

The default value is 16K bytes.If larger send buffer sizes are desired, this value should be increased (to affect all sockets).

To employ large TCP windows, the /proc/sys/net/ipv4/tcp_window_scaling must be set to a non-zero value (default).

  • max The maximum size of the send buffer used by each TCP socket. This value does not override the value in /proc/sys/net/core/wmem_max. This is not used to limit the size of the send buffer declared using SO_SNDBUF on a socket.

The default value is calculated using the formula: max(65536, min(4MB, tcp_mem[1]*PAGE_SIZE/128))

On Linux 2.4, the default value is 128K bytes, lowered 64K depending on low-memory systems.)

II. net.ipv4.tcp_rmem:

tcp_rmem (since Linux 2.4) This is a vector of 3 integers: [min, default, max].

These parameters are used by TCP to regulate receive buffer sizes. TCP dynamically adjusts the size of the receive buffer from the defaults listed below, in the range of these values, depending on memory available in the system.

  • min Minimum size of the receive buffer used by each TCP socket. The default value is the system page size. On Linux 2.4, the default value is 4K, lowered to PAGE_SIZE bytes in low-memory systems.

This value is used to ensure that in memory pressure mode, allocations below this size will still succeed. This is not used to bound the size of the receive buffer declared using SO_RCVBUF on a socket.

  • default The default size of the receive buffer for a TCP socket. The default value is 87380 bytes. (On Linux 2.4, this will be lowered to 43689 in low-memory systems.)

This value overwrites the initial default buffer size from the generic global net.core.rmem_default defined for all protocols.

If larger receive buffer sizes are desired, this value should be increased (to affect all sockets). To employ large TCP windows, the net.ipv4.tcp_window_scaling must be enabled (default).

  • max The maximum size of the receive buffer used by each TCP socket. The default value is calculated using the formula: max(87380, min(4MB, tcp_mem[1]*PAGE_SIZE/128)). (On Linux 2.4, the default is 87380*2 bytes, lowered to 87380 in low-memory systems).

This value does not override the global net.core.rmem_max. This is not used to limit the size of the receive buffer declared using SO_RCVBUF on a socket.


  • tcp_max_syn_backlog (integer; default: see below; since Linux 2.2)
  • The maximum number of queued connection requests which have still not received an acknowledgement from the connecting client. If this number is exceeded, the kernel will begin dropping requests.
  • The default value of 256 is increased to 1024 when the memory present in the system is adequate or greater (>= 128Mb), and reduced to 128 for those systems with very low memory (<= 32Mb).
  • It is recommended that if this needs to be increased above 1024, TCP_SYNQ_HSIZEin include/net/tcp.h be modified to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog,and the kernel be recompiled.

To modify this value you can add the following line to /etc/sysctl.conf

net.ipv4.tcp_max_syn_backlog = $integer


Note: Values are in pages. To see the page size, use the command $getconf PAGE_SIZE

For the latest kernel (after 2.6), there is a feature for auto tuning, which dynamically adjusts the TCP buffer size till the maximum value is attained. This feature is turned on by default and I recommend that it be left turned on. You can check it by running the following command:

$cat /proc/sys/net/ipv4/tcp_moderate_rcvbuf

Then, to turn it on in case it is off, use the command given below:

$sysctl -w net.ipv4.tcp_moderate_rcvbuf=1

This setting allocates space up to the maximum value, in case you need to increase the maximum buffer size if you find the kernel buffer is your bottleneck. The average value need not be changed, but the maximum value will have to be set to higher than the BDP (bandwidth delay product) for maximum throughput.

Maximum pending connections:(net.core.somaxconn)

An application can specify the maximum number of pending requests to put in queue before processing one connection. When this value reaches the maximum, further connections start to drop out. For applications like a Web server, which issue lots of connections, this value has to be high for these to work properly. To see the maximum connection backlog, run the following command:

SOMAXCONN can be used as a statement which specifies the max number of connection requests that can be queued for any given listening socket.

If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

/proc/sys/net/core/somaxconn: Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. The value should be raised substantially to support bursts of request. For example, to support a burst of 1024 requests, set somaxconn to 1024.

$ sysctl net.core.somaxconn

The default value is 128 and can increase to much secure value.

$sysctl -w net.core.somaxconn=2048

TCP timestamp

TCP timestamp is a TCP feature that puts a timestamp header in every packet to calculate the precise round trip time. This feature causes a little overhead and can be disabled, if it is not necessary to increase CPU performance to process packets.

$sysctl -w net.ipv4.tcp_timestamps=0


TCP Selective Acknowledgements (SACK) is a feature that allows TCP to send ACK for every segment stream of packets, as compared to the traditional TCP that sends ACK for contiguous segments only. This feature can cause a little CPU overhead; hence, disabling it may increase network throughput.

# sysctl -w net.ipv4.tcp_sack=0

TCP FIN timeout

In a TCP connection, both sides have to close the connection independently. Linux TCP sends a FIN packet to close the connection and waits for FINACK till the defined time mentioned in…


The default value (60) is quite high, and can be decreased to 20 or 30 to let the TCP close the connection and free resources for another connection.

$sysctl -w net.ipv4.tcp_fin_timeout=20

UDP buffer size

UDP generally doesn’t need tweaks to improve performance, but you can increase the UDP buffer size if UDP packet losses are happening.

$ sysctl net.core.rmem_max

Miscellaneous tweaks

IP ports: net.ipv4.ip_local_port_range shows all the ports available for a new connection. If no port is free, the connection gets cancelled. Increasing this value helps to prevent this problem.
To check the port range, use the command given below:

$sysctl net.ipv4.ip_local_port_range

You can increase the value by using the following command:

$sysctl -w net.ipv4.ip_local_port_range=’20000 60000’


tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6)

Allow to reuse TIME_WAIT sockets for new connections when it is safe from protocol viewpoint. It should not be changed without advice/request of technical experts.

If you do wish to enable this option you can do so by modifying sysctl.conf

net.ipv4.tcp_tw_reuse = 1

TCP Memory Concept:-

The kernel keeps track of the memory allocated to TCP in multiple of pages, not in bytes. This is a first bit of confusion that a lot of people run into because some settings are in bytes and other are in pages (and most of the time 1 page = 4096 bytes).

cat /proc/sys/net/ipv4/tcp_mem
3093984 4125312 6187968

The values are in number of pages.

They get automatically sized at boot time (values above are for a machine with 32GB of RAM). They mean:

  1. When TCP uses less than 3093984 pages (11.8GB), the kernel will consider it below the “low threshold” and won’t bother TCP about its memory consumption.
  2. When TCP uses more than 4125312 pages (15.7GB), enter the “memory pressure” mode.
  3. The maximum number of pages the kernel is willing to give to TCP is 6187968 (23.6GB). When we go above this, we’ll start seeing thBIN%e “Out of socket memory” error and Bad Things will happen.
$ cat /proc/net/sockstat
sockets: used 14565
TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894
UDP: inuse 11 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

Now let’s find how much of that memory TCP actually uses.

The last value on the second line (mem 1894) is the number of pages allocated to TCP. In this case we can see that 1894 is way below 6187968, so there’s no way we can possibly be running out of TCP memory. So in this case, the “Out of socket memory” error was caused by the number of orphan sockets.


Tuned is a very useful tool to auto-configure your system according to different profiles. Besides doing manual tuning, tuned makes all the required tweaks for you dynamically. You can download it by using the following command:

$apt-get install tuned

To start the daemon, use the command given below:

$tuned -d

To check the available profiles, you can see man tuned-profiles. You can create your own profile but for the best network performance, tuned already has some interesting profile options like network-throughput and network-latency. To set a tuned profile, use the following command:

$tuned -p network-throughput

The following table summarizes the recommended TCP/IP settings for Linux. These settings are in the /etc/sysctl.conf file

Setting Recommended Value Rationale
net.core.netdev_max_backlog 30000 Set maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. Recommended setting is for 10GbE links. For 1GbE links use 8000.
net.core.wmem_max 67108864 Set max to 16MB (16777216) for 1GbE links and 64MB (67108864) for 10GbE links.
net.core.rmem_max 67108864 Set max to 16MB (16777216) for 1GbE links and 64MB (67108864) for 10GbE links.
net.ipv4.tcp_congestion_control htcp There seem to be bugs in both bic and cubic (the default) for a number of versions of the Linux kernel up to version 2.6.33. The kernel version for Redhat 5.x is 2.6.18-x and 2.6.32-x for Redhat 6.x
net.ipv4.tcp_congestion_window 10 This is the default for Linux operating systems based on Linux kernel 2.6.39 or later.
net.ipv4.tcp_fin_timeout 10 This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. The default value is 60. The recommened setting lowers its to 10. You can lower this even further, but too low, and you can run into socket close errors in networks with lots of jitter.
net.ipv4.tcp_keepalive_interval 30 This determines the wait time between isAlive interval probes. Default value is 75. Recommended value reduces this in keeping with the reduction of the overall keepalive time.
net.ipv4.tcp_keepalive_probes 5 How many keepalive probes to send out before the socket is timed out. Default value is 9. Recommended value reduces this to 5 so that retry attempts will take 2.5 minutes.
net.ipv4.tcp_keepalive_time 600 Set the TCP Socket timeout value to 10 minutes instead of 2 hour default. With an idle socket, the system will wait tcp_keepalive_time seconds, and after that try tcp_keepalive_probes times to send a TCP KEEPALIVE in intervals of tcp_keepalive_intvl seconds. If the retry attempts fail, the socket times out.
net.ipv4.tcp_low_latency 1 Configure TCP for low latency, favoring low latency over throughput
net.ipv4.tcp_max_orphans 16384 Limit number of orphans, each orphan can eat up to 16M (max wmem) of unswappable memory
net.ipv4.tcp_max_tw_buckets 1440000 Maximal number of timewait sockets held by system simultaneously. If this number is exceeded time-wait socket is immediately destroyed and warning is printed. This limit exists to help prevent simple DoS attacks.
net.ipv4.tcp_no_metrics_save 1 Disable caching TCP metrics on connection close
net.ipv4.tcp_orphan_retries 0 Limit number of orphans, each orphan can eat up to 16M (max wmem) of unswappable memory
net.ipv4.tcp_rfc1337 1 Enable a fix for RFC1337 – time-wait assassination hazards in TCP
net.ipv4.tcp_rmem 10240 131072 33554432 Setting is min/default/max. Recommed increasing the Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_wmem 10240 131072 33554432 Setting is min/default/max. Recommed increasing the Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_sack 1 Enable select acknowledgments
net.ipv4.tcp_slow_start_after_idle 0 By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request.
net.ipv4.tcp_syncookies 0 Many default Linux installations use SYN cookies to protect the system against malicious attacks that flood TCP SYN packets. The use of SYN cookies dramatically reduces network bandwidth, and can be triggered by a running Geode cluster. If your Geode cluster is otherwise protected against such attacks, disable SYN cookies to ensure that Geode network throughput is not affected.
NOTE: if SYN floods are an issue and SYN cookies can’t be disabled, try the following:
net.ipv4.tcp_timestamps 1 Enable timestamps as defined in RFC1323:
net.ipv4.tcp_tw_recycle 1 This enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). Should be used with caution with load balancers and not at all when behind a NAT device.
net.ipv4.tcp_tw_reuse 1 This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle. The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers and loadbalancers.
net.ipv4.tcp_window_scaling 1 Turn on window scaling which can be an option to enlarge the transfer window:

Other settings:-

1.Set the maximum open files limit:

$ ulimit -n   # check existing limits for logged in user
# ulimit -n 65535   # root change values above hard limits

2. To permanently set limits for a user, open /etc/security/limits.conf and add
the following lines at end of the file. Make sure to replace values in brackets, <> :

<username>  soft          nofile            <value>           # soft limits
<username>  hard        nofile            <value>           # hard limits

Save limits.conf and exit. Then restart the user session.

3.Set the maximum number of times the IPV4 packet can be reordered in the TCP
packet stream:
# echo ‘net.ipv4.tcp_reordering=3’ >> /etc/sysctl.conf

Other Links to Read :-




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s