Fluentd failed to start because of “Address already in use — bind(2) for 0.0.0.0:24224”

Mitsunori Komatsu
3 min readMay 9, 2021

--

I restarted Fluentd (td-agent) process a few times per day these days for some reasons. But unfortunately, I’ve occasionally seen Fluentd worker kept failing to start due to “Address already in use — bind(2) for 0.0.0.0:24224”. But no processes was listening to the port 24224 at that time. Interesting! I looked into the issue.

Investigation

The log file of Fluentd was as follows.

2021-05-06 02:18:53 +0000 [error]: #0 unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24224"
:
2021-05-06 02:18:53 +0000 [info]: Worker 0 finished unexpectedly with status

I also checked TCP ports. Fluentd worker (in_forword plugin) is supposed to listen TCP port 24224, but no process was listening to the port

$ ss -tl 'sport = 24224'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
$

So… why does Fluentd worker fail to bind the TCP port 24224?

Interestingly a client library (https://github.com/komamitsu/fluency in this case) connected to itself through port 24224.

$ sudo lsof -i | grep 24224
java 19973 nobody 62u IPv4 1092503 0t0 TCP localhost:24224->localhost:24224 (ESTABLISHED)

But how did it happen…?

TCP self-connection

This behavior may be known as “TCP self-connection”

When a server socket port (24224 in this case) is within ephemeral ports, client socket could use the port as src port and connect to the same port (dst port) if the port number is available. I guessed this happened during the restart of td-agent service.

Can we reproduce this issue?

I tried to reproduce this issue.

Stop Fluentd (td-agent service) in advance

$ sudo systemctl stop td-agent
$ sudo systemctl status td-agent
● td-agent.service - td-agent: Fluentd based data collector for Treasure Data
Loaded: loaded (/lib/systemd/system/td-agent.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2021-05-07 04:07:05 UTC; 23min ago
:

Reproduce TCP self-connection on client side

I wrote this Ruby script

require 'socket'cnt = 0
loop do
if cnt % 1000 == 0
puts "cnt: #{cnt}"
end
# Retrying forever so that client socket uses the same port as source port
begin
s = TCPSocket.new 'localhost', 24224
# Write something to the socket
s.puts "hello"
# Read from the socket
puts "Read: '#{s.gets.chop}'"
puts 'Connected to itself. Waiting for a while'
sleep 3600
rescue Errno::ECONNREFUSED
# Expected
end
cnt += 1
end

and checked the ephemeral port range

$ cat /proc/sys/net/ipv4/ip_local_port_range
4096 65635

then executed the script

$ ruby tcp_self_conn.rb 
cnt: 0
cnt: 1000
cnt: 2000
cnt: 3000
:
cnt: 27000
cnt: 28000
cnt: 29000
Read: 'hello'
Connected to itself. Waiting for a while

The TCP port was like this

$ sudo lsof -i | grep 24224
ruby 15666 td-admin 7u IPv4 14499424 0t0 TCP localhost:24224->localhost:24224 (ESTABLISHED)

The client socket has connected to itself. Looks like I successfully reproduced it?

After the TCP self connection, I restarted Fluentd and checked the log file

2021-05-07 02:41:38 +0000 [info]: #0 starting fluentd worker pid=25233 ppid=23137 worker=0
2021-05-07 02:41:38 +0000 [info]: #0 listening port port=24224 bind="0.0.0.0"
2021-05-07 02:41:38 +0000 [error]: #0 unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24224"

Bingo! 🎉 I can’t say the restart failure I experienced was 100% occurred because of the same cause, but I can say this TCP self connection can be the cause.

Remediation

I think we have the following options

  1. Change Fluentd’s in_forward port to less than the default ephemeral port range
  2. Change the ephemeral port range so that port 24224 gets outside ephemeral port range
  3. Make client library reconnect at some interval. This doesn’t solve the root cause but just mitigates this issue. BTW, according to the test result of tcp_self_conn.rb, client application can’t detect the weird situation since it can write data through the socket without any error

I tried option 2 and it worked

$ echo "24228 65535" | sudo tee /proc/sys/net/ipv4/ip_local_port_range
24228 65535
td-admin@ip-172-23-169-9:~$ ruby tcp_self_conn.rb
cnt: 0
cnt: 1000
cnt: 2000
cnt: 3000
cnt: 4000
cnt: 5000
cnt: 6000
cnt: 7000
cnt: 8000
cnt: 9000
cnt: 10000
:
cnt: 192000
cnt: 193000

If you are facing this issue, one of the above options may work.

--

--