Go transport 长连接导致 goroutine 泄露
记录一个近期发现的线上服务 goroutine 泄露问题,通过监控发现服务存在缓慢的内存泄露,通过 pprof 工具查询其 goroutine 数量,http://ip:port/debug/pprof/goroutine?debug=1
发现其存在 goroutine 泄露,泄露的 goroutine 达到 20000+,在 debug 页面中可以看到类似如下的展示
# 0x42f8b4 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:203
# 0x4cb714 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4cc5aa internal/poll.(*pollDesc).waitRead+0x19a /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4cc58c internal/poll.(*FD).Read+0x17c /usr/local/go/src/internal/poll/fd_unix.go:169
# 0x5414be net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_unix.go:202
# 0x553dad net.(*conn).Read+0x8d /usr/local/go/src/net/net.go:184
# 0x7045b4 net/http.(*persistConn).Read+0x74 /usr/local/go/src/net/http/transport.go:1825
# 0x5c9ae2 bufio.(*Reader).fill+0x102 /usr/local/go/src/bufio/bufio.go:100
# 0x5c9c4e bufio.(*Reader).Peek+0x4e /usr/local/go/src/bufio/bufio.go:138
# 0x705247 net/http.(*persistConn).readLoop+0x1a7 /usr/local/go/src/net/http/transport.go:1978
将 debug 模式改为2,http://ip:port/debug/pprof/goroutine?debug=2
,可以看到如下的调用栈
goroutine 2280278 [IO wait, 435 minutes]:
internal/poll.runtime_pollWait(0x7f36c5cc9aa0, 0x72, 0x18d73e0)
/usr/local/go/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc01a65b698, 0x72, 0xc0313d2000, 0x1000, 0x1000)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc01a65b680, 0xc0313d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:159 +0x1b1
net.(*netFD).Read(0xc01a65b680, 0xc0313d2000, 0x1000, 0x1000, 0xc0002015c0, 0xc020906d98, 0xc020906c20)
/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0205ea7d0, 0xc0313d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:182 +0x8e
net/http.(*persistConn).Read(0xc0006e9320, 0xc0313d2000, 0x1000, 0x1000, 0xc020906eb0, 0x46ca40, 0xc020906eb0)
/usr/local/go/src/net/http/transport.go:1887 +0x77
bufio.(*Reader).fill(0xc01f7f5ce0)
/usr/local/go/src/bufio/bufio.go:101 +0x105
bufio.(*Reader).Peek(0xc01f7f5ce0, 0x1, 0x2, 0x0, 0x0, 0x0, 0xc02691cfc0)
/usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc0006e9320)
/usr/local/go/src/net/http/transport.go:2040 +0x1a8
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1708 +0xcb7
比较困惑的是,goroutine 的调用是由底层源码发起的,可以猜测是某些基础库的使用方式有误,注意到代码是 http 的 lib 相关,能看到明显的 pollWait、persistConn、IO wait 字眼,为了定位是什么位置的代码引用,我们可以去线上查找当前服务进程打开的连接句柄
使用 pidof 或者 ps 命令找到自己服务进程的 pid,之后 lsof -p pid
来查看进程打开的句柄,能看到许多异常的建立起的连接
srv 17987 work 30u IPv4 30417287 0t0 TCP localhost:49788->localhost:fmtp (ESTABLISHED)
srv 17987 work 31u IPv4 30697379 0t0 TCP localhost:61063->localhost:fmtp (ESTABLISHED)
srv 17987 work 32u IPv4 30565197 0t0 TCP localhost:34199->localhost:fmtp (ESTABLISHED)
srv 17987 work 33u IPv4 30574610 0t0 TCP localhost:34687->localhost:fmtp (ESTABLISHED)
srv 17987 work 34u IPv4 29940569 0t0 TCP localhost:47346->localhost:fmtp (ESTABLISHED)
srv 17987 work 35u IPv4 30708877 0t0 TCP localhost:61533->localhost:fmtp (ESTABLISHED)
srv 17987 work 36u IPv4 30164366 0t0 TCP localhost:43103->localhost:fmtp (ESTABLISHED)
srv 17987 work 37u IPv4 30563301 0t0 TCP localhost:34663->localhost:fmtp (ESTABLISHED)
srv 17987 work 38u IPv4 30709910 0t0 TCP localhost:61559->localhost:fmtp (ESTABLISHED)
srv 17987 work 39u IPv4 30238360 0t0 TCP localhost:58203->localhost:fmtp (ESTABLISHED)
srv 17987 work 40u IPv4 30342583 0t0 TCP localhost:35954->localhost:fmtp (ESTABLISHED)
srv 17987 work 41u IPv4 30838726 0t0 TCP localhost:42024->localhost:fmtp (ESTABLISHED)
srv 17987 work 42u IPv4 30415497 0t0 TCP localhost:49098->localhost:fmtp (ESTABLISHED)
srv 17987 work 43u IPv4 30840173 0t0 TCP localhost:42478->localhost:fmtp (ESTABLISHED)
srv 17987 work 44u IPv4 30416471 0t0 TCP localhost:49762->localhost:fmtp (ESTABLISHED)
srv 17987 work 45u IPv4 30839661 0t0 TCP localhost:42500->localhost:fmtp (ESTABLISHED)
srv 17987 work 46u IPv4 31288225 0t0 TCP localhost:39302->localhost:fmtp (ESTABLISHED)
srv 17987 work 47u IPv4 30997795 0t0 TCP localhost:27359->localhost:fmtp (ESTABLISHED)
srv 17987 work 48u IPv4 31009992 0t0 TCP localhost:27841->localhost:fmtp (ESTABLISHED)
srv 17987 work 49u IPv4 30559437 0t0 TCP localhost:33999->localhost:fmtp (ESTABLISHED)
如果 grep -c 统计一下,可以发现其数量和泄露的 goroutine 数量几乎一致,可以判断这就是我们要找的泄露连接。使用 lsof -P -p pid
打印完整端口号,找出连接对应的服务,在我们的例子中,是存在大量连接到 consul 的连接。
为什么 Transport 会导致泄露呢? 我们跟踪进源码,在 Transport 的注释中可以看到
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.
意为 Transports 应该被重复使用而不是重复创建,其在多个 goroutine 之间使用是并发安全的。而实际上 Transport 维护的是一个简单的长连接,goroutine 完成后,其所使用的 fd 并不能被释放,因此便会导致 goroutine 产生泄露。
再回到业务代码中,在对应位置查找对于 Transport 的使用,发现有明显的重新创建行为,而正是该行为导致了大量的 goroutine 泄露,我们只需要将使用到 Transport 的初始化为全局变量,创建一次后重复使用便可以避免该问题了。
Be the first person to leave a comment!