Graceful Websocket Connection Shutdown #3633

john-markham · 2025-04-02T21:47:47Z

Hi! How do we currently gracefully shutdown websocket connections and is it actually graceful?

This has been brought up before but does not seem like the discussion is complete: #2847. The solution there implies canceling outstanding request contexts when the server context shuts down in InitFunc is enough and gqlgen should just handle the Websocket shutdown rigamarole for us. Seems sensible.

Using this approach (or not -- this solution seems to have no effect which suggests to me something is up in gqlgen) I see tons of these errors in my ErrorFunc on deploy:

websocket write: websocket: close 1006 (abnormal closure): unexpected EOF
websocket read: websocket connection closed
websocket write: websocket: close sent

My InitFunc (see below for full code) looks something like:

InitFunc: func(reqCtx context.Context, _ transport.InitPayload) (context.Context, *transport.InitPayload, error) {
	liveWebsocketConnections.Add(1)

	newReqCtx, initFuncCancel := context.WithCancel(reqCtx)

	go func() {
		select {
		case <-serverCtx.Done():
			fmt.Printf("Server shutting down. Gracefully shutting down WS based request. Outstanding WS connections %d", liveWebsocketConnections.Load())
			initFuncCancel()
		case <-newReqCtx.Done():
			return
		}
	}()

	return newReqCtx, nil, nil
},

I see those error messages immediately after my Server shutting down log.

Looking at gqlgen's code, I should hit closeOnCancel then c.close(...) on server shutdown

func (c *wsConnection) closeOnCancel(ctx context.Context) {
	<-ctx.Done()

	if r := closeReasonForContext(ctx); r != "" {
		c.sendConnectionError("%s", r)
	}
	c.close(websocket.CloseNormalClosure, "terminated")
}

...

func (c *wsConnection) close(closeCode int, message string) {
	c.mu.Lock()
	if c.closed {
		c.mu.Unlock()
		return
	}
	_ = c.conn.WriteMessage(websocket.CloseMessage, websocket.FormatCloseMessage(closeCode, message))
	for _, closer := range c.active {
		closer()
	}
	c.closed = true
	c.mu.Unlock()
	_ = c.conn.Close()

	if c.CloseFunc != nil {
		c.CloseFunc(c.ctx, closeCode)
	}
}

Why don't we wait for a closing frame from the client before closing c.conn? Isn't that part of the Websocket RFC?
Where do we prevent in-flight writes and reads (besides in the close func which is where we should read the client's response) from happening after we've initiated shutdown but before shutdown is complete? Should we? It doesn't seem like we do, which I'm guessing is partially what explains the 1006's.

These 1006 errors ultimately show up as Protocol(ResetWithoutClosingHandshake) in my downstream Rust proxy. I don't think these errors should always be retried (see snapview/tokio-tungstenite#101), so it makes it hard to know what to do even further downstream on the client. These errors are nearly 100% correlated with my deployments and pods being naturally cycled out and not reproducible in any other way.

It would be nice if gorilla/websocket provided an API for this like net/http does for shutting down non-Websocket connections, but it seems like it does not and will not (gorilla/websocket#448). gqlgen seems like the place we should handle gracefully closing websocket connections, no? It seems like others in the community have started to work around this issue (ethereum/go-ethereum#30482 (comment)), which is really not ideal.

My full code:

transport.Websocket{
	Upgrader: websocket.Upgrader{
		CheckOrigin: func(r *http.Request) bool {
			return true
		},
		Error: func(w http.ResponseWriter, r *http.Request, status int, reason error) {
			_, _ = fmt.Fprintf(w, "Websocket upgrade failed: %s with status %v", reason, status)
			http.Error(w, http.StatusText(status), status)
		},
	},
	InitFunc: func(reqCtx context.Context, _ transport.InitPayload) (context.Context, *transport.InitPayload, error) {
		liveWebsocketConnections.Add(1)

		newReqCtx, initFuncCancel := context.WithCancel(reqCtx)

		go func() {
			select {
			case <-serverCtx.Done():
				fmt.Printf("Server shutting down. Gracefully shutting down WS based request. Outstanding WS connections %d", liveWebsocketConnections.Load())
				initFuncCancel()
			case <-newReqCtx.Done():
				return
			}
		}()

		return newReqCtx, nil, nil
	},
	PingPongInterval: 1500 * time.Millisecond,
	ErrorFunc: func(ctx context.Context, err error) {
		_, _ = fmt.Println("Websocket error: ", err)
	},
	CloseFunc: func(ctx context.Context, closeCode int) {
		liveWebsocketConnections.Add(-1)
		fmt.Printf("Gracefully shut down WS based request. Remaining WS connections %d", liveWebsocketConnections.Load())

		if closeCode == websocket.CloseNormalClosure {
			return
		}
		_, _ = fmt.Println("Websocket closed with close code: ", closeCode)
	},
},

The text was updated successfully, but these errors were encountered:

john-markham · 2025-04-03T17:55:35Z

@StevenACoffman Curious for your opinion here. If you think my points are legitimate, I can work on a PR.

StevenACoffman · 2025-04-04T01:12:14Z

@john-markham I'm feverish at the moment, so I don't really trust my own technical judgement at the moment. 😅 However, I am always in favor of PRs that do not break existing behavior and only help fix problems. The tricky part is verifying both of those for websockets.

Some of the recent contributors for websocket related code probably have more hands on experience and insight on this particular issue as well.

@UnAfraid @telemenar @wiegell @vlad-tokarev @szgupta @df-wg Can you please take a look and share your thoughts?

john-markham · 2025-04-04T03:23:32Z

@StevenACoffman Feel better. Thank you so much for the energy you bring to this project and for keeping it alive!

john-markham · 2025-04-16T17:07:44Z

I'm closing this issue. After testing out Add websocket shutdown grace period #3653 in my infra, I saw my application layer shutdown hook was not running in time before my load balancer's de registration delay was up, which is what was causing 99% of the 1006's.

I'm still skeptical we implement the full closing handshake -- but it seems like it just doesn't even matter 99% of the time. For the 1%, I think #3653 might be useful.

For those in the future looking at this issue, I found this article: https://www.grammarly.com/blog/engineering/perfecting-smooth-rolling-updates-in-amazon-elastic-container-service/ which seems to suggest that these errors are unavoidable, but in my experience that was not the case. Another thing we are considering doing long term is migrating to service mesh to get rid of the ALB related headaches

StevenACoffman added the websocket label Apr 4, 2025

StevenACoffman mentioned this issue Apr 4, 2025

Gracefully close websocket connections #3638

Closed

2 tasks

john-markham mentioned this issue Apr 7, 2025

Add websocket shutdown grace period #3653

Draft

2 tasks

john-markham closed this as completed Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful Websocket Connection Shutdown #3633

Graceful Websocket Connection Shutdown #3633

john-markham commented Apr 2, 2025 •

edited

Loading

john-markham commented Apr 3, 2025

StevenACoffman commented Apr 4, 2025

john-markham commented Apr 4, 2025

john-markham commented Apr 16, 2025 •

edited

Loading

Graceful Websocket Connection Shutdown #3633

Graceful Websocket Connection Shutdown #3633

Comments

john-markham commented Apr 2, 2025 • edited Loading

john-markham commented Apr 3, 2025

StevenACoffman commented Apr 4, 2025

john-markham commented Apr 4, 2025

john-markham commented Apr 16, 2025 • edited Loading

john-markham commented Apr 2, 2025 •

edited

Loading

john-markham commented Apr 16, 2025 •

edited

Loading