Skip to content

Graceful Websocket Connection Shutdown #3633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
john-markham opened this issue Apr 2, 2025 · 4 comments
Closed

Graceful Websocket Connection Shutdown #3633

john-markham opened this issue Apr 2, 2025 · 4 comments

Comments

@john-markham
Copy link

john-markham commented Apr 2, 2025

Hi! How do we currently gracefully shutdown websocket connections and is it actually graceful?

This has been brought up before but does not seem like the discussion is complete: #2847. The solution there implies canceling outstanding request contexts when the server context shuts down in InitFunc is enough and gqlgen should just handle the Websocket shutdown rigamarole for us. Seems sensible.

Using this approach (or not -- this solution seems to have no effect which suggests to me something is up in gqlgen) I see tons of these errors in my ErrorFunc on deploy:

websocket write: websocket: close 1006 (abnormal closure): unexpected EOF
websocket read: websocket connection closed
websocket write: websocket: close sent

My InitFunc (see below for full code) looks something like:

InitFunc: func(reqCtx context.Context, _ transport.InitPayload) (context.Context, *transport.InitPayload, error) {
	liveWebsocketConnections.Add(1)

	newReqCtx, initFuncCancel := context.WithCancel(reqCtx)

	go func() {
		select {
		case <-serverCtx.Done():
			fmt.Printf("Server shutting down. Gracefully shutting down WS based request. Outstanding WS connections %d", liveWebsocketConnections.Load())
			initFuncCancel()
		case <-newReqCtx.Done():
			return
		}
	}()

	return newReqCtx, nil, nil
},

I see those error messages immediately after my Server shutting down log.

Looking at gqlgen's code, I should hit closeOnCancel then c.close(...) on server shutdown

func (c *wsConnection) closeOnCancel(ctx context.Context) {
	<-ctx.Done()

	if r := closeReasonForContext(ctx); r != "" {
		c.sendConnectionError("%s", r)
	}
	c.close(websocket.CloseNormalClosure, "terminated")
}

...

func (c *wsConnection) close(closeCode int, message string) {
	c.mu.Lock()
	if c.closed {
		c.mu.Unlock()
		return
	}
	_ = c.conn.WriteMessage(websocket.CloseMessage, websocket.FormatCloseMessage(closeCode, message))
	for _, closer := range c.active {
		closer()
	}
	c.closed = true
	c.mu.Unlock()
	_ = c.conn.Close()

	if c.CloseFunc != nil {
		c.CloseFunc(c.ctx, closeCode)
	}
}
  1. Why don't we wait for a closing frame from the client before closing c.conn? Isn't that part of the Websocket RFC?
  2. Where do we prevent in-flight writes and reads (besides in the close func which is where we should read the client's response) from happening after we've initiated shutdown but before shutdown is complete? Should we? It doesn't seem like we do, which I'm guessing is partially what explains the 1006's.

These 1006 errors ultimately show up as Protocol(ResetWithoutClosingHandshake) in my downstream Rust proxy. I don't think these errors should always be retried (see snapview/tokio-tungstenite#101), so it makes it hard to know what to do even further downstream on the client. These errors are nearly 100% correlated with my deployments and pods being naturally cycled out and not reproducible in any other way.

It would be nice if gorilla/websocket provided an API for this like net/http does for shutting down non-Websocket connections, but it seems like it does not and will not (gorilla/websocket#448). gqlgen seems like the place we should handle gracefully closing websocket connections, no? It seems like others in the community have started to work around this issue (ethereum/go-ethereum#30482 (comment)), which is really not ideal.

My full code:

transport.Websocket{
	Upgrader: websocket.Upgrader{
		CheckOrigin: func(r *http.Request) bool {
			return true
		},
		Error: func(w http.ResponseWriter, r *http.Request, status int, reason error) {
			_, _ = fmt.Fprintf(w, "Websocket upgrade failed: %s with status %v", reason, status)
			http.Error(w, http.StatusText(status), status)
		},
	},
	InitFunc: func(reqCtx context.Context, _ transport.InitPayload) (context.Context, *transport.InitPayload, error) {
		liveWebsocketConnections.Add(1)

		newReqCtx, initFuncCancel := context.WithCancel(reqCtx)

		go func() {
			select {
			case <-serverCtx.Done():
				fmt.Printf("Server shutting down. Gracefully shutting down WS based request. Outstanding WS connections %d", liveWebsocketConnections.Load())
				initFuncCancel()
			case <-newReqCtx.Done():
				return
			}
		}()

		return newReqCtx, nil, nil
	},
	PingPongInterval: 1500 * time.Millisecond,
	ErrorFunc: func(ctx context.Context, err error) {
		_, _ = fmt.Println("Websocket error: ", err)
	},
	CloseFunc: func(ctx context.Context, closeCode int) {
		liveWebsocketConnections.Add(-1)
		fmt.Printf("Gracefully shut down WS based request. Remaining WS connections %d", liveWebsocketConnections.Load())

		if closeCode == websocket.CloseNormalClosure {
			return
		}
		_, _ = fmt.Println("Websocket closed with close code: ", closeCode)
	},
},
@john-markham
Copy link
Author

@StevenACoffman Curious for your opinion here. If you think my points are legitimate, I can work on a PR.

@StevenACoffman
Copy link
Collaborator

@john-markham I'm feverish at the moment, so I don't really trust my own technical judgement at the moment. 😅 However, I am always in favor of PRs that do not break existing behavior and only help fix problems. The tricky part is verifying both of those for websockets.

Some of the recent contributors for websocket related code probably have more hands on experience and insight on this particular issue as well.

@UnAfraid @telemenar @wiegell @vlad-tokarev @szgupta @df-wg Can you please take a look and share your thoughts?

@john-markham
Copy link
Author

@StevenACoffman Feel better. Thank you so much for the energy you bring to this project and for keeping it alive!

@john-markham
Copy link
Author

john-markham commented Apr 16, 2025

I'm closing this issue. After testing out Add websocket shutdown grace period #3653 in my infra, I saw my application layer shutdown hook was not running in time before my load balancer's de registration delay was up, which is what was causing 99% of the 1006's.

I'm still skeptical we implement the full closing handshake -- but it seems like it just doesn't even matter 99% of the time. For the 1%, I think #3653 might be useful.

For those in the future looking at this issue, I found this article: https://www.grammarly.com/blog/engineering/perfecting-smooth-rolling-updates-in-amazon-elastic-container-service/ which seems to suggest that these errors are unavoidable, but in my experience that was not the case. Another thing we are considering doing long term is migrating to service mesh to get rid of the ALB related headaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants