Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard unused body of Unary and ClientStream methods #5331

Merged
merged 12 commits into from
Apr 3, 2025

Conversation

paskozdilar
Copy link
Contributor

@paskozdilar paskozdilar commented Mar 6, 2025

When io.Copy is called synchronously on HTTP request body, it stalls the clients who keep the request stream open.
This broke WebSocket gateway implementations which rely on Close detection to know when to terminate.

This PR runs the io.Copy method in the background.

References to other Issues or PRs

Fixes #5326.

Have you read the Contributing Guidelines?

Yes.

Brief description of what is fixed or changed

Added go keyword in front of io.Copy call in the template when body annotation is not set.

Other comments

@paskozdilar
Copy link
Contributor Author

Forgot to regenerate files. Doing it now.

@paskozdilar
Copy link
Contributor Author

Missing import for sync.WaitGroup.

I don't feel like tinkering with the import machinery - and benchmarks don't show any significant slowdown when using channels instead.

I'll update the implementation to use plain channels.

@paskozdilar
Copy link
Contributor Author

I'll try to write an integration test for this too.

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 7, 2025

Apparently, transactional request-response flow is inherent to Go HTTP, including grpc-gateway runtime.
As such, it is impossible to "stream" data via HTTP client as we would with a WebSocket wrapper, so it's impossible to write an integration test for this without implementing a WebSocket gateway inside the server.

@johanbrandhorst
Would it be acceptable to add a minimal WebSocket wrapper inside the integration test server?

@johanbrandhorst
Copy link
Collaborator

Sure, it'd be great to know if we break the websocket proxy again in the future, why not!

@paskozdilar
Copy link
Contributor Author

I've opened a repository for the websocket proxy itself:
https://github.com/paskozdilar/grpc-gateway-websocket

It's fairly covered with tests, which fail on io.Copy and pass on go io.Copy.
I'll use this code as reference.

@paskozdilar
Copy link
Contributor Author

Please see: #5338

@paskozdilar
Copy link
Contributor Author

Integration tests are failing even with the new changes :) good thing we decided to include them.

Unfortunately, seems like the very act of waiting for io.Copy() to finish makes the request hang on the client side - even when gRPC method returns, the connection never gets disconnected until client closes the connection.

This is certainly not desirable behavior - we want clients to be disconnected when server returns. That only leaves us with go io.Copy as an option.

I will revert the commits replacing the go io.Copy with blocking version - tests should then pass.

@johanbrandhorst
Copy link
Collaborator

Is it only affecting server side streaming methods, or bi-directional methods too?

@paskozdilar
Copy link
Contributor Author

It happens on Unary methods and Server Stream methods (i.e. non-client-stream methods).

Bidirectional and Client Stream methods don't use this template branch, so the problem doesn't apply - and neither does the root issue in which gateway doesn't read client body...


In other news, I'm having trouble with properly testing WebSocket context cancellation:

  1. in case of Unary and ServerStream methods, closing the WS connection should cancel the context of the method
  2. in case of ClientStream and Bidirectional methods, closing the WS connection should send io.EOF to the gRPC stream

Since WebSocket proxy layer cannot possibly know which method is going to be executed, we cannot decide what to do on WS close message:

  1. if we close the request pipe, Unary and ServerStream methods will treat that as EOF of request and still execute
  2. if we cancel the request context, ClientStream and Bidirectional methods will fail immediately and not return the response.

In an ideal world, grpc-gateway would generate the WebSocket proxy itself, and it would generate two different implementations depending on the method type. Maybe that would be a cool feature in the future?


So yeah, this is going to be a hairy one. I'll try to figure something out.

@johanbrandhorst
Copy link
Collaborator

I think @tmc has wanted to integrate the websocket proxy into the gateway for a long time but we just haven't decided the best way forward. I think it would be a bigger task than this issue. If we have to choose between supporting the websocket proxy and removing a bug in the gateway (#5236), then we're going to fix the bug. It might be good enough that we support the websocket proxy only for bi-directional streaming methods?

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 19, 2025

Eh, you're right. I'm way too deep in the rabbit hole.
I'll take a simpler approach - I'll just use tmc/grpc-gateway-websocket in the first place, and only add tests for context hanging from the bug report. That would be just no-body unary and server-stream methods (which were affected by the original change).

So I can use both 1) coder/websocket library, as suggested, and 2) your suggestion to wait for io.Copy at end of the request.
The fact that it doesn't quite work with my implemetation of websocket proxy is fault of my implementation.

For any further bugs and features, I'll open new issues and possibly further PRs.
I'd love to give full websocket integration a try :)

@paskozdilar
Copy link
Contributor Author

Will test how this works with https://github.com/jnis23/grpc-gateway-proxy-example provided by the bug report.

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 19, 2025

From example provided by the bug report, even "working" server-stream (with body) doesn't properly close websocket methods.
The problem is the same - after the request is unmarshalled, the body remains unread, so the EOF never reaches server.

And since server-stream request needs to return before request body is closed (in case of websockets), we should add go io.Copy there as well, after unmarshalling request, regardless of whether or not body: "*" is defined. The one with deferred wait won't work...

I'll add that to templates and integration test as well...

@paskozdilar
Copy link
Contributor Author

Integration tests are failing on integration-tests branch, bazel workflows are passing on main, and my local tests show that tmc/grpc-gateway-websocket proxy works with these changes.
As far as I'm concerned, I think this is ready for merge.

Let me know if there's anything else I need to do.

@paskozdilar paskozdilar changed the title Discard body of body-less methods in background Discard unused body of Unary and ClientStream methods in background Mar 21, 2025
@paskozdilar
Copy link
Contributor Author

I've modified the PR name to better align with changes.

Copy link
Collaborator

@johanbrandhorst johanbrandhorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I really think we need to get to the bottom of the behavior here before we can merge it. This proposed solution will affect all users of the gateway (the goroutine), and I want to be sure this is really necessary. The original fix (#5240) fixed a minor connection leak. The impact of that change was very small because it only affected RPCs with no body. Somehow it broke the websocket proxy. That is also probably a pretty small impact. Now fixing both of the issues requires creating a goroutine, which may have a big impact on all users of the gateway. I just want to be sure this is the right solution, and honestly I'm still tempted to say we should just break the websocket proxy rather than impose this on all users.

@paskozdilar
Copy link
Contributor Author

Fair enough.

I'll try to investigate the root cause of this issue.

@paskozdilar
Copy link
Contributor Author

I've deduced that the issue appears in the google.golang.org/grpc/internal/transport/clientStream.waitOnHeader() method - when request is not closed (or completely read, in case of no-body methods), the s.headerChan is never closed for some reason.
Adding go io.Copy has a side-effect of s.headerChan being closed, which unblocks the whole thing.

The newClientWithParams function also mentions a comment that may be related:

	// Possible context leak:
	// The cancel function for the child context we create will only be called
	// when RecvMsg returns a non-nil error, if the ClientConn is closed, or if
	// an error is generated by SendMsg.
	// https://github.com/grpc/grpc-go/issues/1818.

Indeed, during tests, the context doesn't seem to be closed when connection closes, but instead blocks forever.

This is what I have for now. I'll try to figure out why this happens in the first place.

@paskozdilar
Copy link
Contributor Author

You're right - it seems that this issue can be fixed in the websocket-proxy itself, instead of in the grpc-gateway.

The root cause of this is the conn-read -> req-write loop in the websocket proxy:
https://github.com/tmc/grpc-websocket-proxy/blob/master/wsproxy/websocket_proxy.go#L255

If nobody is reading the req.Body, the Write call gets stuck, and the loop never goes on to read the body again, so it never detects the connection close.
One possible solution is to add a "buffer" goroutine in the websocket-proxy itself. That way this PR is not necessary at all.


Arguably, it should be the server's responsibility to read requests, even when they are not needed e.g. in order to not block the network buffer.
Server should not assume how much data the client will send - and the cancellation of context should definitely not depend on client not sending more data than expected.

Of course, in this case, the fault is on the websocket-proxy library, since it mistakenly assumes that request write will always pass.
But the fact that this worked at all is because most Go HTTP servers require clients to send the whole body before even starting to process the data...


This has been a wild ride. There's a lot to consider here.
I've created the PR for websocket-proxy here:
tmc/grpc-websocket-proxy#41

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 26, 2025

Wait, that won't work either. The blocking io.Copy prevents WebSocket from continuing in this case because io.Copy expects Body to be closed, which isn't the case with websocket-proxy - and cannot be, in general case, because proxy doesn't know which method expects request and which expects a stream. Only grpc-gateway knows this.

So either we break HTTP clients that send body when body is not defined or all WS clients when body is not defined...

I'd like to point out that grpc-gateway already use a goroutine in the implementation of client-streaming methods (ClientStream and Bidirectional):

        go func() {
		for {
			if err := handleSend(); err != nil {
				break
			}
		}
		if err := stream.CloseSend(); err != nil {
			grpclog.Errorf("Failed to terminate client stream: %v", err)
		}
	}()

For what it's worth, my benchmark with and without go io.Copy on Unary method that doesn't do anything (just returns a value) seem pretty close:

$ sudo nice -20 go test -bench=. -benchtime=10s -benchmem .
goos: linux
goarch: amd64
pkg: github.com/paskozdilar/bug-report-grpc-gateway
cpu: 13th Gen Intel(R) Core(TM) i7-13800H
BenchmarkUnary/UnaryOK-20                  84042            146442 ns/op           24236 B/op        360 allocs/op
BenchmarkUnary/UnaryGoIOCopy-20            83079            145399 ns/op           24227 B/op        360 allocs/op
PASS
ok      github.com/paskozdilar/bug-report-grpc-gateway  27.313s

@johanbrandhorst
Copy link
Collaborator

So websocket clients with a body defined still work? I'm happy to put constraints on what types of websocket clients we can support, but I'm not really happy breaking any part of the gateway to support the websocket wrapper.

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 28, 2025

I understand. Messing with a widely used library is tricky, and one must be careful.


Some shortcuts to cut down on repetitive phrases:

  • BDUSM: Body-Defined Unary and ServerStream Methods
  • NBDUSM: No-Body-Defined Unary and ServerStream Methods

However, I'd like to present some more arguments for including go io.Copy in both BDUSM and NBDUSM anyway:

  1. they don't break behavior, since the handler never reads any data beyond first message (or at all, if body is not defined)
  2. if client sends more data than the default marshaler.Decoder ingests in the first scan (which is 512 bytes), the issue also happens with methods with body defined.

For reproduction of second bug, please see: https://github.com/paskozdilar/bug-report-grpc-gateway/tree/bug-when-body-defined (this line being crucial).

So this bug on body-defined methods exists even with HTTP clients. It's just that json.Decode accidentally fixes it for excess data under 512 bytes.

We could add a blocking io.Copy to BDUSM too, which would fix this issue for all HTTP clients regardless of excess body size, but it would break all WebSocket Unary/ServerStream methods.
In contrast, go io.Copy would fix both HTTP and WebSocket clients.


If adding a goroutine is categorically not acceptable, then using a codegen for WebSocket clients will be necessary for it to work.
The generated WebSocket client could, for BDUSM, read the message body, and then close the request in the background, while keeping the WebSocket connection open. And when WebSocket connection is closed, it would cancel the context of the request, thus keeping the behavior correct, while being compatible with the HTTP implementation.
For NBDUSM, we could simply close the request immediately.

Alternatively, similar mechanism could be used by specifying MethodType query parameter in WebSocket URL, similar to how HTTP method is defined.

I'm willing to work on this to help tmc/grpc-websocket-proxy establish a correct behavior.


Then, a question comes to mind - if client sends more data than server expects (e.g. more than a single JSON object in case of BDUSM, or anything at all for NBDUSM), should grpc-gateway return an error, e.g. 400 Bad Request?
The request is, after all, not correct.
We could easily add that by waiting for EOF from marshaler.Decoder (and returning an error on anything else) for BDUSM, and performing io.Read with a size-1 byte slice until either EOF or successful read for non-body-defined method NBDUSM.

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 28, 2025

Alternatively, similar mechanism could be used by specifying MethodType query parameter in WebSocket URL, similar to how HTTP method is defined.

Yes, I think this could work. That way we could add blocking io.Copy to all Unary and ServerStreaming methods, and everything would work for both HTTP and WebSocket clients.

I have a proof-of-concept example here: https://github.com/paskozdilar/grpc-gateway-proxy-example
This example uses my fork of tmc/grpc-websocket-proxy: https://github.com/paskozdilar/grpc-websocket-proxy


So, would it be acceptable to add a blocking io.Copy to both BDUSM and NBDUSM?
Or maybe to check for excess data and refuse requests in the first place?

If so, I'll implement the integration test for that case, too.
WebSocket tests can be held off, and if my PR gets accepted, we can add it too: tmc/grpc-websocket-proxy#41

@johanbrandhorst
Copy link
Collaborator

Thank you for spending so much time investigating this and laying out the options. I absolutely think a blocking io.Copy would acceptable. We could perhaps have a

defer func() {
    io.Copy(io.Discard, req.Body)
    req.Body.Close()
}()

What do you think?

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 31, 2025

No, that would not work.

  1. req.Body.Close() would only run after io.Copy finishes, and io.Copy only finishes when req.Body is closed
  2. The call causing problems in ServerStream is stream.Header(), so the body must be drained until closed before stream.Header() is called (more detailed explanation here: Discard unused body of Unary and ClientStream methods #5331 (comment)).

Considering many HTTP servers would read the body completely before even starting to process the response, I believe the correct behavior would be adding the following lines before stream.Header() call:

// NBDUSM
[...]
n, err := io.Copy(io.Discard, req.Body)
if err != nil {
    return nil, metadata, status.Errorf(codes.InvalidArgument, "%v", err)
}
if n != 0 {
    return nil, metadata, status.Errorf(codes.InvalidArgument, "unexpected data")
}
[...]

// BDUSM
[...]
d := marshaler.NewDecoder(req.Body)
if err := d.Decode(&protoReq); err != nil {
    return nil, metadata, status.Errorf(codes.InvalidArgument, "%v", err)
}
if err := d.Decode(&struct{}{}); !errors.Is(err, io.EOF) {
    return nil, metadata, status.Errorf(codes.InvalidArgument, "unexpected data")
}
[...]

Interestingly, this issue of not finishing json.Decoder stream is documented in several blogposts and even an official Go issue:


Do you agree with this approach?

Pros:

  • parsing of request body is well-defined, so clients cannot accidentally cause context leaks while everything appears to work

Cons:

  • we may break some clients that depend on the previous (incorrect) behavior

We could circumvent the latter by adding a backwards-compatibility opt for the protoc. I believe this would be the right thing to do.

paskozdilar added a commit to paskozdilar/bug-report-grpc-gateway that referenced this pull request Mar 31, 2025
@paskozdilar
Copy link
Contributor Author

paskozdilar commented Mar 31, 2025

I have written a proof-of-concept fix with the above code on a brach bug-all-together:
https://github.com/paskozdilar/bug-report-grpc-gateway/tree/bug-all-together

Running the code prints pretty expected scenario - invalid-body gets rejected, valid body handles contexts nicely:

> Running invalid requests:
requesting: ServerStreamNoBody
requesting: UnaryNoBody
requesting: UnaryBody
requesting: ServerStreamBody
request failed: ServerStreamNoBody: 400 Bad Request
request failed: UnaryBody: 400 Bad Request
request failed: UnaryNoBody: 400 Bad Request
request failed: ServerStreamBody: 400 Bad Request
> Running valid requests:
requesting: ServerStreamNoBody
requesting: UnaryBody
requesting: UnaryNoBody
requesting: ServerStreamBody
ServerStreamNoBody open
UnaryBody open
UnaryNoBody open
ServerStreamBody open
request failed: ServerStreamNoBody: Post "http://localhost:8081/example/v1/ServerStreamNoBody": context canceled
request failed: UnaryNoBody: Post "http://localhost:8081/example/v1/UnaryNoBody": context canceled
request failed: UnaryBody: Post "http://localhost:8081/example/v1/UnaryBody": context canceled
request failed: ServerStreamBody: Post "http://localhost:8081/example/v1/ServerStreamBody": context canceled
UnaryNoBody close
UnaryBody close
ServerStreamNoBody close
ServerStreamBody close

As for WebSocket issue, would you be open to adding method-type metadata to response headers? That way, tmc/grpc-websocket-proxy could detect the method type automatically and adjust its behavior accordingly - and everything would work correctly!

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Apr 1, 2025

This issue (sending body to methods that don't expect it) also seems to appear in some of the tests, e.g. many of the testEcho* and testABE* methods send body to method that doesn't expect it:

This is a sign that many clients may expect methods to work even with excess body.

So the way I see it, there are three choices:

  • use go io.Copy(io.Discard, req.Body) to drain body in the background, keeping the current (erroneous) clients and websocket proxy working
  • use blocking io.Copy(io.Discard, req.Body), keeping the current (erroneous) clients working, needing additional metadata to differentiate between methods in websocket proxy
  • use strict request checking as described in comment-2765779112 and risk breaking some (erroneous) clients, needing additional metadata to differentiate between methods in websocket proxy

I am currently in favor of the third option - I think strictness-that-breaks-some-clients is better than relaxedness-that-allows-erroneous-behavior-to-go-unnoticed. If that's not acceptable, the second one would be preferable.

What do you think?

@paskozdilar paskozdilar changed the title Discard unused body of Unary and ClientStream methods in background Implement stricter checks on HTTP request body Apr 1, 2025
@paskozdilar
Copy link
Contributor Author

Strange, tests are failing on http.MethodOption calls, which I didn't touch at all... I'll investigate this.

@johanbrandhorst
Copy link
Collaborator

Thanks again for your work on this. Out of the options, I'm afraid that even if the third option is the more correct, we cannot break users, even if they are doing the wrong thing today. Thus I think option 2 is preferred to me.

I'm confused about the need for the method type in the response headers. What do yo mean by method type? Could it be an option, or even better, a third party middleware?

@paskozdilar
Copy link
Contributor Author

paskozdilar commented Apr 2, 2025

What do yo mean by method type?

Method type would be one of the strings: "Unary", "ServerStreaming", "ClientStreaming", "DuplexStreaming", signifying the gRPC method type that grpc-gateway is forwarding.
The idea is to send that data to clients as part of response headers, e.g. x-grpc-method-type, x-grpc-gateway-body.

I'm confused about the need for the method type in the response headers.

The issue is that:

  • WebSocket proxy keeps the request writer open until WS conn is closed
  • this PR will expect:
    • Unary / ServerStream request writers to be closed immediately after sending data
    • ClientStream / Bidirectional request writers to be kept open.

WebSocket proxy cannot, by itself, know when it is calling a Unary / ServerStream method, and when it is it is calling ClientStream / Bidirectional methods, so it can't know when to close the request writer and when to keep it open.

WebSocket proxy could add a query parameter for users to fill in with correct method type, but I figured if it can automatically get that information from the response header itself, that would be more convenient for the users. You can see preliminary implementation here:
https://github.com/paskozdilar/grpc-websocket-proxy/blob/master/wsproxy/websocket_proxy.go#L201

Could it be an option, or even better, a third party middleware?

It could be an option, but it would have to be codegen, since (I assume?) codegen is the only place where we have data about which method is which type.

A third party middleware, I'm not sure how it would work. The tmc/grpc-websocket-proxy is a third party middleware.
Only grpc-gateway knows which method type each method is.

@paskozdilar paskozdilar changed the title Implement stricter checks on HTTP request body Discard unused body of Unary and ClientStream methods Apr 3, 2025
@paskozdilar
Copy link
Contributor Author

paskozdilar commented Apr 3, 2025

Alright, the tests are passing now.


As for the server metadata, since we're pretty liberal about the request body, we only need method type metadata for WebSocket proxy to work fine without changes.

I see there's already some metadata header prefixes used in https://github.com/grpc-ecosystem/grpc-gateway/blob/main/runtime/context.go#L23:

const MetadataHeaderPrefix = "Grpc-Metadata-"
[...]
const MetadataPrefix = "grpcgateway-"
[...]
const MetadataTrailerPrefix = "Grpc-Trailer-"
[...]
const xForwardedFor = "X-Forwarded-For"
const xForwardedHost = "X-Forwarded-Host"

I am not 100% sure what are those prefixes used for, but I'll dig in depeer.

We could define a new runtime.AnnotationContextOption for method type, which would append to response headers something like X-Grpc-Method-Type: Unary, etc.


Assuming everything is fine, you may merge this PR as-is if you want, or we can work on the request metadata in this PR, too. Your call.

@johanbrandhorst
Copy link
Collaborator

I believe the header prefixes you found are used for parsing incoming headers, I don't know that we set any specific headers in the response. You can write a custom ForwardResponseOption that sets headers in the response (such as cookies), and maybe we could have some way for users to use that to inform the websocket of the type? I also think it's OK for us to require websocket users to set the type of the RPC explicitly. I expect doing it manually won't be that hard because who maintains more than 1-2 websockets around the gateway anyway?

Copy link
Collaborator

@johanbrandhorst johanbrandhorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all your work on this, lets tackle the exact websocket wrapper work in the original issue. I can happily approve this PR knowing that there's nothing particularly bad about draining the request body.

@johanbrandhorst johanbrandhorst merged commit b776bd5 into grpc-ecosystem:main Apr 3, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v2.26.2 breaks grpc-websocket-proxy
2 participants