-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-2884 Avoid connection churn when operations timeout #1675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
source/client-side-operations-timeout/tests/connection-churn.yml
Outdated
Show resolved
Hide resolved
source/client-side-operations-timeout/tests/connection-churn.yml
Outdated
Show resolved
Hide resolved
# after maxTimeMS, whereas mongod returns it after | ||
# max(blockTimeMS, maxTimeMS). Until this ticket is resolved, these tests | ||
# will not pass on sharded clusters. | ||
topologies: ["standalone", "replicaset"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standalone -> single
- name: findOne | ||
object: *collection | ||
arguments: | ||
timeoutMS: 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python this timeout is too small and causes this find to fail before sending anything to the server. The same problem exists in the other tests too. Perhaps all of theses tests should run a setup command (eg ping) to ensure a connection is created and available in the pool, then run the finds. What do you think?
Assigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to wait until #1792 is merged to review the schema changes. From what I can see in the UTF specification, those changes look good.
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
Show resolved
Hide resolved
|
||
#### Connection Aliveness Check Fails | ||
|
||
1. Initialize a mock TCP listener to simulate the server-side behavior. The listener should write at least 5 bytes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on adding a mock server to drivers-evergreen-tools for these tests? I could go either way - there are only two, so the burden on drivers isn't too great but it might be nice if drivers didn't need to worry about the mock server logic themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m concerned that this solution will require drivers to spin up a server when trying to test locally. I’ve suggested DRIVERS-3183 to support raw-TCP connection test entities which will allow us to convert these prose tests to a unified spec test in the future.
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
Show resolved
Hide resolved
The next time the connection is checked out, the driver MUST attempt to read and discard the remaining response from the | ||
socket. The workflow for this is as follows: | ||
|
||
- The connection MUST persist the current time recorded immediately after the original socket timeout, and this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not opposed to having the logic in the connection but we do have precedent for other connection state related-actions in checkIn (destroying the connection, etc). Would recording the start time make sense at checkIn instead of after the timeout in the driver's connection abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two thoughts on this. First, some drivers will likely add an object to the connection to maintain state for a pending response, for example:
type pendingResponseState struct {
remainingBytes int32
requestID int32
start time.Time
}
type connection struct {
// pendingResponseState contains information required to attempt a pending read
// in the event of a socket timeout for an operation that has appended
// maxTimeMS to the wire message.
pendingResponseState *pendingResponseState
}
It would be more pragmatic to update the current time where remainginBytes and requestID are assigned (which is when the socket times out).
Additionally, we want to start this “countdown” ASAP in case the connection is “dead”: in such cases the “aliveness check” will be a non-blocking failure while awaiting a pending response. Delaying when we set the current time reduces the likelihood (albeit small) that the 3 second pending response timeout has been exceeded while the connection remains idle in the pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, leaving this where it is works for me. Thanks!
@@ -576,7 +576,106 @@ other threads from checking out [Connections](#connection) while establishing a | |||
Before a given [Connection](#connection) is returned from checkOut, it must be marked as "in use", and the pool's | |||
availableConnectionCount MUST be decremented. | |||
|
|||
```text | |||
If an operation times out the socket while awaiting a server response and CSOT is enabled and `maxTimeMS` was added to | |||
the command, the driver MUST mark the connection as "pending" and record the current time in a way that can be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - I do understand this is the approach outlined in the design and that it is generally valuable to have example algoritims and pseudocode in the specification. However, I do think this approach makes assumptions about connection layer implementations and socket APIs that don't hold true for all drivers (at least in Node, they certainly don't). For example - Node's socket API is push-based and we collect chunks when they're available. Immediate reads do not make sense for our connection layer implementation and so the existing implementation of await_pending_response
doesn't either.
I'm happy to work with you on the phrasing here but can we try to phrase these requirements in a way that outlines the requirements, and then outlines a particular implementation that satisfies the requirements? Ex: in Node, the socket API we use pushes chunks of data to us automatically (there is no read(n)
method). We collect this into a buffer automatically when they're available. So when we implement these changes in Node, what we will likely do instead of this algorithm is:
- on timeout, fail the current request and record the time of timeout but don't stop receiving data chunks from the socket.
- set the socket's timeout to 3s (time out the socket if no chunks arrive in 3s)
- in checkout, check if the pending connection has finished reading the response from the server. If it has, discard and continue. If not, calculate the wait time and wait. On success, proceed. On timeout, close the connection.
I think this approach still satisfies the goals of the spec changes but as the spec is currently written, our implementation would not be spec compliant. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC in the Node case, bytes that have not been consumed will still sit on the socket after a timeout. So you would still need to read and discard any buffered data when checking out a connection that has been pinned to a pending response. The three bullet points you note do not conflict with the algorithm, AFAICT. I would be happy to troubleshoot this offline with you, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not exactly. Unless we explicitly disable this behavior, any data that arrives on the socket will actually be processed by the driver, even while the connection is checked into the pool and unused (unless there's something preventing the chunks from reaching Node's socket layer, such as AWS lambda freezing the runtime but at that point, this is sort of moot because the driver's process is frozen). Perhaps a rough outline of more-or-less the final product will look in node might help:
Sort of, here's a rough outline of what I imagine Node's implementation to look like:
// in our connection layer's `command()` method:
async command() {
try {
...
} catch (error) {
if (isTimeoutError(error)) {
connection.lastChunkTime = now();
connection.state = 'pending response';
// if there is more than 3s between chunks, Nodejs will close this socket and error our connection for us
connection.socketTimeoutMS = 3_000;
}
throw error;
}
}
// in the connection pool
async checkout() {
// other steps already happened above, we have a connection:
if (connection.state === 'pending response') {
await awaitPendingRead(connection)
}
}
async awaitPendingRead(connection, remainingTimeMS) {
const remainingTime = now() - connection.lastChunkTime;
// wait either 3s - total time waited so far, or remainingTimeMS
const duration = Math.min(remainingTime, remainingTimeMS);
// connection.responses.next() will resolve to a promise that returns a full wire protocol message, indicating we finished reading everything from the socket
await Promise.race([sleep(duration), connection.responses.next()]);
const timedOut = (now() - connection.lastChunkTime) > duration;
if (timedOut) throw new Error('timed out');
// connection closes itself when socket timeout hits 3s - no need to close it here
}
The three bullet points you note do not conflict with the algorithm, AFAICT
I mean, I agree to some extent. I think the implementation above satisfies the requirements here, but I wouldn't say that this implementation "implements" the algorithm (or if it does, it's a bit obfuscated by our connection layer and Node's connection APIs). That's why I'd prefer if we could rework the requirements to be a set of requirements that is socket API-agnostic, with a proposed algorithm that satisfies the requirements that drivers can implement if they want.
connectionId: int64; | ||
|
||
/** | ||
* The time it took to complete the pending read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long as data is still coming back from socket in intervals of <3s, it is possible for the same connection to require multiple checkout requests to fully exhaust. So - is this duration the total time it took to read all of the data off of the socket (now() - time of timeout) or the amount of time that the checkout request waited on the final pending read wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same comment for logging events)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would anticipate this duration to be within the context of ConnectionPendingResponseStarted, i.e. 1 call to await_pending_response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Can we clarify that in the description of duration? We can take inspiration from the definitions of duration for checkout failed
and checkout succeeded
events. Ex:
/**
* The time it took to establish the connection.
* In accordance with the definition of establishment of a connection
* specified by `ConnectionPoolOptions.maxConnecting`,
* it is the time elapsed between emitting a `ConnectionCreatedEvent`
* and emitting this event as part of the same checking out.
*
* Naturally, when establishing a connection is part of checking out,
* this duration is not greater than
* `ConnectionCheckedOutEvent`/`ConnectionCheckOutFailedEvent.duration`.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
So, maybe something like:
/**
* The time it took to complete the pending read.
* This duration is defined as the time elapsed between emitting a `PendingResponseStarted` event
* and emitting this event as part of the same checking out.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same comment for other definitions of duration in this PR).
*/ | ||
connectionId: int64; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be valuable to include a duration here as well, indicating how long the request waited for the pending read before failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same comment for logging events)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this is a good idea, this could highlight the case where you are trickling 1 byte aliveness checks but the response continues to time out while attempting to discard the input TCP buffer. Good call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're in agreement that his seems valuable, can you make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was made in 9500fd5
`ConnectionPendingResponseFailed` events. | ||
3. Instantiate a connection pool using the mock listener’s address, ensuring readiness without error. Attach the event | ||
monitor to observe the connection’s state. | ||
4. Check out a connection from the pool and initiate a read operation with an appropriate socket timeout (e.g, 10ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is related to my other comment about a less socket-api specific implementation. Is it possible to write these tests in a way that doesn't require explicit use of a read
API? Node's connection layer doesn't expose a read
method method, we only expose command()
which performs write+read on the underlying socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Node only has access to a round-trip API? Even in the non-public API? Could you not just discard the write half in the mock listener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted "mock listener" as "mock server" but reading over this test more closely, that doesn't seem to be what you're asking for. Are you asking for some way of stubbing a socket and pushing chunks into the sockets buffer, instead of a mock server that the driver opens real TCP sockets to? That's possible in Node but I'd check with other drivers (I don't know if something like this exists yet in the specs, it may not be possible in other drivers).
Regardless, I'm not sure if these tests would work for the Node implementation I mentioned here
7. Verify that one event for each `ConnectionPendingResponseStarted` and `ConnectionPendingResponseFailed` was emitted. | ||
Also verify that the fields were correctly set for each event. | ||
|
||
#### Connection Aliveness Check Succeeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test that demonstrates that multiple aliveness checks might be required to fully read the response from the socket? I'm imagining the mock server emitting chunks of data every second for longer than 6s (2 * the static timeout). Each checkout should fail, but we'll continue to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s unclear to me what the goal of this would be. Are you just wanting to make sure the driver does not pre-maturely close the connection if the aliveness check succeeds? We could use event monitoring for that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I want a test somewhere that demonstrates that drivers might need to span multiple checkout + pending responses, and that that process can take longer than the static timeout, before reading the full response from the connection.
|
||
- description: "force a pending response read, fail first try, succeed second try" | ||
operations: | ||
- name: createEntities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, can we add a test that demonstrates that when the pending read checkout has no timeoutMS set, we use socket_timeout_ms (if it is <3s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! The Go Driver doesn’t support socket timeouts which is a technically deprecated option. Perhaps @ShaneHarvey can opine. If we decide to add this test would you mind implementing it since the Go Driver has no way of verifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to the unified test format LGTM.
@prestonvasquez as per our conversation around where to add the missing event names in #1782, this schema version would be an ideal candidate as it already adds new events to the list.
The next time the connection is checked out, the driver MUST attempt to read and discard the remaining response from the | ||
socket. The workflow for this is as follows: | ||
|
||
- The connection MUST persist the current time recorded immediately after the original socket timeout, and this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, leaving this where it is works for me. Thanks!
*/ | ||
connectionId: int64; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're in agreement that his seems valuable, can you make this change?
connectionId: int64; | ||
|
||
/** | ||
* The time it took to complete the pending read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Can we clarify that in the description of duration? We can take inspiration from the definitions of duration for checkout failed
and checkout succeeded
events. Ex:
/**
* The time it took to establish the connection.
* In accordance with the definition of establishment of a connection
* specified by `ConnectionPoolOptions.maxConnecting`,
* it is the time elapsed between emitting a `ConnectionCreatedEvent`
* and emitting this event as part of the same checking out.
*
* Naturally, when establishing a connection is part of checking out,
* this duration is not greater than
* `ConnectionCheckedOutEvent`/`ConnectionCheckOutFailedEvent.duration`.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
So, maybe something like:
/**
* The time it took to complete the pending read.
* This duration is defined as the time elapsed between emitting a `PendingResponseStarted` event
* and emitting this event as part of the same checking out.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
connectionId: int64; | ||
|
||
/** | ||
* The time it took to complete the pending read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same comment for other definitions of duration in this PR).
@@ -576,7 +576,106 @@ other threads from checking out [Connections](#connection) while establishing a | |||
Before a given [Connection](#connection) is returned from checkOut, it must be marked as "in use", and the pool's | |||
availableConnectionCount MUST be decremented. | |||
|
|||
```text | |||
If an operation times out the socket while awaiting a server response and CSOT is enabled and `maxTimeMS` was added to | |||
the command, the driver MUST mark the connection as "pending" and record the current time in a way that can be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not exactly. Unless we explicitly disable this behavior, any data that arrives on the socket will actually be processed by the driver, even while the connection is checked into the pool and unused (unless there's something preventing the chunks from reaching Node's socket layer, such as AWS lambda freezing the runtime but at that point, this is sort of moot because the driver's process is frozen). Perhaps a rough outline of more-or-less the final product will look in node might help:
Sort of, here's a rough outline of what I imagine Node's implementation to look like:
// in our connection layer's `command()` method:
async command() {
try {
...
} catch (error) {
if (isTimeoutError(error)) {
connection.lastChunkTime = now();
connection.state = 'pending response';
// if there is more than 3s between chunks, Nodejs will close this socket and error our connection for us
connection.socketTimeoutMS = 3_000;
}
throw error;
}
}
// in the connection pool
async checkout() {
// other steps already happened above, we have a connection:
if (connection.state === 'pending response') {
await awaitPendingRead(connection)
}
}
async awaitPendingRead(connection, remainingTimeMS) {
const remainingTime = now() - connection.lastChunkTime;
// wait either 3s - total time waited so far, or remainingTimeMS
const duration = Math.min(remainingTime, remainingTimeMS);
// connection.responses.next() will resolve to a promise that returns a full wire protocol message, indicating we finished reading everything from the socket
await Promise.race([sleep(duration), connection.responses.next()]);
const timedOut = (now() - connection.lastChunkTime) > duration;
if (timedOut) throw new Error('timed out');
// connection closes itself when socket timeout hits 3s - no need to close it here
}
The three bullet points you note do not conflict with the algorithm, AFAICT
I mean, I agree to some extent. I think the implementation above satisfies the requirements here, but I wouldn't say that this implementation "implements" the algorithm (or if it does, it's a bit obfuscated by our connection layer and Node's connection APIs). That's why I'd prefer if we could rework the requirements to be a set of requirements that is socket API-agnostic, with a proposed algorithm that satisfies the requirements that drivers can implement if they want.
`ConnectionPendingResponseFailed` events. | ||
3. Instantiate a connection pool using the mock listener’s address, ensuring readiness without error. Attach the event | ||
monitor to observe the connection’s state. | ||
4. Check out a connection from the pool and initiate a read operation with an appropriate socket timeout (e.g, 10ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted "mock listener" as "mock server" but reading over this test more closely, that doesn't seem to be what you're asking for. Are you asking for some way of stubbing a socket and pushing chunks into the sockets buffer, instead of a mock server that the driver opens real TCP sockets to? That's possible in Node but I'd check with other drivers (I don't know if something like this exists yet in the specs, it may not be possible in other drivers).
Regardless, I'm not sure if these tests would work for the Node implementation I mentioned here
7. Verify that one event for each `ConnectionPendingResponseStarted` and `ConnectionPendingResponseFailed` was emitted. | ||
Also verify that the fields were correctly set for each event. | ||
|
||
#### Connection Aliveness Check Succeeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I want a test somewhere that demonstrates that drivers might need to span multiple checkout + pending responses, and that that process can take longer than the static timeout, before reading the full response from the connection.
socket. The workflow for this is as follows: | ||
|
||
- The connection MUST persist the current time recorded immediately after the original socket timeout, and this | ||
timestamp MUST be updated to the current time whenever any data is successfully read from the socket during a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we concerned specifically about a case where the socket timeout beats maxTimeMS and the operation succeeds server-side but the connection is still put into a "pending response” state?
Yeah - like a network delay that slows down the response from the server initially. There wouldn't be anything that ends the operation server side in that scenario
object: *database | ||
arguments: | ||
command: | ||
ping: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UTR runCommand operation requires the commandName: "ping"
field:
https://github.com/mongodb/specifications/blob/30d15e0600a798834fb40d36eb468ccf9c49bc04/source/unified-test-format/unified-test-format.md#runcommand
socket. The workflow for this is as follows: | ||
|
||
- The connection MUST persist the current time recorded immediately after the original socket timeout, and this | ||
timestamp MUST be updated to the current time whenever any data is successfully read from the socket during a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a good call out. If moreToCome is present on the response, we should keep reading the connection until we get a response with moreToCome:false or the timeout expires.
This PR implements the design for connection pooling improvements described in DRIVERS-2884, based on the CSOT (Client-Side Operation Timeout) spec. It addresses connection churn caused by network timeouts during operations, especially in environments with low client-side timeouts and high latency.
When a connection is checked out after a network timeout, the driver now attempts to resume and complete reading any pending server response (instead of closing and discarding the connection). This may require multiple checkouts.
Each pending response read is subject to a cumulative 3-second static timeout. The timeout is refreshed after each successful read, acknowledging that progress is being made. If no data is read and the timeout is exceeded, the connection is closed.
To reduce unnecessary latency, if the timeout has expired while the connection was idle in the pool, a non-blocking single-byte read is performed; if no data is available, the connection is closed immediately.
This update introduces new CMAP events and logging messages (PendingResponseStarted, PendingResponseSucceeded, PendingResponseFailed) to improve observability of this path.
Please complete the following before merging:
clusters, and serverless).