EXPERIMENTAL: Improve WebSocket Re-Connections #1216

jpsantosbh · 2025-05-09T15:29:37Z

Description

This changes are intented for internal test/usage since ATM the e2e corver were removed, because it was providing false positives.

Depending on the cause and the signalling stage the WebSocket reconnection needs to baheve differently.

avoid multiple signalwire.connetc in the same session
stop the RTC negotiation the the authorizantion_state sync is lost
don't use verto.invites if the intance still have it refence state
cleanup the previus call id when not try to reattach

Type of change

Internal refactoring
Bug fix (bugfix - non-breaking)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Code snippets

In case of new feature or breaking changes, please include code snippets.

changeset-bot · 2025-05-09T15:29:40Z

🦋 Changeset detected

Latest commit: 475184b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 6 packages

Name	Type
@signalwire/webrtc	Minor
@signalwire/core	Minor
@signalwire/js	Minor
@sw-internal/e2e-js	Patch
@signalwire/realtime-api	Patch
@signalwire/web-api	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

jpsantosbh · 2025-05-09T15:31:29Z

packages/core/src/BaseSession.ts

@@ -451,6 +461,12 @@ export class BaseSession {
    const payload = this.decode<JSONRPCRequest | JSONRPCResponse>(event.data)
    this.logger.wsTraffic({ type: 'recv', payload })

+    if (this._waitingAuthStateUpdate && !isAuthStateEvent(payload)) {
+      // we lost a `signalwire.authorization.state` probably in a WS reconnection
+      // the server should always send before any other msg


@Astaelan can you validate this assumption?

I believe this should be correct, I can't find any other messages that could sneak through. A ping maybe but should still never come before the auth state and connect response. If auth fails however, an error response may come without auth state, so you should validate that potential.

The other thing to note is that if you reconnect with an authorization state and nothing changes, then you may not get an auth state, because nothing changed, validate this scenario as well.

@Astaelan does that mean after sending the “signalwire.connect”, the SDK should NOT send any request before receiving the “auth_state” event?

The longer answer is that you should never depend on authorization state events in the first place. They can come whenever Hagrid decides it needs the client to retain new information on a reconnect, and then only sends the authorization state event if something actually changed. This may or may not happen right after connect, may or may not happen during an invite, even broadcast stuff can change it I believe. So I originally proposed NOT depending on the auth state event itself.

Instead, what I originally proposed, is that if the client starts a verto.invite request, it should not consider the invite/call valid until it receives the invite response. If there is going to be an authorization state update, it will happen during the invite before the response returns to the client. Approaching it based on the invite response means that getting an invite response is the trigger for a healthy call (if it is a successful response) and if you needed an auth state you would have gotten one before the response.

If you approach it like this, then the client would "throw away" knowledge about the verto.invite it sent if it gets disconnected before receiving the invite response. This would in turn mean that when reconnecting with the same protocol, it would ignore the verto.answer event pending for that previous call attempt, and could start a new call with a new call id. This would be the most reliable way since auth state is just a state update and should not be treated as gaurenteed to ever occur, there is a sort of delta check to only send if it changes.

What you CAN depend on, is that IF there is going to be an authorization state update during a verto.invite, then you WILL receive it before the verto.invite response. Therefore you can depend on the verto.invite response as the point when things are healthy for a reconnect/hijack/reattach.

Therefore to answer the original question in shorter form, it doesn't really matter. You may not get an auth state, so don't depend on it, depend on the invite response if you need something to control state and flow, but there is nothing stopping you from sending a ping or ping response or other things while waiting for an invite response. The main point here is not to assume you can reattach a call until after you get the invite response, if you get disconnected before the invite response you should throw the call away.

🤔 Then I need to make changes...

The verto.invite is easy to handle it like you said. My problem was with the verto.answer
The verto.answer is mostly why I did this way,,,

Is there a way for the SDK to understand the stored auth_state is an old/stale auth_state?

Not really, it's supposed to be blind and just send the latest. It doesn't know if it hasn't received one because of a disconnect. Making any assumptions about expecting to receive one is also sketchy as it may not change depending on the scenario (like reattaching attempts).

The only true point of safe knowledge is that once we receive an invite response, we can assume IF we needed an updated authorization state then we will have already received it before the invite response making this call "reattachable".

Maybe we need to think about this slightly different, the simplest way I can think about this is "An outbound call is not reattachable until we get the invite response", which means we should not try to reattach, answer, or deal with anything related to that call if we see information about it after a reconnect that says it's not reattachable.

@iAmmar7 @Astaelan I'm moving the discussion to the ticket...

Can you link the ticket to this PR so I can find it, thanks.

iAmmar7 · 2025-05-09T15:41:51Z

avoid multiple signalwire.connect in the same session

Could you please share in which scenario the SDK sends this request twice?

stop the RTC negotiation the the authorizantion_state sync is lost

What does it mean by lost? How does the SDK lose this during the call?

jpsantosbh · 2025-05-09T17:05:32Z

packages/js/src/fabric/SATSession.ts

@@ -78,8 +80,9 @@ export class SATSession extends JWTSession {
        variation: this.options.apiRequestRetriesDelayIncrement,
      }),
      expectedErrorHandler: (error) => {
-        if (error?.message?.startsWith('Authentication failed')) {


@iAmmar7

Could you please share in which scenario the SDK sends this request twice?

In the case of a double signalwire.connect the server, don't fail the authentication. But with a "Method not recognized". You can verify in the production logs in the protocol signalwire_5d863653-9c10-414c-a9ee-5368b06da742_a988d370-f27a-4b97-8011-89b1bda89b83_b85f0246-f186-4d79-a42c-266cced37153 a real case.

The problem was two retry mechanisms in "parallel" for the signalwire.connect. With this change, we let only the connection handle the retries.

jpsantosbh · 2025-05-09T17:18:24Z

@iAmmar7

What does it mean by lost? How does the SDK lose this during the call?
lost sync

An authorization_state sent by the server never gets to the client. Since WebSocket reconnections can't clean the authorization_state server and client are out of sync,

A real case in production can be verified with the connection id 943134ba-f6b3-4c94-8db0-7008f5f54241

Astaelan · 2025-05-09T22:37:53Z

Discussions moved to https://github.com/signalwire/cloud-product/issues/14632

iAmmar7 · 2025-05-29T17:27:16Z

@jpsantosbh, since you mentioned this PR in the standup for review. I’m honestly not sure what we’re trying to achieve here, it seems the scope of the PR may have changed significantly.

As @Astaelan mentioned, the SDK only needs to:

"Ignore a verto.answer event (or any events) with a call id that matches to a verto.invite request the client started where the client never received a matching verto.invite response"

I strongly suggest splitting this PR work into multiple PRs, each addressing a single task. Otherwise, reviewing the PR will be very hard and would generate a lot of back-and-forth.

jpsantosbh · 2025-05-30T12:37:21Z

@jpsantosbh, since you mentioned this PR in the standup for review. I’m honestly not sure what we’re trying to achieve here, it seems the scope of the PR may have changed significantly.

As @Astaelan mentioned, the SDK only needs to:

"Ignore a verto.answer event (or any events) with a call id that matches to a verto.invite request the client started where the client never received a matching verto.invite response"

I strongly suggest splitting this PR work into multiple PRs, each addressing a single task. Otherwise, reviewing the PR will be very hard and would generate a lot of back-and-forth.

Sure...
I split into 4 other PR

jpsantosbh added 11 commits April 16, 2025 16:53

shouldAttach change

82e9b58

reenable tests

be09de2

Merge branch 'main' into joao/fix_network_handoff

e195e97

no verto.invite for simple re connections

d90b023

changeset

2fd2f12

Merge branch 'main' into joao/fix_network_handoff

783d4e8

Merge branch 'main' into joao/fix_network_handoff

571132f

review changes

8e7faa8

use dialAddress

26a9a10

manually tested

d6a9bed

fix ws re-connections

2cea4c1

jpsantosbh requested review from giavac, Astaelan and iAmmar7 May 9, 2025 15:29

jpsantosbh commented May 9, 2025

View reviewed changes

jpsantosbh mentioned this pull request May 9, 2025

Fix WebSocket reconnections #1206

Closed

4 tasks

jpsantosbh commented May 9, 2025

View reviewed changes

revert expect auth)state

2982f35

jpsantosbh changed the base branch from joao/fix_network_handoff to main May 15, 2025 19:44

jpsantosbh added 7 commits May 16, 2025 13:25

outbound calls reconnects

f549670

fix import

02cb470

skip test

595fd47

restore auth retry

67b77c0

incoming calls reattach

5c60b10

Merge branch 'main' into joao/fix_reconnections

26cb6f1

cleanup

85c58cc

jpsantosbh marked this pull request as draft May 27, 2025 14:45

jpsantosbh added 2 commits May 27, 2025 12:31

calee reattach fixed

406877b

cleanup

39f46b0

jpsantosbh added 2 commits May 29, 2025 16:17

Merge branch 'main' into joao/fix_reconnections

1e44780

merge changes from slipped

475184b

This was referenced May 30, 2025

Joao/fix reconnection after verto invite only #1227

Open

Joao/fix reconnection after connect #1228

Open

Joao/fix reconnection sdp message only #1229

Open

Joao/fix callle reattach only #1230

Open

EXPERIMENTAL: Improve WebSocket Re-Connections #1216

Are you sure you want to change the base?

EXPERIMENTAL: Improve WebSocket Re-Connections #1216

Conversation

jpsantosbh commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Code snippets

Uh oh!

changeset-bot bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

jpsantosbh May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Astaelan May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iAmmar7 May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Astaelan May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpsantosbh May 9, 2025

Choose a reason for hiding this comment

Uh oh!

iAmmar7 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Astaelan May 9, 2025

Choose a reason for hiding this comment

Uh oh!

jpsantosbh May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpsantosbh May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Astaelan May 9, 2025

Choose a reason for hiding this comment

Uh oh!

iAmmar7 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpsantosbh May 9, 2025

Choose a reason for hiding this comment

Uh oh!

jpsantosbh commented May 9, 2025

Uh oh!

Astaelan commented May 9, 2025

Uh oh!

iAmmar7 commented May 29, 2025

Uh oh!

jpsantosbh commented May 30, 2025

Uh oh!

Uh oh!

jpsantosbh commented May 9, 2025 •

edited

Loading

changeset-bot bot commented May 9, 2025 •

edited

Loading

Astaelan May 9, 2025 •

edited

Loading

iAmmar7 May 9, 2025 •

edited

Loading

Astaelan May 9, 2025 •

edited

Loading

jpsantosbh May 9, 2025 •

edited

Loading

jpsantosbh May 9, 2025 •

edited

Loading

iAmmar7 commented May 9, 2025 •

edited

Loading