Skip to content

Limit to_device EDU size to 65536 #18416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

MatMaul
Copy link
Contributor

@MatMaul MatMaul commented May 9, 2025

If a set of messages exceeds this limit, the messages are splitted across several EDUs.

Should fix #17035.

There is currently no official specced limit for EDUs, but the consensus seems to be that it would be useful to have one to avoid this bug by bounding the transaction size.

As a side effect it also limits the size of a single to-device message to a bit less than 65536.

This should probably be added to the spec similarly to the message size limit.

Pull Request Checklist

@MatMaul MatMaul marked this pull request as ready for review May 9, 2025 14:38
@MatMaul MatMaul requested a review from a team as a code owner May 9, 2025 14:38
@@ -28,6 +28,7 @@

# the max size of a (canonical-json-encoded) event
MAX_PDU_SIZE = 65536
MAX_EDU_SIZE = 65536
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MAX_EDU_SIZE = 65536
# This isn't spec'ed but is our own reasonable default to play nice with Synapse's
# `max_request_size`/`max_request_body_size`. We chose the same as `MAX_PDU_SIZE` as our
# `max_request_body_size` math is currently limited by 200 `MAX_PDU_SIZE` things. The
# spec for a `/federation/v1/send` request sets the limit at 100 EDU's and 50 PDU's
# which is below that 200 `MAX_PDU_SIZE` limit (`max_request_body_size`).
#
# Allowing oversized EDU's results in failed `/federation/v1/send` transactions (because
# the request overall can overrun the `max_request_body_size`) which are retried over
# and over and prevent other outbound federation traffic from happening.
MAX_EDU_SIZE = 65536

Comment on lines +304 to +307
edu_contents = get_device_message_edu_contents(
sender_user_id, message_type, messages, context
)
remote_edu_contents[destination] = edu_contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the structure of remote_edu_contents (was a map from destination to EDU meta) (to a map from destination to multiple EDU meta), could we just call add_messages_to_device_inbox(...) multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi version should have some gain performance side, since it's in an unique transaction, and pre-allocate all the stream ids.

I am fine if we decide to keep it simple and sacrifice some perf for that, but I am not sure it's worth it here it's not overly complicated.

"type": message_type,
"message_id": random_string(16),
}
# This is the size of the full EDU without any messages and without the opentracing context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the BASE_EDU_SIZE calculated without BASE_EDU_CONTENT["org.matrix.opentracing_context"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding it's some data helping reporting to the good opentracing context through the code, and it's stripped out of the payload before sending the transaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MatMaul Thanks for linking the context! We should explain this here in the comment

if current_edu_size + message_entry_size > MAX_EDU_SIZE:
edu_contents.append(current_edu_content)
logger.debug(
"Splitting %d device messages from %s into EDU msgid %s, %d EDUs queued",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Splitting %d device messages from %s into EDU msgid %s, %d EDUs queued",
"Splitting %d to-device messages from %s into EDU (message_id=%s), (total EDUs so far: %d)",


edu_contents = []

current_edu_content: JsonDict = deepcopy(BASE_EDU_CONTENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this cloning, perhaps it's easier to understand if we just have a little helper (maybe performs better as well 🤷):

    def create_new_to_device_edu_content() -> JsonDict:
        """Create a new `m.direct_to_device` EDU `content` object with a unique message ID."""
        content = {
            "messages": {},
            "sender": sender_user_id,
            "type": message_type,
            "message_id": random_string(16),
            "org.matrix.opentracing_context": json_encoder.encode(context)
        }
        return content

) -> List[JsonDict]:
"""
This function takes a dictionary of messages and splits them into several EDUs if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a docstring for the args and return.

And context of why we care to split similar to how we explain it for MAX_EDU_SIZE above.

Comment on lines +489 to +495
logger.debug(
"Queuing last %d device messages from %s into EDU msgid %s, %d EDUs queued",
len(current_edu_content["messages"]),
sender_user_id,
current_edu_content["message_id"],
len(edu_contents),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(
"Queuing last %d device messages from %s into EDU msgid %s, %d EDUs queued",
len(current_edu_content["messages"]),
sender_user_id,
current_edu_content["message_id"],
len(edu_contents),
)
logger.debug(
"Splitting remaining %d device messages from %s into EDU (message_id=%s), (total EDUs so far: %d)",
len(current_edu_content["messages"]),
sender_user_id,
current_edu_content["message_id"],
len(edu_contents),
)


mock_send_transaction.reset_mock()

# 2 messages, each just big enough to fit in an EDU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# 2 messages, each just big enough to fit in an EDU
# 2 messages, each just big enough to fit into their own EDU


self.assertEqual(mock_send_transaction.call_count, 2)

# A transaction can contain up to 100 EDUs but synapse reserves 10 EDUs for other purposes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, this happens at

) = await self.queue._get_to_device_message_edus(edu_limit - 10)

It would be good to label this magic value as a constant which we could also cross-reference here.


for recipient, message in messages.items():
# We remove 2 for the curly braces and add 1 for the colon
message_entry_size = len(encode_canonical_json({recipient: message})) - 2 + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by thought: instead of trying to work out the lengths and calculate the number of messages we can add, it might be easier to just generate the EDU and then check the size of it. If its too big you half the number of messages and try again.

The common case will be that we don't need to split up the EDU, at the expense of duplicating some work. It feels a bit hacky, but I think might be a little less brittle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it will be that much simpler for comprehension TBH.

If we think we can eat the perf cost, the simpler is probably to call encode_canonical_json on the whole EDU for each added message, and remove it and create a new EDU if it's larger than the max.

My calculation tricks were to avoid doing a full serialization on each added message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a special case tried first where we try to put everything in one ?

I don't know TBH, the idea of spitting in 2 is nice too but I feel like it is going to be quite annoying to implement and hence not simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A long queue of to-device messages can prevent outgoing federation working
3 participants