Skip to content

[Feature] Noise XKpsk3 integration (2025 version) #5692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

simonwicky
Copy link
Contributor

@simonwicky simonwicky commented Apr 7, 2025

Description

Noise PR #4360 is dead, long live Noise PR #5692.

No stacked PRs this time, commits description are quite explicit (and it's less of a mess than last time)


This change is Reviewable

@simonwicky simonwicky requested a review from jstuczyn April 7, 2025 12:27
Copy link

vercel bot commented Apr 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nym-explorer-v2 ❌ Failed (Inspect) May 21, 2025 7:59am
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
docs-nextra ⬜️ Ignored (Inspect) Visit Preview May 21, 2025 7:59am
nym-next-explorer ⬜️ Ignored (Inspect) Visit Preview May 21, 2025 7:59am

@timkuijsten
Copy link

Is there any rationale of the choice for XKpsk3? I.e. how to do DoS mitigation? Also I'm curious how this relates to the announcement of planning to use McEliece.

@aniampio
Copy link
Contributor

@timkuijsten Thanks for the question! We decided to use XKpsk3 because it provides the best forward secrecy, authentication, replay protection, and identity-hiding guarantees for client-server settings where the client cannot be identified based on its source IP address. Also, as PQ-security is on our roadmap, we opted for the variant with PSK, as the pre-shared key can be later used to inject Post-Quantum safety into the protocol. We also decided to use the XKpsk3 between nodes, as we are considering supporting private gateways in the future, so the identity of the initiator may not be obvious from the source IP address (entry-mixnode). Then, for the rest of the connections (mix node to mix node, and mix node to exit) we also use the same pattern to keep the usage across the network uniform.

In terms of DoS protection, deploying Noise at the transport layer helps shield the Sphinx layer, as each message must first be validated by Noise before any Sphinx processing occurs. However, Noise itself can still be vulnerable to DoS attacks. WireGuard mitigates this risk using a cookie-based mechanism. We could adopt a similar approach, or explore alternative strategies — all of which are currently under consideration.

let node_id = nym_node.node_id;

let role: NodeRole = rewarded_set.role(node_id).into();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you copied it from skimmed response but you dropped filtering for inactive nodes, what gives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since (at least for the moment) we need this endpoint to get the Noise keys, we also need it for inactive nodes (tests and stuff) so I did not why we would need it only for active ones.
That being said, I can add the filtering back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's not designed to be a noise-exlusive endpoint : )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the option for later in d4d6e42
I didn't implement the active version of the endpoint though

match self.as_str().find("psk") {
Some(n) => {
let psk_index = n + 3;
let psk_char = self.as_str().chars().nth(psk_index).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about it, given we're using hardcoded patterns, couldn't we also hardcode that psk_position to avoid unwrap alltogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding/modifying patterns, I don't want to have that info in two places, that can lead to mismatch if not maintained properly.

The snow library is parsing the pattern string to get its scheme so it's not really far from that.

Given that, I feel like this is still the better option no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding/modifying patterns, I don't want to have that info in two places, that can lead to mismatch if not maintained properly.

that's why we write unit tests : )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can derive EnumIter on the guy and use it in tests to make sure every variant is always covered and returns expected value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests to ensure validity of all the noise pattern we hardcode in 090026d
Let's keep the hardcoded position in the tests though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why not also use constant for the psk_position itself xD (I guess I really don't like unwraps that could be removed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon having the automated construction somewhere and the constants somewhere else is a good thing to ensure validity.
I'm team automated construction in the actual code and constants in the tests, if the tests pass, those unwraps cannot fail.
Alternatively, I can remove them and make the fn fallible, since it's already used in a fallible function. But again, if the tests pass, they can't fail

.support
.inner
.load()
.get(&ip_addr.to_canonical()) // SW default bind address being [::]:1789, it can happen that a responder sees the ipv6-mapped address of the initiator, this check for that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would that matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, wouldn't it be better if we always and only used canonical address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter because I have seen a case where the responder sees the IPv6-mapped version of the initiator's IP, and it needs to canonicalize it to find it.

As for only using the canonical version, using solely the IP for this is kinda shaky, so let's put all the chances on our side no? We're more likely to miss a Noise supporting node than to think a node is supporting it when it doesn't

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. wouldn't using canonical address help us in reducing chances of accidentally having duplicates of the same underlying node?

}

// Only for phased update
//SW This can lead to some troubles if two nodes shares the same IP and one support Noise but not the other. This in only for the progressive update though and there is no workaround
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i hope this doesn't introduce any problems/vulnerabilities, because somebody will attempt to use that.

couldn't we simply say that if we have connection from ip A.B.C.D and it doesn't use noise, all connections from that ip are not allowed to use noise (and vice versa)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They might exploit and block you if you don't update. If you do, they can't.

We could but then it means somebody can force a downgrade on your up-to-date node. Better to cause problems on older than newer nodes no?

return Poll::Ready(Err(io::ErrorKind::InvalidInput.into()));
};
noise_buf.truncate(len);
match projected_self.inner_stream.start_send(noise_buf.into()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but start_send does not imply the data has been sent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added flushing afterwards in fc16dc6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work as you wouldn't be able to flush it before sending is finished. you'd need some sort of state machine to keep track of it. i think. perhaps use one of tokio-util's adapters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold up, Tokio is doing exactly that in their SinkWriter implementation to go from a Sink to an AsyncWrite :
https://docs.rs/tokio-util/latest/src/tokio_util/io/sink_writer.rs.html#104

let res = self
.client
.get_all_basic_nodes()
.get_all_expanded_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a tiny issue. currently nym api will return unimplemented on that endpoint : (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well let's sure to update the nym-api before the nodes.
This endpoint is implemented in that PR

.flat_map(|n| {
n.basic.ip_addresses.iter().map(|ip_addr| {
(
SocketAddr::new(*ip_addr, n.basic.mix_port),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, hold on a second here. i'm not sure this is the port you want to be using. this is the port the remote node is listening on. when it tries to establish connection to your node, it will be different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the info I want. The port info is used by initiator, which is sending to your node.
The responder will only use the IP part that is constructed here, where I'm removing the port info :

let noise_support = new

@timkuijsten
Copy link

In terms of DoS protection, deploying Noise at the transport layer helps shield the Sphinx layer, as each message must first be validated by Noise before any Sphinx processing occurs. However, Noise itself can still be vulnerable to DoS attacks. WireGuard mitigates this risk using a cookie-based mechanism. We could adopt a similar approach, or explore alternative strategies — all of which are currently under consideration.

I guess XK + cookie and making sure there is no responder-side state for unauthenticated connections (like rosenpass did) would put you in a good position. Are there already any public discussions, designs or transcripts?

.support
.inner
.load()
.get(&ip_addr.to_canonical()) // SW default bind address being [::]:1789, it can happen that a responder sees the ipv6-mapped address of the initiator, this check for that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. wouldn't using canonical address help us in reducing chances of accidentally having duplicates of the same underlying node?


Poll::Ready(Some(Ok(noise_msg))) => {
//We have a new moise msg
let mut dec_msg = vec![0u8; MAXMSGLEN];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do we always have to allocate so much data per each message? i don't think that was the noise design

Copy link
Contributor Author

@simonwicky simonwicky Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit : Reading the code (from snow too) again, indeed allocating noise_message.len() - TAGLEN will suffice, fixed in ea233c7

let mut dec_msg = vec![0u8; MAXMSGLEN];
let len = match projected_self.noise {
Some(transport_state) => {
match transport_state.read_message(&noise_msg, &mut dec_msg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just return those bytes immediately here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have things in the buffer that needs to be returned before that.
A call to this poll_read might end up with us having data we have decrypted but can't return yet.
Hence we need to check it first before returning that part

Poll::Ready(Err(err)) => Poll::Ready(Err(err)),

Poll::Ready(Ok(())) => {
let mut noise_buf = BytesMut::zeroed(MAXMSGLEN + TAGLEN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not reuse existing buf or at least use its length? it's seems very wasteful to always allocate 60k for each tiny write...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory the writes can be up to 60k bytes though

Copy link
Contributor Author

@simonwicky simonwicky Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit : Reading the code (from snow too) again, indeed allocating buf.len() + TAGLEN will suffice, fixed in ea233c7

}
}

impl AsyncWrite for NoiseStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would https://docs.rs/tokio-util/latest/tokio_util/io/struct.SinkWriter.html work instead of implementing AsyncWrite? actually, why do we even need AsyncWrite + AsyncRead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without Noise, we're wrapping the TcpStream into a tokio_util::codec::Framed with the sphinx packet codec.
To code a one-for-one replacement, NoiseStream has to be AsyncWrite + AsyncRead in a way or another

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SinkWriter doesn't work for the same reason StreamReader doesn't. We need to handle the encryption in the middle

@simonwicky
Copy link
Contributor Author

@timkuijsten There are no public discussions or designs yet no.
Using cookies is a good lead, although it's just used to confirm the IP address so it can then be limited with other means.

@timkuijsten
Copy link

Using cookies is a good lead, although it's just used to confirm the IP address so it can then be limited with other means.

But nym uses TCP right? Isn't TCP's three-way-handshake enough for IP ownership confirmation? (unlike WireGuard which needs the cookie because it uses UDP).

@simonwicky
Copy link
Contributor Author

@timkuijsten For the moment it does indeed. That's a fair point, I'll need to think about it.

@simonwicky simonwicky added this to the Godiva milestone Apr 22, 2025
Err(err) => {
error!("Failed to perform Noise handshake with {address} - {err}");
// we failed to finish the noise handshake - increase reconnection attempt
self.current_reconnection.fetch_add(1, Ordering::SeqCst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we end up in a constant reconnection loop if receiver doesn't support noise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is a bit misleading tbh, my bad.
If the receiver doesn't support Noise, there won't be any handshake, and we will just return the TcpStream wrapped in a Connection for compatibility

match self.as_str().find("psk") {
Some(n) => {
let psk_index = n + 3;
let psk_char = self.as_str().chars().nth(psk_index).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why not also use constant for the psk_position itself xD (I guess I really don't like unwraps that could be removed)


while !handshake.is_handshake_finished() {
if handshake.is_my_turn() {
self.send_handshake_msg(&mut handshake).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking slightly deeper into it. i'm a bit confused: why are you having a method on self that takes handshake argument that originally was part of self but you temporarily removed it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the inner_stream from self to send the encrypted handshake message.
I'm taking the handshake out from self to prevent duplicate calls to this fn from messing with the first one.

@simonwicky simonwicky modified the milestones: Appenzeller, Brie May 5, 2025
@vercel vercel bot temporarily deployed to Preview – nym-next-explorer May 20, 2025 14:33 Inactive
@vercel vercel bot temporarily deployed to Preview – nym-explorer-v2 May 20, 2025 14:33 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants