Skip to content

binder: Introduce server pre-authorization #12127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 73 commits into
base: master
Choose a base branch
from

Conversation

jdcormie
Copy link
Member

@jdcormie jdcormie commented Jun 3, 2025

grpc-binder clients authorize servers by checking the UID of the sender of the SETUP_TRANSPORT Binder transaction against some SecurityPolicy. But merely binding to an unauthorized server to learn its UID can enable "keep-alive" and "background activity launch" abuse, even if security policy ultimately causes the grpc connection to fail. Pre-authorization mitigates this kind of abuse by resolving addresses and authorizing a candidate server Application's UID before binding to it. Pre-auth is especially important when the server's address is not fixed in advance but discovered by PackageManager lookup.

Design at: go/grpc-binder-service-discovery "Mitigating Keep-Alive Attacks"

ejona86 and others added 30 commits June 26, 2025 18:33
We can easily compute the rdsName and avoiding the state means we don't
need to override onResourceDoesNotExist() to keep the cache in-sync with
the config.
I noticed we deviated from gRFC A37 in some ways. It turned out those
were added to the gRFC later in grpc/proposal#344:
- NACKing empty aggregate clusters
- Failing aggregate cluster when children could not be loaded
- Recusion limit of 16. We had this behavior already, but it was
  ascribed to matching C++

There's disagreement on whether we should actually fail the aggregate
cluster for bad children, so I'm preserving the pre-existing behavior
for now.

The code is now doing a depth-first leaf traversal, not breadth-first.
This was odd to see, but the code was also pretty old, so the reasoning
seems lost to history. Since we haven't seen more than a single level of
aggregate clusters in practice, this wouldn't have been noticed by
users.

XdsDependencyManager.start() was created to guarantee that the callback
could not be called before returning from the constructor. Otherwise
XDS_CLUSTER_SUBSCRIPT_REGISTRY could potentially be null.
Just use a regular method instead of reusing the EvictionListener API.
Fix a few comments as well. Both of these changes were based on review
comments to pre-existing code in grpc#11203.

Contributes to grpc#11243
Changed the builder pattern to pass the builder to the constructor,
since I was changing almost all the arguments of the constructor anyway.
We here address the following obstacles in grpc-java to using Bazel's
--incompatible_disable_target_default_provider_fields flag:

```
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: in _java_grpc_library rule @@googleapis+//google/devtools/build/v1:build_java_grpc:
Traceback (most recent call last):
        File "/private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/grpc-java+/java_grpc_library.bzl", line 94, column 30, in _java_rpc_library_impl
                args.add(toolchain.plugin.files_to_run.executable, format = "--plugin=protoc-gen-rpc-plugin=%s")
Error: Accessing the default provider in this manner is deprecated and will be removed soon. It may be temporarily re-enabled by setting --incompatible_disable_target_default_provider_fields=false. See bazelbuild/bazel#20183 for details.
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: Analysis of target '@@googleapis+//google/devtools/build/v1:build_java_grpc' failed
ERROR: Analysis of target '//src:bazel' failed; build aborted: Analysis failed
```
The watchers can be completely regular, so the base class can do the
cache management while the subclasses are only concerned with
subscribing to children.
It was introduced in fcb5c54 because at the time we didn't change the
API to communicate the status. When onResult2() was introduced in
90d0fab this hack stopped being necessary.
None of these classes were intended to be extended. Even non-public
classes need final to prevent mocks from doing horrible things.
This will be used for logical dns clusters as part of gRFC A74. Swapping
to EnumMap wasn't really necessary, but was easy given the new type
system.

I can't say I'm particularly happy with the name of the new
TrackedWatcher type, but XdsConfigWatcher prevented using "Watcher"
because it won't implement the new interface, and ResourceWatcher
already exists in XdsClient. So we have TrackedWatcher, WatcherTracer,
TypeWatchers, and TrackedWatcherType.
TimeProvider provides wall time. That can move forward and backward as time is adjusted. OutlierDetection is measuring durations, so it should use a monotonic clock.

Fixes grpc#11622
This is missing behavior defined in gRFC A74:

> As per gRFC A31, the ConfigSelector gives each RPC a ref to the
> cluster that was selected for it to ensure that the cluster is not
> removed from the xds_cluster_manager LB policy config before the RPC
> is done with its LB picks. These cluster refs will also hold a
> subscription for the cluster from the XdsDependencyManager, so that
> the XdsDependencyManager will not stop watching the cluster resource
> until the cluster is removed from the xds_cluster_manager LB policy
> config.

Without the logic, RPCs can race and see the error:

> INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster

Fixes grpc#12152. This fixes the regression introduced in 297ab05
This should often not matter much, but in b/412468630 it was cleary
visible that child creation order can skew load for the first batch of
RPCs. This doesn't solve all the cases, as further-away backends will
still be less likely chosen initially and it is ignorant of the LB
policy. But this doesn't impact correctness, is easy, and is one fewer
cases to worry about.
ClusterResolverLb gets the NameResolverRegistry from
LoadBalancer.Helper, so a new API was added in NameResover.Args to
propagate the same object to the name resolver tree.

RetryingNameResolver was exposed to xds. This is expected to be
temporary, as the retrying is being removed from ManagedChannelImpl and
moved into the resolvers. At that point, DnsNameResolverProvider would
wrap DnsNameResolver with a similar API to RetryingNameResolver and xds
would no longer be responsible.
…mination (grpc#12167)

# Conflicts:
#	binder/src/androidTest/java/io/grpc/binder/internal/BinderClientTransportTest.java
#	binder/src/main/java/io/grpc/binder/internal/BinderTransport.java
#	binder/src/testFixtures/java/io/grpc/binder/internal/SettableAsyncSecurityPolicy.java
ClusterResolverLb is still doing DNS itself, so disable it in XdsDepMan
until that migration has finished. EDS is fine in XdsDepman, because
XdsClient will share the result with ClusterResolverLb.
The previous code did a ping-pong to make sure the transport had enough
time to process, but then proceeded to sleep 5 seconds. That sleep would
have been needed without the ping-pong, but with the ping-pong we are
confident all events have been drained from the transport. Deleting the
unnecessary sleeps saves 10 seconds, for each of the 9 instances of this
test.
…dentity before binding.

# Conflicts:
#	binder/src/androidTest/java/io/grpc/binder/internal/BinderClientTransportTest.java
#	binder/src/main/java/io/grpc/binder/internal/BinderTransport.java
#	binder/src/test/java/io/grpc/binder/RobolectricBinderSecurityTest.java
Parameterize RobolectricBinderSecurityTest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants