Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auth caching issue #5883

Open
4 tasks done
baznikin opened this issue Mar 26, 2025 · 7 comments
Open
4 tasks done

Auth caching issue #5883

baznikin opened this issue Mar 26, 2025 · 7 comments
Assignees

Comments

@baznikin
Copy link

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

We run buildkit-daemon to build docker images with GitLab. We face issues which, I believe, because of some sort of auth caching.

  1. user A with specific registry rights run build job in repo groupA/repoA. Build is successful
  2. user B with different permissions run build job in repo groupB/repoB. Build failed with error server message: insufficient_scope: authorization failed. In build logs mentioned groupA/repoA:
#21 [auth] groupA/projectA/branch:pull,push groupB/projectB/othebranch:pull token for registry.mydomain.com

If restart buildkit daemon in run jobs in different order - situation is changed.

First time encounter this issue 4 days after we implement buildkit-daemon. After some restarts and running same versions of cli (it was rootless-master, became 0.20.1) and daemon (it was 0.18, became 0.20.1) issie vanished and returned today, when I try to build image for completly isolated project and user, who have no access to other repositories. Upon studying of logs I notice mention of completly different project, to which this limited user have no access and which didn't mentioned in pipeline either - obviously it came from buildkit itself!

Reproduction

I have no particular reproduction scenario; logs follows. Logs was redacted by substituting sensitive information like project names, group names and domains with placeholders.

CLI logs, as I see them in GitLab:

#17 [9/9] COPY ./ /app
#17 DONE 7.4s
#18 exporting to image
#18 ...
#19 [auth] groupA/projectA/branch:pull,push token for registry.mydomain.com
#19 DONE 0.0s
#18 exporting to image
#18 exporting layers done
#18 exporting manifest sha256:f5c3c9d7afc6b922afaf1cac8ff2b7cb4111d43bf1fa7a3dce3e898843950ee2 0.0s done
#18 exporting config sha256:c35890b3582f426c92c3523c053e12483201a4a3918be4732a81ad809f0a925f done
#18 pushing layers
#18 ...
#20 [auth] groupA/projectA/branch:pull,push groupA/projectB/agregator/backend/buildkit-cache:pull token for registry.mydomain.com
#20 DONE 0.0s
#21 [auth] groupA/projectA/branch:pull,push groupA/projectB/agregator/backend/buildkit-cache:pull token for registry.mydomain.com
#21 DONE 0.0s
#22 exporting to image
#22 exporting manifest sha256:f5c3c9d7afc6b922afaf1cac8ff2b7cb4111d43bf1fa7a3dce3e898843950ee2 done
#22 exporting config sha256:c35890b3582f426c92c3523c053e12483201a4a3918be4732a81ad809f0a925f done
#22 pushing layers 0.7s done
#22 ERROR: failed to push registry.mydomain.com/groupA/projectA/branch:be7804fa: server message: insufficient_scope: authorization failed
#18 exporting to image
#18 pushing layers 0.7s done
#18 CANCELED

buildkit daemon log and trace for this build - https://gist.github.com/baznikin/9bd860a22a96b0bbbf5cef9601e76b44

Version information

daemon: moby/buildkit:v0.20.1-rootless
cli: moby/buildkit:v0.20.1-rootless

daemon installed with Helm chat using terraform, resource declaration:

resource "helm_release" "buildkit-daemon" {
  name       = "buildkit"
  namespace  = "gitlab-runner"
  repository = "https://andrcuns.github.io/charts"
  chart      = "buildkit-service"
  version    = "0.9.0"

  values = [<<YAML
    image:
      tag: v0.20.1-rootless
    preStop: true
    rootless: true
    resources:
      requests:
        #ephemeral-storage: 95Gi
        cpu: 10
        memory: 16Gi
    nodeSelector:
      purpose: buildkit
    tolerations:
      - key: buildkit
        effect: NoSchedule
  YAML
  ]
}

Daemon called with GitLab pipeline template this way:

      buildctl
      --addr tcp://buildkit-buildkit-service:1234
      build
      --frontend dockerfile.v0
      --local context=${BUILD_CONTEXT}
      --local dockerfile=${BUILD_DOCKERFILE_DIR}
      --opt filename=${BUILD_DOCKERFILE_NAME}
      $BUILD_ARGS
      --export-cache type=registry,ref=${CI_REGISTRY_IMAGE}/buildkit-cache:${CI_COMMIT_REF_SLUG}-${CI_JOB_NAME_SLUG},mode=max,image-manifest=true,ignore-error=true
      --import-cache type=registry,ref=${CI_REGISTRY_IMAGE}/buildkit-cache:${CI_COMMIT_REF_SLUG}-${CI_JOB_NAME_SLUG}
      --import-cache type=registry,ref=${CI_REGISTRY_IMAGE}/buildkit-cache:${CI_DEFAULT_BRANCH}-${CI_JOB_NAME_SLUG}
      --output type=image,name=${LATEST},push=true
      --output type=image,name=${BUILD_DESTINATION},push=true
@tonistiigi
Copy link
Member

This is somewhat expected in multi-user use cases as buildkit daemon does not provide isolation between different users. The content from the first build is matched against the second build, but in order to push it with cross-repo mount the second build needs to prove that it had access to the original source.

There might be some hacky fix for this if we can detect the conditions for cross-repo-mount not working and fall back to inefficient reupload of layer bytes.

@baznikin
Copy link
Author

I wonder - how people even use buildkit daemon with gitlab? they are lucky enough to have same access rights for all users?

It is pretty strange for me, why one job need access to completely different repo. In my opinion daemon instructed to build image - we provide it with Dockerfile and context, we provide it with credentials which is enough to build an image, we instruct where get caches and where to put results. If daemon sees "similar" layers in previous builds - lets just pick them from local cache and reuse them. If he can't and have to pull it from somewhere else location, not mentioned in current build job - just rebuild it not using cache.

At first glance I thought daemon is cached credentials with first build job ("to access registry.mydomain.com use this. always") and failed to push to different image on same registry. If so - maybe just add a switch to turn off auth caching?

@tonistiigi
Copy link
Member

#21 [auth] groupA/projectA/branch:pull,push groupA/projectB/agregator/backend/buildkit-cache:pull token for registry.mydomain.com
#21 DONE 0.0s
#22 exporting to image
#22 exporting manifest sha256:f5c3c9d7afc6b922afaf1cac8ff2b7cb4111d43bf1fa7a3dce3e898843950ee2 done
#22 exporting config sha256:c35890b3582f426c92c3523c053e12483201a4a3918be4732a81ad809f0a925f done
#22 pushing layers 0.7s done
#22 ERROR: failed to push registry.mydomain.com/groupA/projectA/branch:be7804fa: server message: insufficient_scope: authorization failed
#18 exporting to image

You are pushing to registry.mydomain.com/groupA/projectA/branch so you always need to have a push token for that. My understanding is that this build failed was because you didn't have access to groupA/projectB/agregator/backend/buildkit-cache:pull and that is the part that was used by previous build (and now layers from that source are part of your local storage).

@baznikin
Copy link
Author

you didn't have access to groupA/projectB/agregator/backend/buildkit-cache:pull

Exactly! It is different project with different team and different access rights. Only thing they have in common - they are both written on python and use very similar looking Dockerfiles based on python:3.13, that's all. Looks like buildkit now suppose layers of image bound to another repository and want pull access to it? :)

@chrisbradleydev
Copy link

@baznikin

My team recently ran into the same issue.

We're currently exploring our options.

@one-adam-nolan
Copy link

I am not sure if this should be considered a feature request or a bug.

First, it's important to know that each project within Gitlab has it's own instance of a Container Registry, see Gitlab Container Registry

@baznikin My assumption is that your pipeline is authenticating with the CI_REGISTRY_PASSWORD or CI_JOB_TOKEN (if present are the same value, see Predefined Variables). The permissions for that token are scoped to the project the Gitlab Runner is working with...

This means that when a build is completed and it's using layers from the local cache that have been pushed to projects other than the current project, we get this error. This is because the CI_JOB_TOKEN does not have access to the other project(s)- including their registry.

It seems Buildkits local cache's existing behavior operates under the presumption that all builds will be pushed to the same Image Registry - this does not align is not how Gitlab works.

To point out the obvious (maybe not so obvious)- by isolating all of the registries, Gitlab has lots of layer duplication across projects in their respective caches....Which his not great, but is by design with the isolated registries.

What I see as potential solutions for this are the following:

  1. Add a flag named something like--isolate-registries. When used, if the path of the cached layers project does not match the path of where the new image is being pushed, a new layer is pushed to the new images registry cache that does not reference the location/project of the locally cached layer.

  2. Add a flag such as --bypass-local-cache that can be used when both --import-cache and --export-cache are provided.

If you have made it this far, thanks for listening! My team has ran into the same issue this week when attempting to switch build systems.

@fiam fiam self-assigned this Apr 3, 2025
@baznikin
Copy link
Author

baznikin commented Apr 8, 2025

@baznikin My assumption is that your pipeline is authenticating with the CI_REGISTRY_PASSWORD or CI_JOB_TOKEN (if present are the same value, see Predefined Variables). The permissions for that token are scoped to the project the Gitlab Runner is working with...

This means that when a build is completed and it's using layers from the local cache that have been pushed to projects other than the current project, we get this error. This is because the CI_JOB_TOKEN does not have access to the other project(s)- including their registry.

Exactly!

It seems Buildkits local cache's existing behavior operates under the presumption that all builds will be pushed to the same Image Registry - this does not align is not how Gitlab works.

Yeah, basically looks like buildkit "claim" layer to specific repository when it first time pushed to repository.

I do not see why it is required... I suppose it should buld image using its caches and provided context and then just push to specified destination using specified credentials. Don't "bind" layers to specific repository or do not cache auth credentials.

Maybe I am wrong in implementation details, I didn't read sources.

To point out the obvious (maybe not so obvious)- by isolating all of the registries, Gitlab has lots of layer duplication across projects in their respective caches....Which his not great, but is by design with the isolated registries.

What I see as potential solutions for this are the following:

1. Add a flag named something like`--isolate-registries`. When used, if the path of the cached layers project does not match the path of where the new image is being pushed, a new layer is pushed to the new images registry cache that does not reference the location/project of the locally cached layer.

2. Add a flag such as `--bypass-local-cache` that can be used when both `--import-cache` and `--export-cache` are provided.

or --bypass-auth-cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants