-
Notifications
You must be signed in to change notification settings - Fork 693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839
Comments
Extracting out this bit of ~speculation to its own comment: why isn't the policy applied on boot?I didn't know. It might be a kernel bug, or a docker bug, or a systemd bug. Or it might just be a good old fashioned race condition? At the highest level these are activating in the right order
this chain is there:
And I can confirm that after a
followed by re-pulling the image and trying again fruitlessly in a loop which terminates only when you run But clearly there is something going (enabalingly) wrong in the application of profile from cold start of an existing hassio_supervisor container So there may be value (once the profile is not breaking the udev function) in adding to
and/or indeed fully move the sort of |
Having the same issue
Removal of app-armor from supervisor compose file eliminates this error, HA updates work as expected too. Thus I suppose the issue is in app-armor supervisor's profile, as originally reported. |
specifically, as I sort of said, believe the issue with the policy is, for some reason now-and-not-before, the interpretation of now applying also to I don't believe just adding an I think giving it a cut out will require instead saying like...
Of course ... the logical question there is ... does it actually need non-raw access to most of those domains? (probs no) so probs the sweeping [allow] maybe want like(?)
(maybe bridge? unix?) |
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. |
Thanks for this in depth analyzes! I was able to reproduce the original symptom (Supervisor getting unhealthy) on Supervised on a Debian 12 Bookworm. I then was actually wondering if I can reproduce this with HAOS 12.2.rc1, specifically because it comes with new Buildroot which pushes it closer to what Debian Bookworm is using (specifically, Docker 25). However, I was not able to reproduce the problem on HAOS 12.2.rc1. Which really made me question, what is different between Debian 12 and HAOS 12.2.rc1 🤔 Just like in Supervised, the AppArmor profile only gets applied on initial Supervisor container start on HAOS 12.2.rc1 as well. So this part seems to (mis)behave exactly the same way. HAOS uses AppArmor 3.1.2 already in HAOS 11, no update has taken place there. Debian 12 uses 3.0.8. I looked through the changes between the two versions, but nothing caught my eyes. So far I am a bit clueless 😢 I feel it should always have not worked (as in, In any case, I think applying an explicit allow protocols as you suggested make sense. We don't need Bluetooth in Supervisor, so I think I'd go with:
|
I am still lost as to why Debian behaves different than HAOS here 😢 A few things to note: Docker exec seems to not apply the AppArmor profile:
Hence my udevadm command always worked, even with the uncorrected profile. The AppArmor rules processing seems not very well defined to me, especially for networking. The syntax is:
The two rules so far:
Again, technically allowed all networking, but then disallowed all In any case, raw seems to be disallowed by default, so allowing only netlink explicitly seems to be the way to go. It seems that
I'll create a PR to update the profile, for dev and beta channel first. |
just to confirm I was bitten by the above deny rule. I changed to what @agners suggested and all is healthy. |
TL;DR basically like the title says: I have discovered:
hassio-supervisor
apparmor policy from being actually applied/enforced on any process in an existing-on-boothassio_supervisor
container (with e.g.ps auZ
showing them "unrestricted", and `apparmor_status listing none of them)Describe the issue you are experiencing
Until pretty recently after many hours of investigating and writing I did not have a specific proposed pathology (had only
the "disease"
), but I have found it (the worrying "cure"
) and it raises potentially greater concerns, including both how it interacts with the udev issue and that it occurs at all.the "disease"
As captured in the log-snippets below (and at root in #4381 and potentially related to continuing issues with #4827 after release of docker 25.0.1), the eventual problem is that (reliably after container recreation, so e.g. supervisor update, as well as potentially after some supervisor restarts in my testing, and also apparently just in letting HA run sometimes?) this code goes down the unhappy path:
supervisor/supervisor/hardware/monitor.py
Lines 29 to 43 in d24543e
producing a false-positive (or at least transiently-valid) claim of running an unsupported/unhealthy configuration, failing by ~short-circuit under the "privileged" check, owing to what I would assert is an erroneous re-usage of that error code in the network code, as
remains the case at every point.
Against an already running core, this also blocks add-on (and core) upgrades even as the supervisor may be de facto functioning adequately (for most purposes, presumably it would fail at any responsibilities specifically related to hardware (change) monitoring)
the worrying "cure"
With limited exceptions (when running Docker 25.0.0 and hitting other problems) I have generally found that rebooting the entire machine will solve my problem… for a while.
I did not know why, even as I started to dig into this udev issue and hone in on the specifics within libudev
but as it turns out, after I reboot, the loaded
hassio-supervisor
profile is simply not applied to the containerthis being in contrast to the other
hassio_*
containers which are instead/at least underdocker-default
profile. (by virtue having no specified alternative profile)What type of installation are you running?
Home Assistant Supervised
affecting supported systems?
I am of the impression that supervised-installer run on Debian Bookworm is considered supported, so ... then yes
But even if it is not, whereas there seems to be a reproducible critical failure upon the application of the stable apparmor profile at all, the issue may very well be waiting in the wings for HAOS
apparmor
,systemd
/sytemd-udev
;I hope y'all take this seriously given the many hours [more than I meant to] I have put into it
but I will put my cards on the table, admitting I might be of spec somewhere (I believe only in an older kernel, but let me know if you see something else):
uname -a
and filtereddpkg-query -l
Which operating system are you running on?
Debian
Steps to reproduce the issue
Reboot host machine (with a
hassio_supervisor
container already created. If I stop and delete before reboot, the initial creation is "broken". An additional reboot achieves)Observer that supervisor comes up reporting as in healthy state;
optionally: proceed in blissful ignorance as to why/how
Alternatively ruin your sense of well-being by looking into the apparmor status of that-there container
3.5. Confirm that in fact there is such profile actually
3.75. Try various methods to simply restart the container [
ha supervisor restart
,systemctl restart hassio-supervisor
or evensystemctl stop hassio-supervisor
wait a whilesystemctl start hassio-supervisor
... and oops actually everything was "fine";I kinda swear at least once it broke on restart, but I might be misremembering, or hit one of the edges suggested below)
still "fine" is all the processes remain unconstrained from the apparmor profile intended for the,
Okay, but anything which actually recreates the container (as
/usr/sbin/hassio-supervisor
will do when the startup-marker (/run/supervisor/startup-marker
) is left in place (so that is the supervisor crashes early?), or indeed as outlined above when the profile is missing initially, or on supervisor update; or if simply someone/something runsdocker rm
on the stopped container)So yeah we can do
... for example
And now we will find our error state of interest
What has changed? I mean you know, but its
i.e.
We can go deeper actually. You can see most the same thing by starting with debug and debug_wait and using the and attaching from vscode; but to really get there - lets more awkwardly do
systemctl stop hassio-supervisor docker rm hassio_supervisor sh <(sed 's/docker container create/docker run --rm -it --entrypoint \/bin\/bash/' /usr/sbin/hassio-supervisor)
This should (still1) give you into an interactive shell that is set up like the supervisor, including (which running our
docker inspect ...
pipeline can verify) the apparmor profile.From here its instructive to do 3 things (restarting the shell between or not as you prefer):
actually run the supervisor (e.g.
python3 -m supervisor
) and see it fail same as when deamonized (includingCRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
)run
udevadm monitor
, and discover that it works perfectly (prints the headers and will respond if I go e.g. plug and unplug USB devices)Attempt my minimal repro case, expanded from the real line via the pyudev source
by starting a
python
repl, thenand like the supervisor, you'll get
Why do these things? Well, it reveals that the apparmor policy is in effect AND is killing udev connect for the supervisor (and the python interpreter in general, and most things).
How do I know that? Because udevadm still works. It is calling the same function (
udev_monitor_new_from_netlink
) of the same library (from alpine'seudev-libs
3.2.11-r8): https://github.com/eudev-project/eudev/blob/611b6fbae2cca9ceeacb820befb3a0e8caec88b8/src/udev/udevadm-monitor.c#L214(Unless in fact pyudev's wheel is bringing along its own copy of e.g. systemd libudev; but I don't see evidence of that)
So it follows that the only reason that udevadm still works when the other two don't is because the apparmor policy gives it an unrestricted pass:
where python gets no such
Additional information
So what is actually failing?
the
libudev
callit's apparently
udev_monitor_new_from_netlink
(3) which is unexpectedly returning a NULL pointer and settingerrno
to 13 ("Access denied")eudev-libs
3.2.11-r8. meaning it is a simple wrapper around itsudev_monitor_new_from_netlink_fd
function, whose source begins at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L166That's pretty much, at core, 3 syscalls:
access
on/run/udev/control
- which,ls -l
shows itsrw------- root root
and a stat also passedsocket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)
Indeed I can just now say with some confidence it is this syscall, because I belatedly discovered that pyudev does hook a few log priority functions and
That error appearing to be generated at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L203-L206 :
getsockname(udev_monitor->sock, &snl.sa, &addrlen);
So... what's the problem? I don't know. Most of the internet agrees that a
net_admin
capability ought to clear the way for interfacing with netlinkIf I had to guess, I'd point at maybe the:
at https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L7-L8
hitting up against the
SOCK_RAW
flag on the call ... but again nothing here has changed recently? or has it?Anyway ... this turned into 8 hours of investigation & writing today on top of the 5 of investigation and 3 of probs useless PR (#4834) on Tuesday; ADHD be like that.
the rest of the standard bug form stuff
Anything in the Supervisor logs that might be useful for us?
... CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor! ... System is running in an unhealthy state and needs manual intervention! #after restart ... CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor! ... (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3 24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed 24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete 24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
System Health information
System Information
Home Assistant Community Store
Home Assistant Supervisor
Dashboards
Recorder
Supervisor diagnostics
config_entry-hassio-230a4ddd53de32307921bbc57f86054b.json.txt
Footnotes
When I first investigated this on Tuesday (with docker 25.0.0 installed) it was sufficient to do
docker exec -it hassio_supervisor bash
to repro as below, but something has changed and now that generates a shell sibling to the pid1 as unrestricted instead of also getting the apparmor profile. Bafflingly at one point Tuesday this behavior seemed to change after I ransystemd-analyze set-log-level debug
) ↩Up to writing this section that didn't click and I was looking at systemd-udev's version ; and it seems possible that the mismatch (with the offered udevd) could be in play as the issue. But the code between it (eudev) and the systemd versions of it and the nested
device_monitor_new_full
, do not differ greatly, though perhaps systemd's version'sif (DEBUG_LOGGING) {
gated stanza's may end up instructive] ↩The text was updated successfully, but these errors were encountered: