supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

donaldguy · 2024-01-26T04:54:15Z

TL;DR basically like the title says: I have discovered:

that the supervisor has only ever(? recently-ever?) been working/starting for me because some bug on (re)boot is preventing the hassio-supervisor apparmor policy from being actually applied/enforced on any process in an existing-on-boot hassio_supervisor container (with e.g. ps auZ showing them "unrestricted", and `apparmor_status listing none of them)
As soon as something causes the policy to actually be applied (most reproducibly a re-creation of the container, but possibly others) - the system cannot be healthy due to udev permission failures; as in Supervisor keeps going “Unhealthy” #4381

Describe the issue you are experiencing

Until pretty recently after many hours of investigating and writing I did not have a specific proposed pathology (had only the "disease"), but I have found it (the worrying "cure") and it raises potentially greater concerns, including both how it interacts with the udev issue and that it occurs at all.

the "disease"

As captured in the log-snippets below (and at root in #4381 and potentially related to continuing issues with #4827 after release of docker 25.0.1), the eventual problem is that (reliably after container recreation, so e.g. supervisor update, as well as potentially after some supervisor restarts in my testing, and also apparently just in letting HA run sometimes?) this code goes down the unhappy path:

supervisor/supervisor/hardware/monitor.py

Lines 29 to 43 in d24543e

    
           async def load(self) -> None: 
        
               """Start hardware monitor.""" 
        
               try: 
        
                   self.monitor = pyudev.Monitor.from_netlink(self.context, "kernel") 
        
                   self.monitor.set_receive_buffer_size(32 * 1024 * 1024) 
        
                   self.observer = pyudev.MonitorObserver( 
        
                       self.monitor, 
        
                       callback=lambda x: self.sys_loop.call_soon_threadsafe( 
        
                           self._udev_events, x 
        
                       ), 
        
                   ) 
        
               except OSError: 
        
                   self.sys_resolution.unhealthy = UnhealthyReason.PRIVILEGED 
        
                   _LOGGER.critical("Not privileged to run udev monitor!")

producing a false-positive (or at least transiently-valid) claim of running an unsupported/unhealthy configuration, failing by ~short-circuit under the "privileged" check, owing to what I would assert is an erroneous re-usage of that error code in the network code, as

$ docker inspect hassio_supervisor -f '{{.HostConfig.Privileged}}'
true

remains the case at every point.

Against an already running core, this also blocks add-on (and core) upgrades even as the supervisor may be de facto functioning adequately (for most purposes, presumably it would fail at any responsibilities specifically related to hardware (change) monitoring)

the worrying "cure"

With limited exceptions (when running Docker 25.0.0 and hitting other problems) I have generally found that rebooting the entire machine will solve my problem… for a while.

I did not know why, even as I started to dig into this udev issue and hone in on the specifics within libudev

but as it turns out, after I reboot, the loaded hassio-supervisor profile is simply not applied to the container

$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]

$ aa-status | grep hassio-supervisor
   hassio-supervisor
   hassio-supervisor///usr/bin/gdbus
   hassio-supervisor///usr/bin/git

# the shortness of which alone should hint the issue, but we can confirm

$ docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid
LABEL                               PID TTY      STAT   TIME COMMAND
LABEL                               PID TTY      STAT   TIME COMMAND
unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
unconfined                         2722 ?        S      0:00 s6-supervise supervisor
unconfined                         2724 ?        S      0:00 s6-supervise watchdog
# with a more elaborate pipeline used below indicating that it does not in fact somehow come into effect by `python`

this being in contrast to the other hassio_* containers which are instead/at least under docker-default profile. (by virtue having no specified alternative profile)

What type of installation are you running?

Home Assistant Supervised

affecting supported systems?

I am of the impression that supervised-installer run on Debian Bookworm is considered supported, so ... then yes
But even if it is not, whereas there seems to be a reproducible critical failure upon the application of the stable apparmor profile at all, the issue may very well be waiting in the wings for HAOS
- pending some upgrade to apparmor, systemd/sytemd-udev ;
- but it is also possible it is a kernel bug that has already been fixed upstream and taken by y'all

I hope y'all take this seriously given the many hours [more than I meant to] I have put into it

but I will put my cards on the table, admitting I might be of spec somewhere (I believe only in an older kernel, but let me know if you see something else):

uname -a and filtered dpkg-query -l

$ uname -a
Linux odroidn2 6.1.63-current-meson64 #1 SMP PREEMPT Mon Nov 20 10:52:19 UTC 2023 aarch64 GNU/Linux

$ sudo dpkg-query -l | grep -E 'apparmor|cifs-utils|curl|dbus|jq|libglib2.0-bin|lsb-release|network-manager|nfs-common|systemd|udisks2|wget|^linux|docker|homeassistant'
hii  apparmor                                              3.0.8-3                        arm64        user-space parser utility for AppArmor
ii  cifs-utils                                            2:7.0-2                        arm64        Common Internet File System utilities
ii  curl                                                  7.88.1-10+deb12u5              arm64        command line tool for transferring data with URL syntax
ii  dbus                                                  1.14.10-1~deb12u1              arm64        simple interprocess messaging system (system message bus)
ii  dbus-bin                                              1.14.10-1~deb12u1              arm64        simple interprocess messaging system (command line utilities)
ii  dbus-broker                                           33-1                           arm64        Linux D-Bus Message Broker
ii  dbus-daemon                                           1.14.10-1~deb12u1              arm64        simple interprocess messaging system (reference message bus)
ii  dbus-session-bus-common                               1.14.10-1~deb12u1              all          simple interprocess messaging system (session bus configuration)
ii  dbus-system-bus-common                                1.14.10-1~deb12u1              all          simple interprocess messaging system (system bus configuration)
ii  dbus-user-session                                     1.14.10-1~deb12u1              arm64        simple interprocess messaging system (systemd --user integration)
ii  docker-ce                                             5:25.0.1-1~debian.12~bookworm  arm64        Docker: the open-source application container engine
ii  docker-ce-cli                                         5:25.0.1-1~debian.12~bookworm  arm64        Docker CLI: the open-source application container engine
ii  homeassistant-supervised                              1.5.0                          all          Home Assistant Supervised
ii  jq                                                    1.6-2.1                        arm64        lightweight and flexible command-line JSON processor
ii  libapparmor1:arm64                                    3.0.8-3                        arm64        changehat AppArmor library
ii  libcurl3-gnutls:arm64                                 7.88.1-10+deb12u5              arm64        easy-to-use client-side URL transfer library (GnuTLS flavour)
ii  libcurl4:arm64                                        7.88.1-10+deb12u5              arm64        easy-to-use client-side URL transfer library (OpenSSL flavour)
ii  libdbus-1-3:arm64                                     1.14.10-1~deb12u1              arm64        simple interprocess messaging system (library)
ii  libglib2.0-bin                                        2.74.6-2                       arm64        Programs for the GLib library
ii  libjq1:arm64                                          1.6-2.1                        arm64        lightweight and flexible command-line JSON processor - shared library
ii  libpam-systemd:arm64                                  252.19-1~deb12u1               arm64        system and service manager - PAM module
ii  libsystemd-shared:arm64                               252.19-1~deb12u1               arm64        systemd shared private library
ii  libsystemd0:arm64                                     252.19-1~deb12u1               arm64        systemd utility library
ii  libudisks2-0:arm64                                    2.9.4-4                        arm64        GObject based library to access udisks2
ii  lsb-release                                           12.0-1                         all          Linux Standard Base version reporting utility (minimal implementation)
ii  network-manager                                       1.42.4-1                       arm64        network management framework (daemon and userspace tools)
ii  network-manager-openvpn                               1.10.2-2                       arm64        network management framework (OpenVPN plugin core)
ii  networkd-dispatcher                                   2.2.3-1                        all          Dispatcher service for systemd-networkd connection status changes
ii  nfs-common                                            1:2.6.2-4                      arm64        NFS support files common to client and server
ii  python3-dbus                                          1.3.2-4+b1                     arm64        simple interprocess messaging system (Python 3 interface)
ii  systemd                                               252.19-1~deb12u1               arm64        system and service manager
ii  systemd-journal-remote                                252.19-1~deb12u1               arm64        tools for sending and receiving remote journal logs
ii  systemd-resolved                                      252.19-1~deb12u1               arm64        systemd DNS resolver
ii  systemd-sysv                                          252.19-1~deb12u1               arm64        system and service manager - SysV compatibility symlinks
ii  udisks2                                               2.9.4-4                        arm64        D-Bus service to access and manipulate storage devices
ii  wget                                                  1.21.3-1+b1                    arm64        retrieves files from the web

Which operating system are you running on?

Debian

Steps to reproduce the issue

Reboot host machine (with a hassio_supervisor container already created. If I stop and delete before reboot, the initial creation is "broken". An additional reboot achieves)

Observer that supervisor comes up reporting as in healthy state;

odroidn2:~:# uptime
 18:44:13 up 1 min,  1 user,  load average: 1.65, 0.75, 0.28
odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -2
24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: true
supported: true
version: 2023.12.1
version_latest: 2023.12.1

optionally: proceed in blissful ignorance as to why/how

Alternatively ruin your sense of well-being by looking into the apparmor status of that-there container

odroidn2:~:# docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]

docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
    xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
    xargs ps Z

LABEL                               PID TTY      STAT   TIME COMMAND
unconfined                         2503 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
unconfined                         2641 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
unconfined                         2680 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
unconfined                         2722 ?        S      0:00 s6-supervise supervisor
unconfined                         2724 ?        S      0:00 s6-supervise watchdog
unconfined                         2726 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
unconfined                         2727 ?        Ssl    0:06 python3 -m supervisor
unconfined                         7424 ?        S      0:00 sleep 300

3.5. Confirm that in fact there is such profile actually

odroidn2:~:# aa-status | grep hassio
   hassio-supervisor
   hassio-supervisor///usr/bin/gdbus
   hassio-supervisor///usr/bin/git
odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep apparmor
24-01-25 19:31:42 INFO (MainThread) [supervisor.host.apparmor] Loading AppArmor Profiles: {'hassio-supervisor'}
odroidn2:~:# head /usr/share/hassio/apparmor/hassio-supervisor
#include <tunables/global>

profile hassio-supervisor flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/python>

  network,
  deny network raw,

  signal (send) set=(kill,term,int,hup,cont),

3.75. Try various methods to simply restart the container [ha supervisor restart, systemctl restart hassio-supervisor or even systemctl stop hassio-supervisor wait a while systemctl start hassio-supervisor ... and oops actually everything was "fine";

I kinda swear at least once it broke on restart, but I might be misremembering, or hit one of the edges suggested below)

odroidn2:~:# ha supervisor restart
Command completed successfully.
odroidn2:~:# docker ps | grep  hassio_supervisor
974b16d01e95   ghcr.io/home-assistant/aarch64-hassio-supervisor:latest     "/init"                  14 minutes ago   Up 13 minutes                                                                                                                                  hassio_supervisor
odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: true
supported: true
version: 2023.12.1
version_latest: 2023.12.1

still "fine" is all the processes remain unconstrained from the apparmor profile intended for the,

Okay, but anything which actually recreates the container (as /usr/sbin/hassio-supervisor will do when the startup-marker (/run/supervisor/startup-marker) is left in place (so that is the supervisor crashes early?), or indeed as outlined above when the profile is missing initially, or on supervisor update; or if simply someone/something runs docker rm on the stopped container)

So yeah we can do

systemctl stop hassio-supervisor && touch /run/supervisor/startup-marker && systemctl start hassio-superviosr

... for example

And now we will find our error state of interest

odroidn2:~:# docker ps | grep hassio_supervisor
325c2a447241   ghcr.io/home-assistant/aarch64-hassio-supervisor:latest     "/init"                  About a minute ago   Up About a minute                                                                                                                                  hassio_supervisor
odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: false
supported: true
version: 2023.12.1
version_latest: 2023.12.1

odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged

odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!

What has changed? I mean you know, but its

odroidn2:~:# aa-status | grep hassio-supervisor
   hassio-supervisor
   hassio-supervisor///usr/bin/gdbus
   hassio-supervisor///usr/bin/git
   /package/admin/s6-2.12.0.2/command/s6-svscan (9035) hassio-supervisor
   /package/admin/s6-2.12.0.2/command/s6-supervise (9077) hassio-supervisor
   /package/admin/s6-linux-init-1.1.2.0/command/s6-linux-init-shutdownd (9079) hassio-supervisor
   /package/admin/s6-2.12.0.2/command/s6-supervise (9086) hassio-supervisor
   /package/admin/s6-2.12.0.2/command/s6-supervise (9087) hassio-supervisor
   /package/admin/s6-2.12.0.2/command/s6-ipcserverd (9095) hassio-supervisor
   /package/admin/s6-2.12.0.2/command/s6-supervise (9135) hassio-supervisor
   /package/admin/s6-2.12.0.2/command/s6-supervise (9137) hassio-supervisor
   /usr/local/bin/python3.11 (9138) hassio-supervisor
   /usr/bin/bash (9142) hassio-supervisor
   /bin/busybox (9773) hassio-supervisor

i.e.

odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
  xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
  xargs ps Z
LABEL                               PID TTY      STAT   TIME COMMAND
hassio-supervisor (enforce)        9035 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
hassio-supervisor (enforce)        9077 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
hassio-supervisor (enforce)        9079 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
hassio-supervisor (enforce)        9086 ?        S      0:00 s6-supervise s6rc-fdholder
hassio-supervisor (enforce)        9087 ?        S      0:00 s6-supervise s6rc-oneshot-runner
hassio-supervisor (enforce)        9095 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
hassio-supervisor (enforce)        9135 ?        S      0:00 s6-supervise supervisor
hassio-supervisor (enforce)        9137 ?        S      0:00 s6-supervise watchdog
hassio-supervisor (enforce)        9138 ?        Ssl    0:03 python3 -m supervisor
hassio-supervisor (enforce)        9142 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce)        9773 ?        S      0:00 sleep 30

We can go deeper actually. You can see most the same thing by starting with debug and debug_wait and using the and attaching from vscode; but to really get there - lets more awkwardly do
```
systemctl stop hassio-supervisor
docker rm hassio_supervisor
sh <(sed 's/docker container create/docker run --rm -it --entrypoint \/bin\/bash/' /usr/sbin/hassio-supervisor)
```
This should (still¹) give you into an interactive shell that is set up like the supervisor, including (which running our docker inspect ... pipeline can verify) the apparmor profile.

From here its instructive to do 3 things (restarting the shell between or not as you prefer):
1. actually run the supervisor (e.g. python3 -m supervisor ) and see it fail same as when deamonized (including CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!)
2. run udevadm monitor, and discover that it works perfectly (prints the headers and will respond if I go e.g. plug and unplug USB devices)
3. Attempt my minimal repro case, expanded from the real line via the pyudev source
  by starting a python repl, then
```
import pyudev
import os

from ctypes import get_errno

c = pyudev.Context()
m = c._libudev.udev_monitor_new_from_netlink(c, b'kernel')

if bool(m):
  print("successfully got pointer to a `struct udev_monitor`")
else:
  print("udev_monitor_new_from_netlink(3) returned NULL pointer")
  e = get_errno()
  print(f"Errno set was: {os.strerror(e)} ({e})")
```
  and like the supervisor, you'll get
  
  udev_monitor_new_from_netlink(3) returned NULL pointer
  Errno set was: Permission denied (13)
Why do these things? Well, it reveals that the apparmor policy is in effect AND is killing udev connect for the supervisor (and the python interpreter in general, and most things).

How do I know that? Because udevadm still works. It is calling the same function (udev_monitor_new_from_netlink) of the same library (from alpine's eudev-libs 3.2.11-r8): https://github.com/eudev-project/eudev/blob/611b6fbae2cca9ceeacb820befb3a0e8caec88b8/src/udev/udevadm-monitor.c#L214

(Unless in fact pyudev's wheel is bringing along its own copy of e.g. systemd libudev; but I don't see evidence of that)

So it follows that the only reason that udevadm still works when the other two don't is because the apparmor policy gives it an unrestricted pass:
```
/bin/udevadm Ux,
```
where python gets no such

Additional information

So what is actually failing?

the libudev call

it's apparently udev_monitor_new_from_netlink(3) which is unexpectedly returning a NULL pointer and setting errno to 13 ("Access denied")

Specifically I guess it is the version provided² by alpine's eudev-libs 3.2.11-r8. meaning it is a simple wrapper around its udev_monitor_new_from_netlink_fd function, whose source begins at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L166

That's pretty much, at core, 3 syscalls:

access on /run/udev/control - which, ls -l shows it srw------- root root and a stat also passed

socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)

Indeed I can just now say with some confidence it is this syscall, because I belatedly discovered that pyudev does hook a few log priority functions and

>>> import syslog
>>> c.log_priority = syslog.LOG_DEBUG
>>> m = c._libudev.udev_monitor_new_from_netlink(c, b'kernel')
error getting socket: Permission denied

That error appearing to be generated at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L203-L206 :

        if (fd < 0) {
                udev_monitor->sock = socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT);
                if (udev_monitor->sock < 0) {
                        log_debug_errno(errno, "error getting socket: %m");

getsockname(udev_monitor->sock, &snl.sa, &addrlen);

So... what's the problem? I don't know. Most of the internet agrees that a net_admin capability ought to clear the way for interfacing with netlink

If I had to guess, I'd point at maybe the:

network,
deny network raw,

at https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L7-L8

hitting up against the SOCK_RAW flag on the call ... but again nothing here has changed recently? or has it?

Anyway ... this turned into 8 hours of investigation & writing today on top of the 5 of investigation and 3 of probs useless PR (#4834) on Tuesday; ADHD be like that.

the rest of the standard bug form stuff

Anything in the Supervisor logs that might be useful for us?

...
CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
...
System is running in an unhealthy state and needs manual intervention!

#after restart
...
CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
...
(MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged

System Health information

System Information

version	core-2024.1.5
installation_type	Home Assistant Supervised
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.11.6
os_name	Linux
os_version	6.1.63-current-meson64
arch	aarch64
timezone	America/Denver
config_dir	/config

Home Assistant Community Store

GitHub API	ok
GitHub Content	ok
GitHub Web	ok
GitHub API Calls Remaining	5000
Installed Version	1.34.0
Stage	running
Available Repositories	1380
Downloaded Repositories	19

Home Assistant Supervisor

host_os	Debian GNU/Linux 12 (bookworm)
update_channel	stable
supervisor_version	supervisor-2023.12.1
agent_version	1.6.0
docker_version	25.0.1
disk_total	915.8 GB
disk_used	10.8 GB
healthy	failed to load: Unhealthy
supported	true
supervisor_api	failed to load: unreachable
version_api	ok
installed_addons	Studio Code Server (5.15.0), ESPHome (2023.12.8), Mosquitto broker (6.4.0), Tailscale (0.14.0), Music Assistant BETA (2.0.0b84), Zigbee2MQTT (1.35.1-1), Govee to MQTT Bridge (2024.01.21-33e8cd7c)

Dashboards

dashboards	1
resources	12
views	5
mode	storage

Recorder

oldest_recorder_run	January 16, 2024 at 9:45 PM
current_recorder_run	January 25, 2024 at 7:53 PM
estimated_db_size	290.71 MiB
database_engine	sqlite
database_version	3.41.2

Supervisor diagnostics

config_entry-hassio-230a4ddd53de32307921bbc57f86054b.json.txt

When I first investigated this on Tuesday (with docker 25.0.0 installed) it was sufficient to do docker exec -it hassio_supervisor bash to repro as below, but something has changed and now that generates a shell sibling to the pid1 as unrestricted instead of also getting the apparmor profile. Bafflingly at one point Tuesday this behavior seemed to change after I ran systemd-analyze set-log-level debug ) ↩
Up to writing this section that didn't click and I was looking at systemd-udev's version ; and it seems possible that the mismatch (with the offered udevd) could be in play as the issue. But the code between it (eudev) and the systemd versions of it and the nested device_monitor_new_full, do not differ greatly, though perhaps systemd's version's if (DEBUG_LOGGING) { gated stanza's may end up instructive] ↩

The text was updated successfully, but these errors were encountered:

donaldguy · 2024-01-26T04:58:09Z

Extracting out this bit of ~speculation to its own comment:

why isn't the policy applied on boot?

I didn't know. It might be a kernel bug, or a docker bug, or a systemd bug. Or it might just be a good old fashioned race condition?

At the highest level these are activating in the right order

odroidn2:~:# systemctl status apparmor | grep active
     Active: active (exited) since Thu 2024-01-25 18:42:39 MST; 19min ago
odroidn2:~:# systemctl status hassio-apparmor | grep active
     Active: active (exited) since Thu 2024-01-25 18:42:40 MST; 19min ago
odroidn2:~:# systemctl status hassio-supervisor | grep active
     Active: active (running) since Thu 2024-01-25 18:42:44 MST; 19min ag

this chain is there:

Wanted and After for hassio-apparmor.service in hassio-supervisor.service

And I can confirm that after a systemctl stop hassio-supervisor && docker rm -f $(docker ps -aq) && aa-remove-unknown && systemctl start hassio-supervisor, you end up at a-different-than-this-issue supervisor restart loop involving such journal lines as

audit[8358]: AVC apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" hassio-supervisor" pid=15881 comm="runc:[2:INIT]"
audit: type=1400 audit(1706235970.127:70): apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" name="hassio-supervisor" pid=15881 comm="runc:[2:INIT]"
...
hassio-supervisor[15823]: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: unable to apply apparmor profile: apparmor failed to apply profile: write /proc/self/attr/apparmor/exec: no such file or directory: unknown
hassio-supervisor[15823]: Error: failed to start containers: hassio_supervisor

followed by re-pulling the image and trying again fruitlessly in a loop which terminates only when you run systemctl start hassio-apparmor (at which point you end up landing in the profile-applied-to-supervisor-so-it-doesn't work state, until system reboot)

But clearly there is something going (enabalingly) wrong in the application of profile from cold start of an existing hassio_supervisor container

So there may be value (once the profile is not breaking the udev function) in adding to hassio-apparmor.service (and indeed potentially hassos-apparmor.service something like a

ExecStartPre=/usr/bin/aa-enabled -q

and/or indeed fully move the sort of apparmor_parser call which is the heart of these services to be instead a ExecStartPre of the supervisor

voron · 2024-02-22T16:13:25Z

Having the same issue

24-02-22 15:57:00 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!

Removal of app-armor from supervisor compose file eliminates this error, HA updates work as expected too. Thus I suppose the issue is in app-armor supervisor's profile, as originally reported.

donaldguy · 2024-02-29T06:59:09Z

specifically, as I sort of said, believe the issue with the policy is,

for some reason now-and-not-before, the interpretation of deny network raw
https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L8

now applying also to netlink domain, meaning the

I don't believe just adding an allow network netlink raw will work ... (but maybe it will with the right ordering?)

I think giving it a cut out will require instead saying like...

deny network inet raw,
deny network ax25 raw,
deny network ipx raw,
deny network appletalk raw,
deny network netrom raw,
deny network bridge raw,
deny network atpmvc raw,
deny network x25 raw,
deny network inet6 raw,
deny network rose raw,
deny network netbui raw,
deny network security raw,
deny network key raw,
deny network packet raw,
deny network ash raw,
deny network econet raw,
deny network atmsvc raw,
deny network sna raw,
deny network irda raw,
deny network ppox raw,
deny network wanpipe raw,
deny network bluetooth raw,
deny network unix raw,
deny network rds raw,
deny network llc raw,
deny network can raw,
deny network tipc raw,
deny network iucy raw,
deny network rxrpc raw,
deny network isdn raw,
deny network phonet raw,
deny network ieee802154 raw,
deny network caif raw,
deny network alg raw,
deny network nfc raw,
deny network vsock raw,
deny network mpls raw,
deny network ib raw,

Of course ... the logical question there is ... does it actually need non-raw access to most of those domains? (probs no)

so probs the sweeping [allow] network [] [] [*] of line 7 probs should go

maybe want like(?)

allow network inet tcp,
allow network inet6 tcp,
allow network bluetooth,
allow network netlink,

(maybe bridge? unix?)

github-actions · 2024-03-30T07:04:05Z

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

agners · 2024-04-08T17:14:43Z

Thanks for this in depth analyzes!

I was able to reproduce the original symptom (Supervisor getting unhealthy) on Supervised on a Debian 12 Bookworm.

I then was actually wondering if I can reproduce this with HAOS 12.2.rc1, specifically because it comes with new Buildroot which pushes it closer to what Debian Bookworm is using (specifically, Docker 25). However, I was not able to reproduce the problem on HAOS 12.2.rc1.

Which really made me question, what is different between Debian 12 and HAOS 12.2.rc1 🤔

Just like in Supervised, the AppArmor profile only gets applied on initial Supervisor container start on HAOS 12.2.rc1 as well. So this part seems to (mis)behave exactly the same way.

HAOS uses AppArmor 3.1.2 already in HAOS 11, no update has taken place there. Debian 12 uses 3.0.8. I looked through the changes between the two versions, but nothing caught my eyes.

So far I am a bit clueless 😢

I feel it should always have not worked (as in, deny network raw should always lead to deny network netlink raw).

In any case, I think applying an explicit allow protocols as you suggested make sense. We don't need Bluetooth in Supervisor, so I think I'd go with:

network inet  stream,
network inet6 stream,
network inet  dgram,
network inet6 dgram,
network netlink raw,

agners · 2024-04-09T14:04:31Z

I am still lost as to why Debian behaves different than HAOS here 😢

A few things to note: Docker exec seems to not apply the AppArmor profile:

hassio-supervisor (enforce)     root       17131  0.2 13.8 192868 136592 ?       Ssl  15:32   0:00 python3 -m supervisor
hassio-supervisor (enforce)     root       17132  0.0  0.3   4160  3516 ?        Ss   15:32   0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce)     root       17151  0.0  0.0   1612     4 ?        S    15:32   0:00 sleep 300
unconfined                      root       17325  0.0  0.4  10008  4488 pts/1    S+   15:33   0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17326  0.0  0.0  10008   480 pts/6    Ss   15:33   0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17327  0.0  2.8 1770644 28048 pts/6   Sl+  15:33   0:00 docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17342  0.0  0.2   2616  2288 pts/0    Ss   15:33   0:00 /bin/bash

Hence my udevadm command always worked, even with the uncorrected profile.

The AppArmor rules processing seems not very well defined to me, especially for networking. The syntax is:

NETWORK RULE = [ QUALIFIERS ] 'network' [ DOMAIN ] [ TYPE | PROTOCOL ]

The two rules so far:

network,
deny network raw,

Again, technically allowed all networking, but then disallowed all raw protocol for all domains, even for netlink.

In any case, raw seems to be disallowed by default, so allowing only netlink explicitly seems to be the way to go. It seems that unix domain is also allowed by default. I did not found where this defaults are specified. But it is probably better to enable it explicitly too, so I end up with:

network unix stream,
network inet stream,
network inet6 stream,
network inet dgram,
network inet6 dgram,
network netlink raw,

I'll create a PR to update the profile, for dev and beta channel first.

RichardPar · 2024-05-23T14:14:14Z

just to confirm I was bitten by the above deny rule. I changed to what @agners suggested and all is healthy.

donaldguy added the bug label Jan 26, 2024

github-actions bot added the stale label Mar 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024

agners added no-stale Avoids issue going stale and removed stale labels Apr 8, 2024

agners reopened this Apr 8, 2024

bdraco mentioned this issue May 23, 2024

Ubuntu app armor update causes Bluetooth sensors to stop working home-assistant/core#117768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

donaldguy commented Jan 26, 2024 •

edited

Loading

Anything in the Supervisor logs that might be useful for us?

System Health information

System Information

Supervisor diagnostics

donaldguy commented Jan 26, 2024 •

edited

Loading

voron commented Feb 22, 2024

donaldguy commented Feb 29, 2024 •

edited

Loading

github-actions bot commented Mar 30, 2024

agners commented Apr 8, 2024 •

edited

Loading

agners commented Apr 9, 2024

RichardPar commented May 23, 2024

supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

Comments

donaldguy commented Jan 26, 2024 • edited Loading

Describe the issue you are experiencing

the "disease"

the worrying "cure"

What type of installation are you running?

affecting supported systems?

Which operating system are you running on?

Steps to reproduce the issue

Additional information

So what is actually failing?

Anything in the Supervisor logs that might be useful for us?

System Health information

System Information

Supervisor diagnostics

Footnotes

donaldguy commented Jan 26, 2024 • edited Loading

why isn't the policy applied on boot?

voron commented Feb 22, 2024

donaldguy commented Feb 29, 2024 • edited Loading

github-actions bot commented Mar 30, 2024

agners commented Apr 8, 2024 • edited Loading

agners commented Apr 9, 2024

RichardPar commented May 23, 2024

donaldguy commented Jan 26, 2024 •

edited

Loading

donaldguy commented Jan 26, 2024 •

edited

Loading

donaldguy commented Feb 29, 2024 •

edited

Loading

agners commented Apr 8, 2024 •

edited

Loading