[CentOS] dbus/systemd failure on startup (CentOS 7.7)

Thu Jan 23 15:17:18 UTC 2020
Simon Matter <simon.matter at invoca.ch>

> Simon Matter via CentOS wrote:
>>
>>> We are seeing a problem that occurs ~5% of the time when rebooting
>>
>> I see such issues on a quite large multi user system but when this
>> happens, after forced restarts for kernel updates, I usually don't have
>> the time to analyze and play doctor on it. My "solution" now is to
>> simply
>> reboot the server again in such a case, AKA the systemd way :-)
>>
>>> CentOS 7.7 where systemd gets a 'Connection timed out' to D-Bus just
>>> after the D-Bus service starts - from 'journalctl -x' :
>>>
>>> ...
>>> Jan 21 16:09:59 linux7-7.mpc.local systemd[1]: Started D-Bus System
>>> Message Bus.
>>> -- Subject: Unit dbus.service has finished start-up
>>> -- Defined-By: systemd
>>> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>>> --
>>> -- Unit dbus.service has finished starting up.
>>> --
>>> -- The start-up result is done.
>>> Jan 21 16:10:24 linux7-7.mpc.local systemd[1]: Failed to register match
>>> for Disconnected message: Connection timed out
>>> Jan 21 16:10:24 linux7-7.mpc.local systemd[1]: Failed to initialize
>>> D-Bus connection: Connection timed out
>>> ...
>>>
>>> This then has a knock-on effect that causes other services to fail -
>>> e.g.
>>>
>>> -- Unit gdm.service has begun starting up.
>>> Jan 21 16:10:39 linux7-7.mpc.local dbus[817]: [system] Activating
>>> systemd to hand-off: service name='org.freedesktop.login1'
>>> unit='dbus-org.freedesktop.login1.service'
>>> Jan 21 16:10:50 linux7-7.mpc.local dbus[817]: [system] Failed to
>>> activate service 'org.freedesktop.systemd1': timed out
>>> Jan 21 16:10:50 linux7-7.mpc.local systemd-logind[1221]: Failed to
>>> enable subscription: Failed to activate service
>>> 'org.freedesktop.systemd1': timed out
>>> Jan 21 16:10:50 linux7-7.mpc.local systemd-logind[1221]: Failed to
>>> fully
>>> start up daemon: Connection timed out
>>> Jan 21 16:10:50 linux7-7.mpc.local systemd[1]: systemd-logind.service:
>>> main process exited, code=exited, status=1/FAILURE
>>> Jan 21 16:10:50 linux7-7.mpc.local systemd[1]: Failed to start Login
>>> Service.
>>> -- Subject: Unit systemd-logind.service has failed
>>> -- Defined-By: systemd
>>> -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>>> --
>>> -- Unit systemd-logind.service has failed.
>>> --
>>> -- The result is failed.
>>>
>>> Whatever the issue is, it appears that polkit might be involved - if we
>>> restart the polkit service, things appear to return to normal (e.g. gdm
>>> starts up etc)
>>>
>>> We can't find any similar reports of this happening elsewhere with
>>> CentOS 7.7 - but we were wondering if anyone else had come across a
>>> problem like this?
>>
>> I think the root of the problem is that there are missing definitions in
>> some of the systemd scripts. They allow things to work in 95% or greater
>> of the cases but this happens by chance, not because of perfect process
>> handling and system control. Small delays somewhere or uncommon system
>> environments then lead to intermittent failures which are difficult to
>> diagnose - at least for me.
>>
>> The good news is that you can just fiddle with the systemd scripts the
>> same way we fiddled with init scripts in the past. That way you can try
>> and error until you find a solution. Doesn't sound like being in full
>> control of things but better than not finding a solution at all.
>
> Yeah, we found that by introducing a small delay before the ExecStart in
> the dbus.service unit - even a delay of just 0.01 seconds (via
> 'ExecStartPre=/usr/bin/sleep 0.01') _seems_ to workaround the issue ...

Nice that you found at least a workaround. I think I remember that dbus is
quite special here because systemd starts it but also depends on it. At
least I remember cases where dbus got crazy for whatever reason: the
result was that systemd became completely unresponsive and unmanageable
and the whole system went down the drain, slowly but steady. Ever tried to
shutdown a box if systemd doesn't listen to you anymore? The perfect
Windows experience on Linux ;-)

> However, we would still like to know what the issue is and get a 'real'
> fix - I guess we could try creating a bug report with Redhat ...

By bug report you mean BZ or a support request as paying RHEL customer?

Unfortunately I'm not too happy anymore with how BZs are handled these
days. Am I alone with this feeling?

Regards,
Simon