[CentOS] ssh stalls/hangs instead of exiting

Thu Apr 15 10:40:24 UTC 2021
Simon Matter <simon.matter at invoca.ch>

> On 4/14/21 2:22 AM, Simon Matter wrote:
>>>>> On 4/13/21 11:36 PM, Chris Schanzle via CentOS wrote:
>>>>>> On 4/13/21 5:00 PM, Frank Cox wrote:
>>>>>>> On Tue, 13 Apr 2021 22:29:26 +0200
>>>>>>> Simon Matter wrote:
>>>>>>>
>>>>>>>> You could try running strace on the hanging process so see what
>>>>>>>> it's
>>>>>>>> doing.
>>>>>>> [frankcox at mutt temp]$ rsync -avv ../temp/ jeff:temp
>>>>>>> opening connection using: ssh jeff rsync --server
>>>>>>> -vvlogDtpre.iLsfxC
>>>>>>> .
>>>>> temp  (7 args)
>>>>>>> sending incremental file list
>>>>>>> delta-transmission enabled
>>>>>>> abc is uptodate
>>>>>>> total: matches=0  hash_hits=0  false_alarms=0 data=0
>>>>>>>
>>>>>>> Leaving that sit there apparently doing nothing (but still not
>>>>>>> giving
>>>>>>> me my cursor back) I switched to another terminal window and did
>>>>>>> the
>>>>>>> following:
>>>>>>>
>>>>>>> [frankcox at mutt ~]$ ps -FA | grep rsync
>>>>>>> frankcox    5400    2435  0 60586  3160   5 14:52 pts/0    00:00:00
>>>>>>> rsync -avv ../temp/ jeff:temp
>>>>>>> frankcox    5401    5400  0 67980  7440   1 14:52 pts/0    00:00:00
>>>>>>> ssh
>>>>>> jeff rsync --server -vvlogDtpre.iLsfxC . temp
>>>>>>> frankcox    5526    5416  0 55476  1076   3 14:53 pts/1    00:00:00
>>>>>>> grep --color=auto rsync
>>>>>>>
>>>>>>> [frankcox at mutt ~]$ strace -p 5401
>>>>>>> strace: Process 5401 attached
>>>>>>> select(11, [5 9 10], [], NULL, NULL
>>>>>>>
>>>>>>> Then it just sits there with no further action.  I get my cursor
>>>>>>> back
>>>>>>> when I hit ctrl-c.
>>>>>>>
>>>>>>> [frankcox at mutt ~]$ strace -p 5400
>>>>>>> strace: Process 5400 attached
>>>>>>> restart_syscall(<... resuming interrupted nanosleep ...>) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>>>
>>>>>>> The wait4-etc line just keeps repeating endlessly until I hit
>>>>>>> ctrl-c.
>>>>>>>
>>>>>>> Unfortunately, I have no idea what any of the above actually means.
>>>>>>> Does it tell us anything interesting?
>>>>>> Yay!  I am glad someone else on the planet is experiencing this. 
>>>>>> I noticed this started happening to me after updating some CentOS
>>>>>> Linux
>>>>> 8
>>>>>> systems today.
>>>>>>
>>>>>> I discovered if I set ForwardX11=no (either on ssh command line or
>>>>>> in
>>>>> ~/.ssh/config) the hang does not happen.  But why does that matter? 
>>>>> No
>>>>> updates to openssh.
>>>>>> It is not the systemd update doing something silly with session
>>>>>> management.  I painfully downgraded manually and rebooted to no
>>>>>> effect. 
>>>>>> As an aside, why can't we we have nice things in life like 'dnf
>>>>>> downgrade
>>>>>> systemd\*' actually work?  I did the below - might be dumb, but it
>>>>> worked -- alternate suggestions to downgrade are appreciated -
>>>>> searching
>>>>> the list and my google-fu was off the mark today.
>>>>>>   cd [path-to-repo]/centos/8/BaseOS/x86_64/os/Packages
>>>>>>   dnf downgrade $(rpm -qa systemd\* | grep 239-41.el8_3.2 | sed -e
>>>>> 's/3\.2/3.1/' -e 's/^/.\//' -e 's/$/.rpm/')
>>>>>> Chris
>>>>>
>>>>> [adjusted the subject, hope that is OK.]
>>>>>
>>>>> Found it!  It's the dbus update to 1.12.8-12.  Downgrade to -11
>>>>> and ssh connections close normally.
>>>>>
>>>>> To clarify the problem, with the new dbus, simple ssh's like:
>>>>>
>>>>> ssh somehost uptime
>>>>>
>>>>> will print the uptime, but do not return to the local shell prompt
>>>>> until
>>>>> you hit ctrl-c.  It works normally if you downgrade dbus or
>>>>>
>>>>> ssh -o forwardx11=no somehost uptime
>>>>>
>>>>> I'm sure a bug report exists somewhere, but that's something to dig
>>>>> for
>>>>> or
>>>>> create tomorrow.
>>>>>
>>>>> To downgrade, packages were scattered in different locations, so I
>>>>> copied
>>>>> them to one directory and did
>>>>>
>>>>> dnf downgrade ./*
>>>>>
>>>>> The packages I needed to downgrade on a  x86_64 system were:
>>>>>
>>>>> dbus-1.12.8-11.el8.x86_64.rpm
>>>>> dbus-common-1.12.8-11.el8.noarch.rpm
>>>>> dbus-daemon-1.12.8-11.el8.x86_64.rpm
>>>>> dbus-devel-1.12.8-11.el8.x86_64.rpm
>>>>> dbus-libs-1.12.8-11.el8.x86_64.rpm
>>>>> dbus-tools-1.12.8-11.el8.x86_64.rpm
>>>>> dbus-x11-1.12.8-11.el8.x86_64.rpm
>>>> Now that's really interesting, I was wondering why I don't see this on
>>>> OL8. The thing is that certain OL8 packages have an additional RPM
>>>> revision added like .0.1. Just checked dbus and its changelog shows:
>>>>
>>>> * Tue Feb 16 2021 Kevin Lyons <kevin.x.lyons at oracle.com>
>>>> -1.12.8-12.0.1
>>>> - bus: raise fd limits before dropping privs [Orabug: 31175643]
>>>> - fix netlink poll: error 4 (Zhenzhong Duan)
>>>>
>>>> So OL is defnitly not 100% bug to bug compatible like the other clones
>>>> :-)
>>>>
>>>> And it makes me a bit worried why O* fixed this on Feb 16 and the
>>>> broken
>>>> dbus packages are now (in April) installed on CentOS servers?
>>> Sorry, maybe I'm wrong here and the OL8 addons are fixing other things?
>>> Could someone who experiences the issue test with the OL8 dbus
>>> packages?
>>>
>> Could it be BZ #1940067?
>>
>> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1940067&data=04%7C01%7Cchristopher.schanzle%40nist.gov%7C33c18e2f06884a73d85508d8ff0dc2c4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C637539781864707918%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jFSxkP%2BWTZgq7VGAZGHXIWak7N%2BmP8SeGLelTTRUHv8%3D&reserved=0
>
> Bullseye, Simon!  Many thanks.
>
> A reasonable one-liner fix / workaround is below.  Also works when
> requesting a terminal with 'ssh -Xt'.  Adds a "tty -s || return" line
> in the right spot to check if a tty exists and if not, bail out w/o
> starting dbus-launch.  Change "-i" to "-i.bak" to make a backup.
>
>  sed -i '/SHLVL/atty -s || return' /etc/profile.d/ssh-x-forwarding.sh

Hi Chris,

IMHO we see a fundamental problem here if desktop toys like D-Bus can have
such an impact on basic tools like rsync. It's even worse if D-Bus goes
crazy and makes systemd become unmanageable. Not fun on big servers :-)

Regards,
Simon