[CentOS] ssh stalls/hangs instead of exiting

Wed Apr 14 06:22:52 UTC 2021
Simon Matter <simon.matter at invoca.ch>

>>> On 4/13/21 11:36 PM, Chris Schanzle via CentOS wrote:
>>>> On 4/13/21 5:00 PM, Frank Cox wrote:
>>>>> On Tue, 13 Apr 2021 22:29:26 +0200
>>>>> Simon Matter wrote:
>>>>>
>>>>>> You could try running strace on the hanging process so see what it's
>>>>>> doing.
>>>>> [frankcox at mutt temp]$ rsync -avv ../temp/ jeff:temp
>>>>> opening connection using: ssh jeff rsync --server -vvlogDtpre.iLsfxC
>>>>> .
>>> temp  (7 args)
>>>>> sending incremental file list
>>>>> delta-transmission enabled
>>>>> abc is uptodate
>>>>> total: matches=0  hash_hits=0  false_alarms=0 data=0
>>>>>
>>>>> Leaving that sit there apparently doing nothing (but still not giving
>>>>> me my cursor back) I switched to another terminal window and did the
>>>>> following:
>>>>>
>>>>> [frankcox at mutt ~]$ ps -FA | grep rsync
>>>>> frankcox    5400    2435  0 60586  3160   5 14:52 pts/0    00:00:00
>>>>> rsync -avv ../temp/ jeff:temp
>>>>> frankcox    5401    5400  0 67980  7440   1 14:52 pts/0    00:00:00
>>>>> ssh
>>>> jeff rsync --server -vvlogDtpre.iLsfxC . temp
>>>>> frankcox    5526    5416  0 55476  1076   3 14:53 pts/1    00:00:00
>>>>> grep --color=auto rsync
>>>>>
>>>>> [frankcox at mutt ~]$ strace -p 5401
>>>>> strace: Process 5401 attached
>>>>> select(11, [5 9 10], [], NULL, NULL
>>>>>
>>>>> Then it just sits there with no further action.  I get my cursor back
>>>>> when I hit ctrl-c.
>>>>>
>>>>> [frankcox at mutt ~]$ strace -p 5400
>>>>> strace: Process 5400 attached
>>>>> restart_syscall(<... resuming interrupted nanosleep ...>) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>>
>>>>> The wait4-etc line just keeps repeating endlessly until I hit ctrl-c.
>>>>>
>>>>> Unfortunately, I have no idea what any of the above actually means.
>>>>> Does it tell us anything interesting?
>>>>
>>>> Yay!  I am glad someone else on the planet is experiencing this. 
>>>> I noticed this started happening to me after updating some CentOS
>>>> Linux
>>> 8
>>>> systems today.
>>>>
>>>> I discovered if I set ForwardX11=no (either on ssh command line or in
>>> ~/.ssh/config) the hang does not happen.  But why does that matter?  No
>>> updates to openssh.
>>>>
>>>> It is not the systemd update doing something silly with session
>>>> management.  I painfully downgraded manually and rebooted to no
>>>> effect. 
>>>
>>>> As an aside, why can't we we have nice things in life like 'dnf
>>>> downgrade
>>>> systemd\*' actually work?  I did the below - might be dumb, but it
>>> worked -- alternate suggestions to downgrade are appreciated -
>>> searching
>>> the list and my google-fu was off the mark today.
>>>>
>>>>   cd [path-to-repo]/centos/8/BaseOS/x86_64/os/Packages
>>>>   dnf downgrade $(rpm -qa systemd\* | grep 239-41.el8_3.2 | sed -e
>>> 's/3\.2/3.1/' -e 's/^/.\//' -e 's/$/.rpm/')
>>>>
>>>> Chris
>>>
>>>
>>> [adjusted the subject, hope that is OK.]
>>>
>>> Found it!  It's the dbus update to 1.12.8-12.  Downgrade to -11
>>> and ssh connections close normally.
>>>
>>> To clarify the problem, with the new dbus, simple ssh's like:
>>>
>>> ssh somehost uptime
>>>
>>> will print the uptime, but do not return to the local shell prompt
>>> until
>>> you hit ctrl-c.  It works normally if you downgrade dbus or
>>>
>>> ssh -o forwardx11=no somehost uptime
>>>
>>> I'm sure a bug report exists somewhere, but that's something to dig for
>>> or
>>> create tomorrow.
>>>
>>> To downgrade, packages were scattered in different locations, so I
>>> copied
>>> them to one directory and did
>>>
>>> dnf downgrade ./*
>>>
>>> The packages I needed to downgrade on a  x86_64 system were:
>>>
>>> dbus-1.12.8-11.el8.x86_64.rpm
>>> dbus-common-1.12.8-11.el8.noarch.rpm
>>> dbus-daemon-1.12.8-11.el8.x86_64.rpm
>>> dbus-devel-1.12.8-11.el8.x86_64.rpm
>>> dbus-libs-1.12.8-11.el8.x86_64.rpm
>>> dbus-tools-1.12.8-11.el8.x86_64.rpm
>>> dbus-x11-1.12.8-11.el8.x86_64.rpm
>>
>> Now that's really interesting, I was wondering why I don't see this on
>> OL8. The thing is that certain OL8 packages have an additional RPM
>> revision added like .0.1. Just checked dbus and its changelog shows:
>>
>> * Tue Feb 16 2021 Kevin Lyons <kevin.x.lyons at oracle.com> -1.12.8-12.0.1
>> - bus: raise fd limits before dropping privs [Orabug: 31175643]
>> - fix netlink poll: error 4 (Zhenzhong Duan)
>>
>> So OL is defnitly not 100% bug to bug compatible like the other clones
>> :-)
>>
>> And it makes me a bit worried why O* fixed this on Feb 16 and the broken
>> dbus packages are now (in April) installed on CentOS servers?
>
> Sorry, maybe I'm wrong here and the OL8 addons are fixing other things?
> Could someone who experiences the issue test with the OL8 dbus packages?
>

Could it be BZ #1940067?

https://bugzilla.redhat.com/show_bug.cgi?id=1940067

Regards,
Simon