[CentOS] ssh stalls/hangs instead of exiting

Wed Apr 14 06:16:37 UTC 2021
Simon Matter <simon.matter at invoca.ch>

>> On 4/13/21 11:36 PM, Chris Schanzle via CentOS wrote:
>>> On 4/13/21 5:00 PM, Frank Cox wrote:
>>>> On Tue, 13 Apr 2021 22:29:26 +0200
>>>> Simon Matter wrote:
>>>>
>>>>> You could try running strace on the hanging process so see what it's
>>>>> doing.
>>>> [frankcox at mutt temp]$ rsync -avv ../temp/ jeff:temp
>>>> opening connection using: ssh jeff rsync --server -vvlogDtpre.iLsfxC .
>> temp  (7 args)
>>>> sending incremental file list
>>>> delta-transmission enabled
>>>> abc is uptodate
>>>> total: matches=0  hash_hits=0  false_alarms=0 data=0
>>>>
>>>> Leaving that sit there apparently doing nothing (but still not giving
>>>> me my cursor back) I switched to another terminal window and did the
>>>> following:
>>>>
>>>> [frankcox at mutt ~]$ ps -FA | grep rsync
>>>> frankcox    5400    2435  0 60586  3160   5 14:52 pts/0    00:00:00
>>>> rsync -avv ../temp/ jeff:temp
>>>> frankcox    5401    5400  0 67980  7440   1 14:52 pts/0    00:00:00
>>>> ssh
>>> jeff rsync --server -vvlogDtpre.iLsfxC . temp
>>>> frankcox    5526    5416  0 55476  1076   3 14:53 pts/1    00:00:00
>>>> grep --color=auto rsync
>>>>
>>>> [frankcox at mutt ~]$ strace -p 5401
>>>> strace: Process 5401 attached
>>>> select(11, [5 9 10], [], NULL, NULL
>>>>
>>>> Then it just sits there with no further action.  I get my cursor back
>>>> when I hit ctrl-c.
>>>>
>>>> [frankcox at mutt ~]$ strace -p 5400
>>>> strace: Process 5400 attached
>>>> restart_syscall(<... resuming interrupted nanosleep ...>) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>>
>>>> The wait4-etc line just keeps repeating endlessly until I hit ctrl-c.
>>>>
>>>> Unfortunately, I have no idea what any of the above actually means.
>>>> Does it tell us anything interesting?
>>>
>>> Yay!  I am glad someone else on the planet is experiencing this. 
>>> I noticed this started happening to me after updating some CentOS Linux
>> 8
>>> systems today.
>>>
>>> I discovered if I set ForwardX11=no (either on ssh command line or in
>> ~/.ssh/config) the hang does not happen.  But why does that matter?  No
>> updates to openssh.
>>>
>>> It is not the systemd update doing something silly with session
>>> management.  I painfully downgraded manually and rebooted to no
>>> effect. 
>>
>>> As an aside, why can't we we have nice things in life like 'dnf
>>> downgrade
>>> systemd\*' actually work?  I did the below - might be dumb, but it
>> worked -- alternate suggestions to downgrade are appreciated - searching
>> the list and my google-fu was off the mark today.
>>>
>>>   cd [path-to-repo]/centos/8/BaseOS/x86_64/os/Packages
>>>   dnf downgrade $(rpm -qa systemd\* | grep 239-41.el8_3.2 | sed -e
>> 's/3\.2/3.1/' -e 's/^/.\//' -e 's/$/.rpm/')
>>>
>>> Chris
>>
>>
>> [adjusted the subject, hope that is OK.]
>>
>> Found it!  It's the dbus update to 1.12.8-12.  Downgrade to -11
>> and ssh connections close normally.
>>
>> To clarify the problem, with the new dbus, simple ssh's like:
>>
>> ssh somehost uptime
>>
>> will print the uptime, but do not return to the local shell prompt until
>> you hit ctrl-c.  It works normally if you downgrade dbus or
>>
>> ssh -o forwardx11=no somehost uptime
>>
>> I'm sure a bug report exists somewhere, but that's something to dig for
>> or
>> create tomorrow.
>>
>> To downgrade, packages were scattered in different locations, so I
>> copied
>> them to one directory and did
>>
>> dnf downgrade ./*
>>
>> The packages I needed to downgrade on a  x86_64 system were:
>>
>> dbus-1.12.8-11.el8.x86_64.rpm
>> dbus-common-1.12.8-11.el8.noarch.rpm
>> dbus-daemon-1.12.8-11.el8.x86_64.rpm
>> dbus-devel-1.12.8-11.el8.x86_64.rpm
>> dbus-libs-1.12.8-11.el8.x86_64.rpm
>> dbus-tools-1.12.8-11.el8.x86_64.rpm
>> dbus-x11-1.12.8-11.el8.x86_64.rpm
>
> Now that's really interesting, I was wondering why I don't see this on
> OL8. The thing is that certain OL8 packages have an additional RPM
> revision added like .0.1. Just checked dbus and its changelog shows:
>
> * Tue Feb 16 2021 Kevin Lyons <kevin.x.lyons at oracle.com> -1.12.8-12.0.1
> - bus: raise fd limits before dropping privs [Orabug: 31175643]
> - fix netlink poll: error 4 (Zhenzhong Duan)
>
> So OL is defnitly not 100% bug to bug compatible like the other clones :-)
>
> And it makes me a bit worried why O* fixed this on Feb 16 and the broken
> dbus packages are now (in April) installed on CentOS servers?

Sorry, maybe I'm wrong here and the OL8 addons are fixing other things?
Could someone who experiences the issue test with the OL8 dbus packages?

Thanks,
Simon