[CentOS] ssh stalls/hangs instead of exiting

Wed Apr 14 06:10:17 UTC 2021
Simon Matter <simon.matter at invoca.ch>

> On 4/13/21 11:36 PM, Chris Schanzle via CentOS wrote:
>> On 4/13/21 5:00 PM, Frank Cox wrote:
>>> On Tue, 13 Apr 2021 22:29:26 +0200
>>> Simon Matter wrote:
>>>
>>>> You could try running strace on the hanging process so see what it's
>>>> doing.
>>> [frankcox at mutt temp]$ rsync -avv ../temp/ jeff:temp
>>> opening connection using: ssh jeff rsync --server -vvlogDtpre.iLsfxC .
> temp  (7 args)
>>> sending incremental file list
>>> delta-transmission enabled
>>> abc is uptodate
>>> total: matches=0  hash_hits=0  false_alarms=0 data=0
>>>
>>> Leaving that sit there apparently doing nothing (but still not giving
>>> me my cursor back) I switched to another terminal window and did the
>>> following:
>>>
>>> [frankcox at mutt ~]$ ps -FA | grep rsync
>>> frankcox    5400    2435  0 60586  3160   5 14:52 pts/0    00:00:00
>>> rsync -avv ../temp/ jeff:temp
>>> frankcox    5401    5400  0 67980  7440   1 14:52 pts/0    00:00:00 ssh
>> jeff rsync --server -vvlogDtpre.iLsfxC . temp
>>> frankcox    5526    5416  0 55476  1076   3 14:53 pts/1    00:00:00
>>> grep --color=auto rsync
>>>
>>> [frankcox at mutt ~]$ strace -p 5401
>>> strace: Process 5401 attached
>>> select(11, [5 9 10], [], NULL, NULL
>>>
>>> Then it just sits there with no further action.  I get my cursor back
>>> when I hit ctrl-c.
>>>
>>> [frankcox at mutt ~]$ strace -p 5400
>>> strace: Process 5400 attached
>>> restart_syscall(<... resuming interrupted nanosleep ...>) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>> nanosleep({tv_sec=0, tv_nsec=20000000}, NULL) = 0
>>> wait4(5401, 0x7ffd45105564, WNOHANG, NULL) = 0
>>>
>>> The wait4-etc line just keeps repeating endlessly until I hit ctrl-c.
>>>
>>> Unfortunately, I have no idea what any of the above actually means.
>>> Does it tell us anything interesting?
>>
>> Yay!  I am glad someone else on the planet is experiencing this. 
>> I noticed this started happening to me after updating some CentOS Linux
> 8
>> systems today.
>>
>> I discovered if I set ForwardX11=no (either on ssh command line or in
> ~/.ssh/config) the hang does not happen.  But why does that matter?  No
> updates to openssh.
>>
>> It is not the systemd update doing something silly with session
>> management.  I painfully downgraded manually and rebooted to no effect. 
>
>> As an aside, why can't we we have nice things in life like 'dnf
>> downgrade
>> systemd\*' actually work?  I did the below - might be dumb, but it
> worked -- alternate suggestions to downgrade are appreciated - searching
> the list and my google-fu was off the mark today.
>>
>>   cd [path-to-repo]/centos/8/BaseOS/x86_64/os/Packages
>>   dnf downgrade $(rpm -qa systemd\* | grep 239-41.el8_3.2 | sed -e
> 's/3\.2/3.1/' -e 's/^/.\//' -e 's/$/.rpm/')
>>
>> Chris
>
>
> [adjusted the subject, hope that is OK.]
>
> Found it!  It's the dbus update to 1.12.8-12.  Downgrade to -11
> and ssh connections close normally.
>
> To clarify the problem, with the new dbus, simple ssh's like:
>
> ssh somehost uptime
>
> will print the uptime, but do not return to the local shell prompt until
> you hit ctrl-c.  It works normally if you downgrade dbus or
>
> ssh -o forwardx11=no somehost uptime
>
> I'm sure a bug report exists somewhere, but that's something to dig for or
> create tomorrow.
>
> To downgrade, packages were scattered in different locations, so I copied
> them to one directory and did
>
> dnf downgrade ./*
>
> The packages I needed to downgrade on a  x86_64 system were:
>
> dbus-1.12.8-11.el8.x86_64.rpm
> dbus-common-1.12.8-11.el8.noarch.rpm
> dbus-daemon-1.12.8-11.el8.x86_64.rpm
> dbus-devel-1.12.8-11.el8.x86_64.rpm
> dbus-libs-1.12.8-11.el8.x86_64.rpm
> dbus-tools-1.12.8-11.el8.x86_64.rpm
> dbus-x11-1.12.8-11.el8.x86_64.rpm

Now that's really interesting, I was wondering why I don't see this on
OL8. The thing is that certain OL8 packages have an additional RPM
revision added like .0.1. Just checked dbus and its changelog shows:

* Tue Feb 16 2021 Kevin Lyons <kevin.x.lyons at oracle.com> -1.12.8-12.0.1
- bus: raise fd limits before dropping privs [Orabug: 31175643]
- fix netlink poll: error 4 (Zhenzhong Duan)

So OL is defnitly not 100% bug to bug compatible like the other clones :-)

And it makes me a bit worried why O* fixed this on Feb 16 and the broken
dbus packages are now (in April) installed on CentOS servers?

Regards,
Simon