kernel-4.9.37-29.el7 (and el6)

List overview All Threads
Download

newer

older

AWS EC2 - CentOS 6 + 7 AMIs for...

OVS+DPDK Problem

Johnny Hughes

17 Jul 2017 17 Jul '17

2:45 p.m.

Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

If we don't hear any negative comments by Wednesday July 19th, 2017 then we are going to push those to updates as they solve iscsi issues with some hardware and don't seem to impact anything else based on limited testing.

BTW .. to test, edit the xen repo config file in /etc/yum.repos.d/ and turn on the testing repository .. or

yum --enablerepo=centos-virt-xen-testing upgrade kernel*

Thanks, Johnny Hughes

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

Kristián Feldsam

17 Jul 17 Jul

2:47 p.m.

Hello, is this kernel usable also for KVM or is only for XEN?

S pozdravem Kristián Feldsam Tel.: +420 773 303 353, +421 944 137 535 E-mail.: support@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za adekvátní ceny.

FELDSAM s.r.o. V rohu 434/3 Praha 4 – Libuš, PSČ 142 00 IČ: 290 60 958, DIČ: CZ290 60 958 C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s. Číslo účtu: 2400330446/2010 BIC: FIOBCZPPXX IBAN: CZ82 2010 0000 0024 0033 0446

...

On 17 Jul 2017, at 16:45, Johnny Hughes johnny@centos.org wrote:

Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

If we don't hear any negative comments by Wednesday July 19th, 2017 then we are going to push those to updates as they solve iscsi issues with some hardware and don't seem to impact anything else based on limited testing.

BTW .. to test, edit the xen repo config file in /etc/yum.repos.d/ and turn on the testing repository .. or

yum --enablerepo=centos-virt-xen-testing upgrade kernel*

Thanks, Johnny Hughes

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

Kevin Stange

5:13 p.m.

On 07/17/2017 09:47 AM, Kristián Feldsam wrote:

...

Hello, is this kernel usable also for KVM or is only for XEN?

This kernel is intended for the Xen repos. None of us are testing it with KVM to my knowledge, but it may work. The KVM-related virt SIG repos don't include a custom kernel.

This kernel is tracking an upstream LTS kernel and building for Xen specific functionality. Personally, I would stick with the base kernels for CentOS as they're intended to run KVM and are maintained longer than upstream LTS kernels.

-- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@steadfast.net | www.steadfast.net

Johnny Hughes

7:56 p.m.

On 07/17/2017 09:47 AM, Kristián Feldsam wrote:

...

Hello, is this kernel usable also for KVM or is only for XEN?

It certainly will work with KVM (I use it on my KVM test server and my xen test server.

But a better kernel probably for KVm is here:

http://mirror.centos.org/altarch/7/experimental/x86_64/

...

...
On 17 Jul 2017, at 16:45, Johnny Hughes <johnny@centos.org mailto:johnny@centos.org> wrote:

Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

If we don't hear any negative comments by Wednesday July 19th, 2017 then we are going to push those to updates as they solve iscsi issues with some hardware and don't seem to impact anything else based on limited testing.

BTW .. to test, edit the xen repo config file in /etc/yum.repos.d/ and turn on the testing repository .. or

yum --enablerepo=centos-virt-xen-testing upgrade kernel*

Piotr Gackiewicz

19 Jul 19 Jul

9:27 a.m.

On Mon, 17 Jul 2017, Johnny Hughes wrote:

...

Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

Since 4.9.25-26 I do often get: on 3 supermicro servers (different generations): - memory allocation errors on Dom0 and corresponding lost lost page writes due to buffer I/O error on PV guests - after such memory allocation error od dom0 I have spotted also: - NFS client hangups on guests (server not responding, still trying => server OK) - iptables lockups on PV guest reboot

on 1 supermicro server: - memory allocation errors on Dom0 and SATA lockups (many, if not SATA channels at - once): exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen hard resetting link failed to IDENTIFY (I/O error, err_mask=0x4) then: blk_update_request: I/O error, dev sd., sector ....

All of these machines have been tested with memtest, no detected memory problems. No such things occur, when I boot 4.9.13-22 Most of my guests are centos 6 x86_64, bridged.

Do anyone had such problems, dealt with it somehow?

Since spotting these errors I have done many tests, compiled and tested to point out single code change (kernel version, patch) - no conclusions yet.

But one has changed much between 4.9.13 and 4.9.25: kernel size and configuration. 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been compiled into kernel, here is shortened, but significant list: - iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT) - SATA_AHCI - ATA_AHCI (PATA, what a heck?) - FBDEV_FRONTEND - HID_MAGICKMOUSE - HID_NTRIG - USB_XHCI - INTEL_SMARTCONNECT

Do we really need these compiled into dom0 kernel?

I assume, that the biggest change in size is due to yama and CRYPTO_*, and it is not going to change.

Regards,

-- Piotr Gackiewicz Intertele S.A. - operator systemów ITL.PL i DOMENY.ITL.PL al. T. Rejtana 10, 35-310 Rzeszów TEL: +48 17 8507580, FAX: +48 17 8520275 http://www.itl.pl - niezawodne usługi hostingowe http://domeny.itl.pl - tanie domeny internetowe http://www.intertele.pl

Johnny Hughes

2:23 p.m.

On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...

On Mon, 17 Jul 2017, Johnny Hughes wrote:

...
Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

Since 4.9.25-26 I do often get: on 3 supermicro servers (different generations):

memory allocation errors on Dom0 and corresponding lost lost page writes due to buffer I/O error on PV guests

after such memory allocation error od dom0 I have spotted also:

NFS client hangups on guests (server not responding, still trying

=> server OK) - iptables lockups on PV guest reboot

on 1 supermicro server:

memory allocation errors on Dom0 and SATA lockups (many, if not SATA

channels at - once): exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen hard resetting link failed to IDENTIFY (I/O error, err_mask=0x4) then: blk_update_request: I/O error, dev sd., sector ....

All of these machines have been tested with memtest, no detected memory problems. No such things occur, when I boot 4.9.13-22 Most of my guests are centos 6 x86_64, bridged.

Do anyone had such problems, dealt with it somehow?

Since spotting these errors I have done many tests, compiled and tested to point out single code change (kernel version, patch) - no conclusions yet.

But one has changed much between 4.9.13 and 4.9.25: kernel size and configuration. 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been compiled into kernel, here is shortened, but significant list:

iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT)

SATA_AHCI

ATA_AHCI (PATA, what a heck?)

FBDEV_FRONTEND

HID_MAGICKMOUSE

HID_NTRIG

USB_XHCI

INTEL_SMARTCONNECT

Modules that are not loaded are not used. It has no impact at all on performance or compatibility unless it is used. If you take an lsmod of the kernel that works and one of the kernel with issues, we can see if there are LOADED modules that might cause issues.

The modules that are built are the same as Fedora and if in the RHEL 7 kernel, RHEL 7.

We did troubleshoot and turn off some things recently, one thing in particular was CONFIG_IO_STRICT_DEVMEM , which is on in fedora, but which is off in some other distros and causes issues with ISCSI and some other things.

We also added some specific xen patches, one for netback queue, one for apic, one for nested dom0. Also upstream has added in several xen patches since 4.9.13.

And yes, we did change the kernel configs specifically to add in iptables as many people want them.

If you can point to problems with a specific module, we can discuss it here and turn it off if necessary.

Johnny Hughes

2:43 p.m.

On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...

On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...
On Mon, 17 Jul 2017, Johnny Hughes wrote:

...
Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

Since 4.9.25-26 I do often get: on 3 supermicro servers (different generations):

memory allocation errors on Dom0 and corresponding lost lost page writes due to buffer I/O error on PV guests

after such memory allocation error od dom0 I have spotted also:

NFS client hangups on guests (server not responding, still trying

=> server OK) - iptables lockups on PV guest reboot

on 1 supermicro server:

memory allocation errors on Dom0 and SATA lockups (many, if not SATA

channels at - once): exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen hard resetting link failed to IDENTIFY (I/O error, err_mask=0x4) then: blk_update_request: I/O error, dev sd., sector ....

All of these machines have been tested with memtest, no detected memory problems. No such things occur, when I boot 4.9.13-22 Most of my guests are centos 6 x86_64, bridged.

Do anyone had such problems, dealt with it somehow?

Since spotting these errors I have done many tests, compiled and tested to point out single code change (kernel version, patch) - no conclusions yet.

But one has changed much between 4.9.13 and 4.9.25: kernel size and configuration. 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been compiled into kernel, here is shortened, but significant list:

iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT)

SATA_AHCI

ATA_AHCI (PATA, what a heck?)

FBDEV_FRONTEND

HID_MAGICKMOUSE

HID_NTRIG

USB_XHCI

INTEL_SMARTCONNECT

Modules that are not loaded are not used. It has no impact at all on performance or compatibility unless it is used. If you take an lsmod of the kernel that works and one of the kernel with issues, we can see if there are LOADED modules that might cause issues.

The modules that are built are the same as Fedora and if in the RHEL 7 kernel, RHEL 7.

We did troubleshoot and turn off some things recently, one thing in particular was CONFIG_IO_STRICT_DEVMEM , which is on in fedora, but which is off in some other distros and causes issues with ISCSI and some other things.

We also added some specific xen patches, one for netback queue, one for apic, one for nested dom0. Also upstream has added in several xen patches since 4.9.13.

There are several very important patches in this kernel for xen (for example):

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.36

Piotr Gackiewicz

20 Jul 20 Jul

10:31 a.m.

On Wed, 19 Jul 2017, Johnny Hughes wrote:

...

On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...
On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...
On Mon, 17 Jul 2017, Johnny Hughes wrote:

...
Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure.

SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above.

After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers.

Moreover, digging into logs of memory allocation failures on my production supermicro servers resulted in some interesting findings:

Jul 9 05:02:47 xen kernel: [3040088.089379] gzip: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC) Jul 10 12:18:01 xen kernel: [3152495.802565] 2.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 10 12:18:01 xen kernel: [3152495.815871] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 10 12:18:01 xen kernel: [3152495.816826] 2.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 10 12:18:01 xen kernel: [3152495.832477] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 10 12:20:20 xen kernel: [3152635.070680] 1.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 10 12:20:20 xen kernel: [3152635.083952] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 12 09:15:15 xen kernel: [118420.343615] 10.xvda5-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 12 09:15:15 xen kernel: [118420.359779] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT)

What is node "-1" ? 8-/

I think it should be reported to Xen and/or SLUB developers. I suggest releasing new Xen kernels with SLAB, until the issue is resolved.

Regards,

Kevin Stange

7:59 p.m.

On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:

...

On Wed, 19 Jul 2017, Johnny Hughes wrote:

...
On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...
On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...
On Mon, 17 Jul 2017, Johnny Hughes wrote:

...
Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure.

SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above.

After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers.

I was having page allocation failures on 4.9.25 with SLUB, but these problems seem to be gone with 4.9.34 (still with SLUB). Have you checked this build? It was moved to the stable repo on July 4th.

config-4.9.25-27.el6.x86_64:CONFIG_SLUB=y config-4.9.34-29.el6.x86_64:CONFIG_SLUB=y

Piotr Gackiewicz

8:14 p.m.

On Thu, 20 Jul 2017, Kevin Stange wrote:

...

On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:

...
On Wed, 19 Jul 2017, Johnny Hughes wrote:

...
On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...
On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...
On Mon, 17 Jul 2017, Johnny Hughes wrote:

...
Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure.

SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above.

After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers.

I was having page allocation failures on 4.9.25 with SLUB, but these problems seem to be gone with 4.9.34 (still with SLUB). Have you checked this build? It was moved to the stable repo on July 4th.

Yes, 4.9.34 was failing too. And this was actually the worst case, with I/O error on guest:

Jul 16 06:01:03 dom0 kernel: [452360.743312] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:03 guest kernel: end_request: I/O error, dev xvda3, sector 9200640 Jul 16 06:01:03 dom0 kernel: [452360.758931] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150080 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150081 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150082 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150083 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150084 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 dom0 kernel: [452361.449389] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:03 dom0 kernel: [452361.449685] CPU: 1 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:03 dom0 kernel: [452361.449934] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.0c 02/25/2010 Jul 16 06:01:03 guest kernel: end_request: I/O error, dev xvda3, sector 6102784 Jul 16 06:01:03 dom0 kernel: [452361.462103] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:03 dom0 kernel: [452361.676257] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:03 dom0 kernel: [452361.676531] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:03 guest kernel: end_request: I/O error, dev xvda3, sector 6127872 Jul 16 06:01:03 dom0 kernel: [452361.692171] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:07 dom0 kernel: [452365.438565] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:07 dom0 kernel: [452365.438870] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:07 dom0 kernel: [452365.454213] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:07 guest kernel: end_request: I/O error, dev xvda3, sector 6477112 Jul 16 06:01:09 dom0 kernel: [452366.732994] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:09 dom0 kernel: [452366.733306] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:09 dom0 kernel: [452366.746362] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:09 guest kernel: end_request: I/O error, dev xvda3, sector 6546488 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818311 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818312 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818313 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818314 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818315 Jul 16 06:01:09 dom0 kernel: [452366.913734] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:09 dom0 kernel: [452366.914002] CPU: 1 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:09 guest kernel: end_request: I/O error, dev xvda3, sector 6366208 Jul 16 06:01:09 dom0 kernel: [452366.929809] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:09 dom0 kernel: [452367.288193] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:09 dom0 kernel: [452367.288455] CPU: 1 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:09 dom0 kernel: [452367.301690] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:09 guest kernel: end_request: I/O error, dev xvda3, sector 6630656 Jul 16 06:01:10 dom0 kernel: [452368.253435] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:10 dom0 kernel: [452368.253701] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:10 guest kernel: end_request: I/O error, dev xvda3, sector 6708224

Regards,

Johnny Hughes

21 Jul 21 Jul

10:48 a.m.

On 07/20/2017 03:14 PM, Piotr Gackiewicz wrote:

...

On Thu, 20 Jul 2017, Kevin Stange wrote:

...
On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:

...
On Wed, 19 Jul 2017, Johnny Hughes wrote:

...
On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...
On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...
On Mon, 17 Jul 2017, Johnny Hughes wrote:

> Are the testing kernels (kernel-4.9.37-29.el7 and > kernel-4.9.37-29.el6, > with the one config file change) working for everyone: > > (turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure.

SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above.

After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers.

I was having page allocation failures on 4.9.25 with SLUB, but these problems seem to be gone with 4.9.34 (still with SLUB). Have you checked this build? It was moved to the stable repo on July 4th.

Yes, 4.9.34 was failing too. And this was actually the worst case, with I/O error on guest:

Jul 16 06:01:03 dom0 kernel: [452360.743312] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:03 guest kernel: end_request: I/O error, dev xvda3, sector 9200640 Jul 16 06:01:03 dom0 kernel: [452360.758931] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150080 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150081 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150082 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150083 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 guest kernel: Buffer I/O error on device xvda3, logical block 1150084 Jul 16 06:01:03 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:03 dom0 kernel: [452361.449389] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:03 dom0 kernel: [452361.449685] CPU: 1 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:03 dom0 kernel: [452361.449934] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.0c 02/25/2010 Jul 16 06:01:03 guest kernel: end_request: I/O error, dev xvda3, sector 6102784 Jul 16 06:01:03 dom0 kernel: [452361.462103] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:03 dom0 kernel: [452361.676257] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:03 dom0 kernel: [452361.676531] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:03 guest kernel: end_request: I/O error, dev xvda3, sector 6127872 Jul 16 06:01:03 dom0 kernel: [452361.692171] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:07 dom0 kernel: [452365.438565] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:07 dom0 kernel: [452365.438870] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:07 dom0 kernel: [452365.454213] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:07 guest kernel: end_request: I/O error, dev xvda3, sector 6477112 Jul 16 06:01:09 dom0 kernel: [452366.732994] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:09 dom0 kernel: [452366.733306] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:09 dom0 kernel: [452366.746362] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:09 guest kernel: end_request: I/O error, dev xvda3, sector 6546488 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818311 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818312 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818313 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818314 Jul 16 06:01:09 guest kernel: lost page write due to I/O error on xvda3 Jul 16 06:01:09 guest kernel: Buffer I/O error on device xvda3, logical block 818315 Jul 16 06:01:09 dom0 kernel: [452366.913734] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:09 dom0 kernel: [452366.914002] CPU: 1 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:09 guest kernel: end_request: I/O error, dev xvda3, sector 6366208 Jul 16 06:01:09 dom0 kernel: [452366.929809] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:09 dom0 kernel: [452367.288193] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:09 dom0 kernel: [452367.288455] CPU: 1 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:09 dom0 kernel: [452367.301690] SLUB: Unable to allocate memory on node -1, gfp=0x2000000(GFP_NOWAIT) Jul 16 06:01:09 guest kernel: end_request: I/O error, dev xvda3, sector 6630656 Jul 16 06:01:10 dom0 kernel: [452368.253435] 12.xvda3-0: page allocation failure: order:0, mode:0x2200000(GFP_NOWAIT|__GFP_NOTRACK) Jul 16 06:01:10 dom0 kernel: [452368.253701] CPU: 0 PID: 28450 Comm: 12.xvda3-0 Tainted: G O 4.9.34-29.el6.x86_64 #1 Jul 16 06:01:10 guest kernel: end_request: I/O error, dev xvda3, sector 6708224

I will happily create a test kernel with SLAB .. what is your config file diff?

Piotr Gackiewicz

11:06 a.m.

On Fri, 21 Jul 2017, Johnny Hughes wrote:

...

...
I will happily create a test kernel with SLAB .. what is your config file diff?

I have just choosed SLAB allocator in menuconfig. It has implied several other internal configurations changes.

Overall differencess (config file patch) is in attachment.

But my considerations about compiled in PATA etc., instead of modules, remain actual ;-).

Regards,

Johnny Hughes

11:48 a.m.

On 07/21/2017 06:06 AM, Piotr Gackiewicz wrote:

...

On Fri, 21 Jul 2017, Johnny Hughes wrote:

...
...
I will happily create a test kernel with SLAB .. what is your config file diff?

I have just choosed SLAB allocator in menuconfig. It has implied several other internal configurations changes.

Overall differencess (config file patch) is in attachment.

But my considerations about compiled in PATA etc., instead of modules, remain actual ;-).

OK .. I will create a 4.9.39 kernel with slub off and slab on later today. The only change I will make is to also turn slab_debug on (it will add things to the debuginfo file that Sarah created for the kernels).

WRT the other items in the kernel (modules or compiled directly), I would rather leave everything else alone as it is how the Red Hat kernels seem to be done .. unless someone has them actually causing a problem.

Johnny Hughes

4:26 p.m.

On 07/21/2017 06:06 AM, Piotr Gackiewicz wrote:

...

On Fri, 21 Jul 2017, Johnny Hughes wrote:

...
...
I will happily create a test kernel with SLAB .. what is your config file diff?

I have just choosed SLAB allocator in menuconfig. It has implied several other internal configurations changes.

Overall differencess (config file patch) is in attachment.

But my considerations about compiled in PATA etc., instead of modules, remain actual ;-).

OK .. I have built:

kernel-4.9.39-29.el6 and kernel-4.9.39-29.el7

They have been tagged to the testing repository

They should show up in a couple of hours into the testing repo.

Everyone who was having memory issues, give those a try.

Also, please test the iscsi configs as well.

Giuseppe Tanzilli - Serverplan

24 Jul 24 Jul

2:29 p.m.

On 21/07/2017 18:26, Johnny Hughes wrote:

...

On 07/21/2017 06:06 AM, Piotr Gackiewicz wrote:

...
On Fri, 21 Jul 2017, Johnny Hughes wrote:

...
I will happily create a test kernel with SLAB .. what is your config file diff?

I have just choosed SLAB allocator in menuconfig. It has implied several other internal configurations changes.

Overall differencess (config file patch) is in attachment.

But my considerations about compiled in PATA etc., instead of modules, remain actual ;-).

OK .. I have built:

kernel-4.9.39-29.el6 and kernel-4.9.39-29.el7

They have been tagged to the testing repository

They should show up in a couple of hours into the testing repo.

Everyone who was having memory issues, give those a try.

Also, please test the iscsi configs as well.

tested

kernel-4.9.39-29.el6

and iSCSI on bnx2i driver for broadcom 57810 is working fine, only few hours but much better then latest released kernel

...

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

-- -------------------------------------------------- Giuseppe Tanzilli Serverplan

Kevin Stange

8:05 p.m.

On 07/20/2017 03:14 PM, Piotr Gackiewicz wrote:

...

On Thu, 20 Jul 2017, Kevin Stange wrote:

...
On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:

...
On Wed, 19 Jul 2017, Johnny Hughes wrote:

...
On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...
On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...
On Mon, 17 Jul 2017, Johnny Hughes wrote:

> Are the testing kernels (kernel-4.9.37-29.el7 and > kernel-4.9.37-29.el6, > with the one config file change) working for everyone: > > (turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure.

SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above.

After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers.

I was having page allocation failures on 4.9.25 with SLUB, but these problems seem to be gone with 4.9.34 (still with SLUB). Have you checked this build? It was moved to the stable repo on July 4th.

Yes, 4.9.34 was failing too. And this was actually the worst case, with I/O error on guest:

I did find one server running 4.9.34 that was still throwing SLUB page allocation errors, but oddly, the only servers ever to have this issue for me are spares that are running no domains. I've just tried booting that box up on 4.9.39, but I may not know if the switch back to SLAB fixes anything for several weeks.

Otherwise, the other server I'm running 4.9.39 on for the past 72 hours has been stable with running domains.

Johnny Hughes

9:04 p.m.

On 07/24/2017 03:05 PM, Kevin Stange wrote:

...

On 07/20/2017 03:14 PM, Piotr Gackiewicz wrote:

...
On Thu, 20 Jul 2017, Kevin Stange wrote:

...
On 07/20/2017 05:31 AM, Piotr Gackiewicz wrote:

...
On Wed, 19 Jul 2017, Johnny Hughes wrote:

...
On 07/19/2017 09:23 AM, Johnny Hughes wrote:

...
On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote: > On Mon, 17 Jul 2017, Johnny Hughes wrote: > >> Are the testing kernels (kernel-4.9.37-29.el7 and >> kernel-4.9.37-29.el6, >> with the one config file change) working for everyone: >> >> (turn off: CONFIG_IO_STRICT_DEVMEM) > > Hello. > Maybe it's not the most appropriate thread or time, but I have been > signalling it before: > > 4.9.* kernels do not work well for me any more (and for other people > neither, as I know). Last stable kernel was 4.9.13-22.

I think I have nailed down the faulty combo. My tests showed, that SLUB allocator does not work well in Xen Dom0, on top of Xen Hypervisor. Id does not work at least on one of my testing servers (old AMD K8 (1 proc, 1 core), only 1 paravirt guest). If kernel with SLUB booted as main (w/o Xen hypervisor), it works well. If booted as Xen hypervisor module - it almost instantly gets page allocation failure.

SLAB=>SLUB was changed in kernel config, starting from 4.9.25. Then problems started to explode in my production environment, and on testing server mentioned above.

After recompiling recent 4.9.34 with SLAB - everything works well on that testing machine. A will try to test 4.9.38 with the same config on my production servers.

I was having page allocation failures on 4.9.25 with SLUB, but these problems seem to be gone with 4.9.34 (still with SLUB). Have you checked this build? It was moved to the stable repo on July 4th.

Yes, 4.9.34 was failing too. And this was actually the worst case, with I/O error on guest:

I did find one server running 4.9.34 that was still throwing SLUB page allocation errors, but oddly, the only servers ever to have this issue for me are spares that are running no domains. I've just tried booting that box up on 4.9.39, but I may not know if the switch back to SLAB fixes anything for several weeks.

Otherwise, the other server I'm running 4.9.39 on for the past 72 hours has been stable with running domains.

Cool,

We have several good reports .. I'll wait until Wednesday and push this kernel to "release" if we don't get any bad reports.

Piotr Gackiewicz

19 Jul 19 Jul

4:46 p.m.

On Wed, 19 Jul 2017, Johnny Hughes wrote:

...

On 07/19/2017 04:27 AM, Piotr Gackiewicz wrote:

...

...
But one has changed much between 4.9.13 and 4.9.25: kernel size and configuration. 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been compiled into kernel, here is shortened, but significant list:

iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT)

SATA_AHCI

ATA_AHCI (PATA, what a heck?)

FBDEV_FRONTEND

HID_MAGICKMOUSE

HID_NTRIG

USB_XHCI

INTEL_SMARTCONNECT

Modules that are not loaded are not used.

You got me wrong: these are compiled into kernel statically, not as modules...

More updates on kernel problems: 4.9.13 (no xen patches), recompiled with config applied from 4.9.25, appears to be also faulty in my env. I'll try to dissect config and compilation differences and try to isolate what went wrong there. Trace included below.

Regards,

[ 151.899085] vif vif-1-0 vif1.0: Trying to unmap invalid handle! pending_idx: 0xe5 [ 151.899158] ------------[ cut here ]------------ [ 151.899163] kernel BUG at drivers/net/xen-netback/netback.c:428! [ 151.899168] invalid opcode: 0000 [#1] SMP [ 151.899171] Modules linked in: xt_physdev br_netfilter xt_mac ebt_arp xen_pciback xen_gntalloc hwmon_vid ebtable_filter ebtables bridge 8021q mrp garp stp llc xt_CT xt_addrtype iptable_r aw nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_owner iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_i pv6 xt_state nf_conntrack ip6table_filter ip6_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd ppdev parport_pc parport fjes 3c59x asus_atk0110 pcspkr serio_raw k8temp via_rhine mii i2c_viapro shpchp raid1 ata_generic sata_via [ 151.899271] CPU: 0 PID: 2097 Comm: vif1.0-q0-deall Not tainted 4.9.13-26.itl.3.el6.x86_64 #1 [ 151.899276] Hardware name: System manufacturer System Product Name/K8V-VM, BIOS 0902 05/14/2007 [ 151.899282] task: ffff88006786c900 task.stack: ffffc900408f8000 [ 151.899286] RIP: e030:[<ffffffffc00a576b>] [<ffffffffc00a576b>] xenvif_tx_dealloc_action+0x23b/0x240 [xen_netback] [ 151.899301] RSP: e02b:ffffc900408fbbe8 EFLAGS: 00010246 [ 151.899305] RAX: 0000000000000045 RBX: ffffc90040915000 RCX: 0000000000000000 [ 151.899309] RDX: 0000000000000000 RSI: ffff88007220e0e8 RDI: ffff88007220e0e8 [ 151.899313] RBP: ffffc900408fbe18 R08: 00000000fffffffe R09: 0000000000000000 [ 151.899318] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9004091cf48 [ 151.899322] R13: 0000000000017852 R14: aaaaaaaaaaaaaaab R15: ffffc900408fbe68 [ 151.899331] FS: 00007f207a23f9a0(0000) GS:ffff880072200000(0000) knlGS:0000000000000000 [ 151.899336] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.899340] CR2: 00007f3d6f821000 CR3: 0000000066538000 CR4: 0000000000000660 [ 151.899345] Stack: [ 151.899349] ffff88006786c980 ffff88006786ea80 ffff880072216e80 0000000000000000 [ 151.899356] ffff88007220b890 0000000000000000 ffffc900408fbca8 ffffffff8101e0da [ 151.899363] ffff88007220b890 ffff88007220c090 ffff88007220b890 ffff88007220c090 [ 151.899370] Call Trace: [ 151.899380] [<ffffffff8101e0da>] ? xen_load_sp0+0x9a/0x1a0 [ 151.899385] [<ffffffff8101e28a>] ? xen_load_tls+0xaa/0x160 [ 151.899392] [<ffffffff8102eb3c>] ? __switch_to+0x1dc/0x680 [ 151.899397] [<ffffffff810ce493>] ? finish_task_switch+0x93/0x270 [ 151.899405] [<ffffffff818ce9f8>] ? __schedule+0x238/0x530 [ 151.899411] [<ffffffff818d2e7f>] ? _raw_spin_lock_irqsave+0x1f/0x50 [ 151.899417] [<ffffffff818d2ba6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 [ 151.899423] [<ffffffff810ed202>] ? prepare_to_wait_event+0x82/0x130 [ 151.899428] [<ffffffff818d2e7f>] ? _raw_spin_lock_irqsave+0x1f/0x50 [ 151.899434] [<ffffffff818d2ba6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 [ 151.899439] [<ffffffff810ecd10>] ? finish_wait+0x70/0x90 [ 151.899445] [<ffffffffc00a57f6>] xenvif_dealloc_kthread+0x86/0x110 [xen_netback] [ 151.899451] [<ffffffff810ecb60>] ? woken_wake_function+0x20/0x20 [ 151.899456] [<ffffffff818cedda>] ? schedule+0x3a/0xa0 [ 151.899461] [<ffffffff818d2ba6>] ? _raw_spin_unlock_irqrestore+0x16/0x20 [ 151.899467] [<ffffffffc00a5770>] ? xenvif_tx_dealloc_action+0x240/0x240 [xen_netback] [ 151.899477] [<ffffffff810c60dd>] kthread+0xcd/0xf0 [ 151.899481] [<ffffffff810ce493>] ? finish_task_switch+0x93/0x270 [ 151.899487] [<ffffffff810d100e>] ? schedule_tail+0x1e/0xc0 [ 151.899491] [<ffffffff810c6010>] ? __kthread_init_worker+0x40/0x40 [ 151.899496] [<ffffffff818d32d5>] ret_from_fork+0x25/0x30 [ 151.899501] Code: c6 01 44 89 f0 49 39 c5 7f be 0f 0b eb fe 0f 0b eb fe 48 8b 43 20 48 c7 c6 60 da 0a c0 48 8b b8 20 03 00 00 31 c0 e8 75 f3 6f c1 <0f> 0b eb fe 90 55 48 89 e5 41 57 41 5 6 41 55 41 54 53 48 83 ec [ 151.899554] RIP [<ffffffffc00a576b>] xenvif_tx_dealloc_action+0x23b/0x240 [xen_netback] [ 151.899561] RSP <ffffc900408fbbe8> [ 151.899612] ---[ end trace 7f804a9e1f8ce687 ]---

Nathan Coulson

6:43 p.m.

On 2017-07-19 02:27 AM, Piotr Gackiewicz wrote:

...

On Mon, 17 Jul 2017, Johnny Hughes wrote:

...
Are the testing kernels (kernel-4.9.37-29.el7 and kernel-4.9.37-29.el6, with the one config file change) working for everyone:

(turn off: CONFIG_IO_STRICT_DEVMEM)

Hello. Maybe it's not the most appropriate thread or time, but I have been signalling it before:

4.9.* kernels do not work well for me any more (and for other people neither, as I know). Last stable kernel was 4.9.13-22.

Since 4.9.25-26 I do often get: on 3 supermicro servers (different generations):

memory allocation errors on Dom0 and corresponding lost lost page

writes due to buffer I/O error on PV guests

after such memory allocation error od dom0 I have spotted also:

NFS client hangups on guests (server not responding, still

trying => server OK) - iptables lockups on PV guest reboot

on 1 supermicro server:

memory allocation errors on Dom0 and SATA lockups (many, if not SATA

channels at - once): exception Emask 0x0 SAct 0x20 SErr 0x0 action 0x6 frozen hard resetting link failed to IDENTIFY (I/O error, err_mask=0x4) then: blk_update_request: I/O error, dev sd., sector ....

All of these machines have been tested with memtest, no detected memory problems. No such things occur, when I boot 4.9.13-22 Most of my guests are centos 6 x86_64, bridged.

Do anyone had such problems, dealt with it somehow?

Since spotting these errors I have done many tests, compiled and tested to point out single code change (kernel version, patch) - no conclusions yet.

But one has changed much between 4.9.13 and 4.9.25: kernel size and configuration. 4.9.13 size was 6MB and 4.9.24 is 7.1MB. Many modules have been compiled into kernel, here is shortened, but significant list:

iptables (NETFILTER_XTABLES, IP_NF_FILTER, IP_NF_TARGET_REJECT)

SATA_AHCI

ATA_AHCI (PATA, what a heck?)

FBDEV_FRONTEND

HID_MAGICKMOUSE

HID_NTRIG

USB_XHCI

INTEL_SMARTCONNECT

Do we really need these compiled into dom0 kernel?

I assume, that the biggest change in size is due to yama and CRYPTO_*, and it is not going to change.

Regards,

CentOS-virt mailing list CentOS-virt@centos.org https://lists.centos.org/mailman/listinfo/centos-virt

I have not done any deep digging on this, but we had xen on centos 7 on a couple servers for some experimental vm's. AMD, with Tyan S2882. (Other then the kernel, all the packages are up to date). The CPU's do not support HVM.

We host paravirtualized centos 7 instances (stock centos 7 kernels), but in our testing it was rebooting even with no vm's running.

With the 4.9.23-26 (I think... would rather retest that before saying for sure) and 4.9.25-27 (I can confirm this) (haven't tested 4.9.34-29), We get periodic reboots and kernel panics. * by I think, we tested 2 kernels since 4.9.13-22 which had troubles so far, and I recall us not yet testing 4.9.34-29).

Temporary solution for us is just holding onto the 4.9.13-22 kernel.

Anyway, I suppose a simple "me too", for now. Have not brought it up prior as we haven't done any of the legwork to narrow it down.

-- Nathan Coulson

2927

Age (days ago)

2934

Last active (days ago)

virt@lists.centos.org

18 comments

6 participants

tags (0)

participants (6)

Giuseppe Tanzilli - Serverplan
Johnny Hughes
Kevin Stange
Kristián Feldsam
Nathan Coulson
Piotr Gackiewicz