My environment is "heterogeneous" my authentication and home server are
currently stuck on a 1G shared network, the production servers and
storage servers are on a bonded 40G network, all are in the same VLAN. I
have about 100 servers on the 40GB bonded network each with 12cores and
128GB of memory.
They are running centos 6.6
Except for my storage servers they are all just running large and small
research jobs on a grid engine.
Two questions:
The errors she seems to spawn is
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
and at some point, we start getting errors that the file locks are
stuck.. you can write and read from the lockfile, but programs that
depend on the C construct lock file throw filelock errors until we reboot.
Why is dmesg, /var/log/dmesg, and /var/log/messages unique from each other?
I thought dmesg was a representation of /var/log/messages/
Is there a way to get a date stamp for the dmesg? if a job failed in
the last hour and the message is from yesterday...and I don't know that
doesn't help.
I think what I am troubleshooting is THAT user who REFUSES to follow
direction... and is sending thousands of very large jobs which each
might immediately spawn another 10-20 jobs to a grid of 100 servers in a
matter of seconds overwhelming either the network or the home directory
server or the authentication server... because when she strikes,
sometimes users cannot get a response from LDAP or the home server
within as much as 10 seconds. Thus she breaks the NFS because it gets
hammered and I have to restart all the servers on my grid.
We have had problems with "out of memory errors" due to her programs in
the recent past and had to restart all 100 servers.
*/var/adm/messages gives this*
Oct 18 13:26:08 blade5-2-1 nslcd[2520]: [dd5cc5] ldap_result() failed:
Can't contact LDAP server
Oct 18 13:26:08 blade5-2-1 nslcd[2520]: [dd5cc5] ldap_abandon() failed
to abandon search: Other (e.g., implementation specific) error
Oct 18 13:27:14 blade5-2-1 nslcd[2520]: [e01acb] ldap_result() failed:
Can't contact LDAP server
Oct 18 13:27:30 blade5-2-1 nslcd[2520]: [8c7a8f] ldap_result() failed:
Can't contact LDAP server
*dmesg gives these*
lockd: server home not responding, still trying
lockd: server home OK
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
lockd: spurious grace period reject?!
lockd: failed to reclaim lock for pid 8225 (errno -37, status 4)
*/var/log/dmesg gives this*
pmi_si: probing via SMBIOS
ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 10
ipmi_si: Adding SMBIOS-specified kcs state machine
ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca8,
slave address 0x20, irq 10
(NULL device *): The BMC does not support setting the recv irq bit,
compensating, but the BMC needs to be fixed.
IRQ 10/ipmi_si: IRQF_DISABLED is not guaranteed on shared IRQs
ipmi_si ipmi_si.0: Using irq 10
ipmi_si ipmi_si.0: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100,
dev_id: 0x20)
ipmi_si ipmi_si.0: IPMI kcs interface initialized
ACPI: No handler for Region [SYSI] (ffff882029e57348) [IPMI]
power_meter ACPI000D:00: Found ACPI power meter.
ipmi device interface
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-4): mounted filesystem with ordered data mode. Opts:
EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts:
Adding 121724924k swap on /dev/mapper/vg_server-lv_swap. Priority:-1
extents:1 across:121724924