... Initializing..
hi all,
I've a a very basic setup, directly two boxes via two MHEH28-XTC and I cannot activate them. One peculiar thing is I get (randomly & !often):
[85947.090496] AMD-Vi: Event logged [ [85947.090539] IO_PAGE_FAULT device=09:00.7 domain=0x0000 address=0x00000000f6ffb000 flags=0x0050] [85947.298509] AMD-Vi: Event logged [ [85947.298550] IO_PAGE_FAULT device=09:00.7 domain=0x0000 address=0x00000000f6ffb000 flags=0x0050]
which is the card itself, judging by the device id Would you have and share some thoughts please?
$ ./flint/mstflint -d 09:00.0 q # for both cards
-W- Running quick query - Skipping full image integrity checks.
Image type: FS2 FW Version: 2.9.1000 Device ID: 25408 Description: Node Port1 Port2 Sys image GUIDs: 0008f104039a62a0 0008f104039a62a1 0008f104039a62a2 0008f104039a62a3 MACs: 000000000000 000000000001 VSD: PSID: MT_04A0110001
$ ibstat CA 'mlx4_0' CA type: MT25408 Number of ports: 2 Firmware version: 2.9.1000 Hardware version: a0 Node GUID: 0x0008f104039a08dc System image GUID: 0x0008f104039a08df Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0259086a Port GUID: 0x0008f104039a08dd Link layer: InfiniBand Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x0259086a Port GUID: 0x0008f104039a08de Link layer: InfiniBand
in opensm log:
Jan 06 17:00:28 817185 [F6D5A700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1cd1 Jan 06 17:00:28 817200 [F6D5A700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3120 Timeout while getting attribute 0x11 (NodeInfo); Possible mis-set mkey?