Nvme m.2 disk problem - Discuss

24 Feb 2019


      Hi list,
I'm running Centos 7.6 on an Corsair Force MP500 120 GB. Root fs is ext4 
and this drive is ~1 year old.
System works very well except on boot.
During boot process I got always a file system check on nvme drive.
Running smartctl on this drive I got this:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    1%
Data Units Read:                    5,355,595 [2,74 TB]
Data Units Written:                 5,826,517 [2,98 TB]
Host Read Commands:                 67,978,550
Host Write Commands:                75,422,898
Controller Busy Time:               32,863
Power Cycles:                       811
Power On Hours:                     2,813
Unsafe Shutdowns:                   317
Media and Data Integrity Errors:    0
Error Information Log Entries:      177
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               77 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
   0        177     0  0x0014  0x4004      - 8796109799680     1     -
   1        176     0  0x0019  0x4004      - 8796109799680     1     -
   2        175     0  0x001a  0x4004      - 8796109799680     1     -
   3        174     0  0x0005  0x4004      - 8796109799680     1     -
   4        173     0  0x000c  0x4004      - 8796109799680     1     -
   5        172     0  0x0019  0x4004      - 8796109799680     1     -
   6        171     0  0x001d  0x4004      - 8796109799680     1     -
   7        170     0  0x0014  0x4004      - 8796109799680     1     -
   8        169     0  0x0011  0x4004      - 8796109799680     1     -
   9        168     0  0x000f  0x4004      - 8796109799680     1     -
  10        167     0  0x0000  0x4004      - 8796109799680     1     -
  11        166     0  0x0006  0x4004      - 8796109799680     1     -
  12        165     0  0x0008  0x4004      - 8796109799680     1     -
  13        164     0  0x000e  0x4004      - 8796109799680     1     -
  14        163     0  0x0008  0x4004      - 8796109799680     1     -
  15        162     0  0x0006  0x4004      - 8796109799680     1     -
... (48 entries not shown)
I noticed that Unsafe shutdowns increased rapidly and I don't know why 
there is an unsafe shutdown. Every 3/4 boot this value is increased by 1 
and I don't know why.
I can't find any errors on system logs.
Can someone point me in the right direction?
Thanks in advance.
Alessandro.