[CentOS-devel] Broken hyperthreading on Intel Skylake and Kaby Lake processors

Tue Jun 27 21:17:00 UTC 2017
Tru Huynh <tru at centos.org>

On Tue, Jun 27, 2017 at 09:34:52PM +0100, Phil Perry wrote:
> 
> I have a potentially affected system. I've filed a bug with Red Hat
> to request microcode_ctl be updated to include the latest microcode:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1465631
> 
> I can confirm the issue is not fixed in the current RHEL7.4beta
> microcode_ctl package.
The microcode update is already being worked on
https://bugzilla.redhat.com/show_bug.cgi?id=1456339
no ETA
> 
> In the meantime I've manually applied the microcode update on my
> affected system.
...
https://downloadcenter.intel.com/download/26798/Linux-Processor-Microcode-Data-File
does not mention any Xeon E5 v4

But there is this changelog from the debian team:
http://metadata.ftp-master.debian.org/changelogs/non-free/i/intel-microcode/unstable_changelog
...
intel-microcode (3.20170511.1) unstable; urgency=medium

  * New upstream microcode datafile 20170511
...
    + This release fixes undisclosed errata on the desktop, mobile and
      server processor models from the Haswell, Broadwell, and Skylake
      families, including even the high-end multi-socket server Xeons
    + Likely fix the TSC-Deadline LAPIC errata (BDF89, SKL142 and
      similar) on several processor families
    + Fix erratum BDF90 on Xeon E7v4, E5v4(?) (closes: #862606)
    + Likely fix serious or critical Skylake errata: SKL138/144,
      SKL137/145, SLK149
    * Likely fix nightmare-level Skylake erratum SKL150.  Fortunately,
      either this erratum is very-low-hitting, or gcc/clang/icc/msvc
      won't usually issue the affected opcode pattern and it ends up
      being rare.
      SKL150 - Short loops using both the AH/BH/CH/DH registers and
      the corresponding wide register *may* result in unpredictable
      system behavior.  Requires both logical processors of the same
      core (i.e. sibling hyperthreads) to be active to trigger, as
      well as a "complex set of micro-architectural conditions"
...

I am worried by the "This release fixes undisclosed errata ... including even the high-end multi-socket server Xeons".

It may relates to
https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v4-spec-update.html
...
BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit
Internal Parity Errors or Unpredictable System Behavior
Problem: Under a complex series of microarchitectural events while running Intel Hyper-
Threading Technology, a correctable internal parity error or unpredictable system
behavior may occur.
Implication: A correctable error (IA32_MC0_STATUS.MCACOD=0005H and
IA32_MC0_STATUS.MSCOD=0001H) may be logged. The unpredictable system
behavior frequently leads to faults (e.g. #UD, #PF, #GP).
Workaround: It is possible for the BIOS to contain a workaround for this erratum.
Status: For the Steppings affected, see the Summary Tables of Changes.
...

Tru

-- 
Tru Huynh 
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xBEFA581B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20170627/67d7ebd5/attachment-0008.sig>