Seagate 8TB ST8000AS0002 vs. Linux (aka: kernel issues you might face)

Have I told you that I just got 4 brand-new 8TB Seagate SATA-HDD?

Well, unfortunately my RAID setup raised some problems. The first reshaping attempt failed, with the host simply hanging, apparently with something like a kernel-panic that unfortunately I missed from the console, as it was…. blank! (BTW: this is one of the main reason to disable console-blanking in datacenter contexts)

As I setup the backup-server with syslog-forwarding to a central log-server, I have been able to easily get the last messages logged:

so my first guess was: “Uhm… Have I been so unlucky to receive a faulty drive? (sda. actually)

Following attempts to restart the server using the installed OS (residing on a separated, non-raid, partition) were unsuccesfull due to several reasons (the most important ones were: raid partition automatically mounted at boot with NFS server automatically started at boot, serving files exactly from there; my yet-not-so-deep understanding of systemd fixing/troubleshooting; problems with RAID due to failure of sdd during a reshaping activity).

So I ended up booting the server with a CD (I always have with me a copy of SystemRescueCD) and trying to fix the various issues. For various reasons (that, probably, will be sources for another POST, here), I ended up destroying and recreating the RAID-5 array.

Once more, after some time (more or less, one day of activity), the server went down again:

As you can see, above messages refer to sdd while, before, we talked about sda. This is normal, as I ended up swapping sda with sdd, just to lower the possibility to miss a boot due to sda failure (the fist SATA HDD is configured, in BIOS, as BOOT-drive). So above sdd and sda are exactly the same drive.

With a second-set of disk errors I just started searching for Amazon return-policy, as, after all, drives were really young (less than two weeks) and should have a proper warranty. Also, it should be easy to demostrate the failures, as the error-count were clearly showed within SMART parameters. While searching, I said to myself: “Am I the only guy experiencing problems with those disks?

A quick search for ST8000AS0002 raised lots of informations.

With a big surprise, I started reading some threads (like this and this) where people reported problems with those drives but specifically with some kernel versions and filesystems (EXT4, XFS, BTRFS). Other people reported that with NTFS (on Linux) problems disappeared.

To be short: I discovered that it’s perfectly possible that what appears as hardware failures are, instead, errors generated by kernel when encountering some strange conditions.

In the second thread mentioned above (“Bug 93581: 3.17..3.19 all fail with new Seagate Archive 8TB S-ATA disk (NCQ timeouts)“) I found a post of Martin K. Petersen including a patch to solve the problem in recent kernel versions.

Few message below it’s stated that such a patch has been acepted and scheduled to be pushed on kernel 4.4.

Unfortunately:

  1. I’m working on CentOS 7 (now 7.2.1511), whose official kernel is currently 3.10.0-327.3.1.el7.x86_64;
  2. the most updated kernel-elrepo package for CentOS 7, currently, is kernel-ml-4.3.3

So I decided to gain once more from the Open-Source models underlying all of those technologies and…. prepared all the needed stuff to create a new RPM package, based on kernel-ml-4.3.3 but including the patch provided by Martin K. Petersen above.

I just setup a similar VM (Centos7) with all the included dev-packages, I downloaded and installed the SRPMS, I fixed the SPEC file, adding a proper reference to the patch:


[...]
Source3: config-%{version}-x86_64
Source4: cpupower.service
Source5: cpupower.config
Patch0: ST8000AS0002-1NA17Z-AR13.patch
[...]
%prep
%setup -q -n %{name}-%{version} -c
cd linux-4.3.3
%patch0 -p1
cd ..

and launched the classic

I know that lots of time are needed to go through the whole compile process and so…. I’m going to sleep, leaving everything running. Tomorrow, hopefully, I should have a new kernel RPMS ready to be installed to my backup-server.

Stay tuned!


Update

The RPM build process was successfull. It left me with following results:

So I copied the RPM files to my backup-server and installed them with:

After a restart a choosing the right kernel at grub boot-menu, here are the results:

as the RAID5 array was faulty, due to problems with sdd disk, I decided to reinitialize it (sdd5), blanking related metadata and re-adding it to the array:

now, after a couple of hours, here are the results:

so, in 749 minutes (or so….) I should be able to see if I’ve been successfull in running a properly patched kernel, to support my brand-new 8TB disks.

Stay tuned!

5 Comments

  1. Hi DV,

    A detailed post! Any updates on how it works for you finally?

    I have the similar problem several (more than 4) Seagate 8TB archive disks in a box. Unfortunately, it does not work very well even with 4.10.1 kernel.

    Any updates from your side will be nice. Thanks.

    • ladreune

      Hi Eric! Thanks for your commenting.

      As mentioned in the POST, on Jan 5th 2016 I upgraded my kernel from 3.10.0-229.20.1 to 4.3.3-1 (patched as described).

      After such upgrade, the system were perfectly stable up to Jan 22nd 2017.

      On that day the microserver rebooted (for reasons unrelated to our HDDs; see below) and unfortunately it rebooted with kernel 3.10.0-514.6.1.
      As you can see, in the meantime the “stock” CentOS kernel was upgraded (from 3.10.0-229.20.1 to 3.10.0-514.6.1).
      Unfortunately I have _NOT_ noted such a change (I forgot to update the GRUB configuration, so to have “my” 4.3.3-1 as the default boot kernel).

      Anyway, despite such unexpected reboot with a wrong kernel (the stock 3.10.0-514.6.1 insted of “my” 4.3.3.-1), currently the system is working _PERFECTLY_NICE_, without any side-problem.

      Please note, however, that the Jan 22nd reboot was required as we _REPLACED_ the HP-Microserver (keeping the old-HDDs). Basically we had the chance to replace the old GEN-7 HP-Microserver, with a new GEN-8 one. So we took out the four HDD and just put them in the new GEN-8 box. Everything worked flawlessly, up to now. I’m saying this, as there’s a minor chance that the HDD problems have been solved “also” with the old-kernel (stock 3.10.0-514.6.1) due to the hardware change.

      As we got rid of the old GEN-7 box, it’s impossible, for me, to check if the stock 3.10.0-514.6.1 kernel solved the issues with HDD.

      Having said all of this, I can surely assert that with the old hardware and the very same disks, the system worked flewlessy with “my” patched kernel.

      HTH.

      Cheers,
      DV

      • Thanks DV for the detailed reply! And good to know the disks work well for you now.

        The stock 3.10.0-514.6.1.el7 kernel did not work well enough for us unfortunately (after several days, problems start to appear). So we changed to the 4.10.1 kernel. It worked better than the stock 3.10.0-514.6.1 kernel (lasts longer before showing problems) while it is not good enough—after 1 weeks or so, Linux starts to report one or 2 disks disappeared on the servers.

        That may be related to the hardware (mother board) or the hardware configurations, or our work loads. I will continue checking these disks and servers.

        But thanks all the same for your info. That’s very good input for us for checking the problems.

  2. Lorenzo Delana

    Hi,
    only to report that today using Ubuntu 16.04.3 LTS with kernel 4.4.0-104-generic I got same problem ( FLUSH CACHE EXT ) within 10TB Seagate disk ( ST10000VN0004 ) as follow:

    Dec 26 21:36:24 fs01bk kernel: [18449.220474] ata6.00: failed command: FLUSH CACHE EXT
    Dec 26 21:36:24 fs01bk kernel: [18449.220628] ata6.00: status: { DRDY }
    Dec 26 21:36:24 fs01bk kernel: [18449.712349] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
    Dec 26 21:36:24 fs01bk kernel: [18449.718308] ata6.00: retrying FLUSH 0xea Emask 0x4
    Dec 26 21:36:24 fs01bk kernel: [18449.718512] ata6: EH complete

    and this could result in failing filesystem and remount readonly. The suddenly thing is that one of these errore ( that appears only under certain heavy load ) may not block the system in a RAID6 configuration but if that happens to 3 different disks at same time your system could get locked by a readonly remount.

    The workaround that I trying now is to reduce queue_depth of NCQ by the following script in /etc/rc.local

    for i in a b c d e f g h i; do
    echo 16 > /sys/block/sd$i/device/queue_depth
    done

    where sd[a-i] are disks

    I trying not to disable at all ( queue_depth=1 ) the NCQ to not loose advantage of command queueing that results in optimization of physical disk cluster access with gain of performance.

    • ladreune

      Hi Lorenzo. Thanks for commenting. I’m glad to know that you’ve been able to solve your problem by fine-tuning queue_depth. Thanks for sharing such an info! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *