Have I told you that I just got 4 brand-new 8TB Seagate SATA-HDD?

Well, unfortunately my RAID setup raised some problems. The first reshaping attempt failed, with the host simply hanging, apparently with something like a kernel-panic that unfortunately I missed from the console, as it was…. blank! (BTW: this is one of the main reason to disable console-blanking in datacenter contexts)

As I setup the backup-server with syslog-forwarding to a central log-server, I have been able to easily get the last messages logged:

Dec 29 21:53:34 srv-backup kernel: [2416554.523012] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 29 21:53:34 srv-backup kernel: [2416554.526089] ata3.00: failed command: FLUSH CACHE EXT
Dec 29 21:53:34 srv-backup kernel: [2416554.529217] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 21
Dec 29 21:53:34 srv-backup kernel: [2416554.529217]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 29 21:53:34 srv-backup kernel: [2416554.535732] ata3.00: status: { DRDY }
Dec 29 21:53:34 srv-backup kernel: [2416554.539082] ata3: hard resetting link
Dec 29 21:53:44 srv-backup kernel: [2416564.540419] ata3: softreset failed (1st FIS failed)
Dec 29 21:53:44 srv-backup kernel: [2416564.543879] ata3: hard resetting link
Dec 29 21:53:54 srv-backup kernel: [2416574.543897] ata3: softreset failed (1st FIS failed)
Dec 29 21:53:54 srv-backup kernel: [2416574.547365] ata3: hard resetting link
Dec 29 21:54:29 srv-backup kernel: [2416609.549599] ata3: softreset failed (1st FIS failed)
Dec 29 21:54:29 srv-backup kernel: [2416609.553098] ata3: limiting SATA link speed to 1.5 Gbps
Dec 29 21:54:29 srv-backup kernel: [2416609.553105] ata3: hard resetting link
Dec 29 21:54:34 srv-backup kernel: [2416614.753857] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 29 21:54:34 srv-backup kernel: [2416614.753872] ata3.00: link online but device misclassified
Dec 29 21:54:39 srv-backup kernel: [2416619.754095] ata3.00: qc timeout (cmd 0xec)
Dec 29 21:54:39 srv-backup kernel: [2416619.754116] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 29 21:54:39 srv-backup kernel: [2416619.754123] ata3.00: revalidation failed (errno=-5)
Dec 29 21:54:39 srv-backup kernel: [2416619.757623] ata3: hard resetting link
Dec 29 21:54:49 srv-backup kernel: [2416629.757546] ata3: softreset failed (1st FIS failed)
Dec 29 21:54:49 srv-backup kernel: [2416629.761143] ata3: hard resetting link
Dec 29 21:54:59 srv-backup kernel: [2416639.761023] ata3: softreset failed (1st FIS failed)
Dec 29 21:54:59 srv-backup kernel: [2416639.764621] ata3: hard resetting link
Dec 29 21:55:34 srv-backup kernel: [2416674.765705] ata3: softreset failed (1st FIS failed)
Dec 29 21:55:34 srv-backup kernel: [2416674.769281] ata3: hard resetting link
Dec 29 21:55:39 srv-backup kernel: [2416679.969979] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 29 21:55:39 srv-backup kernel: [2416679.969995] ata3.00: link online but device misclassified
Dec 29 21:55:49 srv-backup kernel: [2416689.970464] ata3.00: qc timeout (cmd 0xec)
Dec 29 21:55:49 srv-backup kernel: [2416689.970485] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 29 21:55:49 srv-backup kernel: [2416689.970492] ata3.00: revalidation failed (errno=-5)
Dec 29 21:55:49 srv-backup kernel: [2416689.974196] ata3: hard resetting link
Dec 29 21:55:59 srv-backup kernel: [2416699.974925] ata3: softreset failed (1st FIS failed)
Dec 29 21:55:59 srv-backup kernel: [2416699.978676] ata3: hard resetting link
Dec 29 21:56:09 srv-backup kernel: [2416709.978387] ata3: softreset failed (1st FIS failed)
Dec 29 21:56:09 srv-backup kernel: [2416709.982095] ata3: hard resetting link
Dec 29 21:56:44 srv-backup kernel: [2416744.983051] ata3: softreset failed (1st FIS failed)
Dec 29 21:56:44 srv-backup kernel: [2416744.986786] ata3: hard resetting link
Dec 29 21:56:49 srv-backup kernel: [2416750.187330] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 29 21:56:49 srv-backup kernel: [2416750.187345] ata3.00: link online but device misclassified
Dec 29 21:57:19 srv-backup kernel: [2416780.188762] ata3.00: qc timeout (cmd 0xec)
Dec 29 21:57:19 srv-backup kernel: [2416780.188783] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 29 21:57:19 srv-backup kernel: [2416780.188789] ata3.00: revalidation failed (errno=-5)
Dec 29 21:57:19 srv-backup kernel: [2416780.192597] ata3.00: disabled
Dec 29 21:57:19 srv-backup kernel: [2416780.192686] ata3: hard resetting link
Dec 29 21:57:29 srv-backup kernel: [2416790.193211] ata3: softreset failed (1st FIS failed)
Dec 29 21:57:29 srv-backup kernel: [2416790.197058] ata3: hard resetting link
Dec 29 21:57:39 srv-backup kernel: [2416800.197706] ata3: softreset failed (1st FIS failed)
Dec 29 21:57:39 srv-backup kernel: [2416800.201623] ata3: hard resetting link
Dec 29 21:58:14 srv-backup kernel: [2416835.203290] ata3: softreset failed (1st FIS failed)
Dec 29 21:58:14 srv-backup kernel: [2416835.207170] ata3: hard resetting link
Dec 29 21:58:20 srv-backup kernel: [2416840.408548] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 29 21:58:20 srv-backup kernel: [2416840.408563] ata3.00: link online but device misclassified
Dec 29 21:58:20 srv-backup kernel: [2416840.408604] ata3: EH complete
Dec 29 21:58:20 srv-backup kernel: [2416840.408679] sd 2:0:0:0: [sda]  
Dec 29 21:58:20 srv-backup kernel: [2416840.408684] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 29 21:58:20 srv-backup kernel: [2416840.408690] sd 2:0:0:0: [sda] CDB: 
Dec 29 21:58:20 srv-backup kernel: [2416840.408694] Write(16): 8a 00 00 00 00 00 06 e0 18 08 00 00 00 01 00 00
Dec 29 21:58:20 srv-backup kernel: [2416840.408716] end_request: I/O error, dev sda, sector 115349512
Dec 29 21:58:20 srv-backup kernel: [2416840.412694] end_request: I/O error, dev sda, sector 115349512

so my first guess was: “Uhm… Have I been so unlucky to receive a faulty drive? (sda. actually)”

Following attempts to restart the server using the installed OS (residing on a separated, non-raid, partition) were unsuccesfull due to several reasons (the most important ones were: raid partition automatically mounted at boot with NFS server automatically started at boot, serving files exactly from there; my yet-not-so-deep understanding of systemd fixing/troubleshooting; problems with RAID due to failure of sdd during a reshaping activity).

So I ended up booting the server with a CD (I always have with me a copy of SystemRescueCD) and trying to fix the various issues. For various reasons (that, probably, will be sources for another POST, here), I ended up destroying and recreating the RAID-5 array.

Once more, after some time (more or less, one day of activity), the server went down again:

Jan  1 08:06:32 srv-backup kernel: [75314.846389] sd 5:0:0:0: [sdd]  
Jan  1 08:06:32 srv-backup kernel: [75314.852463] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jan  1 08:06:32 srv-backup kernel: [75314.858616] sd 5:0:0:0: [sdd] CDB: 
Jan  1 08:06:32 srv-backup kernel: [75314.864756] Read(16): 88 00 00 00 00 00 06 60 52 68 00 00 00 10 00 00
Jan  1 08:06:32 srv-backup kernel: [75314.871185] end_request: I/O error, dev sdd, sector 106975848
Jan  1 08:06:32 srv-backup kernel: [75314.877477] Read-error on swap-device (8:48:106975856)
Jan  1 08:06:32 srv-backup kernel: [75314.883786] Read-error on swap-device (8:48:106975864)

As you can see, above messages refer to sdd while, before, we talked about sda. This is normal, as I ended up swapping sda with sdd, just to lower the possibility to miss a boot due to sda failure (the fist SATA HDD is configured, in BIOS, as BOOT-drive). So above sdd and sda are exactly the same drive.

With a second-set of disk errors I just started searching for Amazon return-policy, as, after all, drives were really young (less than two weeks) and should have a proper warranty. Also, it should be easy to demostrate the failures, as the error-count were clearly showed within SMART parameters. While searching, I said to myself: “Am I the only guy experiencing problems with those disks?”

A quick search for ST8000AS0002 raised lots of informations.

With a big surprise, I started reading some threads (like this and this) where people reported problems with those drives but specifically with some kernel versions and filesystems (EXT4, XFS, BTRFS). Other people reported that with NTFS (on Linux) problems disappeared.

To be short: I discovered that it’s perfectly possible that what appears as hardware failures are, instead, errors generated by kernel when encountering some strange conditions.

In the second thread mentioned above (“Bug 93581: 3.17..3.19 all fail with new Seagate Archive 8TB S-ATA disk (NCQ timeouts)“) I found a post of Martin K. Petersen including a patch to solve the problem in recent kernel versions.

Few message below it’s stated that such a patch has been acepted and scheduled to be pushed on kernel 4.4.

Unfortunately:

I’m working on CentOS 7 (now 7.2.1511), whose official kernel is currently 3.10.0-327.3.1.el7.x86_64;
the most updated kernel-elrepo package for CentOS 7, currently, is kernel-ml-4.3.3

So I decided to gain once more from the Open-Source models underlying all of those technologies and…. prepared all the needed stuff to create a new RPM package, based on kernel-ml-4.3.3 but including the patch provided by Martin K. Petersen above.

I just setup a similar VM (Centos7) with all the included dev-packages, I downloaded and installed the SRPMS, I fixed the SPEC file, adding a proper reference to the patch:

[...] Source3: config-%{version}-x86_64 Source4: cpupower.service Source5: cpupower.config Patch0: ST8000AS0002-1NA17Z-AR13.patch [...] %prep %setup -q -n %{name}-%{version} -c cd linux-4.3.3 %patch0 -p1 cd ..

and launched the classic

rpmbuild -ba SPECS/kernel-ml-4.3.spec

I know that lots of time are needed to go through the whole compile process and so…. I’m going to sleep, leaving everything running. Tomorrow, hopefully, I should have a new kernel RPMS ready to be installed to my backup-server.

Stay tuned!

Update

The RPM build process was successfull. It left me with following results:

[root@storage-srv-02 rpmbuild]# cd /root/rpmbuild
[root@storage-srv-02 rpmbuild]# ls -l RPMS/x86_64/
totale 51152
-rw-r--r-- 1 root root 39062388 3 gen 02.36 kernel-ml-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root 10446560 3 gen 02.36 kernel-ml-devel-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root   997200 3 gen 02.36 kernel-ml-headers-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root    98360 3 gen 02.36 kernel-ml-tools-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root    31116 3 gen 02.36 kernel-ml-tools-libs-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root    18060 3 gen 02.36 kernel-ml-tools-libs-devel-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root  1260100 3 gen 02.36 perf-4.3.3-1.el7.centos.x86_64.rpm
-rw-r--r-- 1 root root   447784 3 gen 02.36 python-perf-4.3.3-1.el7.centos.x86_64.rpm
[root@storage-srv-02 rpmbuild]#

So I copied the RPM files to my backup-server and installed them with:

[root@srv-backup ~]# yum remove kernel-tools-libs
[root@srv-backup ~]# yum install kernel-ml-4.3.3-1.el7.centos.x86_64.rpm kernel-ml-headers-4.3.3-1.el7.centos.x86_64.rpm kernel-ml-tools-4.3.3-1.el7.centos.x86_64.rpm kernel-ml-tools-libs-4.3.3-1.el7.centos.x86_64.rpm

After a restart a choosing the right kernel at grub boot-menu, here are the results:

[root@srv-backup ~]# uname -a
Linux srv-backup 4.3.3-1.el7.centos.x86_64 #1 SMP Sun Jan 3 01:31:19 CET 2016 x86_64 x86_64 x86_64 GNU/Linux

as the RAID5 array was faulty, due to problems with sdd disk, I decided to reinitialize it (sdd5), blanking related metadata and re-adding it to the array:

[root@srv-backup ~]# dd if=/dev/zero of=/dev/sdd5 bs=1M count=100
[root@srv-backup ~]# mdadm --add /dev/md127 /dev/sdd5

now, after a couple of hours, here are the results:

[root@srv-backup ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md127 : active raid5 sdd5[4] sda5[0] sdb5[1] sdc5[2]
      23268661248 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [==>..................]  recovery = 11.6% (907138296/7756220416) finish=749.0min speed=152402K/sec
      bitmap: 0/58 pages [0KB], 65536KB chunk

unused devices: 
[root@srv-backup ~]#
[root@srv-backup ~]# mdadm -E /dev/sdd5
/dev/sdd5:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x3
     Array UUID : 9cbbcc2c:66c639f3:bf6b5907:ae7302ab
           Name : srv-backup:127  (local to host srv-backup)
  Creation Time : Thu Dec 31 10:34:53 2015
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 15512441487 (7396.91 GiB 7942.37 GB)
     Array Size : 23268661248 (22190.73 GiB 23827.11 GB)
  Used Dev Size : 15512440832 (7396.91 GiB 7942.37 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
Recovery Offset : 1818569504 sectors
   Unused Space : before=262056 sectors, after=655 sectors
          State : clean
    Device UUID : c4544192:01d79928:e40c5ed7:4b4cd578

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Jan  5 23:10:37 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : a32d7d14 - correct
         Events : 17326

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
[root@srv-backup ~]#

so, in 749 minutes (or so….) I should be able to see if I’ve been successfull in running a properly patched kernel, to support my brand-new 8TB disks.

Stay tuned!

5 Comments

Eric Ma

March 23, 2017 at 11:20 AM

Hi DV,

A detailed post! Any updates on how it works for you finally?

I have the similar problem several (more than 4) Seagate 8TB archive disks in a box. Unfortunately, it does not work very well even with 4.10.1 kernel.

Any updates from your side will be nice. Thanks.

- ladreune
  
  March 23, 2017 at 12:50 PM
  
  Hi Eric! Thanks for your commenting.
  
  As mentioned in the POST, on Jan 5th 2016 I upgraded my kernel from 3.10.0-229.20.1 to 4.3.3-1 (patched as described).
  
  After such upgrade, the system were perfectly stable up to Jan 22nd 2017.
  
  On that day the microserver rebooted (for reasons unrelated to our HDDs; see below) and unfortunately it rebooted with kernel 3.10.0-514.6.1.
  As you can see, in the meantime the “stock” CentOS kernel was upgraded (from 3.10.0-229.20.1 to 3.10.0-514.6.1).
  Unfortunately I have _NOT_ noted such a change (I forgot to update the GRUB configuration, so to have “my” 4.3.3-1 as the default boot kernel).
  
  Anyway, despite such unexpected reboot with a wrong kernel (the stock 3.10.0-514.6.1 insted of “my” 4.3.3.-1), currently the system is working _PERFECTLY_NICE_, without any side-problem.
  
  Please note, however, that the Jan 22nd reboot was required as we _REPLACED_ the HP-Microserver (keeping the old-HDDs). Basically we had the chance to replace the old GEN-7 HP-Microserver, with a new GEN-8 one. So we took out the four HDD and just put them in the new GEN-8 box. Everything worked flawlessly, up to now. I’m saying this, as there’s a minor chance that the HDD problems have been solved “also” with the old-kernel (stock 3.10.0-514.6.1) due to the hardware change.
  
  As we got rid of the old GEN-7 box, it’s impossible, for me, to check if the stock 3.10.0-514.6.1 kernel solved the issues with HDD.
  
  Having said all of this, I can surely assert that with the old hardware and the very same disks, the system worked flewlessy with “my” patched kernel.
  
  HTH.
  
  Cheers,
  DV
  
  - Eric Ma
    
    March 24, 2017 at 4:29 AM
    
    Thanks DV for the detailed reply! And good to know the disks work well for you now.
    
    The stock 3.10.0-514.6.1.el7 kernel did not work well enough for us unfortunately (after several days, problems start to appear). So we changed to the 4.10.1 kernel. It worked better than the stock 3.10.0-514.6.1 kernel (lasts longer before showing problems) while it is not good enough—after 1 weeks or so, Linux starts to report one or 2 disks disappeared on the servers.
    
    That may be related to the hardware (mother board) or the hardware configurations, or our work loads. I will continue checking these disks and servers.
    
    But thanks all the same for your info. That’s very good input for us for checking the problems.
    
Lorenzo Delana

December 27, 2017 at 10:12 PM

Hi,
only to report that today using Ubuntu 16.04.3 LTS with kernel 4.4.0-104-generic I got same problem ( FLUSH CACHE EXT ) within 10TB Seagate disk ( ST10000VN0004 ) as follow:

Dec 26 21:36:24 fs01bk kernel: [18449.220474] ata6.00: failed command: FLUSH CACHE EXT
Dec 26 21:36:24 fs01bk kernel: [18449.220628] ata6.00: status: { DRDY }
Dec 26 21:36:24 fs01bk kernel: [18449.712349] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 26 21:36:24 fs01bk kernel: [18449.718308] ata6.00: retrying FLUSH 0xea Emask 0x4
Dec 26 21:36:24 fs01bk kernel: [18449.718512] ata6: EH complete

and this could result in failing filesystem and remount readonly. The suddenly thing is that one of these errore ( that appears only under certain heavy load ) may not block the system in a RAID6 configuration but if that happens to 3 different disks at same time your system could get locked by a readonly remount.

The workaround that I trying now is to reduce queue_depth of NCQ by the following script in /etc/rc.local

for i in a b c d e f g h i; do
echo 16 > /sys/block/sd$i/device/queue_depth
done

where sd[a-i] are disks

I trying not to disable at all ( queue_depth=1 ) the NCQ to not loose advantage of command queueing that results in optimization of physical disk cluster access with gain of performance.

- ladreune
  
  December 27, 2017 at 10:58 PM
  
  Hi Lorenzo. Thanks for commenting. I’m glad to know that you’ve been able to solve your problem by fine-tuning queue_depth. Thanks for sharing such an info! 🙂

DV's blog

SysAdmin, Networking, WebDev and Geek F/OSS IT stuff...

Seagate 8TB ST8000AS0002 vs. Linux (aka: kernel issues you might face)

Update

5 Comments

Leave a Reply Cancel reply