Author Topic: Any RAID gurus in here? (more RAID problems) (Read 17678 times)

jondecker76 · « **on:** July 30, 2008, 04:12:44 am »

Well, i'm back to having raid troubles again. (this is the same thing that happened that caused me to lose all of my data last time)

I did some simple checks with mdadm - and it agrees that the status of one of the drives is Removed... (what is strange is that the drive is good - it said this last time, and it was fine on a fresh install for over a month)

Code: [Select]

linuxmce@dcerouter:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sat Jun 14 09:30:50 2008
     Raid Level : raid5
     Array Size : 1953524992 (1863.03 GiB 2000.41 GB)
  Used Dev Size : 976762496 (931.51 GiB 1000.20 GB)
   Raid Devices : 3
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Jul 29 22:04:34 2008
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : ef05e3dd:c2ec8d78:bd9f1658:0a1d2015 (local to host dcerouter)
         Events : 0.140014

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd

lookin in the web admin, sdb,sdc,and sdd are the 3 drives for the array. In the admin, /dev/sdb is show as "Spare", "Removed"

What should my next step be? I don't want to jump the gun again and lose all my data again in the process...

jondecker76 · « **Reply #1 on:** July 30, 2008, 03:11:47 pm »

I checked and changed out cables this morning - everything seems like it should inside the core. Something is definitely wrong however as the performance of the core has degraded.

I'm at work for the day, so hopefully by the time I get home tonight this will be seen by someone with experience with mdadm and software RAIDs

mikedehaan · « **Reply #2 on:** July 30, 2008, 08:30:10 pm »

Please post your mdadm.conf file as well as output from the following command:

Code: [Select]

mdadm -Ebsc partitions
This command will scan your partitions for the existence of any RAID superblock. This is primarily important for the config lines it will spit out. These should match what you have in your mdadm.conf file.

In my opinion, this is really just the first place to start. Hopefully we'll figure something out from this.

jondecker76 · « **Reply #3 on:** July 30, 2008, 09:47:04 pm »

Code: [Select]

linuxmce@dcerouter:~$ sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015

Code: [Select]

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays

# This file was auto-generated on Mon, 21 Apr 2008 15:58:25 -0400
# by mkconf $Id: mkconf 324 2007-05-05 18:49:44Z madduck $
PROGRAM /usr/pluto/bin/monitoring_RAID.sh

thanks for the reply!

mikedehaan · « **Reply #4 on:** July 30, 2008, 10:36:43 pm »

One thing to consider is adding the following line to your mdadm.conf file.

Code: [Select]

ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
The resulting mdadm.conf file should look like this:

Code: [Select]

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015

# This file was auto-generated on Mon, 21 Apr 2008 15:58:25 -0400
# by mkconf $Id: mkconf 324 2007-05-05 18:49:44Z madduck $
PROGRAM /usr/pluto/bin/monitoring_RAID.sh

This will give mdadm a big hint on where to find your raid devices and might help prevent future confusion on mdadm's part.

You'll need to re-add your removed device to get the raid back up and running in full mode:

Code: [Select]

mdadm -a /dev/md0 /dev/sdb
You should then start to see your raid re-build itself in the mdstat file in proc:

Code: [Select]

cat /proc/mdstat
Do not remove any of the drives while this process is in progress or data loss will likely occur.

jondecker76 · « **Reply #5 on:** July 30, 2008, 10:51:38 pm »

ok I will give it a try.

One thing to mention, is that my RAID is working, and I can access the files. Its just the one drive that has been removed somehow

thanks

mikedehaan · « **Reply #6 on:** July 30, 2008, 11:05:09 pm »

Yes, your raid is currently running in degraded mode meaning if you were to lose either sdc or sdd right now, you'd lose all of your data. Once sdb has been rebuilt, your raid should have the expected level of redundancy.

My speculation as to why the drive disappeared is because of the mdadm configuration. I had an issue in the past with device names (e.g. sda, sdb) versus UUID's. My raid was never safe. mdadm would try to scan for superblocks on my system and would only sometimes succeed. I needed to change my fstab to mount by UUID as well as tell mdadm which UUID to look for (mdadm.conf). My theory is that your issue is related. At the very least, explicitly telling mdadm where your raid is couldn't hurt.

I hope this helps.

jondecker76 · « **Reply #7 on:** July 30, 2008, 11:20:04 pm »

very good help, thanks. So far its looking promising - the device added just fine, and cnd checking /proc/mdstat shows its rebuilding fine

Code: [Select]

linuxmce@dcerouter:/etc/mdadm$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid5 sdb[3] sdc[1] sdd[2]
      1953524992 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
      [=>...................]  recovery =  6.2% (61183728/976762496) finish=216.5min speed=70473K/sec
      
unused devices: <none>

The big thing now is, we have to figure out exactly why this is happening, as this is a very big important role for the Core (or any server), and I'm sure others will be bit by it.

If you wouldn't mind telling me a little more about what goes on behind the scenes with mdadm (and why it would have removed /dev/sdb in my case), I could then look at including a permanent fix for the 0804 release, as well as improving the web admin tools (for instance, now that I know how to see what drives are part of the RAID and how to add drives, I can add this functionality to the web admin to make it easier for people when they have problems)

Anyways, thanks a lot for your help. I've lost 1.8 TB of information just a few months ago, including irreplaceble family pictures we had just taken in Disney Land due to this same bug. It feels good that I won't be starting all over again!

mikedehaan · « **Reply #8 on:** July 30, 2008, 11:40:42 pm »

My sincere condolences for your data loss. Something like that is quite disconcerting, especially considering that the whole purpose behind a raid is to prevent such an occurrence.

I will help in anyway that I can, though I will admit, I may not have the depth of knowledge you seek. I will do my best.

That being said, here's one of my previous posts concerning this issue:

http://ubuntuforums.org/showthread.php?p=5010533#post5010533

...again I'm not 100% sure this is exactly the issue you've faced. I just know the road to software raid became a little more bumpy with 8.04. If your device was removed after a reboot/power outage, then my theory holds a little more water. Otherwise, we'll need to start looking at the syslog entries for more information.

jondecker76 · « **Reply #9 on:** July 31, 2008, 03:19:56 am »

Ok, here is where things stand... I had to leave for a few hours, and I came back to the core being hard locked - couldn't even ssh in - so I rebooted.

Upon coming back online, my media is still accessible (phew!) However, running mdadm -Ebsc partitions now shows a little different (it is reporting the presence of a spare:

Code: [Select]

linuxmce@dcerouter:~$ sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
   spares=1

Also, i wanted to see the progress, so I checked with cat /proc/mdstat:

Code: [Select]

linuxmce@dcerouter:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid5 sdc[1] sdb[3] sdd[2]
      1953524992 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
      [>....................]  recovery =  2.3% (23159384/976762496) finish=244.2min speed=65070K/sec

It is repairing again (even though the first time it ran for over 4 hours, against a projected time of about 250 minutes)

I will see if it rebuilds this time without locking and report back here.

If anyone has any more ideas, I'd like to hear them!

mikedehaan · « **Reply #10 on:** July 31, 2008, 04:38:10 am »

Out of curiosity, what does "mdadm --detail /dev/md0" say after the reboot?

mdadm might have flagged your 3rd drive as a spare while it's rebuilding it.

jondecker76 · « **Reply #11 on:** July 31, 2008, 11:51:10 am »

Well, i let it rebuild all night. When I woke up this morning, I was pleased to see that it appeared to be successful. Just to be sure, I did a full reboot of the core. After coming online, I could see that the web admin showed the status as "OK". I tested a movie on the RAID, which played as it should.

Here are the results of some commands i ran on the core - everything looks good!

Quote

linuxmce@dcerouter:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb[0] sdd[2] sdc[1]
1953524992 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>
linuxmce@dcerouter:~$ mdadm --detail /dev/md0
mdadm: cannot open /dev/md0: Permission denied
linuxmce@dcerouter:~$ sudo mdadm --detail /dev/md0
[sudo] password for linuxmce:
/dev/md0:
Version : 00.90.03
Creation Time : Sat Jun 14 09:30:50 2008
Raid Level : raid5
Array Size : 1953524992 (1863.03 GiB 2000.41 GB)
Used Dev Size : 976762496 (931.51 GiB 1000.20 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jul 31 05:43:57 2008
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

UUID : ef05e3dd:c2ec8d78:bd9f1658:0a1d2015 (local to host dcerouter)
Events : 0.172712

Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
linuxmce@dcerouter:~$ sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
linuxmce@dcerouter:~$

Thanks for your help - everything went fine, and I suffered no data loss!

For anyone using the Software Raid feature, I highly recommend checking up on it at least once a week in the web admin. LMCE currently gives no warning to the user in the event of RAID drive failure or degradation. When I found my problem, I just happened to be just browsing around the web admin. If I wouldn't have seen it, I would not have known, and would have been at a much higher risk of losing data, let alone putting so much extra load on my system.

jondecker76 · « **Reply #12 on:** July 31, 2008, 05:28:18 pm »

Ok, here I am 6 hours after i woke up to find that the RAID was back to normal.

I just checked the web admin (from work), and my drive status is back to "Damaged" with /dev/sdb listed as removed again.

Isn't anyone else using the software raid feature? Can anyone check their status in the web admin to see if they are having similar issues?

I'm going to start pricing barebones NAS systems, the software RAID is just way too unstable

jondecker76 · « **Reply #13 on:** July 31, 2008, 05:48:35 pm »

Just talked to my wife on the phone. Aparently, a couple of hours ago when she got home, the core and all MD's were hard locked, so she rebooted. I've never had stability issues on this setup, so I'm guessing it is related to the RAID (the first rebuild attempt hard locked as well).

So, its back to the drawing board.

mikedehaan · « **Reply #14 on:** July 31, 2008, 06:03:31 pm »

Well...if you know the approximate time frame we might be able to find something in the syslog regarding why mdadm removed the device (or if something else did).

News:

Author Topic: Any RAID gurus in here? (more RAID problems) (Read 17678 times)

jondecker76

Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

mikedehaan

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

mikedehaan

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

mikedehaan

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

mikedehaan

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

mikedehaan

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

jondecker76

Re: Any RAID gurus in here? (more RAID problems)

mikedehaan

Re: Any RAID gurus in here? (more RAID problems)