Author Topic: RAID problems.....again......  (Read 4522 times)

jondecker76

  • Alumni
  • wants to work for LinuxMCE
  • *
  • Posts: 763
    • View Profile
RAID problems.....again......
« on: October 31, 2008, 09:15:26 pm »
I have had a ton of software RAID problems in the past with LMCE. My 4-disk RAID has been working fine for quite some time now, but today it totally crashed out on me. In the web admin, the RAID array is listed as FAILED, with each individual disk listed as REMOVED / SPARE DISK

Here is some output I'm experiencing while trying to see what is going on...

Code: [Select]
linuxmce@dcerouter:~$  sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=4 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
linuxmce@dcerouter:~$ sudo mdadm -a /dev/md0 /dev/sdb
mdadm: cannot get array info for /dev/md0
linuxmce@dcerouter:~$ sudo  mdadm --detail /dev/md0
mdadm: md device /dev/md0 does not appear to be active.
linuxmce@dcerouter:~$ sudo mount /dev/md0
mount: can't find /dev/md0 in /etc/fstab or /etc/mtab
linuxmce@dcerouter:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdb[0](S) sde[3](S) sdd[2](S) sdc[1](S)
      3907049984 blocks
       
unused devices: <none>

Ever since I have had so many problems in the past, I check at least once every few days on the RAID array status to make sure everything looks good. I just checked yesterday and there wasn't a single problem. Last night, I tried to copy the movie "Leatherheads" to the core, it stopped at 3% and gave the message that it may take a long time to copy. I left it in overnight, and this morning everything was completely crashed out (I couldn't even SSH into the core)
Upon reboot,nothing RAID related appears to work.

I had almost 3TB of data on this RAID array. I really hope it is not all lost - just 6 months ago it was all lost and I had to restart from scratch. I don't have the energy or time to do that again.

Can anybody help me find out what the problem is this time? (Since last time I completely changed motherboards and everything - I am now using an M2n-SLI Deluxe which is known to work very well.

Also - I would warn anybody not to use LMCE's software RAID to save themselves all of the problems I have had. I wish I could afford a dependable RAID solution, but unfortunately with a single income and a family of 6, that is just never going to happen. (it is also why I can not afford to do true backups of 3TB of data)

thanks for any help anyone can offer
« Last Edit: October 31, 2008, 09:17:07 pm by jondecker76 »

colinjones

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 3003
    • View Profile
Re: RAID problems.....again......
« Reply #1 on: October 31, 2008, 09:36:20 pm »
Jon - sorry to hear this has happened again, and glad to see you back, haven't seen you for a while....

Can't help with LMCEs RAID system, but perhaps to be on the safe side (rather than accidentally do something that permanently corrupts the RAID container, you should look into the various disk recovery software that is out there.

Some is very sophisticated, and you can configure them for precisely the striping, parity and block parameters you were using and it will be able to recover your data.... just a thought as I realise that data is king here and that is where your focus is...

jondecker76

  • Alumni
  • wants to work for LinuxMCE
  • *
  • Posts: 763
    • View Profile
Re: RAID problems.....again......
« Reply #2 on: November 01, 2008, 11:20:38 am »
Yeah, its been a busy year... working 2 jobs, sports with the kids, etc etc etc.. Things are slowing down now and I'm going to get back into the LMCE swing.

Ok, looks like I may have this fixed (or in the process of being fixed)

First, I re-assembled the array:
Code: [Select]
linuxmce@dcerouter:~$ sudo mdadm --assemble /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde
mdadm: /dev/md0 has been started with 3 drives (out of 4).


Then, I noticed that /dev/sde was listed as spare, so I re-added it:
Code: [Select]
linuxmce@dcerouter:~$ sudo mdadm -a /dev/md0 /dev/sde
mdadm: re-added /dev/sde


Right now the array is rebuilding... Looks like it should be successful...
Code: [Select]
linuxmce@dcerouter:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde[4] sdb[0] sdd[2] sdc[1]
      2930287488 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  1.4% (14153288/976762496) finish=240.1min speed=66797K/sec
     
unused devices: <none>


jondecker76

  • Alumni
  • wants to work for LinuxMCE
  • *
  • Posts: 763
    • View Profile
Re: RAID problems.....again......
« Reply #3 on: November 01, 2008, 10:19:29 pm »
Success! All data is intact and the RAID array is back to normal.

I really need to take some time and see what it would take for some tighter LMCE/mdadm integration - it would be a huge relief knowing I don't have to worry about my software RAID all the time

Enigmus

  • Veteran
  • ***
  • Posts: 132
    • View Profile
Re: RAID problems.....again......
« Reply #4 on: December 29, 2008, 04:48:20 pm »
Thanks for posting this it just saved my bacon.  However, what seemed to happen in my case is that fsck decided to check sdc for file corruption dues to 208 days without a check (I had rebooted).  After that the RAID reported the same problems as jondecker76 posted earlier.  The fix went as follows:

# mdadm --assemble /dev/md1 /dev/sda /dev/sdb /dev/sdc
mdadm: cannot open device /dev/sdc: Device or resource busy
mdadm: /dev/sdc has no superblock - assembly aborted

# mdadm --assemble /dev/md1 /dev/sda /dev/sdb
mdadm: /dev/md1 has been started with 2 drives (out of 3).

# mdadm -a /dev/md1 /dev/sdc
mdadm: Cannot open /dev/sdc: Device or resource busy

From the web console I removed /dev/sdc.  However, when I added the drive back it appeared as /dev/sdd.  It's rebuilding now.  I can see all the data which is the important part.  I'm looking into removing the drives from the initial fsck check on boot.