Well, i'm back to having raid troubles again. (this is the same thing that happened that caused me to lose all of my data last time)
I did some simple checks with mdadm - and it agrees that the status of one of the drives is Removed... (what is strange is that the drive is good - it said this last time, and it was fine on a fresh install for over a month)
linuxmce@dcerouter:~$ sudo mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sat Jun 14 09:30:50 2008
Raid Level : raid5
Array Size : 1953524992 (1863.03 GiB 2000.41 GB)
Used Dev Size : 976762496 (931.51 GiB 1000.20 GB)
Raid Devices : 3
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Tue Jul 29 22:04:34 2008
State : clean, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : ef05e3dd:c2ec8d78:bd9f1658:0a1d2015 (local to host dcerouter)
Events : 0.140014
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
lookin in the web admin, sdb,sdc,and sdd are the 3 drives for the array. In the admin, /dev/sdb is show as "Spare", "Removed"
What should my next step be? I don't want to jump the gun again and lose all my data again in the process...
I checked and changed out cables this morning - everything seems like it should inside the core. Something is definitely wrong however as the performance of the core has degraded.
I'm at work for the day, so hopefully by the time I get home tonight this will be seen by someone with experience with mdadm and software RAIDs
Please post your mdadm.conf file as well as output from the following command:
mdadm -Ebsc partitions
This command will scan your partitions for the existence of any RAID superblock. This is primarily important for the config lines it will spit out. These should match what you have in your mdadm.conf file.
In my opinion, this is really just the first place to start. Hopefully we'll figure something out from this.
linuxmce@dcerouter:~$ sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
# This file was auto-generated on Mon, 21 Apr 2008 15:58:25 -0400
# by mkconf $Id: mkconf 324 2007-05-05 18:49:44Z madduck $
PROGRAM /usr/pluto/bin/monitoring_RAID.sh
thanks for the reply!
One thing to consider is adding the following line to your mdadm.conf file.
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
The resulting mdadm.conf file should look like this:
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
# This file was auto-generated on Mon, 21 Apr 2008 15:58:25 -0400
# by mkconf $Id: mkconf 324 2007-05-05 18:49:44Z madduck $
PROGRAM /usr/pluto/bin/monitoring_RAID.sh
This will give mdadm a big hint on where to find your raid devices and might help prevent future confusion on mdadm's part.
You'll need to re-add your removed device to get the raid back up and running in full mode:
mdadm -a /dev/md0 /dev/sdb
You should then start to see your raid re-build itself in the mdstat file in proc:
cat /proc/mdstat
Do not remove any of the drives while this process is in progress or data loss will likely occur.
ok I will give it a try.
One thing to mention, is that my RAID is working, and I can access the files. Its just the one drive that has been removed somehow
thanks
Yes, your raid is currently running in degraded mode meaning if you were to lose either sdc or sdd right now, you'd lose all of your data. Once sdb has been rebuilt, your raid should have the expected level of redundancy.
My speculation as to why the drive disappeared is because of the mdadm configuration. I had an issue in the past with device names (e.g. sda, sdb) versus UUID's. My raid was never safe. mdadm would try to scan for superblocks on my system and would only sometimes succeed. I needed to change my fstab to mount by UUID as well as tell mdadm which UUID to look for (mdadm.conf). My theory is that your issue is related. At the very least, explicitly telling mdadm where your raid is couldn't hurt.
I hope this helps.
very good help, thanks. So far its looking promising - the device added just fine, and cnd checking /proc/mdstat shows its rebuilding fine
linuxmce@dcerouter:/etc/mdadm$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb[3] sdc[1] sdd[2]
1953524992 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
[=>...................] recovery = 6.2% (61183728/976762496) finish=216.5min speed=70473K/sec
unused devices: <none>
The big thing now is, we have to figure out exactly why this is happening, as this is a very big important role for the Core (or any server), and I'm sure others will be bit by it.
If you wouldn't mind telling me a little more about what goes on behind the scenes with mdadm (and why it would have removed /dev/sdb in my case), I could then look at including a permanent fix for the 0804 release, as well as improving the web admin tools (for instance, now that I know how to see what drives are part of the RAID and how to add drives, I can add this functionality to the web admin to make it easier for people when they have problems)
Anyways, thanks a lot for your help. I've lost 1.8 TB of information just a few months ago, including irreplaceble family pictures we had just taken in Disney Land due to this same bug. It feels good that I won't be starting all over again!
My sincere condolences for your data loss. Something like that is quite disconcerting, especially considering that the whole purpose behind a raid is to prevent such an occurrence. :'(
I will help in anyway that I can, though I will admit, I may not have the depth of knowledge you seek. I will do my best.
That being said, here's one of my previous posts concerning this issue:
http://ubuntuforums.org/showthread.php?p=5010533#post5010533 (http://ubuntuforums.org/showthread.php?p=5010533#post5010533)
...again I'm not 100% sure this is exactly the issue you've faced. I just know the road to software raid became a little more bumpy with 8.04. If your device was removed after a reboot/power outage, then my theory holds a little more water. Otherwise, we'll need to start looking at the syslog entries for more information.
Ok, here is where things stand... I had to leave for a few hours, and I came back to the core being hard locked - couldn't even ssh in - so I rebooted.
Upon coming back online, my media is still accessible (phew!) However, running mdadm -Ebsc partitions now shows a little different (it is reporting the presence of a spare:
linuxmce@dcerouter:~$ sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
spares=1
Also, i wanted to see the progress, so I checked with cat /proc/mdstat:
linuxmce@dcerouter:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdc[1] sdb[3] sdd[2]
1953524992 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
[>....................] recovery = 2.3% (23159384/976762496) finish=244.2min speed=65070K/sec
It is repairing again (even though the first time it ran for over 4 hours, against a projected time of about 250 minutes)
I will see if it rebuilds this time without locking and report back here.
If anyone has any more ideas, I'd like to hear them!
Out of curiosity, what does "mdadm --detail /dev/md0" say after the reboot?
mdadm might have flagged your 3rd drive as a spare while it's rebuilding it.
Well, i let it rebuild all night. When I woke up this morning, I was pleased to see that it appeared to be successful. Just to be sure, I did a full reboot of the core. After coming online, I could see that the web admin showed the status as "OK". I tested a movie on the RAID, which played as it should.
Here are the results of some commands i ran on the core - everything looks good!
Quote
linuxmce@dcerouter:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb[0] sdd[2] sdc[1]
1953524992 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>
linuxmce@dcerouter:~$ mdadm --detail /dev/md0
mdadm: cannot open /dev/md0: Permission denied
linuxmce@dcerouter:~$ sudo mdadm --detail /dev/md0
[sudo] password for linuxmce:
/dev/md0:
Version : 00.90.03
Creation Time : Sat Jun 14 09:30:50 2008
Raid Level : raid5
Array Size : 1953524992 (1863.03 GiB 2000.41 GB)
Used Dev Size : 976762496 (931.51 GiB 1000.20 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu Jul 31 05:43:57 2008
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : ef05e3dd:c2ec8d78:bd9f1658:0a1d2015 (local to host dcerouter)
Events : 0.172712
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
linuxmce@dcerouter:~$ sudo mdadm -Ebsc partitions
ARRAY /dev/md0 level=raid5 num-devices=3 UUID=ef05e3dd:c2ec8d78:bd9f1658:0a1d2015
linuxmce@dcerouter:~$
Thanks for your help - everything went fine, and I suffered no data loss!
For anyone using the Software Raid feature, I highly recommend checking up on it at least once a week in the web admin. LMCE currently gives no warning to the user in the event of RAID drive failure or degradation. When I found my problem, I just happened to be just browsing around the web admin. If I wouldn't have seen it, I would not have known, and would have been at a much higher risk of losing data, let alone putting so much extra load on my system.
Ok, here I am 6 hours after i woke up to find that the RAID was back to normal.
I just checked the web admin (from work), and my drive status is back to "Damaged" with /dev/sdb listed as removed again.
Isn't anyone else using the software raid feature? Can anyone check their status in the web admin to see if they are having similar issues?
I'm going to start pricing barebones NAS systems, the software RAID is just way too unstable
Just talked to my wife on the phone. Aparently, a couple of hours ago when she got home, the core and all MD's were hard locked, so she rebooted. I've never had stability issues on this setup, so I'm guessing it is related to the RAID (the first rebuild attempt hard locked as well).
So, its back to the drawing board.
Well...if you know the approximate time frame we might be able to find something in the syslog regarding why mdadm removed the device (or if something else did).
it was somewhere between 0630 this morning and 0900 that it locked up. My wife didn't reboot it until about 1100, which is when I was able to log into the web admin remotely and check on the status.
Any ideas off the top of your head on which log files to check?
These should provide us with some clue.
Log messages generated during the boot process:
/var/log/dmesg
System log file:
/var/log/syslog
ok, i'll check these when I get home in about 4 more hours. I'll post back then
thanks again
When you have a chance also check your memory usage. I remember having a problem with the mdadm monitoring daemon where it was consuming memory until the system died. I don't remember how exactly I fixed it, but if it appears to be your problem, I'll do some digging.
I used "top" to track memory usage and hit the "M" key to sort by memory. mdadm should be using trivial amounts of memory (less than 1%) (0.1% on my system).
I think the problem with your array could be caused by the hard rebooting, in my experience RAID arrays do not like that at all and often have to be rebuild afterwards. And the hard locking could be caused by some other problem not directly related to RAID.
Something you can also try is to add a line like this to /etc/syslog.conf:
*.* /dev/tty12
Then after the syslog deamon is restarted (/etc/init.d/sysklogd restart) it will output all kernel and system messages on tty12, so you could leave the core displaying that terminal ([Ctrl][Alt][F12]) untill it crashes. With a bit of luck that will give some more clues.
ok, theres some interesting things in the logs..
First, syslog shows no loggin g activity after 0634 this morning, then resumes at 11:14 when my wife rebooted. Prior to locking up, there are a ton of drive errors and failing to mount...
Jul 31 06:33:58 dcerouter kernel: [ 3601.716000] hda: media error (bad sector): error=0x30 { LastFailedSense=0x03 }
Jul 31 06:33:58 dcerouter kernel: [ 3601.716000] ide: failed opcode was: unknown
Jul 31 06:34:02 dcerouter kernel: [ 3605.484000] hda: error code: 0x70 sense_key: 0x03 asc: 0x11 ascq: 0x06
Jul 31 06:34:02 dcerouter kernel: [ 3605.484000] end_request: I/O error, dev hda, sector 2031320
Jul 31 06:34:02 dcerouter kernel: [ 3605.484000] Buffer I/O error on device hda, logical block 507830
Jul 31 06:34:02 dcerouter kernel: [ 3605.484000] Buffer I/O error on device hda, logical block 507831
Jul 31 06:34:08 dcerouter kernel: [ 3612.068000] hda: media error (bad sector): status=0x51 { DriveReady SeekComplete Error }
Jul 31 06:34:08 dcerouter kernel: [ 3612.068000] hda: media error (bad sector): error=0x30 { LastFailedSense=0x03 }
Jul 31 06:34:08 dcerouter kernel: [ 3612.068000] ide: failed opcode was: unknown
Jul 31 06:34:10 dcerouter kernel: [ 3613.684000] hda: error code: 0x70 sense_key: 0x03 asc: 0x11 ascq: 0x06
Jul 31 06:34:10 dcerouter kernel: [ 3613.684000] end_request: I/O error, dev hda, sector 2031320
Jul 31 06:34:10 dcerouter kernel: [ 3613.684000] Buffer I/O error on device hda, logical block 507830
Jul 31 06:34:10 dcerouter kernel: [ 3613.684000] Buffer I/O error on device hda, logical block 507831
Jul 31 06:34:16 dcerouter kernel: [ 3619.664000] printk: 1 messages suppressed.
Jul 31 06:34:16 dcerouter kernel: [ 3619.664000] rtc: lost 27 interrupts
Jul 31 06:34:16 dcerouter kernel: [ 3620.248000] hda: media error (bad sector): status=0x51 { DriveReady SeekComplete Error }
Jul 31 06:34:16 dcerouter kernel: [ 3620.248000] hda: media error (bad sector): error=0x30 { LastFailedSense=0x03 }
Jul 31 06:34:16 dcerouter kernel: [ 3620.248000] ide: failed opcode was: unknown
Jul 31 06:34:20 dcerouter kernel: [ 3623.944000] hda: error code: 0x70 sense_key: 0x03 asc: 0x11 ascq: 0x06
Jul 31 06:34:20 dcerouter kernel: [ 3623.944000] end_request: I/O error, dev hda, sector 2031320
Jul 31 06:34:20 dcerouter kernel: [ 3623.944000] Buffer I/O error on device hda, logical block 507830
Jul 31 06:34:23 dcerouter kernel: [ 3626.916000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 06:34:24 dcerouter kernel: [ 3627.920000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 37453836
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438025
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438024
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438023
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438022
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438021
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] EXT3-fs: sda1: 6 orphan inodes deleted
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] EXT3-fs: recovery complete.
Jul 31 11:14:06 dcerouter kernel: [ 11.040000] EXT3-fs: mounted filesystem with ordered data mode.
Jul 31 11:14:06 dcerouter kernel: [ 19.396000] r8169: eth1: link up
Jul 31 11:14:06 dcerouter kernel: [ 20.112000] md: md0 stopped.
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: bind<sdb>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: bind<sdd>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: bind<sdc>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: kicking non-fresh sdb from array!
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: unbind<sdb>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: export_rdev(sdb)
Jul 31 11:14:06 dcerouter kernel: [ 20.476000] i2c-adapter i2c-0: nForce2 SMBus adapter at 0x4c00
Jul 31 11:14:06 dcerouter kernel: [ 20.476000] i2c-adapter i2c-1: nForce2 SMBus adapter at 0x4c40
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: device sdc operational as raid disk 1
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: device sdd operational as raid disk 2
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: allocated 3163kB for md0
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] RAID5 conf printout:
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] --- rd:3 wd:2
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] disk 1, o:1, dev:sdc
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] disk 2, o:1, dev:sdd
Jul 31 11:14:06 dcerouter kernel: [ 20.784000] input: PC Speaker as /class/input/input3
...
...
Jul 31 11:15:36 dcerouter automount[12640]: lookup(program): lookup for 117 failed
Jul 31 11:15:36 dcerouter automount[12640]: failed to mount /mnt/device/117
Jul 31 11:15:36 dcerouter automount[12674]: lookup(program): lookup for 133 failed
Jul 31 11:15:36 dcerouter automount[12674]: failed to mount /mnt/device/133
Jul 31 11:15:38 dcerouter automount[13579]: lookup(program): lookup for 117 failed
Jul 31 11:15:38 dcerouter automount[13579]: failed to mount /mnt/device/117
Jul 31 11:15:40 dcerouter automount[13719]: lookup(program): lookup for 133 failed
Jul 31 11:15:40 dcerouter automount[13719]: failed to mount /mnt/device/133
Jul 31 11:15:41 dcerouter automount[13858]: lookup(program): lookup for 117 failed
Jul 31 11:15:41 dcerouter automount[13858]: failed to mount /mnt/device/117
Jul 31 11:15:42 dcerouter automount[13856]: lookup(program): lookup for 133 failed
Jul 31 11:15:42 dcerouter automount[13856]: failed to mount /mnt/device/133
Jul 31 11:15:42 dcerouter kernel: [ 136.024000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:15:43 dcerouter automount[13971]: lookup(program): lookup for 117 failed
Jul 31 11:15:43 dcerouter automount[13971]: failed to mount /mnt/device/117
Jul 31 11:15:43 dcerouter kernel: [ 137.024000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:15:53 dcerouter automount[16676]: lookup(program): lookup for 133 failed
Jul 31 11:15:53 dcerouter automount[16676]: failed to mount /mnt/device/133
Jul 31 11:15:54 dcerouter automount[16795]: lookup(program): lookup for 117 failed
Jul 31 11:15:54 dcerouter automount[16795]: failed to mount /mnt/device/117
Jul 31 11:15:55 dcerouter automount[16960]: lookup(program): lookup for 133 failed
Jul 31 11:15:55 dcerouter automount[16960]: failed to mount /mnt/device/133
Jul 31 11:15:56 dcerouter automount[17127]: lookup(program): lookup for 117 failed
Jul 31 11:15:56 dcerouter automount[17127]: failed to mount /mnt/device/117
Jul 31 11:15:57 dcerouter automount[17222]: lookup(program): lookup for 133 failed
...
...
Jul 31 11:16:47 dcerouter automount[23600]: failed to mount /mnt/device/133
Jul 31 11:16:48 dcerouter automount[23645]: lookup(program): lookup for 117 failed
Jul 31 11:16:48 dcerouter automount[23645]: failed to mount /mnt/device/117
Jul 31 11:16:49 dcerouter automount[23734]: lookup(program): lookup for 117 failed
Jul 31 11:16:49 dcerouter automount[23734]: failed to mount /mnt/device/117
Jul 31 11:16:49 dcerouter automount[23842]: lookup(program): lookup for 133 failed
Jul 31 11:16:49 dcerouter automount[23842]: failed to mount /mnt/device/133
Jul 31 11:16:50 dcerouter automount[23876]: lookup(program): lookup for 117 failed
Jul 31 11:16:50 dcerouter automount[23876]: failed to mount /mnt/device/117
...
...
Jul 31 11:21:07 dcerouter automount[9166]: failed to mount /mnt/device/133
Jul 31 11:21:08 dcerouter automount[9440]: lookup(program): lookup for 117 failed
Jul 31 11:21:08 dcerouter automount[9440]: failed to mount /mnt/device/117
Jul 31 11:21:09 dcerouter automount[9506]: lookup(program): lookup for 133 failed
Jul 31 11:21:09 dcerouter automount[9506]: failed to mount /mnt/device/133
Jul 31 11:21:10 dcerouter automount[9550]: lookup(program): lookup for 117 failed
Jul 31 11:21:10 dcerouter automount[9550]: failed to mount /mnt/device/117
Jul 31 11:21:17 dcerouter kernel: [ 470.340000] rtc: lost 27 interrupts
Jul 31 11:21:19 dcerouter kernel: [ 472.396000] rtc: lost 28 interrupts
Jul 31 11:21:25 dcerouter kernel: [ 478.856000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:21:26 dcerouter kernel: [ 479.852000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:21:49 dcerouter kernel: [ 503.180000] rtc: lost 26 interrupts
Jul 31 11:22:08 dcerouter kernel: [ 521.648000] rtc: lost 28 interrupts
Jul 31 11:22:10 dcerouter kernel: [ 523.700000] rtc: lost 27 interrupts
Jul 31 11:22:14 dcerouter kernel: [ 527.808000] rtc: lost 28 interrupts
Jul 31 11:22:18 dcerouter kernel: [ 531.912000] rtc: lost 27 interrupts
Jul 31 11:22:20 dcerouter kernel: [ 533.964000] rtc: lost 28 interrupts
Jul 31 11:22:22 dcerouter kernel: [ 536.016000] rtc: lost 27 interrupts
Jul 31 11:22:24 dcerouter kernel: [ 538.068000] rtc: lost 27 interrupts
Jul 31 11:22:49 dcerouter kernel: [ 562.696000] rtc: lost 28 interrupts
Jul 31 11:22:51 dcerouter automount[13908]: lookup(program): lookup for 133 failed
Jul 31 11:22:51 dcerouter automount[13908]: failed to mount /mnt/device/133
Jul 31 11:22:52 dcerouter automount[14018]: lookup(program): lookup for 117 failed
Jul 31 11:22:52 dcerouter automount[14018]: failed to mount /mnt/device/117
Jul 31 11:22:53 dcerouter automount[14118]: lookup(program): lookup for 133 failed
Jul 31 11:22:53 dcerouter automount[14118]: failed to mount /mnt/device/133
Jul 31 11:22:54 dcerouter automount[14225]: lookup(program): lookup for 117 failed
Jul 31 11:22:54 dcerouter automount[14225]: failed to mount /mnt/device/117
Jul 31 11:22:56 dcerouter automount[14299]: lookup(program): lookup for 133 failed
Jul 31 11:22:56 dcerouter automount[14299]: failed to mount /mnt/device/133
Jul 31 11:22:57 dcerouter automount[14342]: lookup(program): lookup for 117 failed
Jul 31 11:22:57 dcerouter automount[14342]: failed to mount /mnt/device/117
Jul 31 11:22:58 dcerouter automount[14392]: lookup(program): lookup for 133 failed
Jul 31 11:22:58 dcerouter automount[14392]: failed to mount /mnt/device/133
Jul 31 11:22:59 dcerouter automount[14481]: lookup(program): lookup for 117 failed
Jul 31 11:22:59 dcerouter automount[14481]: failed to mount /mnt/device/117
Jul 31 11:23:02 dcerouter automount[14703]: lookup(program): lookup for 133 failed
Jul 31 11:23:02 dcerouter automount[14703]: failed to mount /mnt/device/133
Jul 31 11:23:03 dcerouter automount[14739]: lookup(program): lookup for 117 failed
Jul 31 11:23:03 dcerouter automount[14739]: failed to mount /mnt/device/117
Jul 31 11:23:04 dcerouter automount[14785]: lookup(program): lookup for 133 failed
Jul 31 11:23:04 dcerouter automount[14785]: failed to mount /mnt/device/133
Jul 31 11:23:05 dcerouter automount[14829]: lookup(program): lookup for 117 failed
Jul 31 11:23:05 dcerouter automount[14829]: failed to mount /mnt/device/117
Jul 31 11:23:06 dcerouter automount[14919]: lookup(program): lookup for 133 failed
Jul 31 11:23:06 dcerouter automount[14919]: failed to mount /mnt/device/133
Jul 31 11:23:07 dcerouter automount[15022]: lookup(program): lookup for 117 failed
Jul 31 11:23:07 dcerouter automount[15022]: failed to mount /mnt/device/117
Jul 31 11:23:08 dcerouter automount[15128]: lookup(program): lookup for 133 failed
Jul 31 11:23:08 dcerouter automount[15128]: failed to mount /mnt/device/133
Jul 31 11:23:08 dcerouter automount[15198]: lookup(program): lookup for 117 failed
Jul 31 11:23:08 dcerouter automount[15198]: failed to mount /mnt/device/117
Jul 31 11:23:19 dcerouter kernel: [ 592.952000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:23:20 dcerouter kernel: [ 593.952000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:23:21 dcerouter kernel: [ 594.952000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:23:29 dcerouter kernel: [ 602.960000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:23:30 dcerouter kernel: [ 603.960000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:23:31 dcerouter kernel: [ 604.964000] eth0: too many iterations (6) in nv_nic_irq.
Jul 31 11:23:36 dcerouter kernel: [ 609.920000] rtc: lost 28 interrupts
Jul 31 11:23:38 dcerouter kernel: [ 611.972000] rtc: lost 27 interrupts
Jul 31 11:23:40 dcerouter kernel: [ 614.024000] rtc: lost 27 interrupts
Jul 31 11:23:48 dcerouter kernel: [ 622.236000] rtc: lost 28 interrupts
Jul 31 11:23:53 dcerouter kernel: [ 626.340000] rtc: lost 27 interrupts
Jul 31 11:23:55 dcerouter kernel: [ 628.392000] rtc: lost 7 interrupts
Jul 31 11:23:57 dcerouter kernel: [ 630.444000] rtc: lost 27 interrupts
dmesg shows that the sdb was kicked from the array, though i'm not sure why:
[ 8.136000] Probing IDE interface ide0...
[ 8.152000] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
[ 8.152000] sd 0:0:0:0: [sda] Write Protect is off
[ 8.152000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 8.152000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.152000] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
[ 8.152000] sd 0:0:0:0: [sda] Write Protect is off
[ 8.152000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 8.152000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.152000] sda: sda1 sda2 < sda5 sda6 >
[ 8.200000] sd 0:0:0:0: [sda] Attached SCSI disk
[ 8.200000] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
[ 8.200000] sd 1:0:0:0: [sdb] Write Protect is off
[ 8.200000] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 8.200000] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.200000] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
[ 8.200000] sd 1:0:0:0: [sdb] Write Protect is off
[ 8.200000] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 8.200000] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.200000] sdb: unknown partition table
[ 8.208000] sd 1:0:0:0: [sdb] Attached SCSI disk
[ 8.208000] sd 2:0:0:0: [sdc] 1953525168 512-byte hardware sectors (1000205 MB)
[ 8.208000] sd 2:0:0:0: [sdc] Write Protect is off
[ 8.208000] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[ 8.208000] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.208000] sd 2:0:0:0: [sdc] 1953525168 512-byte hardware sectors (1000205 MB)
[ 8.208000] sd 2:0:0:0: [sdc] Write Protect is off
[ 8.208000] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[ 8.208000] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.208000] sdc: unknown partition table
[ 8.212000] sd 2:0:0:0: [sdc] Attached SCSI disk
[ 8.212000] sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
[ 8.212000] sd 3:0:0:0: [sdd] Write Protect is off
[ 8.212000] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[ 8.212000] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.212000] sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
[ 8.212000] sd 3:0:0:0: [sdd] Write Protect is off
[ 8.212000] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[ 8.212000] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 8.212000] sdd: unknown partition table
[ 8.216000] sd 3:0:0:0: [sdd] Attached SCSI disk
[ 8.220000] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 8.220000] sd 1:0:0:0: Attached scsi generic sg1 type 0
[ 8.220000] sd 2:0:0:0: Attached scsi generic sg2 type 0
[ 8.220000] sd 3:0:0:0: Attached scsi generic sg3 type 0
[ 8.496000] EXT3-fs: INFO: recovery required on readonly filesystem.
[ 8.496000] EXT3-fs: write access will be enabled during recovery.
[ 8.872000] hda: JLMS DVD-ROM XJ-HD166, ATAPI CD/DVD-ROM drive
[ 9.404000] ieee1394: Host added: ID:BUS[0-00:1023] GUID[0011d80000d41943]
[ 9.656000] hdb: SONY CD-RW/DVD-ROM CRX310EE, ATAPI CD/DVD-ROM drive
[ 9.712000] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
[ 9.720000] Probing IDE interface ide1...
[ 10.288000] r8169 Gigabit Ethernet driver 2.2LK loaded
[ 10.288000] ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
[ 10.288000] ACPI: PCI Interrupt 0000:01:09.0[A] -> Link [APC2] -> GSI 17 (level, low) -> IRQ 21
[ 10.288000] eth1: RTL8110s at 0xf88a6000, 00:18:f8:09:ce:90, XID 04000000 IRQ 21
[ 10.296000] hda: ATAPI 48X DVD-ROM drive, 512kB Cache, UDMA(33)
[ 10.296000] Uniform CD-ROM driver Revision: 3.20
[ 10.296000] hdb: ATAPI 63X DVD-ROM CD-R/RW drive, 1536kB Cache, UDMA(33)
[ 11.036000] kjournald starting. Commit interval 5 seconds
[ 11.036000] EXT3-fs: sda1: orphan cleanup on readonly fs
[ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 37453836
[ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438025
[ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438024
[ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438023
[ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438022
[ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438021
[ 11.036000] EXT3-fs: sda1: 6 orphan inodes deleted
[ 11.036000] EXT3-fs: recovery complete.
[ 11.040000] EXT3-fs: mounted filesystem with ordered data mode.
[ 19.396000] r8169: eth1: link up
[ 20.112000] md: md0 stopped.
[ 20.376000] md: bind<sdb>
[ 20.376000] md: bind<sdd>
[ 20.376000] md: bind<sdc>
[ 20.376000] md: kicking non-fresh sdb from array!
[ 20.376000] md: unbind<sdb>
[ 20.376000] md: export_rdev(sdb)
[ 20.476000] i2c-adapter i2c-0: nForce2 SMBus adapter at 0x4c00
[ 20.476000] i2c-adapter i2c-1: nForce2 SMBus adapter at 0x4c40
[ 20.760000] raid5: device sdc operational as raid disk 1
[ 20.760000] raid5: device sdd operational as raid disk 2
[ 20.760000] raid5: allocated 3163kB for md0
[ 20.760000] raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
[ 20.760000] RAID5 conf printout:
[ 20.760000] --- rd:3 wd:2
[ 20.760000] disk 1, o:1, dev:sdc
[ 20.760000] disk 2, o:1, dev:sdd
[ 20.784000] input: PC Speaker as /class/input/input3
[ 20.936000] Linux agpgart interface v0.102 (c) Dave Jones
[ 20.948000] usbcore: registered new interface driver usbserial
http://pastebin.com/m78b79e18
Some clarification:
The errors for devices 117 and 113 in the post above are to disk drives in a couple of media directors that were turned off, so those are unrelated...
Also, the important parts of syslog are:
Jul 31 11:14:06 dcerouter kernel: [ 8.136000] Probing IDE interface ide0...
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.152000] sda: sda1 sda2 < sda5 sda6 >
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 0:0:0:0: [sda] Attached SCSI disk
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.200000] sdb: unknown partition table
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 1:0:0:0: [sdb] Attached SCSI disk
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] 1953525168 512-byte hardware sectors (1000205 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] 1953525168 512-byte hardware sectors (1000205 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.208000] sdc: unknown partition table
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 2:0:0:0: [sdc] Attached SCSI disk
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] Write Protect is off
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 31 11:14:06 dcerouter kernel: [ 8.212000] sdd: unknown partition table
Jul 31 11:14:06 dcerouter kernel: [ 8.216000] sd 3:0:0:0: [sdd] Attached SCSI disk
Jul 31 11:14:06 dcerouter kernel: [ 8.220000] sd 0:0:0:0: Attached scsi generic sg0 type 0
Jul 31 11:14:06 dcerouter kernel: [ 8.220000] sd 1:0:0:0: Attached scsi generic sg1 type 0
Jul 31 11:14:06 dcerouter kernel: [ 8.220000] sd 2:0:0:0: Attached scsi generic sg2 type 0
Jul 31 11:14:06 dcerouter kernel: [ 8.220000] sd 3:0:0:0: Attached scsi generic sg3 type 0
Jul 31 11:14:06 dcerouter kernel: [ 8.496000] EXT3-fs: INFO: recovery required on readonly filesystem.
Jul 31 11:14:06 dcerouter kernel: [ 8.496000] EXT3-fs: write access will be enabled during recovery.
Jul 31 11:14:06 dcerouter kernel: [ 8.872000] hda: JLMS DVD-ROM XJ-HD166, ATAPI CD/DVD-ROM drive
Jul 31 11:14:06 dcerouter kernel: [ 9.404000] ieee1394: Host added: ID:BUS[0-00:1023] GUID[0011d80000d41943]
Jul 31 11:14:06 dcerouter kernel: [ 9.656000] hdb: SONY CD-RW/DVD-ROM CRX310EE, ATAPI CD/DVD-ROM drive
Jul 31 11:14:06 dcerouter kernel: [ 9.712000] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Jul 31 11:14:06 dcerouter kernel: [ 9.720000] Probing IDE interface ide1...
Jul 31 11:14:06 dcerouter kernel: [ 10.288000] r8169 Gigabit Ethernet driver 2.2LK loaded
Jul 31 11:14:06 dcerouter kernel: [ 10.288000] ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
Jul 31 11:14:06 dcerouter kernel: [ 10.288000] ACPI: PCI Interrupt 0000:01:09.0[A] -> Link [APC2] -> GSI 17 (level, low) -> IRQ 21
Jul 31 11:14:06 dcerouter kernel: [ 10.288000] eth1: RTL8110s at 0xf88a6000, 00:18:f8:09:ce:90, XID 04000000 IRQ 21
Jul 31 11:14:06 dcerouter kernel: [ 10.296000] hda: ATAPI 48X DVD-ROM drive, 512kB Cache, UDMA(33)
Jul 31 11:14:06 dcerouter kernel: [ 10.296000] Uniform CD-ROM driver Revision: 3.20
Jul 31 11:14:06 dcerouter kernel: [ 10.296000] hdb: ATAPI 63X DVD-ROM CD-R/RW drive, 1536kB Cache, UDMA(33)
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] kjournald starting. Commit interval 5 seconds
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] EXT3-fs: sda1: orphan cleanup on readonly fs
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 37453836
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438025
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438024
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438023
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438022
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] ext3_orphan_cleanup: deleting unreferenced inode 36438021
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] EXT3-fs: sda1: 6 orphan inodes deleted
Jul 31 11:14:06 dcerouter kernel: [ 11.036000] EXT3-fs: recovery complete.
Jul 31 11:14:06 dcerouter kernel: [ 11.040000] EXT3-fs: mounted filesystem with ordered data mode.
Jul 31 11:14:06 dcerouter kernel: [ 19.396000] r8169: eth1: link up
Jul 31 11:14:06 dcerouter kernel: [ 20.112000] md: md0 stopped.
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: bind<sdb>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: bind<sdd>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: bind<sdc>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: kicking non-fresh sdb from array!
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: unbind<sdb>
Jul 31 11:14:06 dcerouter kernel: [ 20.376000] md: export_rdev(sdb)
Jul 31 11:14:06 dcerouter kernel: [ 20.476000] i2c-adapter i2c-0: nForce2 SMBus adapter at 0x4c00
Jul 31 11:14:06 dcerouter kernel: [ 20.476000] i2c-adapter i2c-1: nForce2 SMBus adapter at 0x4c40
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: device sdc operational as raid disk 1
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: device sdd operational as raid disk 2
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: allocated 3163kB for md0
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] RAID5 conf printout:
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] --- rd:3 wd:2
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] disk 1, o:1, dev:sdc
Jul 31 11:14:06 dcerouter kernel: [ 20.760000] disk 2, o:1, dev:sdd
Working with some of the guys on IRC right now.. Running a badblocks test on the drive that keeps getting booted
Any luck?
I started the badblocks scan last thursday night before going on a short vacation for the weekend. When I had come back last night, the core was locked up, so the scan never completed. I rebooted the core this morning and started a badblocks scan again - Hopefully I will have results tonight
Still no luck. Started badblocks (again) this morning at about 06:00. By 18:00, the core was dead-locked again, and the scan was not finished.
linuxmce@dcerouter:~$ sudo badblocks -vws /dev/sdb
[sudo] password for linuxmce:
Checking for bad blocks in read-write mode
From block 0 to 976762584
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: 95983808/ 976762583Read from remote host 192.168.80.1: Connection reset by peer
Connection to 192.168.80.1 closed.
Having over 30 hours of badblocks testing (both which crashed at different places) it is apparent that I now have a stability issue (I had used this setup based on an M2NPV-VM for 6 months with no stability issues), so I have to trace that down as well.
Regarding badblocks - I'm not quite sure how it is supposed to work. Its starting to look like it would let it run forever if I let it.. For A test, I did a check from block 900000000 to 976762584. As of this moment, here is my output at the terminal:
linuxmce@dcerouter:~$ sudo badblocks -vws /dev/sdb 976762583 900000000
[sudo] password for linuxmce:
last_block = 976762583 (976762583)
from_count = 900000000
Checking for bad blocks in read-write mode
From block 900000000 to 976762584
Testing with pattern 0xaa: 0done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: 938319152/ 976762584
First it did the test with 0xAA, then with 0X55, now its on 0XFF... With this just continue forever? How will I know if it finds a problem? Checking the man page for badblocks didn't yield an answer.
Anyways, not sure where to go from here.. I have not seen any evidence that the drive is bad yet. I don't want to order another drive until I know for sure that it is bad - but as you can see I'm about 4 days into just trying to find out if there is a problem with the drive - with no luck. I still believe there is a bug in the RAID implementation
Ok, things didn't get to far on this last night.. A badblocks scan (of about 10% of the drive) did finally complete and I now know what to expect. So far, I have scanned the last 20% of the drive, 10% at a time, with no errors shown.
Quote
linuxmce@dcerouter:~$ sudo badblocks -vws /dev/sdb 976762583 900000000
[sudo] password for linuxmce:
last_block = 976762583 (976762583)
from_count = 900000000
Checking for bad blocks in read-write mode
From block 900000000 to 976762584
Testing with pattern 0xaa: 0done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found.
linuxmce@dcerouter:~$ sudo badblocks -vws /dev/sdb 900000000 800000000
[sudo] password for linuxmce:
last_block = 900000000 (900000000)
from_count = 800000000
Checking for bad blocks in read-write mode
From block 800000000 to 900000001
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found.
linuxmce@dcerouter:~$ Read from remote host 192.168.80.1: Connection reset by peer
Connection to 192.168.80.1 closed.
Considering that the array was just about 100% full both times I've had RAID problems, I'm still going to guess that the drive is good. I'll continue to run badblocks scans of about 10% at a time though.
Also, when I woke up this morning, the core was hard locked again. Therefore, I think its safe to say that the badblocks is not causing the crashes - rather that a full scan takes so long, that with my instability problem it just always crashes before it completes.
As far as the instability, I'm not sure where to start.. My syslog shows 2 things:
1) I am seeing very very often in the log that NETDEV WATCHDOG transmit timed out, following pages of dumped rx/tx registers:
Aug 5 05:21:28 dcerouter kernel: [46401.988000] NETDEV WATCHDOG: eth0: transmit timed out
Aug 5 05:21:28 dcerouter kernel: [46401.988000] eth0: Got tx_timeout. irq: 00000036
Aug 5 05:21:28 dcerouter kernel: [46401.988000] eth0: Ring at 1f9c0000
Aug 5 05:21:28 dcerouter kernel: [46401.988000] eth0: Dumping tx registers
Aug 5 05:21:28 dcerouter kernel: [46401.988000] 0: 00000036 000000ff 00000003 001203ca 00000000 00000000 00000000 00000000
Aug 5 05:21:28 dcerouter kernel: [46401.988000] 20: 00000000 f0000000 00000000 00000000 00000000 00000000 00000000 00000000
Aug 5 05:21:28 dcerouter kernel: [46401.988000] 40: 0420e20e 0000a455 00002e20 00000000 00000000 00000000 00000000 00000000
...
...
...
2) My log is absolutely flooded, begining to end, with failed drive mounting. Also, the drives that are failing are hard drives that are in other media directors that are offline. I can understand the failed mount (as the drive is offline), but honestly, should it be trying to mount it hundreds of times each minute, 24 hours a day?
Aug 5 05:46:10 dcerouter automount[12112]: lookup(program): lookup for 133 failed
Aug 5 05:46:10 dcerouter automount[12112]: failed to mount /mnt/device/133
Aug 5 05:46:11 dcerouter automount[12160]: lookup(program): lookup for 117 failed
Aug 5 05:46:11 dcerouter automount[12160]: failed to mount /mnt/device/117
Aug 5 05:46:12 dcerouter automount[12205]: lookup(program): lookup for 133 failed
Aug 5 05:46:12 dcerouter automount[12205]: failed to mount /mnt/device/133
Aug 5 05:46:13 dcerouter automount[12247]: lookup(program): lookup for 117 failed
Aug 5 05:46:13 dcerouter automount[12247]: failed to mount /mnt/device/117
Aug 5 05:46:15 dcerouter automount[12469]: lookup(program): lookup for 133 failed
Aug 5 05:46:15 dcerouter automount[12469]: failed to mount /mnt/device/133
Aug 5 05:46:16 dcerouter automount[12539]: lookup(program): lookup for 117 failed
Aug 5 05:46:16 dcerouter automount[12539]: failed to mount /mnt/device/117
Aug 5 05:46:16 dcerouter automount[12582]: lookup(program): lookup for 133 failed
Aug 5 05:46:16 dcerouter automount[12582]: failed to mount /mnt/device/133
Aug 5 05:46:17 dcerouter automount[12616]: lookup(program): lookup for 117 failed
Aug 5 05:46:17 dcerouter automount[12616]: failed to mount /mnt/device/117
Aug 5 05:46:19 dcerouter automount[12679]: lookup(program): lookup for 133 failed
Aug 5 05:46:19 dcerouter automount[12679]: failed to mount /mnt/device/133
Aug 5 05:46:20 dcerouter automount[12724]: lookup(program): lookup for 117 failed
Aug 5 05:46:20 dcerouter automount[12724]: failed to mount /mnt/device/117
Aug 5 05:46:21 dcerouter automount[12881]: lookup(program): lookup for 133 failed
Aug 5 05:46:21 dcerouter automount[12881]: failed to mount /mnt/device/133
Aug 5 05:46:22 dcerouter automount[12980]: lookup(program): lookup for 117 failed
Aug 5 05:46:22 dcerouter automount[12980]: failed to mount /mnt/device/117
...
...
...
Any help would be appreciated. Looks like I'm down and out until I can get this fixed, as this is a very busy time of year for me (wake up at 0500 for work, get home at about 15:30, kids football practice starts at 1700, then we get home at 2000, then its time for showers etc, then bedtime.) so I only have about an hour or two each day to try to get things going again.
are you using the forcedeth driver? Does the core completely lock up or does the network get stuck?
I'd try Zaerc's approach with the console logging.
best regards,
hari
Hari - now that I know the network is getting killed, it may not be a hard lockup (I thought it was, as I could not ssh into the core - but of course this would fail with no networking on the core) So I will have to look a little deeper into whether it is actually hard-locking or not. Reading a few posts on the Internet, the NETDEV WATCHDOG error does indeed kill the networking.
I'm just not sure why after 6 months of using this core I start getting this error. The only network change that has been made recently is replacing my old 8-port gigabit switch (a cheap D-Link from office max)/Belkin cheapo wireless-G router as an access point with a 24 port Netgear gigabit rackmount switch (GS524NA)/Netgear WG302 Access point. I sure hope this is not related, as the switch/access point was fairly expensive, and I need the extra ports and stability offered by the higher end equipment!
I will try Zaercs suggestion and see exactly where at in the log things are breaking
I don't see a GS524NA on Netgear's site, but I have the Netgear JGS524. I think we're talking the same thing. Mine has been running rock solid for about a year and a half now.
Jimmy
Yes, its the same switch.. I didn't think it would be related to my recent problems, but it was a fairly recent change to my network so I mentioned it just in case. Maybe I will unplug the WG302 access point for a few days and see if this has any effect on things (I just won't be able to use my wireless orbiters while its unplugged)
I called home and talked to my wife.. It does seem to us that we've had lockups every single day since adding the Netgear WG302 Accesspoint. I had her unplug it, and we will watch for the next day or so (I don't think we've made it over 24 hours in 2 weeks, so If its still going tomorrow I will look into wy the access point causing the network to croak.
Also, the last 40% of the drive the keeps getting booted from my RAID array has been scanned by badblocks with no bad blocks found thus far.
Quote from: jondecker76 on August 05, 2008, 02:59:19 PM
Hari - now that I know the network is getting killed, it may not be a hard lockup (I thought it was, as I could not ssh into the core - but of course this would fail with no networking on the core) So I will have to look a little deeper into whether it is actually hard-locking or not. Reading a few posts on the Internet, the NETDEV WATCHDOG error does indeed kill the networking.
can you give me some details about the nic you are using?
best regards,
hari
Hari - the NETDEV WATCHDOG error was against eth:0 - the onboard NIC on my Asus M2NPV-VM
I had two occurences of the same nic lockup on my Abit AN-M2 this year. If you have a high volume network the chance to hit the lockup is higher. If you ask me, the nvidia nic is crap.
best regards,
Hari
Hari - you probably are right about the onboard being crap.
However, the last several days, I have been running only the core and have not allowed anyone to watch TV or otherwise use LMCE at all until I can find the problem, and the lockups have still been happening every day like clockwork now. The other confusing part is that until recently, this hasn't happened in the prior 6 months on this setup, even while having 4 simultaneous video streams. If all else fails, I may try another NIC soon (though all of my PCI slots are full which is why I'd like to hold off if I can).
I should have a better idea after I wake up in the morning. If the core is up running normal in the morning, it will be the first time in weeks since the addition of the access point. Since we unplugged it today, I think it would narrow it down.
Either way, I need these stability issues fixed before getting back to the RAID problems. The badblocks scan is taking an incredible amount of time (I am doing it in blocks of about 10% of the drive, with each scan taking about 6 hours. That would make a total of about 60 hours for a badblocks scan on a 1TB drive (in write mode). Ouch!
Another thing that is catching my eye in syslog:
Aug 5 15:31:28 dcerouter kernel: [36517.332000] rtc: lost 28 interrupts
Aug 5 15:31:30 dcerouter kernel: [36519.384000] rtc: lost 27 interrupts
Aug 5 15:31:32 dcerouter kernel: [36521.440000] rtc: lost 27 interrupts
Aug 5 15:31:35 dcerouter kernel: [36523.492000] rtc: lost 27 interrupts
Aug 5 15:31:39 dcerouter kernel: [36527.600000] rtc: lost 27 interrupts
Aug 5 15:31:41 dcerouter kernel: [36529.652000] rtc: lost 28 interrupts
Aug 5 15:31:43 dcerouter kernel: [36531.708000] rtc: lost 4 interrupts
Aug 5 15:31:49 dcerouter kernel: [36537.864000] rtc: lost 28 interrupts
Aug 5 15:31:57 dcerouter kernel: [36546.072000] printk: 1 messages suppressed.
Aug 5 15:31:57 dcerouter kernel: [36546.072000] rtc: lost 27 interrupts
Aug 5 15:31:59 dcerouter kernel: [36548.124000] rtc: lost 27 interrupts
Aug 5 15:32:01 dcerouter kernel: [36550.176000] rtc: lost 28 interrupts
Aug 5 15:32:07 dcerouter kernel: [36556.332000] printk: 1 messages suppressed.
Aug 5 15:32:07 dcerouter kernel: [36556.332000] rtc: lost 27 interrupts
Aug 5 15:32:16 dcerouter kernel: [36564.544000] printk: 1 messages suppressed.
Aug 5 15:32:16 dcerouter kernel: [36564.544000] rtc: lost 28 interrupts
Aug 5 15:32:18 dcerouter kernel: [36566.596000] rtc: lost 27 interrupts
Aug 5 15:32:20 dcerouter kernel: [36568.584000] eth0: too many iterations (6) in nv_nic_irq.
Aug 5 15:32:22 dcerouter kernel: [36570.700000] printk: 1 messages suppressed.
Aug 5 15:32:22 dcerouter kernel: [36570.700000] rtc: lost 28 interrupts
Aug 5 15:32:28 dcerouter kernel: [36576.872000] printk: 2 messages suppressed.
Aug 5 15:32:28 dcerouter kernel: [36576.872000] rtc: lost 26 interrupts
Aug 5 15:32:32 dcerouter kernel: [36580.976000] rtc: lost 22 interrupts
Aug 5 15:32:36 dcerouter kernel: [36585.080000] printk: 1 messages suppressed.
Aug 5 15:32:36 dcerouter kernel: [36585.080000] rtc: lost 28 interrupts
Aug 5 15:32:42 dcerouter kernel: [36591.236000] printk: 1 messages suppressed.
Aug 5 15:32:42 dcerouter kernel: [36591.236000] rtc: lost 28 interrupts
Aug 5 15:32:50 dcerouter kernel: [36599.444000] printk: 1 messages suppressed.
Aug 5 15:32:50 dcerouter kernel: [36599.444000] rtc: lost 28 interrupts
Aug 5 15:32:55 dcerouter kernel: [36603.548000] rtc: lost 28 interrupts
My log is flooded with these too.. I wonder what is causing interrupts so fast that rtc can't respond fast enough?
the forcedeth driver does the frame handling in the interrupt routine (afaik).
best regards,
Hari
HAri-
I'm not very familiar with the forcedeath driver.. Is this a common problem? Is my onboard NIC fried?
I think you'll find the rtc messages relate to the High Precision Event Timer (HPET) on your motherboard. Mine does this too, doesn't seem to cause any problems - tried disabling HPET in the BIOS and that stopped it, but as it didn't seem to cause any problems (I know of) I re-enabled it.... red herring I think...
colinjones - thanks, gives me a little bit of relief.. I'm going to dig through some old logs and see if its always been like that.
Also, talked to Netgear tech support today, and I definitely did not have the access point set up correctly, so that may have been causing my network to crash ( I think it was handing using its own DHCP in the same .80.xxx range). I have it set up correctly now, so I'll see if that takes care of the network stability.
No go... Woke up this morning to the network being completely hung up again. This time I went to the core locally, and it was not frozen. I did a force reload of the router, but it did not bring the network back.
This thread has become a bit messy with all of the problems I'm having all of the sudden. I'm going to start a new thread for my network issues, and stick back to the RAID problems in this thread.
Regarding the RAID problems - I finished badblocks scans on the last 60% of the drive, and it has passed all of them. I decided that since the drives are about full, i'm not going to do the first 40%, since it is taking so long.
Last night I added /dev/sdb back to the array:
sudo mdadm -a /dev/md0 /dev/sdb
This morning, (after my reboot due to the network locking up again) In the web admin the RAID status is back to OK and everything seems normal again (though I won't hold my breath, the array has magically broke itsself 3 times now). The array is back to normal and functioning as I would expect. Lets see how it holds up... As per Zaerc's recomendation, I will make sure that a hard reset is not done on the core and see if it keeps the drive from being kicked from the array again.
I know this thread is old, but I just thought I'd post the solution in case anyone else runs into this issue.
This issue occurs on nforce boards utilizing the HPET chip, running kernels built with AMD_X64 prior to 2.6.23 and running Asterisk module zt_dummy. I'm not sure what broken code exists in the zt_dummy module, but it causes a lockup of seemingly random devices (read NIC, SATA, IDE, etc.) after an arbitrary number of RTC errors.
Using a kernel that is compiled with HPET_EMULATE_RTC=y option will solve your issue (this config option is enabled by default in all build architectures except AMD_X64, in which it was disabled by accident). Additionally, removing the zt_dummy module will work (but also break Asterisk). On some boards that have both RTC and HPET chips you can disable the HPET chip in the bios, although I don't recommend this on anything that will be playing media. The HPET is useful for smooth playback of processor intensive media, such as h264 content.
I'm not entirely sure if these boards emulate RTC for older operating systems, like Windows XP or if they have both an RTC and HPET. If you disable ACPI that should also force the system to use the RTC or emulate it at least.