Any RAID gurus in here? (more RAID problems)

jondecker76 · August 05, 2008, 04:46:40 PM

Jimmy

Yes, its the same switch.. I didn't think it would be related to my recent problems, but it was a fairly recent change to my network so I mentioned it just in case. Maybe I will unplug the WG302 access point for a few days and see if this has any effect on things (I just won't be able to use my wireless orbiters while its unplugged)

jondecker76 · August 05, 2008, 05:59:11 PM

I called home and talked to my wife.. It does seem to us that we've had lockups every single day since adding the Netgear WG302 Accesspoint. I had her unplug it, and we will watch for the next day or so (I don't think we've made it over 24 hours in 2 weeks, so If its still going tomorrow I will look into wy the access point causing the network to croak.

Also, the last 40% of the drive the keeps getting booted from my RAID array has been scanned by badblocks with no bad blocks found thus far.

hari · August 05, 2008, 07:39:39 PM

Quote from: jondecker76 on August 05, 2008, 02:59:19 PM
Hari - now that I know the network is getting killed, it may not be a hard lockup (I thought it was, as I could not ssh into the core - but of course this would fail with no networking on the core) So I will have to look a little deeper into whether it is actually hard-locking or not. Reading a few posts on the Internet, the NETDEV WATCHDOG error does indeed kill the networking.

can you give me some details about the nic you are using?

best regards,
hari

jondecker76 · August 05, 2008, 08:07:34 PM

Hari - the NETDEV WATCHDOG error was against eth:0 - the onboard NIC on my Asus M2NPV-VM

hari · August 05, 2008, 08:43:09 PM

I had two occurences of the same nic lockup on my Abit AN-M2 this year. If you have a high volume network the chance to hit the lockup is higher. If you ask me, the nvidia nic is crap.

best regards,
Hari

jondecker76 · August 05, 2008, 08:59:28 PM

Hari - you probably are right about the onboard being crap.

However, the last several days, I have been running only the core and have not allowed anyone to watch TV or otherwise use LMCE at all until I can find the problem, and the lockups have still been happening every day like clockwork now. The other confusing part is that until recently, this hasn't happened in the prior 6 months on this setup, even while having 4 simultaneous video streams. If all else fails, I may try another NIC soon (though all of my PCI slots are full which is why I'd like to hold off if I can).

I should have a better idea after I wake up in the morning. If the core is up running normal in the morning, it will be the first time in weeks since the addition of the access point. Since we unplugged it today, I think it would narrow it down.

Either way, I need these stability issues fixed before getting back to the RAID problems. The badblocks scan is taking an incredible amount of time (I am doing it in blocks of about 10% of the drive, with each scan taking about 6 hours. That would make a total of about 60 hours for a badblocks scan on a 1TB drive (in write mode). Ouch!

jondecker76 · August 05, 2008, 09:57:14 PM

Another thing that is catching my eye in syslog:

Code Select


Aug  5 15:31:28 dcerouter kernel: [36517.332000] rtc: lost 28 interrupts
Aug  5 15:31:30 dcerouter kernel: [36519.384000] rtc: lost 27 interrupts
Aug  5 15:31:32 dcerouter kernel: [36521.440000] rtc: lost 27 interrupts
Aug  5 15:31:35 dcerouter kernel: [36523.492000] rtc: lost 27 interrupts
Aug  5 15:31:39 dcerouter kernel: [36527.600000] rtc: lost 27 interrupts
Aug  5 15:31:41 dcerouter kernel: [36529.652000] rtc: lost 28 interrupts
Aug  5 15:31:43 dcerouter kernel: [36531.708000] rtc: lost 4 interrupts
Aug  5 15:31:49 dcerouter kernel: [36537.864000] rtc: lost 28 interrupts
Aug  5 15:31:57 dcerouter kernel: [36546.072000] printk: 1 messages suppressed.
Aug  5 15:31:57 dcerouter kernel: [36546.072000] rtc: lost 27 interrupts
Aug  5 15:31:59 dcerouter kernel: [36548.124000] rtc: lost 27 interrupts
Aug  5 15:32:01 dcerouter kernel: [36550.176000] rtc: lost 28 interrupts
Aug  5 15:32:07 dcerouter kernel: [36556.332000] printk: 1 messages suppressed.
Aug  5 15:32:07 dcerouter kernel: [36556.332000] rtc: lost 27 interrupts
Aug  5 15:32:16 dcerouter kernel: [36564.544000] printk: 1 messages suppressed.
Aug  5 15:32:16 dcerouter kernel: [36564.544000] rtc: lost 28 interrupts
Aug  5 15:32:18 dcerouter kernel: [36566.596000] rtc: lost 27 interrupts
Aug  5 15:32:20 dcerouter kernel: [36568.584000] eth0: too many iterations (6) in nv_nic_irq.
Aug  5 15:32:22 dcerouter kernel: [36570.700000] printk: 1 messages suppressed.
Aug  5 15:32:22 dcerouter kernel: [36570.700000] rtc: lost 28 interrupts
Aug  5 15:32:28 dcerouter kernel: [36576.872000] printk: 2 messages suppressed.
Aug  5 15:32:28 dcerouter kernel: [36576.872000] rtc: lost 26 interrupts
Aug  5 15:32:32 dcerouter kernel: [36580.976000] rtc: lost 22 interrupts
Aug  5 15:32:36 dcerouter kernel: [36585.080000] printk: 1 messages suppressed.
Aug  5 15:32:36 dcerouter kernel: [36585.080000] rtc: lost 28 interrupts
Aug  5 15:32:42 dcerouter kernel: [36591.236000] printk: 1 messages suppressed.
Aug  5 15:32:42 dcerouter kernel: [36591.236000] rtc: lost 28 interrupts
Aug  5 15:32:50 dcerouter kernel: [36599.444000] printk: 1 messages suppressed.
Aug  5 15:32:50 dcerouter kernel: [36599.444000] rtc: lost 28 interrupts
Aug  5 15:32:55 dcerouter kernel: [36603.548000] rtc: lost 28 interrupts

My log is flooded with these too.. I wonder what is causing interrupts so fast that rtc can't respond fast enough?

hari · August 05, 2008, 10:36:56 PM

the forcedeth driver does the frame handling in the interrupt routine (afaik).

best regards,
Hari

jondecker76 · August 05, 2008, 10:51:34 PM

HAri-
I'm not very familiar with the forcedeath driver.. Is this a common problem? Is my onboard NIC fried?

colinjones · August 05, 2008, 11:46:31 PM

I think you'll find the rtc messages relate to the High Precision Event Timer (HPET) on your motherboard. Mine does this too, doesn't seem to cause any problems - tried disabling HPET in the BIOS and that stopped it, but as it didn't seem to cause any problems (I know of) I re-enabled it.... red herring I think...

jondecker76 · August 05, 2008, 11:53:51 PM

colinjones - thanks, gives me a little bit of relief.. I'm going to dig through some old logs and see if its always been like that.

Also, talked to Netgear tech support today, and I definitely did not have the access point set up correctly, so that may have been causing my network to crash ( I think it was handing using its own DHCP in the same .80.xxx range). I have it set up correctly now, so I'll see if that takes care of the network stability.

jondecker76 · August 06, 2008, 11:46:32 AM

No go... Woke up this morning to the network being completely hung up again. This time I went to the core locally, and it was not frozen. I did a force reload of the router, but it did not bring the network back.

This thread has become a bit messy with all of the problems I'm having all of the sudden. I'm going to start a new thread for my network issues, and stick back to the RAID problems in this thread.

Regarding the RAID problems - I finished badblocks scans on the last 60% of the drive, and it has passed all of them. I decided that since the drives are about full, i'm not going to do the first 40%, since it is taking so long.

Last night I added /dev/sdb back to the array:

Code Select


sudo mdadm -a /dev/md0 /dev/sdb

This morning, (after my reboot due to the network locking up again) In the web admin the RAID status is back to OK and everything seems normal again (though I won't hold my breath, the array has magically broke itsself 3 times now). The array is back to normal and functioning as I would expect. Lets see how it holds up... As per Zaerc's recomendation, I will make sure that a hard reset is not done on the core and see if it keeps the drive from being kicked from the array again.

Monkgs · September 07, 2008, 03:21:35 PM

I know this thread is old, but I just thought I'd post the solution in case anyone else runs into this issue.

This issue occurs on nforce boards utilizing the HPET chip, running kernels built with AMD_X64 prior to 2.6.23 and running Asterisk module zt_dummy. I'm not sure what broken code exists in the zt_dummy module, but it causes a lockup of seemingly random devices (read NIC, SATA, IDE, etc.) after an arbitrary number of RTC errors.

Using a kernel that is compiled with HPET_EMULATE_RTC=y option will solve your issue (this config option is enabled by default in all build architectures except AMD_X64, in which it was disabled by accident). Additionally, removing the zt_dummy module will work (but also break Asterisk). On some boards that have both RTC and HPET chips you can disable the HPET chip in the bios, although I don't recommend this on anything that will be playing media. The HPET is useful for smooth playback of processor intensive media, such as h264 content.

I'm not entirely sure if these boards emulate RTC for older operating systems, like Windows XP or if they have both an RTC and HPET. If you disable ACPI that should also force the system to use the RTC or emulate it at least.

LinuxMCE Forums

News:

Any RAID gurus in here? (more RAID problems)

jondecker76

jondecker76

hari

jondecker76

hari

jondecker76

jondecker76

hari

jondecker76

colinjones

jondecker76

jondecker76

Monkgs