LinuxMCE Forums

General => Users => Topic started by: LegoGT on September 17, 2008, 08:47:06 pm

Title: Core/Hybrid Locking up: Possible Solution?
Post by: LegoGT on September 17, 2008, 08:47:06 pm
Recently, I've been enjoying the 100% usage lockups and every day or so having to cold boot my system once it goes fully unresponsive. I've put LMCE on 3 separate newly built machines with different hardware and pretty much ruled out any cooling issues. Finally, I came across some postings about the ACPI BIOS functionality hosing up Linux systems and decided to try and disable it in BIOS.

Well, so far the current system's been fully functional for about 4 days straight with no signs of squirrelly behavior. It might not be the magic bullet (I've been playing with kid gloves on this box) but it seems to be working rather well. I'll probably try this on the other 2 Core/Hybrid machines and see if it helps them out, too.

Note: Current test with Phoenix BIOS and LMCE 7.10 RC2
Title: Re: Core/Hybrid Locking up: Possible Solution?
Post by: LegoGT on September 18, 2008, 09:32:06 pm
No dice. The Hybrid is unresponsive on the UI level (can't CTRL-ALT-Fx to other shells) but I can still SSH into it. I tried looking at log files but many of them are either zero'd out or 20 bytes long (gzipped files). In particular xorg.conf.log and xorg.conf.log.1 have no information.

DCERouter.log has messages about "Going to rotate logs..." then a few messages further has a "Query failed (MySQL server has gone away)". After that it repeats (hundreds of times) "Event #75 has no handlers". I'm sure this is more of a symptom than the source of the problem but I'm not sure where to look next.

FWIW, pluto.log is zero'd out, as well. In pluto.log.1 the last few messages are "20 (spawning device) ... Device died..." and variations of the same.

Any ideas on where I should be looking next?

Title: Re: Core/Hybrid Locking up: Possible Solution?
Post by: williammanda on September 21, 2008, 01:46:24 am
Post your computer hardware!
Title: Re: Core/Hybrid Locking up: Possible Solution?
Post by: jimbodude on September 24, 2008, 06:36:25 pm
I recently overcame a similar problem caused by the nvidia driver (I'm assuming you're using an nvidia card).  To tell if you're having the same issue, do this:

Code: [Select]
ls /proc/irq/*

This will give you a listing of the IRQs being used by your system.  If your graphics driver is nvidia, and it's on the same line as something else, these devices are sharing an IRQ.  This should work just fine - but I guess there are some very specific (and completely undocumented) situations when it is not ok to do this.

To further verify the problem, look in /var/log/syslog around the time of the lock-up.  If you see a bunch of Xid 8 and 16 errors, this is likely your problem.  I got these errors about 75% of the time - so you may have to check over a few lockups.

If this is your problem:
The easiest way to correct this it is to physically move the cards around in your system until the nvidia device has its own IRQ, or is sharing with a device that doesn't use its IRQ much or at all (like an ivtv tuner).

Some additional background:
I was using an onboard nVidia 6150, which worked fine until I added a second NIC.  I did a reinstall at that time, then the system started to have xorg and orbiterGL spin at 100% CPU usage.  I was able to get a running system by putting in a PCI graphics card - which led me to believe the problem was a cooked card.  The trick here, was that I removed my tuner card to put in the graphics card - this put the PCI graphics on a physically different IRQ pin, making an overlap impossible.  I bought a new PCI-Express nVidia 7300, reinstalled the tuner, and immediately started to have the same issue.  I just happened to notice the exact time of a few lock ups, and went looking for patterns in the logs.  Finding Xid 8 and 16, I started to do some research.  After a while, I came to the conclusion above, and now everything seems to be just fine.

This took me about a month to track down because it is so obscure and undocumented.  There is one forum post on the nvidia site explaining that a combination of Xid 8 and 16 point to an IRQ conflict.  It goes on to explain that it is technically impossible to have an IRQ conflict like this with the way the nvidia driver is designed (which is clearly false).  There is also almost no documentation of what individual Xid errors actually mean - it seems to be some sort of corporate secret - so it's very hard to get a picture of what is going on unless you ask at the nvidia forums with the exact combination of Xid errors you're getting.

Hope I can save someone all this trouble...