« on: September 24, 2008, 06:36:25 pm »
I recently overcame a similar problem caused by the nvidia driver (I'm assuming you're using an nvidia card). To tell if you're having the same issue, do this:
This will give you a listing of the IRQs being used by your system. If your graphics driver is nvidia, and it's on the same line as something else, these devices are sharing an IRQ. This should work just fine - but I guess there are some very specific (and completely undocumented) situations when it is not ok to do this.
To further verify the problem, look in /var/log/syslog around the time of the lock-up. If you see a bunch of Xid 8 and 16 errors, this is likely your problem. I got these errors about 75% of the time - so you may have to check over a few lockups.
If this is your problem:
The easiest way to correct this it is to physically move the cards around in your system until the nvidia device has its own IRQ, or is sharing with a device that doesn't use its IRQ much or at all (like an ivtv tuner).
Some additional background:
I was using an onboard nVidia 6150, which worked fine until I added a second NIC. I did a reinstall at that time, then the system started to have xorg and orbiterGL spin at 100% CPU usage. I was able to get a running system by putting in a PCI graphics card - which led me to believe the problem was a cooked card. The trick here, was that I removed my tuner card to put in the graphics card - this put the PCI graphics on a physically different IRQ pin, making an overlap impossible. I bought a new PCI-Express nVidia 7300, reinstalled the tuner, and immediately started to have the same issue. I just happened to notice the exact time of a few lock ups, and went looking for patterns in the logs. Finding Xid 8 and 16, I started to do some research. After a while, I came to the conclusion above, and now everything seems to be just fine.
This took me about a month to track down because it is so obscure and undocumented. There is one forum post on the nvidia site explaining that a combination of Xid 8 and 16 point to an IRQ conflict. It goes on to explain that it is technically impossible to have an IRQ conflict like this with the way the nvidia driver is designed (which is clearly false). There is also almost no documentation of what individual Xid errors actually mean - it seems to be some sort of corporate secret - so it's very hard to get a picture of what is going on unless you ask at the nvidia forums with the exact combination of Xid errors you're getting.
Hope I can save someone all this trouble...