Author Topic: Nightmare Install Problems on new core  (Read 3028 times)

Glasswalker

  • Regular Poster
  • **
  • Posts: 20
    • View Profile
Nightmare Install Problems on new core
« on: November 11, 2008, 10:53:18 pm »
Ok, here's the skinny:

Hardware:
Tyan Thunder S2882
Dual Operon 200 series 2.0GHZ (not sure of precise model)
4GB Of ram
4X 3Ware 8506-4LP SATA RAID cards
10x Seagate 300GB HDDs (SATA)
4X Seagate 500GB HDDs (SATA)
1X Western Digital 120GB HDD (SATA) (connected to onboard SATA controller).
IDE DVDROM
Dual Onboard Gigabit NICs
Single Onboard 10/100 NIC.

Problem:
I installed this a couple days ago. First time trying it booted almost immediately to a prompt for BusyBox, Figured it was a cd read error or something wierd, so just rebooted. The install started, and finished just fine, LinuxMCE started up, walked through the av wizard, and came to a running core...

I fucked around with settings a bunch, broke some stuff, and decided to re-install (this was a new core, old one was still running so I had the option to experiment).

On the next install it took 5 reboots to get the install to go (Same thing as before) (note nothing is changing between these reboots, just hit reset button and wait for it to boot)

Did another reinstall... This time it installed with no reboots, first try.

Then tried another reinstall, this time took 20-30 reboots, (pissed me off) but installed fine... This time I left it up, and configured it somewhat, it was stable, and I rebooted the system a few times for various reasons, it rebooted fine no worries on the installed system.

(at this point I should mention that up until now, the first 4 bays running on the first 3ware card were populated with the 500GB drives, no other drives were populated yet, and the first 3ware card was configured for a 4 drive RAID5 hardware array with the 4 500 gb drives)

So I added that raid array as a drive in LMCE, and started migrating my media across from the old core to the new one (across the network, network ports working fine, tested both the 100mb and the gigabit ports)

Now with all the data migrated, I wanted to move some of the drives from the old system into the new system, and figured a fresh install wouldn't hurt (I know I know I probably didn't need it, but I did it anyway).

So I moved my drives across:
Installed a second 120GB drive on the onboard controller
Populated the remaining bays in the system with the 300gb drives
Added some ram from the other system to this one bringing it up to 6GB of ram.
I didn't configure the 300gb drives on hardware raid, I plan on letting lmce use software raid for those.
I did configure the 2x 120gb with raid using the motherboards bios, RAID1 mirror, so I could use that as my OS partition.

Reboot off the install DVD. this time it dies almost right away with:
Code: [Select]
udevd-event[2452]: run_program: '/sbin/modprobe' abnormal exit

BusyBox v1.1.3 (Debian 1:1.1.3-5ubuntu7) Built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)

So I assume it's the same issue as before, so I reboot about 30 times... then I got pissed off...

So then I notice that the modprobe abnormal exit is different this time than last time... So I decide lets put the hardware back exactly like before...

So I remove all the drives I add, remove the extra ram, and reset the raid settings on the onboard to be standard sata, with only the single 120 gb drive installed...

Still getting same error... retried another 10 times...

Then I tried reburning the disk in case disk was bad

Tried another 10 times....

Figured, must be a hardware failure or something... So I tried a centos 5 disk I had laying around...

(all drives and network working fine)

Installs fine, boots to a desktop...

Tried a Windows Server 2003 install... It works fine boots to a desktop (all drives and network working fine)

Then I tried searching on the error, found some bugs relating to seagate drives... Said to use all_generic_ide

So I tried that, that got it past the first error, but now it locks hard on "Configuring Network Interfaces"

WHAT THE HELL?

So after some more searching I find that there are 2 kernel bugs known specifically with the 0710 kubuntu install livecd, that cause this error either on the seagate hdds or on certain onboard network interfaces.

So I go into the bios and disable the onboard nics entirely...

Reboot, now the installer starts... So I run through the install. but after the normal grub install part, when it spits the disk out and reboots, it does nothing (the screen blanks, leaving a blinking cursor in the top left corner, it responds to keystrokes, but does nothing else... hangs there forever, I left it for half an hour to be sure).

At this point I hit the reset button... It booted off the hard drive, but hung with a kernel panic??? what the hell???

So at this point I re-download the iso... MD5 Check it, it's fine... Re-burn it at 1X... Retry the boot... Same friggin deal...

So I retry the install again... using the all_generic_ide and only the gigabit nics disabled... No go... (still failing at configuring network)
Tried again with all_generic_ide and the 100mb nics disabled... No go... (still failing at configuring network)
Tried again with all nics disabled, but without the generic ide... Dies with busybox prompt...

So I retried again with all nics disabled and all_generic_ide... Again, install works fine, but hangs before reboot... Left it for half an hour, hit reset button...

This time it boots, loads the kernel, then my monitor gives me an error saying that the video signal it's recieving is out of it's refresh rate range... (no this is not the av wizard, this is IMMEDIATELY like within half a second of the kernel loading after the bootloader).

Left it for a bit, then hit reset again...

This time I stopped grub, and took off the quiet and splash options... This time it hung at...
udevd-event[2053]: run_program: '/sbin/modprobe' abnormal exit

So I reboot again, this time editing the grub line to pull out the quite/splash and to add all_generic_ide

This time it boots through the proper startup for dcerouter, which gets angry as all hell because I have no network interfaces, but it does get me to avwizard...
Then X starts, and the launch manager comes up.
Got the girl on the screen, did the config steps (whipped through with minimal config to see what's next)
It regenerated the orbiter
While it was doing that I dropped out to a terminal to edit my boot config to force it to set the all_generic_ide option, but the pass for linuxmce user I set during install wasn't working (that may have been a typo during install though)
It came up with onscreen orbiter fine, so I rebooted it via the menu.
This time I hit the bios and re-enabled the onboard NICs
then interrupted grub and added all_generic_ide and still disable the quiet and splash so I can see what's going on...
This time around, it passes configuring network interfaces, but not by much (didn't see the exact line unfortunately, after 4-5 more lines maybe? it kernel panics...
specifically: Kernel panic - not syncing: Aiee, killing interrupt handler

Anyway, so at this point I'm ready for a full blown nervous breakdown... Ready to smash something seriously... was up all night last night, was also si9ck as a dog... Feel like shit, and frustrated as hell right now...
AND sine it worked fine before, and I copied my data over, I have canibalized my old core, so now I am completely without a core! which is REALLY pissing me off lol...

Just for shits and giggles... I retried with the CentOS5.1 dvd again... To confirm it still works... (with the network cards enabled, and no special boot options) (to see if it works without any "workarounds")
The installer booted fine, detected nics fine, detected hard drives fine, with no errors...
The install completed fine with no errors.
And it rebooted and started udev fine...
Detected the nics fine...
fired up networking fine, and booted to X perfectly fine...
From within X I tested networking, and it's working perfectly...
Kernel was 2.6.18-53

So I ask a few questions:

#1 - WHAT THE HELL?
#2 - Why would something that appears to be hardware incompatibility, work perfectly "sometimes", and then suddenly stop working at all unless it's hardware failure.
#3 - If it's not hardware failure, why does it work perfectly in all other OSs lol (including other linux distros)

ok, I'm done ranting now, just if anyone has any other suggestions for me to try PLEASE they would be much appreciated...

Thanks!

hari

  • Administrator
  • LinuxMCE God
  • *****
  • Posts: 2428
    • View Profile
    • ago control
Re: Nightmare Install Problems on new core
« Reply #1 on: November 11, 2008, 11:45:08 pm »
So I ask a few questions:

#1 - WHAT THE HELL?
look into the sky. Do you see a small black cloud that sticks above your head and is raining cats and dogs?

Quote
#2 - Why would something that appears to be hardware incompatibility, work perfectly "sometimes", and then suddenly stop working at all unless it's hardware failure.
#3 - If it's not hardware failure, why does it work perfectly in all other OSs lol (including other linux distros)
i'd swap the kernel and boot lmce with vanilla or sth like that...

best regards,
Hari
rock your home - http://www.agocontrol.com home automation

Glasswalker

  • Regular Poster
  • **
  • Posts: 20
    • View Profile
Re: Nightmare Install Problems on new core
« Reply #2 on: November 12, 2008, 05:07:30 pm »
Thanks Hari :) Yeah I was having a bad day and was going a bit overboard lol... Being sick and sleep deprived (and rediculously stubborn) was making me a little irrational...

Turns out it was ram... Had a bad stick of ram...

Still confuses me, in that running any other Linux or Windows OS it worked fine, and it passed a few hours straight of memtest checks... But in the end, I removed sticks of ram one at a time, and this one came out, and suddenly everything worked perfectly, no more flaky behavior lol...

Anywhoo thanks to everyone on IRC that put up with my whining the past 2 days lol...

caddywhompus

  • Regular Poster
  • **
  • Posts: 45
    • View Profile
Re: Nightmare Install Problems on new core
« Reply #3 on: November 12, 2008, 06:29:40 pm »
Thanks Hari :) Yeah I was having a bad day and was going a bit overboard lol... Being sick and sleep deprived (and rediculously stubborn) was making me a little irrational...

Turns out it was ram... Had a bad stick of ram...

Still confuses me, in that running any other Linux or Windows OS it worked fine, and it passed a few hours straight of memtest checks... But in the end, I removed sticks of ram one at a time, and this one came out, and suddenly everything worked perfectly, no more flaky behavior lol...

Anywhoo thanks to everyone on IRC that put up with my whining the past 2 days lol...
FWIW I feel your pain my brother.  I've had the same issues with RAM in the past.  It appears as though sometimes MS products will function with flaky memory and the system is just slightly unstable, often people attribute this to MS when in fact it is hardware.  It's possible that only a select location (address) on the stick is not functioning and the system will work 'no problem' until it attempts to read/write from that location.  The installation program of all the different OSes you tried would all use different portions and address of RAM, so it's possible that some installs were accidentally avoiding that bad spot and therefore completing with success, while others kept tripping up on it.

I've also had optical drives that were bad and would exhibit the booting frustration you describe.  Generally they would work fine (or seem to) when tossed in a Windows box and run in normal environment, but try to boot from them and install an OS and all bets are off.  I went through 3 DVD drives setting up MythDora before I found one that could complete the install.  Once the OS was up and running, all 3 drives seemed to function fine.  Go figure.

By the way, NICE CORE SYSTEM!  Holy Crap I wish I can resources to put something like that together! :)