Author Topic: Diskless boot fails: transmit timed out  (Read 3514 times)

organicveggie

  • Newbie
  • *
  • Posts: 5
    • View Profile
Diskless boot fails: transmit timed out
« on: March 31, 2009, 05:28:45 am »
So I started with a fresh install of 0710 from DVD, which appears to be working beautifully. Next I tried to PXE boot a second machine to function as an MD - that failed early on due to a problem with the driver for the NVidia ethernet adapter. Since I happened to have an extra RealTek 8139 NIC lying around, I threw that in the machine. I can boot off that, get an IP address from the Core and start the process... unfortunately, it fails part way through:

NETDEV WATCHDOG: eth1: transmit timed out
eth1: link up, 100Mbps, full-duplex, lpa 0x45E1

That repeats for a while then eventually it just gives up and sits there.

I tried reinstalling the Core from scratch, but that didn't help.

Any suggestions would be greatly appreciated!

Thanks.

-Sean

colinjones

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 3003
    • View Profile
Re: Diskless boot fails: transmit timed out
« Reply #1 on: March 31, 2009, 07:12:41 am »
Get one of the Intel NICs that are mentioned in the forums and wiki as working... it will almost certainly be easier.

The only other pointers I can give are - you really need to try to disable the onboard NIC becfore using a card otherwise it might get confused.... I say this because I noted that the message is talking about eth1 instead of eth0, so I'm assuming it is still seeing the on board NIC. Also, the Realtek 8168/8169 mess up of drivers possibly could be implicated if they are using the same driver, or at least misidentifying the product in the same way in the PCI alias.

Path of least resistance: choose an Intel NIC that is mentioned somewhere as working, install that, disable the onboard NIC

organicveggie

  • Newbie
  • *
  • Posts: 5
    • View Profile
Re: Diskless boot fails: transmit timed out
« Reply #2 on: March 31, 2009, 09:54:10 pm »
Quote
you really need to try to disable the onboard NIC becfore using a card otherwise it might get confused.

I tend to doubt it got confused... it got an IP address from DHCP using eth1 and definitely recognized that there was no available connection on eth0. But it's still worth trying. :)

That also reminded me that the RealTek 8139 NIC I'm trying to use is a cheap piece of junk. It's quite possible the NIC is bad... and, as it just so happens, I have another identical 8139 card that I can try instead. So I'll give that a shot as well tonight, in addition to disabling the onboard NIC.

Quote
Get one of the Intel NICs that are mentioned in the forums and wiki as working... it will almost certainly be easier.

If disabling the onboard NIC and swapping in a different RealTek 8139 both fail, I'll pick up a new NIC. From the forums, it sounds like the Intel PWLA8391GT gigabit NIC would be a good choice for a PCI NIC and at $24 from NewEgg it's a lot cheaper than building a new machine. :)

Thanks for the suggestions.

-Sean

colinjones

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 3003
    • View Profile
Re: Diskless boot fails: transmit timed out
« Reply #3 on: March 31, 2009, 10:47:31 pm »
Quote
you really need to try to disable the onboard NIC becfore using a card otherwise it might get confused.

I tend to doubt it got confused... it got an IP address from DHCP using eth1 and definitely recognized that there was no available connection on eth0. But it's still worth trying. :)


After the MD has PXE booted and gotten its DHCP IP address, it pulls down the vmlinuz and initramfs images from the core. This is a Linux micro kernel designed to allow enough functionality to mount an NFS share, pull over the boot files, etc. But when that kernel starts, it has its own NIC drivers which take over from the PXE boot software. When that happens, what I am concerned about is whether it is setting eth1 to DHCP as it would normally expect to set eth0 to DHCP. I'm not sure what approach the code takes, but if it only expects eth0 and so only sets that, at this point you would likely loose network connnectivity as the TCP/IP stack is different ... this certainly does happen in some cases, and the result is a kernel panic because the boot code can no longer contact the boot server for the rest of its files so cannot do anything else, catch 22. When you are talking about simple routing, it shouldn't really matter, but we are talking about the NIC being handed over from one piece of code to another, so it is conceivable that teh device names could play a part. For instance, they definitely play a part on a core... eth0 generally needs to be pointing to the external network and eth1 to the internal, it isn't arbitrary because the software needs to set one to DHCP and the other to static 192.168.80.1. If the device names are the wrong way around compared with how you have actually patched the box you will end up with the internal NIC pointing outside, and external pointing inside! This is for different reasons, but it is an example of where device names can be important.

organicveggie

  • Newbie
  • *
  • Posts: 5
    • View Profile
Re: Diskless boot fails: transmit timed out
« Reply #4 on: April 01, 2009, 01:38:07 am »
Quote
you really need to try to disable the onboard NIC becfore using a card otherwise it might get confused.

I tend to doubt it got confused... it got an IP address from DHCP using eth1 and definitely recognized that there was no available connection on eth0. But it's still worth trying. :)


... But when that kernel starts, it has its own NIC drivers which take over from the PXE boot software. When that happens, what I am concerned about is whether it is setting eth1 to DHCP as it would normally expect to set eth0 to DHCP.

I hear what you're saying. Makes sense. Turns out that's not a problem. The real cause: bad network card.

However, I ran into a few more problems due to my own stupidity... which I should probably record for posterity (in case anyone else runs into similar problems).

First off, this computer only supports PXE boot from the on-board NIC, which doesn't work with LMCE. So I recompiled GRUB with support for the RealTek 8319 and created a "boot cd" with a simple grub config, following the instructions at: http://wiki.linuxmce.org/index.php/GRUB_PXE_network_boot

However, to make matters more complicated, I forgot that I had already tried setting up a different MD with that network card - so LMCE had already created an MD (#35) associated with that MAC address. The default target for the kernel and initrd image did not include a reference to the MD object number. Specifically, the lines in my menu.lst for GRUB looked like the following:

Code: [Select]
kernel /tftpboot/default/vmlinuz root=/dev/nfs acpi=off vga=normal ramdisk_size=10240 rw ip=all apicpmtimer
initrd /tftpboot/default/initrd

When I boot that way, it continually times out with the same message:

Code: [Select]
NETDEV WATCHDOG: eth0: transmit timed out
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1

So I manually launched the network boot (I'll update my boot cd later) using the following settings instead (culled from Grub PXE wiki page):

Code: [Select]
kernel /tftpboot/35/vmlinuz ramdisk=10240 rw root=/dev/nfs boot=nfs nfsroot=192.168.80.1:/usr/pluto/diskless/35
initrd /tftpboot/35/initrd.img

And voila. Problem solved. I'm a happy camper and my faith in LMCE has been restored. ;)

Thanks again for your assistance. I really appreciate it.

-Sean