All MDs are down

joerod · August 17, 2011, 05:33:47 AM

I got home today from work and all three of my Media Directors are giving the same error messages at boot.

They get their initial address via dhcpand start loading their respective initrd.img

then it says "Loading, please wait..."

then we see all the nice ip info followed by: filename : /tftpboot/pxelinux.0

then immediately followed by "connect: Connection refused"

at the end it says:

mount: mounting /dev on /root/dev failed: No such file or directory
mount: mounting /sys on /root/sys failed: No such file or directory
mount: mounting /proc on /root/proc failed: No such file or directory
Target filesystem doesn't have /sbin/init.
No init found. Try passing init= bootrag

..............

The core seems to be working fine... I've rebooted it and it boots fine.

Please help I've been running this system for about a year and its worked great. I'm just beginning to really understand how this system works, but I just dont understand whats going on here...

Thanks.

fibres · August 17, 2011, 07:34:34 PM

This is only a possible suggestion and I would advise trying it on one MD and see if it works.

This is not guaranteed not to break the MD and lose all settings on the MD.

Go intpo Web Admin.

Click on media directors on the left had side and then on one of the MD's not the top one which will be your core, click the rebuild image on the media director.

This will rebuild the md's flesystem on your core. Then try booting the media director.

If this works then do same for all.

Remember I take no responsibility if this breaks things further or loses the MD;s settings!

Hope it helps.

Regards

joerod · August 17, 2011, 08:14:19 PM

Yea, that was one of the first things I tried, but no go; still does the samething

purps · August 17, 2011, 08:25:20 PM

Does your internal network function correctly?

You could try going into web admin, and under Wizard -> Restart, select "Net" for one of the MDs. Then try booting the MD, see if that works. Long shot, but it won't hurt.

Cheers,
Matt.

klovell · August 17, 2011, 08:55:37 PM

Quote from: purps on August 17, 2011, 08:25:20 PM
Does your internal network function correctly?

You could try going into web admin, and under Wizard -> Restart, select "Net" for one of the MDs. Then try booting the MD, see if that works. Long shot, but it won't hurt.

Cheers,
Matt.

Definitely try this!!

I've rebuilt a couple md's before and this was the problem.

joerod · August 18, 2011, 03:12:33 AM

I just tried the above suggestion, but unfortunately nothing has changed

purps · August 18, 2011, 10:16:00 AM

Does your internal network function correctly though? Do you have an independent machine on the internal network from which you can ping the core?

joerod · August 18, 2011, 03:57:22 PM

yes i have a windows box and another linux workstation that work perfect. They have connection to the internet and access to dcerouter through webadmin and ssh

purps · August 18, 2011, 05:19:07 PM

Are all the MDs exactly the same hardware?

Have you updated/upgraded or done anything to the system at all, or has it literally just randomly started doing this?

Have you got another switch you could try? I know everything appears to be working OK from your windows box, but switches aren't necessarily either "working perfectly" or "completely broken", certain switch components can fail before others.

joerod · August 19, 2011, 08:43:08 PM

well, I have upgrade after the fact ( to see that it would help ) and I upgraded about a week before it happened. I have not yet tried another swtich (because other clients are working great) including a few netflix players I have.

Sigg3.net · August 19, 2011, 09:59:51 PM

Could be a filesystem error. Some have been lucky running filechecks (on ext-based filesystems). Read more here: http://ubuntuforums.org/showthread.php?t=1167710
Remember that the drive (partition) must be unmounted to run fs checks.

Do you have HDDs on these MDs? Could be worth checking them for hardware errors with a vendor-specific diagnostic disc.

Given that you boot via PXE I guess it doesn't really matter unless they're hybrids. Then it could be the file system internal to the MD disk image file on the core. Should be able to dd a copy of the image and run e2fcsk on it. Make a backup copy and don't work on the original. I am not sure where or how the PXE disk images are stored, unfortunately.

Good news is, if it's corruption in the disk image, your MDs should be fine though you'll have to re-create them.

purps · August 19, 2011, 10:46:24 PM

Quote from: joerod on August 19, 2011, 08:43:08 PM
well, I have upgrade after the fact ( to see that it would help ) and I upgraded about a week before it happened. I have not yet tried another swtich (because other clients are working great) including a few netflix players I have.

The other clients may appear to be working fine from the point of view of Internet browsing etc, yes, but the netbooting and the netbooting alone may be affected by a potentially faulty switch.

Cheers,
Matt.

dextaslab · August 20, 2011, 03:44:54 PM

Check to make sure nfs daemon is running: /etc/init.d/nfs-kernel-server start - check the logs for any errors

I'd be checking '/etc/exports' for:
...
## BEGIN : DisklessMDRoots

/usr/pluto/diskless/46 192.168.80.0/255.255.255.0(rw,no_root_squash,no_all_squash,sync,no_subtree_check)
...
If you need to make changes try 'exportfs -a -v' before trying to boot your MD's

Check the file '/etc/fstab' on each orbiter: 'cat /usr/pluto/diskless/<MD#>/etc/fstab'
Look for something similar to the following:
...
192.168.80.1:/usr/pluto/diskless/46 / nfs intr,nolock,udp,rsize=32768,wsize=32768,retrans=10,timeo=50 1 1
...

Tell us how you went?
Cheers

joerod · August 20, 2011, 04:16:18 PM

awesome, thanks for the replies. I noticed today that on boot there is a line that says "mounting local filesystem" failed. but the core boots.... and the local filesystem is there and writable. I also noticed none of my orbiters (webdt) boot completely (obiter screen never connects to the core) and just today the orbiter screen ui1 (on the core) stops loading at 98% ( this one is new to today).

I'm going to try fschk on the cores filesystem (that should work with "touch /forcecheck"), right?

dextaslab · August 20, 2011, 06:07:19 PM

The failure to mount our file system relates to an entry in your PXE cfg's, example:
cat /tftpboot/pxelinux.cfg/01-00-19-d1-86-3d-93
...
APPEND initrd=139/initrd.img ramdisk=10240 rw root=/dev/nfs boot=nfs nfsroot=192.168.80.1:/usr/pluto/diskless/49
...

Which again is an nfs mount point, which again should bring you back to the post I did earlier.

LinuxMCE Forums

News:

All MDs are down

joerod

fibres

joerod

purps

klovell

joerod

purps

joerod

purps

joerod

Sigg3.net

purps

dextaslab

joerod

dextaslab