Strange NFS issue (or not)...

tsoukas · November 12, 2012, 11:57:44 PM

Dear community,
Here's something that I never saw before... After convincing a friend to go the linuxmce way (instead of going AMX/Crestron/Vantage), i help him setup 10 rooms and a hybrid/core as a server.
Installation was successful (system will be used for A/V distribution only and all mediaboxes are in the dedicated computer room), and systems one after the other were configured successfully.
When the last one finished, the whole system was powered down and the hybrid/core was powered on first. After a while, WOL kicked in.
This is where the things start to get strange:

out of the 10 media boxes, 5 booted normally and 5 were stuck to the "Mounting root file system. What was amazing was that (even though all booted at the same time through WOL, first one was ok, second was not, third was ok, etc).
after having booted successfully the five of them, i started to (manually) close and boot each one. The sixth came out ok. When the 7th was booted, performance started to degrade considerably, and then all hell broke loose. Core started to reload continuously, proxy&web orbiter crashed whenever the room was selected, etc
When the 7th media director was powered down, everything was normal again!

Now you might say that 6 is my lucky number, but is there anything I can do to enhance/correct NFS performance (as to allow for all 10 media box to boot concurrently)? Is this the problem or should I be looking at anything else?
Hybrid/core is an IBM x3250m3 with 4GB ram, media directors are Xsrteamer Ultra boxes with 2GB each and all are connected on a 24-port gigabit ethernet switch.
Any ideas?

Thanks,
ted

iberium · November 14, 2012, 01:38:35 AM

If he was willing to pay the amx or crestron route you probably should have went with dianemo. It could be the switch overloading or your core nic. Booting off the network takes a lot if bandwidth and with that many at one time any flaw in communication could cause strange effects.

iberium · November 14, 2012, 01:39:27 AM

You can install the media directors locally and they will run much faster anyways.

tsoukas · November 14, 2012, 04:59:38 PM

Thanks man.
I have never tried that but I think its a good time to do so.
(with a small SSD disk inside things could be very fast)
To make my life easier, when I first started designing the project specs, I was planning to install Linuxmce 810 final, but (on installation time) it would not see the IBM controller (serveraid M1015)..
Anyway, will keep all posted on how it goes.

Thank you.
ted

tschak909 · November 14, 2012, 05:19:38 PM

This system is i/o bound.

It is imperative that you use:

* a fantastic switch. (I have used a Cisco SG200 for my last two installations.)
* as good cabling as possible (I use Category 6A cabling with 8P8C cabling, and actually installed CAT 7 into a recent installation.)
* a small and fast system disk in the core.

-Thom

mkbrown69 · November 14, 2012, 06:50:06 PM

Quote from: tsoukas on November 14, 2012, 04:59:38 PM
(with a small SSD disk inside things could be very fast)
To make my life easier, when I first started designing the project specs, I was planning to install Linuxmce 810 final, but (on installation time) it would not see the IBM controller (serveraid M1015)..

As Thom said, you're I/O bound. Your issues are the result of 10 NFS clients using the NFS shares as their root file system, so you have all sorts of random I/O's happening, including lots of writes due to logging. I'm also presuming that the MySQL database which powers the application logic on the core is residing on the same spindle, so you're probably maxing out the IOPS of a single drive, and hitting wait states and timeouts as a result. 'top' and 'iotop' run on both the core and MD's will tell you for sure.

Things you can do to improve the situation:

Install your core's filesystem to an SSD. You'll be going from 100 IOPS to 4K+ IOPS in one easy step. If you can't re-install or move it, then add an SSD to the system using LVM, and create logical volumes which get mounted at /var/lib/mysql and /usr/pluto/diskless to host the IO intensive workloads. Obviously, you'll need to move the contents of those directories while in single-user mode, then reboot.
If you can't do SSD, then use the RAID card in RAID 10 mode for the core's file system. SSD would be better though for random I/O. Either way, set the I/O scheduler for that /dev/sdX to 'deadline'.
If you have lots of memory on your MD's, consider making /tmp a tmpfs ramdisk
Consider setting vm.swappiness=1 i sysctl.conf on your MD's, to reduce swapping, or install a local disk for swap if you don't have enough memory on the MD's.
Tune the TCP stack for throughput on the core, and increase your memory buffers. http://www.cyberciti.biz/faq/linux-tcp-tuning/
Use a really good network card for the internal network. Intel GigE cards are typically the best on Linux; Realtek and NVIDIA on-boards are crap under heavy loads. If you have a capable managed switch and a multi-port NIC (or multiple NIC's), you can look at port aggregation for increased bandwidth. Or go 10G between the core and the switch if you have money to spare. No matter what, you need a switch with a non-blocking fabric, capable of full wire-speeds for the number of ports available.

Hope that helps!

/Mike

Marie.O · November 14, 2012, 06:51:58 PM

* posde wonders how all of a sudden a load of ten PCs doing network booting off a single system is such a big deal

tsoukas · November 16, 2012, 11:36:13 PM

Thom, Mike, thanks for your input. I had to visit my friend to check out the config.
The system has
- 2x Intel Gigabit nics
- 2x 3.5 500GB SATA disks in RAID 1 configuration (1U in size, so disk replacement is difficult) on IBM server M1015 controller
- 4GB RAM (although LMCE reports 2GB total..)
- each media box has 4GB ram as well
- All is connected in 48 port Cisco SF-200 48x10/100 + 2x1000 ports (eth1 of linuxmce + ReadyNas Ultra6 connected)
I also tried a Procurve 1810-24G (24 port gigabit managed switch) that gave much better results in booting times (since the media boxes have gigabit adapters on them) but the same results (as far as number of ACTIVE media boxes concurrently).

I will try your suggestions once I have sometime to play with the system. So far, the only way to have a working installation is to power off any media box that is not needed, keeping a maximum of 6 powered on at all times.
If you remember, when the 7th or 8th (random) is powered on, the system starts to behave erratically upon attempting media/audio playback (continuous reloading of the core, DCE router dies and the proxy orbiter collapses). Even if DCErouter is forcefully restarted, proxy orbiter never wakes up.

Please let me know if you need any logs or what you need me to provide. After your helpful suggestions of monitoring iotop, i recorded the following session upon booting. Notice that even though all 10 media directors fired up, only 8 nfsd sessions were active.

Code Select


dcerouter_1032148:/home/linuxmce# iotop
Total DISK READ: 2.85 M/s | Total DISK WRITE: 1459.94 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  387 be/3 root        0.00 B/s   91.01 K/s  ?unavailable?  [jbd2/sda1-8]
 1766 be/4 root      538.47 K/s    0.00 B/s  ?unavailable?  [nfsd]
 1767 be/4 root      307.16 K/s    0.00 B/s  ?unavailable?  [nfsd]
 1768 be/4 root      174.43 K/s    3.79 K/s  ?unavailable?  [nfsd]
 1769 be/4 root       64.46 K/s   60.67 K/s  ?unavailable?  [nfsd]
 1770 be/4 root      492.97 K/s    0.00 B/s  ?unavailable?  [nfsd]
 1771 be/4 root      728.07 K/s   11.38 K/s  ?unavailable?  [nfsd]
 1772 be/4 root      420.92 K/s   30.34 K/s  ?unavailable?  [nfsd]
 1773 be/4 root      193.39 K/s    3.79 K/s  ?unavailable?  [nfsd]
 2048 rt/4 asterisk    0.00 B/s    0.00 B/s  ?unavailable?  asterisk -p -U asterisk
    1 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  init
    2 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [kthreadd]
    3 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [migration/0]
    4 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [ksoftirqd/0]
    5 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [watchdog/0]
    6 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [migration/1]
    7 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [ksoftirqd/1]
    8 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [watchdog/1]
    9 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [migration/2]
   10 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [ksoftirqd/2]
   11 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [watchdog/2]
   12 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [migration/3]
   13 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [ksoftirqd/3]
   14 rt/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [watchdog/3]
   15 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [events/0]
   16 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [events/1]
   17 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [events/2]
   18 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [events/3]
   19 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [cpuset]
   20 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [khelper]
CONFIG_TASK_DELAY_ACCT not enabled in kernel, cannot determine SWAPIN and IO %

I am also attaching the result of dmesg I gathered from the server in case it helps.

This behavior, was never noticed in LMCE 8.10. Although it is the first time I attempt to install such a huge system (up to now the largest number of remote boot MD's I have installed was 8, without any issues on 8.10), I really hope someone has a clue....
All ideas welcome, I will try to implement mike's comments and report back.

Thanks again for your help,
ted

mkbrown69 · November 17, 2012, 03:53:43 AM

Ted,

The IBM card is coming up in raid mode (meaning it wasn't flashed into IT mode, which is a passthrough mode). So, you'll want to set the I/O scheduler to 'noop' rather than deadline, and let the raid card handle the re-ordering of I/O operations.

/etc/rc.local

Code Select

echo noop > /sys/block/sda/queue/scheduler

The default CFQ scheduler just adds unnecessary latency and I/O operations when feeding HBA's and RAID controller cards. It's more useful on desktops/laptops with SATA disks.

Presuming those are consumer SATA drives, not SAS or enterprise SATA, you'll be limited by the spindle speeds. Hopefully those aren't green drives; 7200 RPM and higher would be necessary as the root file system.

That raid card supports four ports, so depending on how it was cabled at the factory, you might have the other two ports available to you, in which case you could tuck an SSD inside the server case and Velcro it in. It might be using SFF-8087 or 8088 cables, in which case you'll be out of luck for adding internal disks. Take a look at http://public.presalesadvisor.com/LiteratureUploads/Literature-482.pdf which is for the IBM server you identified, and see how it compares to what you have.

Use the GigE switch for production usage. It'll help, but your core issue is disk related. You may also need to tune MySQL, as 5.1, InnoDB tables, and ext4 don't necessarily play well together, resulting in I/O waits which you'll definitely notice as everything is hitting the same set of spindles.

HTH!

/Mike

Marie.O · November 17, 2012, 11:14:30 AM

After reading the details of what happened after changing to an all GigE switch it is, at least my mind, fairly obvious that HDD speed is not the issue. It might very well be, that there is some kind of race condition within DCERouter happening. And re ProxyServer not coming up: I experience the same thing. Which is why I wrote a small script to restart the ProxyOrbiter devices when stuff like that happens. Interestingly, a kill for the ProxyOrbiter isn't enough, you need to kill -KILL ... go figure.

To check where the problem is, I would do the following. Start the first six MDs and wait for things to settle down, i.e. all MDs are up, top on the core is back to idle. For giggles stop UpdateMediaDaemon ( /usr/pluto/bin/UpdateMediaDaemonControl -disable ) - and start the 7th MD. See if problems persist. If it does, get out the DCERouter.log for the time period from booting up til strange behavior starts, and see if you can find anything helpful in there.

LinuxMCE Forums

News:

Strange NFS issue (or not)...

tsoukas

iberium

iberium

tsoukas

tschak909

mkbrown69

Marie.O

tsoukas

mkbrown69

Marie.O