Author Topic: Buggy behaviour with RAID devices- best way to deal with it?  (Read 6679 times)

indulis

  • Veteran
  • ***
  • Posts: 147
    • View Profile
Buggy behaviour with RAID devices- best way to deal with it?
« on: October 06, 2008, 07:05:52 am »
Hi,

I have an md RAID device for /, and another for /home.  When LMCE starts, it puts a sym link to the / and /home devices into /home/public/data/others, then UpdateMedia goes into an endless loop going up and down the directory tree as it tries to scan for files.  That is, you go down to /home/public/data/others/md4_SYMLINK which is a sym link back to the device that is mounted as /home, then you go down again to what is now

 /home/public/data/others/md4_SYMLINK/public/data/others/md4_SYMLINK then does it again again

 /home/public/data/others/md4_SYMLINK/public/data/others/md4_SYMLINK/public/data/others/md4_SYMLINK

I've changed the actual Symlink name to make this more readable.

I have looked at  3 places where this could be fixed:

1)  StorageDevices_Symlinks.sh which takes all of the disk devices and creates the symlinks.  What I'd do here is check before making a symlink that it does not point further towards /  . In other words, if the link is being put in /home/public/data/others, then the link should not point to a disk/raid device that is also mounted at a higher point in the same path.  That is, no link should point to devices that are mounted on /, /home, /home/public, /home/public/data, or /home/public/data/others or an infinite loop will result.

2) in checkforRaids.sh where the system checks for RAID storage, knock out any devices associated with a RAID device that is already mounted

3) Change UpdateMedia so that it recognises when it starts to recurse and jumps out.  WOuld not have to be perfect, could keep last say 255 devices that it scanned in this scan, and check that a directory transition does not take us to a device where we've been before.  If it is a new device just add it to the list.

Which should I do?  I am leaning to #1, as it means RAID devices ares still discovered, but UpdateMedia can cope as it doesn't have to transit infinite loops.  Is there any other use for the symlinks apart from UpdateMedia?

Cheers,

Indulis

tschak909

  • LinuxMCE God
  • ****
  • Posts: 5549
  • DOES work for LinuxMCE.
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #1 on: October 06, 2008, 01:13:13 pm »
The symlinks are used mostly for UpdateMedia, but also to make sure that all the disks are easily reached within the different media bins in the samba share.

-Thom

Zaerc

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 2256
  • Department of Redundancy Department.
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #2 on: October 06, 2008, 07:07:00 pm »
Actually, the best way to deal with it is to not put your root and home filesystems on raid like that but only your media shares.  Messing with the plumbing like this will only make things go from bad to worse. 

Now you could try setting the devices to "disabled" in the "devices-tree" and if that doesn't help, try setting "This device is controlled via" to "- Please select -".  Also remove those entries you added for /mnt/device/... from your /etc/fstab.

And just because you have made a mess of things doesn't automaticly mean it's buggy behavior, this system can be very fragile, you can't just go move everything about as you please and still expect it to work as intended.  I'm willing to bet that most of us have found this out the hard way, just like you.
"Change is inevitable. Progress is optional."
-- Anonymous


indulis

  • Veteran
  • ***
  • Posts: 147
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #3 on: October 07, 2008, 02:04:34 pm »
Zaerc- please explain to me how wanting to have my operating system separated from user data so I can rebuild it easily, and having it set to RAID to provide reliability is "making a mess of things".

It seems to me that making the system more robust and able to cope with reasonable configurations (like LVM, different RAID configurations etc) is a good idea.

What is wrong with changing MediaUpdate to be smarter and not follow recursive links? That'd fix the problem straight away.
« Last Edit: October 07, 2008, 02:06:55 pm by indulis »

indulis

  • Veteran
  • ***
  • Posts: 147
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #4 on: October 07, 2008, 02:15:18 pm »
OK I thought about it, if the links are also used to export data to other devices, then allowing infinite recursion is probably not a good idea for them, even though a modified  UpdateMedia could cope.  So, option 1 sounds OK, don't add symlinks in for anything that you find that has a path that is a subset of /home/data/

If I was going to stick to somethingi that would not break LMCE, what is the best way to get a root fs (excluding any media files) and  say /tmp fs set up as RAID?  I don't want my media files on my / filesystem thanks!  Just really bad practice IMHO. I'd also prefer to NOT have /home in the same FS as / 'cos if say a /home/x/y FS does not mount, the / FS then fills up and BLAM.

« Last Edit: October 07, 2008, 02:18:14 pm by indulis »

tschak909

  • LinuxMCE God
  • ****
  • Posts: 5549
  • DOES work for LinuxMCE.
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #5 on: October 07, 2008, 03:22:54 pm »
The system was designed as such that a RAID for storing media could be easily added in by someone without having to explicitly lay things out.

What zaerc meant was, "This is the way things expect to be, and if you change them, the underlying system will most definitely break."

As such, There is no easy way to solve this problem to provide a RAID to those who want it, and those who don't without either:

(1) becoming so restrictive that you have to have x # of drives, and they will be done in Y fashion,
or
(2) having to create a complicated system to handle RAID within the web admin to handle even the most common variations.

:-/

not an easy thing to solve.

-Thom

indulis

  • Veteran
  • ***
  • Posts: 147
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #6 on: October 07, 2008, 03:55:14 pm »
Well I will think about the "exclude /, /home, /home/data. /home/data/others from discovery" idea.  That should work I think.

I realise that we are sort of "de-appliancing" LinuxMCE.

Thanks!

tschak909

  • LinuxMCE God
  • ****
  • Posts: 5549
  • DOES work for LinuxMCE.
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #7 on: October 07, 2008, 03:59:22 pm »
which I will not do.

-Thom

Zaerc

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 2256
  • Department of Redundancy Department.
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #8 on: October 07, 2008, 08:06:34 pm »
Well, you can either go with the flow, or fight the system tooth and nail all the way, that's your choice.  Now If you'd like to know how this makes a mess of things, simply have a look at your system, is it more robust now?

And excluding all the paths where your media is supposed to be found, may seem like a great idea at first, but in practice you'll find the results rather disappointing.
"Change is inevitable. Progress is optional."
-- Anonymous


indulis

  • Veteran
  • ***
  • Posts: 147
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #9 on: October 08, 2008, 05:24:12 am »
Thom,

When I said "de-appliancing" I meant making the system more able to automatically cope with changes in configuration without user intervention.

Zaerc- I am not proposing to stop all sym links and mounts, just to exclude ones that will cause infinite loops.  That is- no symlinks to RAID devices that are mounted as /, /home, /home/public, /home/public/data, and /home/public/data/others (or the equivalent in whatever path the sym links will be going into). That seems a sensible (and non-intrusive) thing to do!

It seems a bit tragic that LMCE can't cope with something  simple like / on RAID.  Which is ESSENTIAL if I am going to hand over control of my home to LMCE.  I don't want a single dead disk on my Core trashing my phone, lights, security, etc all in one fell swoop! I am sure that I am not alone in expecting this of a "core" component.

If this small change can fix it without affecting other operations, why not?

Guys- what is the downside of excluding the recursive sym links?  If there is none, I'd like to do it on my system, and then hand the code over to someone to test for possible inclusion in LMCE.  i.e. contribute code (which is what you both are always such strong advocates for)

I'd propose to look at the mounted md devices, find the ones that are on a bad path, look up the partitions that make up the RAID device, and blacklist them, as per the other "blacklisting" that happens in /usr/pluto/bin/checkforRaids.sh

I am asking for your expertise in the architecture of LMCE to tell me the consequences of making a small change that then allows users to build a more reliable system.

Anyway, you have to break things in order to fix them and improve them.  Otherwise improvements would never get made!
« Last Edit: October 08, 2008, 05:58:06 am by indulis »

colinjones

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 3003
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #10 on: October 08, 2008, 05:57:30 am »
indulis - layout the logic in simple pseudo code in this thread, and what the exact deliverables would be so that Zaerc, Thom and others can see what you are proposing. Assuming they can find no holes in the logic, it improves the user experience and there are no draw backs, then I'm sure they will be happy to consider any code you cut based on it, for inclusion.

Problem is, at the moment, its very high level and conceptual. Show how it would actually fit into the system and work and you will garner the support you are looking for. Alternatively, if there are holes in your logic and they get pointed out then you will be further up the learning curve!

indulis

  • Veteran
  • ***
  • Posts: 147
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #11 on: October 08, 2008, 06:06:36 am »
Thanks Colin!  I will actually write the code anyway and try it on my system.  It is pretty simple and I've already got the shell commands figured out and some idea of how to do it.

My main Q is whether to blacklist the "recursive" md devices during the scan for RAID during boot (/usr/pluto/bin/checkforRaids.sh), or in the script that creates the sym links (/usr/pluto/bin/StorageDevices_Symlinks.sh).  Right now I think in the checkforRaids, which means that as far as LMCE is concerned, these RAID devices don't exist.  And the code is already in there to blacklist devices like swap etc from partitions that LMCE will manage.  I may lose some automatic management (data scrubbing etc), so that is why I thought maybe at the sym link stage instead, and why I was asking for opinions.

Again, note that I am only talking about md devices that are mounted on
/
/home
/home/public
/home/public/data
/home/public/data/others

or the equivalent path for wherever the RAID device sym links are being created.

Zaerc

  • Alumni
  • LinuxMCE God
  • *
  • Posts: 2256
  • Department of Redundancy Department.
    • View Profile
Re: Buggy behaviour with RAID devices- best way to deal with it?
« Reply #12 on: October 08, 2008, 09:10:30 am »
Well I told you earlier to try the things how I'm running with my rootfs on lvm2/raid5 (and none of my media is on it) but if you really don't want to listen and only be argumentive, then good luck with it!
"Change is inevitable. Progress is optional."
-- Anonymous