LinuxMCE Forums

General => Feature requests & roadmap => Topic started by: johanr on October 14, 2008, 01:08:43 am

Title: Core Redundancy:What if?
Post by: johanr on October 14, 2008, 01:08:43 am
Hi!

This thread is more of a poll of if this is possible for me to achieve or if you as Gurus(yes you are) see any obstacles that I should be aware of before attempting or eve spending time on this.
Maybe the majority of users feels that this is a really good killer feature that could be developed by someone more skilled(obviously I am not)  :)

Having worked in Telecoms for a decent number of years I like the idea of redundancy.
Is there currently or are there any plans on doing so for the Core?
I know that the mtbf for computer hw is fairly high but still I cannot stop thinking about this.
Off course there cannot be total redundancy of like alarm sensors, phonelines etc etc.
 but...

Let me explain a bit more what is important for me. Without rewriting kernels and patching(cause that's way out of my leage). But by using the excisting components even more(even that is a challenge).

Lmce is controlling the whole house(or actually we are with the help of..) how cool isn't that?
Although.
 Assuming you have a proper UPS and that sort of easy redundant things sorted. What happens when there is a breakage inside the Core, like CPU, Hdd etc?
The alarm system, how does it react to a hw failure in the Core?
Or what happens when the system locks/freezes totally, yeah I know this is not a windows machine but it may happen.
Typically the burgler will try to disarm your system and nothing is impossible. I am not saying that the LMCE should be used to guard Pentagon or the white-house but for most humans their belongings are equally important, are you with me what I try to depict?

I would like to know if it is possible to somehow easily configure the Core to have a close friend(stdby).
 So all config files will be stored externally(for example) maybe and the other core is brought up to life(from suspend maybe) when the Master Core stops responding to whatever is used as status check(echo reply for example)
and starts to load the config from the external hdd's and when it's up it will poll each device for their status and send an message to the orbiters(like the UE = mobile phone)
it will also make sure the Master Core is really dead. Maybe even a script in the Master Core will reset itself when a certain trigger is met and a message is broadcasted to the orbiters.

Another unanswered question(in my head) is how the routing could be sorted, however..
I am not loooking for a hsrp/vrrp setup(even though it would be really nice) but something simpler.

To my understanding this should be possible to achieve in the current LMCE. Am I just being dumb or is it possible?

sorry for all the txt but I want to make sure my idea is understood and not taken as an "I want this, please deliver me this.." kind of thread because I would like to contribute with what I can if possible off course(my time and skills are kinda show-stoppers though)..

Br
Johan

Title: Re: Core Redundancy:What if?
Post by: syphr42 on October 14, 2008, 02:17:26 am
I'm interested in this as well. I tried Pluto awhile ago and I have been toying around with LMCE a little bit, but I am afraid to fully commit to the whole house solution because I haven't seen a way to allow redundancy. I am worried that I will devote full control of the house and then the core will die and the family will never let me forget the time when there was "no tv for a week."

On my network right now there is a database server that provides databases for bacula (backup solution), serveral mythtv boxes, zarafa (exchange type mail), etc and it uses a RAID 5 setup with nightly backups to another RAID 5 file server. Is it possible to externalize the LMCE database so that, in the event of catastrophic hardware failure, I could bring another machine online and just tell it to use the same database? Or is there just too much going on behind the scenes to make this kind of thing work?

By the way, thanks to everyone that works on this project. I am truly excited about implementing a whole house solution eventually (hopefully sooner, rather than later).
Title: Re: Core Redundancy:What if?
Post by: totallymaxed on October 14, 2008, 02:51:35 pm
Hi!

This thread is more of a poll of if this is possible for me to achieve or if you as Gurus(yes you are) see any obstacles that I should be aware of before attempting or eve spending time on this.
Maybe the majority of users feels that this is a really good killer feature that could be developed by someone more skilled(obviously I am not)  :)

Having worked in Telecoms for a decent number of years I like the idea of redundancy.
Is there currently or are there any plans on doing so for the Core?
I know that the mtbf for computer hw is fairly high but still I cannot stop thinking about this.
Off course there cannot be total redundancy of like alarm sensors, phonelines etc etc.
 but...

Let me explain a bit more what is important for me. Without rewriting kernels and patching(cause that's way out of my leage). But by using the excisting components even more(even that is a challenge).

Lmce is controlling the whole house(or actually we are with the help of..) how cool isn't that?
Although.
 Assuming you have a proper UPS and that sort of easy redundant things sorted. What happens when there is a breakage inside the Core, like CPU, Hdd etc?
The alarm system, how does it react to a hw failure in the Core?
Or what happens when the system locks/freezes totally, yeah I know this is not a windows machine but it may happen.
Typically the burgler will try to disarm your system and nothing is impossible. I am not saying that the LMCE should be used to guard Pentagon or the white-house but for most humans their belongings are equally important, are you with me what I try to depict?

I would like to know if it is possible to somehow easily configure the Core to have a close friend(stdby).
 So all config files will be stored externally(for example) maybe and the other core is brought up to life(from suspend maybe) when the Master Core stops responding to whatever is used as status check(echo reply for example)
and starts to load the config from the external hdd's and when it's up it will poll each device for their status and send an message to the orbiters(like the UE = mobile phone)
it will also make sure the Master Core is really dead. Maybe even a script in the Master Core will reset itself when a certain trigger is met and a message is broadcasted to the orbiters.

Another unanswered question(in my head) is how the routing could be sorted, however..
I am not loooking for a hsrp/vrrp setup(even though it would be really nice) but something simpler.

To my understanding this should be possible to achieve in the current LMCE. Am I just being dumb or is it possible?

sorry for all the txt but I want to make sure my idea is understood and not taken as an "I want this, please deliver me this.." kind of thread because I would like to contribute with what I can if possible off course(my time and skills are kinda show-stoppers though)..

Br
Johan



The simple answer is... No redundancy at the level you are describing here.

However you can RAID your storage in a separate NAS and you can back that RAID'd data up off site too.

But none of that protects you from a motherboard failure in your Core for example... if you get a major failure in the Core's hardware then that will bring your whole system down. Now you could have a 100% identical Core ready for that eventuality and have a duplicate of the boot drive ready to go... then if you get a failure you just power up the backup Core and your back in business.

However I have to say that this approach paranoid in the extreme ;-)

My current home Core has been up 24/7 for nearly 2 years now... so the MTBF is pretty good.

All the best

Andrew
Title: Re: Core Redundancy:What if?
Post by: johanr on October 14, 2008, 09:29:28 pm
 ;D I know..
Although, Murphy(the guy who makes the impossible happen) is a close friend of mine and I would actually assume that if I setup the Core with it's security functions. It will go down when I'm on holiday with the family in the country far far away..
Ok if only the tv or video sessions where to be disturbed but when talking security we bring the system uptime to a different level. At least I expect the system to have some sort of a parachute in case something goes wrong.
 Just a Very hot summerday can bring a "computer" into sleeping mode.
 Ok, I am not going to try to convince the skeptics about the positive side with node redundancy :)
I am strange, I know  ;D

Ok will start a small project on my own on this then.
I will start with that Raid to Nas. starting with the media folders(video,audio and pictures) to get something to base it on, Sounds like something that is already existing in the wiki, will search and read that.
Next step will be to figure out how and if possible to have a system sleeping(for power consumption matter) and still have a poll from the same node(sounds a bit contradicting I know...)
Then I will start experimenting in how to wake up a pc I have seen it in the bios but never used it for real. Think even there is a wiki for that regarding media directors.
I don't see much more than that that has to be settled to make this work actually.
 Except for the hw cost and setup if going for full redundancy with plcbus control etc  :-\

Going to give a try with a stdby core that has similar hw but only controls/monitors the security first though. Otherwise the $$$ will make this not so attractive.

Questions:

For LMCE, is there a sw architecture page that describes the architecture in more detail? Or is it common Linux/Kubuntu architecture/knowledge that applies?
Just would like to find out what directories are vital for a Core to come up and what directories could be left out since they may be Node specific(hw config etc)
What happens with the alarm when all is locked in case of core dies?
 will it unlock itself, or stay alarmed?

Lets see how far I will come on the limited "free" time given as a parent. Will do extensive research on the matter and maybe it's really an overkill to the extreme :)

Br
Johan
Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 14, 2008, 10:03:08 pm
I suggest looking at the Programmers Guide on the wiki. There is a description of the architecture in the WIKI.

It is a highly distributed messaging system that is very strongly typed (every possible command, event, and data are defined in the master database, are compiled into C++ code, and are used throughout the system), as such, while each piece is relatively small, the interaction of all the individual pieces is where the complexity comes into play.

While the individual devices can be distributed to other machines, there is always one DCERouter, and there are certain devices which are plugins, which must run in the DCERouter's memory space (because they need access to the others' data structures to intercept messages etc.)

The DCE devices themselves, often wrap other pieces of software, exposing a command interface for the other DCE devices, and there is also a boat load of custom scriptage below which deals with system configuration (well over 260 scripts at current count), so that this system can behave as an appliance.

So, as you can see, very complex, but in a good way. It does mean that research into this area is very much a long term venture.

-Thom
Title: Re: Core Redundancy:What if?
Post by: johanr on October 14, 2008, 10:37:46 pm
Thank you, yeah I think I have gripped the complexity of the architecture, meaning I will not be able to learn what everything is doing in this life :)

So, just to make sure I understood you right.
 You basically mean that there is no (easy) way of setting up two nodes(core) the same way and having one in like hibernate state and then make it to take the active role when the main Core dies?
Off course there will always be a difference in the hw(like mac adresses and such) but excluding that and also only concentrate on the security parts.
Sorry for the noobie type questions..

But that's really helpful info since I would more or less waste my time in something that is not really going to work :-\  or maybe be more challenged to get it working  ;D

Will do some more reading on that page. How could I've missed that one(found it now)

-johan


Title: Re: Core Redundancy:What if?
Post by: colinjones on October 14, 2008, 10:48:36 pm
The only "easy" way that springs to mind would be to run your core as dedicated (no MD) as a virtual machine with VMWare. There are multiple options with that platform for regularly taking exact snapshots of VMs, or sharing storage maybe using a shared SCSI bus or iSCSI, even VMotion if you're prepared to go to ESX.

That way you could setup 2 physically separate machines running VMWare and something to perform a heartbeat and trigger failover to the second piece of hardware when needed - as the image is identical, the rest of your LMCE need not know anything. There is even a level of virtualisation of switching/routing in VMWare that would allow the 2 NICs set up to failover as well.

Could be an expensive proposition though, as some of the advanced features are only available in the commercial versions of VMWare...
Title: Re: Core Redundancy:What if?
Post by: patmankn on October 15, 2008, 06:37:00 pm
Well,

high availability with Open Source should be quite common. There are several Projects and papers about this issue.
I wrote my diploma thesis about that *siiggh*, where i compared three Solutions for a HA-framework.
Redundancy in mind, it should be determined first, what's important to keep alive. The core? The DB? The Filesystem?both? How about the TV Cards (keep in mind THE week w/o TV!)?. The easiest way to increase HA is to go for RAID 1, hoping Murphy won't grill both HDs, so you just can keep a second "standby" Core and switch HD, rebuild RAID1.

Things like heartbeat etc could do the trick as well.
The "winner" of the conclusion was http://www.openais.org (http://www.openais.org), an opensource implementation of the Service Availability Forum, as the idea was to keep several independant Software Moduls online.
The company i worked for never implemented anything based on my thesis .... i wonder why ;O)

Havin' this in mind... HA is pure pain!

Some literature:
http://weblog.infoworld.com/geeks/archives/2007/02/achieve_more_re.html (http://weblog.infoworld.com/geeks/archives/2007/02/achieve_more_re.html)

http://www.linux-ha.org/ (http://www.linux-ha.org/)

and a presentation: http://www.linux-ha.org/_cache/HeartbeatTutorials__LWCE08-ha-tutorial.pdf (http://www.linux-ha.org/_cache/HeartbeatTutorials__LWCE08-ha-tutorial.pdf)



Title: Re: Core Redundancy:What if?
Post by: patmankn on October 15, 2008, 06:46:09 pm
Well,

high availability with Open Source should be quite common. There are several Projects and papers about this issue.
I wrote my diploma thesis about that *siiggh*, where i compared three Solutions for a HA-framework.
Redundancy in mind, it should be determined first, what's important to keep alive. The core? The DB? The Filesystem?both? How about the TV Cards (keep in mind THE week w/o TV!)?. The easiest way to increase HA is to go for RAID 1, hoping Murphy won't grill both HDs, so you just can keep a second "standby" Core and switch HD, rebuild RAID1.

Things like heartbeat etc could do the trick as well.
The "winner" of the conclusion was http://www.openais.org (http://www.openais.org), an opensource implementation of the Service Availability Forum, as the idea was to keep several independant Software Moduls online.
The company i worked for never implemented anything based on my thesis .... i wonder why ;O)

Havin' this in mind... HA is pure pain!

Some literature:
http://weblog.infoworld.com/geeks/archives/2007/02/achieve_more_re.html (http://weblog.infoworld.com/geeks/archives/2007/02/achieve_more_re.html)

http://www.linux-ha.org/ (http://www.linux-ha.org/)

and a presentation:
http://www.linux-ha.org/_cache/HeartbeatTutorials__LWCE08-ha-tutorial.pdf (http://www.linux-ha.org/_cache/HeartbeatTutorials__LWCE08-ha-tutorial.pdf)




Title: Re: Core Redundancy:What if?
Post by: johanr on October 15, 2008, 09:52:40 pm
Gee, thanks. . . Well as you say, it has to be decided whats important to protect and to what cost.

When it comes to Alarm which was the main reason of concern I am getting my doubts about an actual need for redundancy.
 Afterall if I would decide to keep the normal way of opening a door with a key the house is at least not Open.
I will definately search further in the matter but it seem more closer to the reality if using raided discs(discs has proven to live shorter) to start with and also a good Ups system.

If Totallymaxed have had a core running for two years (with regular maintenance I presume) then there should at least be a possibility that maybe mine will as well if I just keep my hands off when all is working..

Had my first freeze(small crash) yesterday and that one made me feel more safe with this sw.

 after choosing a video(dvd) and then when it started I hit the f7 then menu(to skip the foreplay)
Then when going to dvd options and choosing what I wanted it simply closed the video and froze.
As an used windose user I felt like, ohh! crap.. now I need to reset it. But then I saw that the mediadirector program closed and I came to the LMCE manager I think it's called. Then after a few seconds the mediadirector came back up again...
That impressed me just a little bit(quite much actually) that, to me feels like there are at least some sort of code, checking that all processes are up and if not restart them. wow!

Is there such a "documented" function that overlooks the processes and if one crashes it will be restarted? Or was I just being lucky?

In that case I feel my worry about the Core going to a windos freeze kinda mode is less likely to happen


-johan
Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 15, 2008, 10:12:33 pm
DCE itself has a watchdog process which watches every thread, and if a particular thread takes too long to execute (60 seconds), it will kill the router, and send a message to the launch manager to restart the DCERouter and the associated devices.

There are also other associated scripts for each of the major daemons, such as Asterisk, MythTV, etc.

-Thom
Title: Re: Core Redundancy:What if?
Post by: johanr on October 15, 2008, 10:38:34 pm
 ::) makes me wonder why I even bothered thinking about node redundancy..
Thanks.. unless the watchdog dies I think I can live without node redundancy feeling pretty safe actually.

 Because hanging/ freeze or crash was one of my main concerns (as a regular windoz user I'm used to it..)

 Thanks!
 you just saved me alot of time...


-Johan
Title: Re: Core Redundancy:What if?
Post by: Marie.O on October 15, 2008, 11:19:20 pm
I wonder, if this could be integrated into a heartbeat type setup with a hot standby. hmmm

Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 16, 2008, 12:10:51 am
people really need to stop, and thank Pluto for designing a system that was ultimately to be used as an appliance. The sheer amount of forethought that went into this system is nothing short of holy shit staggering.

-Thom
Title: Re: Core Redundancy:What if?
Post by: totallymaxed on October 17, 2008, 12:19:56 pm
DCE itself has a watchdog process which watches every thread, and if a particular thread takes too long to execute (60 seconds), it will kill the router, and send a message to the launch manager to restart the DCERouter and the associated devices.

There are also other associated scripts for each of the major daemons, such as Asterisk, MythTV, etc.

-Thom


The problem with that though is that this brings the whole system to a halt while the the reload router happens. In some situations this is a pain. ie I'm watching a movie and some device thread dies... and the watchdog decides to reload the router!... my movie playback gets killed for possibly 1-2 mins on a big complex system while the reload happens... then i have to manually restart my movie. Not really very nice at all.

Ideally we need to be able to resurrect a thread without having to restart the whole router to do it...

Andrew
Title: Re: Core Redundancy:What if?
Post by: johanr on October 17, 2008, 12:41:20 pm
The problem with that though is that this brings the whole system to a halt while the the reload router happens. In some situations this is a pain. ie I'm watching a movie and some device thread dies... and the watchdog decides to reload the router!... my movie playback gets killed for possibly 1-2 mins on a big complex system while the reload happens... then i have to manually restart my movie. Not really very nice at all.

Ideally we need to be able to resurrect a thread without having to restart the whole router to do it...

Andrew

Yes, That's true. That's off course a draw back. How often does this happen for you(or your customers)? (router reload due to the watchdog)
However I rather have this than the whole router getting jammed because of a thread doing nothing.

-johan
Title: Re: Core Redundancy:What if?
Post by: fryed_1 on October 21, 2008, 11:25:38 pm
I suppose it would be possible to have a separate DB box running mysql so you can cluster it to a raid5 filesystem and all that.  That keeps your database going no matter what.  Same with your media files.

Two identical boxes replicated could sit behind a load balancer on your network, each with backend connections to the database.  In the event of the primary box failure, the load balance could shift all traffic from that box to the backup one.

Not sure how the media directors would react to that though, unless you had some scripts to automate changing of channels on the two router boxes so they stayed in sync.  And you'd still have to reload movies, tv shows and the like in the event of a failover.

But if you were absent at the time and the primary box goes down, you could at least rest assuredly that security would only be down for a minute or two once the failover took place.  Could probably setup a reload router script that the router could trigger manually from the command line as well to ensure that a failover situation started fresh.
Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 21, 2008, 11:26:57 pm
*shake-head*

*head-in-hands*

guys, stop thinking with duct tape! damn it.

-Thom
Title: Re: Core Redundancy:What if?
Post by: fryed_1 on October 21, 2008, 11:31:09 pm
I used half a roll of duct tape and some cardboard to replace a radiator cap that lasted 5 months before I fixed it. 

Don't shun the duct tape :P
Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 21, 2008, 11:33:15 pm
i'm being very serious.

you guys don't seem to think these overcomplicated things you're thinking all the way through, and in the process introducing UNRELIABILITY in order to make things reliable?

come on guys, study how the system actually works, and then you might be able to make some properly educated guesses on how to make the system more reliable.

Yes, I do sound like an asshole right now. Tough. I'm trying to beat this into your heads.

-Thom
Title: Re: Core Redundancy:What if?
Post by: syphr42 on October 22, 2008, 04:43:56 am
These ideas may seem overly complex, but if you put everything into one box with no redundancy and no disaster recovery plan, you are setting yourself up for failure. Now, if you don't really rely on the system, that's another story. For many people, this added complexity would all be a waste of time and money. However, I'm sure most people in the IT industry will agree that when it comes to mission critical, you need redundancy and a plan. All of the software improvements we could come up with won't help in the event of a hardware failure if you have no redundancy. If you want to reduce downtime for a system that could potentially control all of your media, security, climate controls, and lighting (and I'm sure much more), there has to be a way to build in redundant hardware. It would probably be a good idea to split up some of the functions as well, like telecom, security, and media director coordination (and I've heard rumors that it may already be possible to split some things apart).

All of that being said, I'm not suggesting that the people who work on improving LMCE should devote time to building in elaborate mechanisms that, in all likelihood, only a few people will use. I just think its an interesting topic for discussion and something to think about if you really are going to rely entirely on a single box.
Title: Re: Core Redundancy:What if?
Post by: colinjones on October 22, 2008, 05:20:18 am
As I say, I've already suggested one option - use something like VMWare server and snapshot to a second VMWare host. This will give you a perfect copy of the core on a separate piece of hardware that can come up at a moment's notice. VMWare has enough virtual networking options to allow the networking to fail over transparently, and there are plenty of heartbeat/failover options to automate it, too.

I think redundancy in this context is way down the list of priorities, but if you really want it, there is at least one option just there!
Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 22, 2008, 05:38:56 am
and once again, you guys misunderstood me.

*hmm* do I have to spell it all out for those of you too slow to get it?

We have... a message passing architecture....

This means, that ANY redundancy solution NEEDS TO START HERE.

This means, real engineering, and possibly, running two DCE routers.. then you have synchronization issues...how do you solve those?

The database is tied heavily into the nature of the message router, not to mention, state is maintained in memory. Since DCERouter is a multithreaded application, this gives rise to possible locking problems... I can go on, but I hope some of you are starting to at least get the picture, that this isn't as simple as replicating the damn database services.

come on people, if you're going to solve the problem, use your brains. Look at how the system works and ACTUALLY ENGINEER A SOLUTION THAT WILL WORK!

What we have right now, works. It is at least able to recover from faults. I'm not saying it's perfect, and we should stop...what I'm saying is, given the complex nature of the system, DUCT TAPE CAN'T BE USED!

-Thom
Title: Re: Core Redundancy:What if?
Post by: colinjones on October 22, 2008, 06:05:54 am
No Thom, I haven't misunderstood you. I understand what you are saying about in memory system state and fully agree that ideally the software should support redundancy (perhaps with db sync'ing through transaction log shipping, active/passive DCE router architecture and hearbeats, etc)

I was simply answering a previous point specifically on hardware redundancy. A snapshot would capture the database "as is" along with the entire core hard drive and VM image. A hardware failure at that point would mean that you could roll back to point-of-snapshot exactly as if you had lost power to the core at that same point. You would not be able to roll forward to point-of-failure as you would have lost all db changes, and system state (including stuff like lighting states and other HA stuff). With the ability to quiese the database first, you would potentially be in an even better state than after a power failure (less lost in lazy-writes, etc).

The point is, we can do it at a hardware level if you really want to. I just don't see much point unless the software component is in place (as you say after some real engineering stuff is done to change the architecture to support this). And if the software component is done right, then the hardware component almost becomes irrelevant/unnecessary... I realise your concern is duct tape vs elegance! (And I agree)
Title: Re: Core Redundancy:What if?
Post by: totallymaxed on October 22, 2008, 01:12:40 pm
The problem with that though is that this brings the whole system to a halt while the the reload router happens. In some situations this is a pain. ie I'm watching a movie and some device thread dies... and the watchdog decides to reload the router!... my movie playback gets killed for possibly 1-2 mins on a big complex system while the reload happens... then i have to manually restart my movie. Not really very nice at all.

Ideally we need to be able to resurrect a thread without having to restart the whole router to do it...

Andrew

Yes, That's true. That's off course a draw back. How often does this happen for you(or your customers)? (router reload due to the watchdog)
However I rather have this than the whole router getting jammed because of a thread doing nothing.

-johan


Well its certainly not very frequent but nevertheless having to reload the whole router to allow any changes to be accepted or to make a change to a single device is not ideal. Its like having to reboot your laptop/pc just because one app was not responding (reloading the router can take some considerable time on a large installation with many devices).

Andrew
Title: Re: Core Redundancy:What if?
Post by: indulis on October 22, 2008, 03:16:49 pm
In my work with UNIX I have set up quite a number of failover systems over the years (IBM HACMP), and also have experience with Oracle RAC which is a highly available clustered database, and GPFS which is a highly available (HA) clustered filesystem.

It can be complex.  I have often recommended the "KISS" process (no HA failover cluster) because as Thom rightly points out, if HA clustering not done right it can make the system *less* reliable- esp with sysadmins that don't know what they are doing.

The way that the failover systems work is that you typically have a set of scripts that start and stop a service (or multiple services) on a server.  There are multiple servers, and they run failover software that has the responsibility to determine which one of the nodes is the "cluster master".  This cluster master node tells the others what to do re the cluster, and orders the other nodes (and itself) to start or stop services.  If there is a server failure, then the "cluster master" responsibility passes to another server in the cluster.  Typically there is a voting method to determine who is the cluster master.  In its simplest form it is the node that "owns" a disk (has put its fingerprint in it).  Other more complex clustering voting requires if a node comes up and can't see >50% of the other nodes in the cluster, it knows it is not allowed to make itself the master.

Software clustering of a software service i.e. going from failover approach to dual active nodes, is v complex to write and make highly available.  You end up writing the HA clustering code within your own software.  Unless you can use someone else's existing code!  Oracle RAC allows all normal applications that run on Oracle to be made Highly Available, as it puts the complexity into the database service software.

I have been thinking about the same HA requirement.  Hardware failure of the DCE router could be disastrous once you come to integrate it into a house.

There are Linux HA cluster software products too.  I haven't had any experience with them, but Steeleye Lifekeeper is one http://www.steeleye.com/products/linux/

An open source HA cluster
http://openssi.org/ssi-intro.pdf
http://openssi.org/

...with MySQL
http://wiki.openssi.org/go/MySQL_Clustering

I think this is a very useful thread.

Even if it turned out to be an approach where you manually do the failover to another physical server this would be a good thing. 

Title: Re: Core Redundancy:What if?
Post by: hari on October 22, 2008, 08:22:02 pm
well said, indulus.

Regarding hardware failures, you can build/buy rock solid hardware configurations. More money buys you either a good hardware service contract or a replacement machine. Use a nice hardware raid and swap the disks to the other chassis if necessary.

Clustering this beast is another story. From a DCE perspective at the moment there is only failover. Mysql can be setup active active. Clustered filesystems have some limitations, DRBD could help. Tying this into the automagic bits of lmce.. adapting launchmanger. All doable, but much work.

best regards,
Hari
Title: Re: Core Redundancy:What if?
Post by: johanr on October 22, 2008, 08:55:45 pm
*shake-head*

*head-in-hands*

guys, stop thinking with duct tape! damn it.

-Thom


*Laughing(with you, not at you)*
I can really picture you sitting there, sighing, shaking your head.
 :)


Wow I like this discussion that was raised, although tschak see's all the work that has to be done. I think it's a fair question.
 Because this as I see it really put's where we are going to place LMCE in the future map. Maybe the question is mostly relevant for Totallymaxed and towards his customers(assuming some sort of support agreement with the product is included(?)

I agree with the fact that the constant router reload is a little bit annoying but those reloads are only during install/config change right?

Anyways, what made Me stop thinking that the hw redundancy or as I would like to call it Node redundancy was crucial. Was the fact that there is a watchdog resetting threads that is not functioning.
 So there is at least a sw function watching my core when being away. :)

Also in case of power failure(likely to happen nowadays for some reason) a good UPS with a diesel(or similar) generator needs to be installed to take over.

In case the hw dies for some reason:
 Then the worst case scenario would be that you have no alarm or security and that the hw failure was initated by the burglars themselves. >:(
But assuming that the doorlocks in the house are properly installed I would guess the house is still locked. They will find their way in if they want... :-\

In case of fire, unless generated by the core catching on fire. The core can be configured to give you or the friendly neighboors a warning (as I have understood it)

So in order to get rid of the redundancy holes I listed this to be necessary equipment:
* UPS(the core have to be configured to shut down all unnecessary power(tv,receiver,lamps etc) to save energy until power is back online
* Diesel Generator with starters engine
* Fireproof cabinet for both cores
* Two Cores has to be built
* Two internet connections (one using 3G/Gprs for example and the other using fixed broadband)
* Nas or similar for Hdd redundancy
* Twice the amount of all security sensors/cameras etc
* and the list goes on..
Not to mention all the days/nights/months/years that has to be spent to get everything running...

My point being. I basically knew before starting this thread that for a normal person all of this cannot be achieved and is not really relevant when looking at the whole picture. I just wanted to know if I was the only one having this "concern" and if it would be by easy means possible to achieve(duct tape?). Seems like it is not.

ColinJones, I remember seing something about VMware and LMCE not functioning as it should (it's to complex) when doing my research

Although.. for a Pluto paying customer, this would be a killer "feature"(maybe I should tell them about the idea?)  :)

So with that said, unless someone pays(big money) for this redundancy(utilized in IPSO for example) to be developed in LMCE. I off course agree that it does not feel very important for a normal user as long as they are aware of Why and what the sacrifices actually are.(money talks ones again)

And it's now being written here in this forum for everyone having the same concern with what it will take.

-johan
Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 22, 2008, 08:59:04 pm
You are far too paranoid.

-Thom
Title: Re: Core Redundancy:What if?
Post by: johanr on October 22, 2008, 09:10:00 pm
Or more like, "I do it right otherwise I don't bother spending the time.."  :)


-johan

Title: Re: Core Redundancy:What if?
Post by: tschak909 on October 22, 2008, 09:44:44 pm
Dude, there comes a point where you have to step back and realise that you can't think of everything. Take a deep breath, and take a serious thought as to whether this really makes sense?

If the electricity goes out in your house, does it really matter?

-Thom
Title: Re: Core Redundancy:What if?
Post by: johanr on October 22, 2008, 10:51:18 pm
well thats exactly my point. It doesn't make sense if you don't do the redundancy to the fullest. And I am not that extreme that I am going to do that.
 But yeah, I am thinking alot!
Currently working on that.. may be corrected in the next release although my employer will not like that release ;D

-->Now I think(!) I have not been very clear in the pre-previous statement. That was to the extreme just to show you(all) my point and to visualize even more what made Me decide Not to bother anymore about redundancy  ;)   <---

Thinking outside LMCE:
If electricity to my house is cut off for a few hours that Will matter.(live in a cold country)
 There have been events where people in my country have been without power for weeks. But when that happens off course no one will prioritise to get the LMCE core up before food is on the table..

-johan
Title: Re: Core Redundancy:What if?
Post by: colinjones on October 22, 2008, 11:16:01 pm
Just a thought, johan, some may have had problems with VMWare but I'm certain that they are not insurmountable as the DVD image is built on a VMWare image (check the xorg.0.log file when it first starts after a build!)
Title: Re: Core Redundancy:What if?
Post by: indulis on November 13, 2008, 05:16:36 pm
Actually, it could be pretty horrible to live in a house where LMCE is tied into all the home automation.

No lights for a start.  Also no phone.  Garage door does not open when you come home.  Automated locks don't open.  Climate control does not work.

I'd be happy with some way to manually move (say) a USB memory stick from one server to another to manually make it the new core.  If all my storage was on the network, I could then access the core data.

Anyway, just ideas. Future fully automated houses will have to address redundancy and availability concerns just like banks do with their vital applications.
Title: Re: Core Redundancy:What if?
Post by: tschak909 on November 13, 2008, 05:20:00 pm
Well even with LMCE handling everything, the light switches etc, are still autonomous...

but think about it.. when the power goes out... so does everything else, right?

At worst, when LMCE goes out, your land phone line goes out... but we all have cell phones, right?

Not to say that redundancy is a bad thing.. It's what we are striving towards.. but in the mean time, it's not a bad thing to trust your house to LMCE.

-Thom
Title: Re: Core Redundancy:What if?
Post by: perspectoff on December 06, 2008, 06:04:15 pm
I know this is a stupid solution, but why not just have a backup hard drive image on an external hard drive? I mean, an external hard drive costs, like, $60 these days?

If the Core burned out somehow, then you just boot from this backup drive on whichever backup computer you have (you could use one of your spare Media Directors, for example).

I keep a backup image of all my mission critical servers hard drives for exactly this possibility.

Virtualization is very labor intensive as a backup/redundancy solution.
Title: Re: Core Redundancy:What if?
Post by: indulis on December 29, 2008, 01:25:19 am
It probably sounds like a reasonable option to "just" backup your core regularly. 

So you have to take the core down for an hour or two while you do it.  If you have the air con, phone, all your TV recordings reliant on the core then when do you want to do this?  2 AM backups are not fun to do.

I have been thinking that a 2nd server with http://www.drbd.org/ (http://www.drbd.org/) DRBD might be the way to keep another backup server image "in sync" with the core.  Then a documented procedure to get the 2nd server up and running.

And then another procedure to "fail back" (which is often forgotten and is as hard as the "fail over").  Because you want any updates done to your data over the week it takes you to fix the original server to not be discarded.
Title: Re: Core Redundancy:What if?
Post by: hari on December 29, 2008, 04:39:18 pm
guys, we are talking about HA and not some financial transaction system. You won't have big financial losses or injured people if it is down for an hour or two. If you need security for your picasso even if the core is down get a dedicated panel with UPS.

br, Hari