Core Redundancy:What if?

johanr · October 17, 2008, 12:41:20 PM

Quote from: totallymaxed on October 17, 2008, 12:19:56 PM
The problem with that though is that this brings the whole system to a halt while the the reload router happens. In some situations this is a pain. ie I'm watching a movie and some device thread dies... and the watchdog decides to reload the router!... my movie playback gets killed for possibly 1-2 mins on a big complex system while the reload happens... then i have to manually restart my movie. Not really very nice at all.

Ideally we need to be able to resurrect a thread without having to restart the whole router to do it...

Andrew

Yes, That's true. That's off course a draw back. How often does this happen for you(or your customers)? (router reload due to the watchdog)
However I rather have this than the whole router getting jammed because of a thread doing nothing.

-johan

fryed_1 · October 21, 2008, 11:25:38 PM

I suppose it would be possible to have a separate DB box running mysql so you can cluster it to a raid5 filesystem and all that. That keeps your database going no matter what. Same with your media files.

Two identical boxes replicated could sit behind a load balancer on your network, each with backend connections to the database. In the event of the primary box failure, the load balance could shift all traffic from that box to the backup one.

Not sure how the media directors would react to that though, unless you had some scripts to automate changing of channels on the two router boxes so they stayed in sync. And you'd still have to reload movies, tv shows and the like in the event of a failover.

But if you were absent at the time and the primary box goes down, you could at least rest assuredly that security would only be down for a minute or two once the failover took place. Could probably setup a reload router script that the router could trigger manually from the command line as well to ensure that a failover situation started fresh.

tschak909 · October 21, 2008, 11:26:57 PM

*shake-head*

*head-in-hands*

guys, stop thinking with duct tape! damn it.

-Thom

fryed_1 · October 21, 2008, 11:31:09 PM

I used half a roll of duct tape and some cardboard to replace a radiator cap that lasted 5 months before I fixed it.

Don't shun the duct tape

tschak909 · October 21, 2008, 11:33:15 PM

i'm being very serious.

you guys don't seem to think these overcomplicated things you're thinking all the way through, and in the process introducing UNRELIABILITY in order to make things reliable?

come on guys, study how the system actually works, and then you might be able to make some properly educated guesses on how to make the system more reliable.

Yes, I do sound like an asshole right now. Tough. I'm trying to beat this into your heads.

-Thom

syphr42 · October 22, 2008, 04:43:56 AM

These ideas may seem overly complex, but if you put everything into one box with no redundancy and no disaster recovery plan, you are setting yourself up for failure. Now, if you don't really rely on the system, that's another story. For many people, this added complexity would all be a waste of time and money. However, I'm sure most people in the IT industry will agree that when it comes to mission critical, you need redundancy and a plan. All of the software improvements we could come up with won't help in the event of a hardware failure if you have no redundancy. If you want to reduce downtime for a system that could potentially control all of your media, security, climate controls, and lighting (and I'm sure much more), there has to be a way to build in redundant hardware. It would probably be a good idea to split up some of the functions as well, like telecom, security, and media director coordination (and I've heard rumors that it may already be possible to split some things apart).

All of that being said, I'm not suggesting that the people who work on improving LMCE should devote time to building in elaborate mechanisms that, in all likelihood, only a few people will use. I just think its an interesting topic for discussion and something to think about if you really are going to rely entirely on a single box.

colinjones · October 22, 2008, 05:20:18 AM

As I say, I've already suggested one option - use something like VMWare server and snapshot to a second VMWare host. This will give you a perfect copy of the core on a separate piece of hardware that can come up at a moment's notice. VMWare has enough virtual networking options to allow the networking to fail over transparently, and there are plenty of heartbeat/failover options to automate it, too.

I think redundancy in this context is way down the list of priorities, but if you really want it, there is at least one option just there!

tschak909 · October 22, 2008, 05:38:56 AM

and once again, you guys misunderstood me.

*hmm* do I have to spell it all out for those of you too slow to get it?

We have... a message passing architecture....

This means, that ANY redundancy solution NEEDS TO START HERE.

This means, real engineering, and possibly, running two DCE routers.. then you have synchronization issues...how do you solve those?

The database is tied heavily into the nature of the message router, not to mention, state is maintained in memory. Since DCERouter is a multithreaded application, this gives rise to possible locking problems... I can go on, but I hope some of you are starting to at least get the picture, that this isn't as simple as replicating the damn database services.

come on people, if you're going to solve the problem, use your brains. Look at how the system works and ACTUALLY ENGINEER A SOLUTION THAT WILL WORK!

What we have right now, works. It is at least able to recover from faults. I'm not saying it's perfect, and we should stop...what I'm saying is, given the complex nature of the system, DUCT TAPE CAN'T BE USED!

-Thom

colinjones · October 22, 2008, 06:05:54 AM

No Thom, I haven't misunderstood you. I understand what you are saying about in memory system state and fully agree that ideally the software should support redundancy (perhaps with db sync'ing through transaction log shipping, active/passive DCE router architecture and hearbeats, etc)

I was simply answering a previous point specifically on hardware redundancy. A snapshot would capture the database "as is" along with the entire core hard drive and VM image. A hardware failure at that point would mean that you could roll back to point-of-snapshot exactly as if you had lost power to the core at that same point. You would not be able to roll forward to point-of-failure as you would have lost all db changes, and system state (including stuff like lighting states and other HA stuff). With the ability to quiese the database first, you would potentially be in an even better state than after a power failure (less lost in lazy-writes, etc).

The point is, we can do it at a hardware level if you really want to. I just don't see much point unless the software component is in place (as you say after some real engineering stuff is done to change the architecture to support this). And if the software component is done right, then the hardware component almost becomes irrelevant/unnecessary... I realise your concern is duct tape vs elegance! (And I agree)

totallymaxed · October 22, 2008, 01:12:40 PM

Quote from: johanr on October 17, 2008, 12:41:20 PM
Quote from: totallymaxed on October 17, 2008, 12:19:56 PM
The problem with that though is that this brings the whole system to a halt while the the reload router happens. In some situations this is a pain. ie I'm watching a movie and some device thread dies... and the watchdog decides to reload the router!... my movie playback gets killed for possibly 1-2 mins on a big complex system while the reload happens... then i have to manually restart my movie. Not really very nice at all.

Ideally we need to be able to resurrect a thread without having to restart the whole router to do it...

Andrew

Yes, That's true. That's off course a draw back. How often does this happen for you(or your customers)? (router reload due to the watchdog)
However I rather have this than the whole router getting jammed because of a thread doing nothing.

-johan

Well its certainly not very frequent but nevertheless having to reload the whole router to allow any changes to be accepted or to make a change to a single device is not ideal. Its like having to reboot your laptop/pc just because one app was not responding (reloading the router can take some considerable time on a large installation with many devices).

Andrew

indulis · October 22, 2008, 03:16:49 PM

In my work with UNIX I have set up quite a number of failover systems over the years (IBM HACMP), and also have experience with Oracle RAC which is a highly available clustered database, and GPFS which is a highly available (HA) clustered filesystem.

It can be complex. I have often recommended the "KISS" process (no HA failover cluster) because as Thom rightly points out, if HA clustering not done right it can make the system *less* reliable- esp with sysadmins that don't know what they are doing.

The way that the failover systems work is that you typically have a set of scripts that start and stop a service (or multiple services) on a server. There are multiple servers, and they run failover software that has the responsibility to determine which one of the nodes is the "cluster master". This cluster master node tells the others what to do re the cluster, and orders the other nodes (and itself) to start or stop services. If there is a server failure, then the "cluster master" responsibility passes to another server in the cluster. Typically there is a voting method to determine who is the cluster master. In its simplest form it is the node that "owns" a disk (has put its fingerprint in it). Other more complex clustering voting requires if a node comes up and can't see >50% of the other nodes in the cluster, it knows it is not allowed to make itself the master.

Software clustering of a software service i.e. going from failover approach to dual active nodes, is v complex to write and make highly available. You end up writing the HA clustering code within your own software. Unless you can use someone else's existing code! Oracle RAC allows all normal applications that run on Oracle to be made Highly Available, as it puts the complexity into the database service software.

I have been thinking about the same HA requirement. Hardware failure of the DCE router could be disastrous once you come to integrate it into a house.

There are Linux HA cluster software products too. I haven't had any experience with them, but Steeleye Lifekeeper is one http://www.steeleye.com/products/linux/

An open source HA cluster
http://openssi.org/ssi-intro.pdf
http://openssi.org/

...with MySQL
http://wiki.openssi.org/go/MySQL_Clustering

I think this is a very useful thread.

Even if it turned out to be an approach where you manually do the failover to another physical server this would be a good thing.

hari · October 22, 2008, 08:22:02 PM

well said, indulus.

Regarding hardware failures, you can build/buy rock solid hardware configurations. More money buys you either a good hardware service contract or a replacement machine. Use a nice hardware raid and swap the disks to the other chassis if necessary.

Clustering this beast is another story. From a DCE perspective at the moment there is only failover. Mysql can be setup active active. Clustered filesystems have some limitations, DRBD could help. Tying this into the automagic bits of lmce.. adapting launchmanger. All doable, but much work.

best regards,
Hari

johanr · October 22, 2008, 08:55:45 PM

Quote from: tschak909 on October 21, 2008, 11:26:57 PM
*shake-head*

*head-in-hands*

guys, stop thinking with duct tape! damn it.

-Thom

*Laughing(with you, not at you)*
I can really picture you sitting there, sighing, shaking your head.

Wow I like this discussion that was raised, although tschak see's all the work that has to be done. I think it's a fair question.
Because this as I see it really put's where we are going to place LMCE in the future map. Maybe the question is mostly relevant for Totallymaxed and towards his customers(assuming some sort of support agreement with the product is included(?)

I agree with the fact that the constant router reload is a little bit annoying but those reloads are only during install/config change right?

Anyways, what made Me stop thinking that the hw redundancy or as I would like to call it Node redundancy was crucial. Was the fact that there is a watchdog resetting threads that is not functioning.
So there is at least a sw function watching my core when being away.

Also in case of power failure(likely to happen nowadays for some reason) a good UPS with a diesel(or similar) generator needs to be installed to take over.

In case the hw dies for some reason:
Then the worst case scenario would be that you have no alarm or security and that the hw failure was initated by the burglars themselves.

But assuming that the doorlocks in the house are properly installed I would guess the house is still locked. They will find their way in if they want... $:-\$

In case of fire, unless generated by the core catching on fire. The core can be configured to give you or the friendly neighboors a warning (as I have understood it)

So in order to get rid of the redundancy holes I listed this to be necessary equipment:
* UPS(the core have to be configured to shut down all unnecessary power(tv,receiver,lamps etc) to save energy until power is back online
* Diesel Generator with starters engine
* Fireproof cabinet for both cores
* Two Cores has to be built
* Two internet connections (one using 3G/Gprs for example and the other using fixed broadband)
* Nas or similar for Hdd redundancy
* Twice the amount of all security sensors/cameras etc
* and the list goes on..
Not to mention all the days/nights/months/years that has to be spent to get everything running...

My point being. I basically knew before starting this thread that for a normal person all of this cannot be achieved and is not really relevant when looking at the whole picture. I just wanted to know if I was the only one having this "concern" and if it would be by easy means possible to achieve(duct tape?). Seems like it is not.

ColinJones, I remember seing something about VMware and LMCE not functioning as it should (it's to complex) when doing my research

Although.. for a Pluto paying customer, this would be a killer "feature"(maybe I should tell them about the idea?)

So with that said, unless someone pays(big money) for this redundancy(utilized in IPSO for example) to be developed in LMCE. I off course agree that it does not feel very important for a normal user as long as they are aware of Why and what the sacrifices actually are.(money talks ones again)

And it's now being written here in this forum for everyone having the same concern with what it will take.

-johan

tschak909 · October 22, 2008, 08:59:04 PM

You are far too paranoid.

-Thom

johanr · October 22, 2008, 09:10:00 PM

Or more like, "I do it right otherwise I don't bother spending the time.."

-johan

LinuxMCE Forums

News:

Core Redundancy:What if?

johanr

fryed_1

tschak909

fryed_1

tschak909

syphr42

colinjones

tschak909

colinjones

totallymaxed

indulis

hari

johanr

tschak909

johanr