More resilient device (re)starting and status reporting

chrisbirkinshaw · April 23, 2010, 01:53:34 AM

1. It would be nice to have a page in the orbiter which displays a list of all devices which have either failed to start or have died and not restarted. At the moment I have problems whereby the zwave device dies but I never know unless I try to switch some lights on/off.

2. If a device is not started then more verbose logging should be provided somewhere, and it should be easily accessible. I have some GSD devices which no not start, are not disabled, but nothing is logged to any logs (grepped for the device name and also id in the log dir with no results). The /var/log/pluto/<device_id> log should be opened as early as possible so it can be logged into. I have devices which do not start but never even have a log file created.

3. If the core dcerouter crashes then the media players and orbiters etc should continue running and try to reconnect in the background. The current behaviour has the effect of making the system quite infuriating sometimes (and as a result I have moved all media playback to dedicated devices - such as Xtreamers - under IR control)

totallymaxed · May 26, 2010, 10:08:11 AM

Quote from: chrisbirkinshaw on April 23, 2010, 01:53:34 AM
1. It would be nice to have a page in the orbiter which displays a list of all devices which have either failed to start or have died and not restarted. At the moment I have problems whereby the zwave device dies but I never know unless I try to switch some lights on/off.

checkout Web Admin->Automation->Device Status for general devices. For ZWave devices just tail the ZWave log on the Core.

Quote

2. If a device is not started then more verbose logging should be provided somewhere, and it should be easily accessible. I have some GSD devices which no not start, are not disabled, but nothing is logged to any logs (grepped for the device name and also id in the log dir with no results). The /var/log/pluto/<device_id> log should be opened as early as possible so it can be logged into. I have devices which do not start but never even have a log file created.

...add some logging to your GSD ;-)

Quote

3. If the core dcerouter crashes then the media players and orbiters etc should continue running and try to reconnect in the background. The current behaviour has the effect of making the system quite infuriating sometimes (and as a result I have moved all media playback to dedicated devices - such as Xtreamers - under IR control)

Generally in our experience MD's will auto re-connect if the Core locks and has to be restarted. It has to be said that it is very rare for even our development Core's to crash/lockup. However iff you use alpha/beta code in production systems then you have to expect instability and issues...helping to document/fix those with the Dev's is what using these incomplete builds is all about.

jimbodude · May 26, 2010, 09:01:28 PM

Quote from: totallymaxed on May 26, 2010, 10:08:11 AM
Quote from: chrisbirkinshaw on April 23, 2010, 01:53:34 AM

3. If the core dcerouter crashes then the media players and orbiters etc should continue running and try to reconnect in the background. The current behaviour has the effect of making the system quite infuriating sometimes (and as a result I have moved all media playback to dedicated devices - such as Xtreamers - under IR control)

Generally in our experience MD's will auto re-connect if the Core locks and has to be restarted. It has to be said that it is very rare for even our development Core's to crash/lockup. However iff you use alpha/beta code in production systems then you have to expect instability and issues...helping to document/fix those with the Dev's is what using these incomplete builds is all about.

I think what chrisbirkinshaw is talking about on a higher level is making the different pieces of the system less dependent on each other whenever possible. The idea being that if one device or system goes "boom" others are not affected or maybe some other device or system takes over the dead one's work. In theory, I agree with this idea of a self-healing or redundant system. I wonder which pieces can be set up to act in this way and how much effort it would take in each case.

In my experience, if the dcerouter goes down, playback stops. This is happening much less frequently than it used to (0710 to current 0810 builds), but I don't know if this is due to dcerouter improvements, or just because everything that talks to the dcerouter is becoming more stable. In any case - it seems to me that media playback does not necessarily need to be so dependent on the dcerouter. Of course features like controlling playback from orbiters would be affected if the dcerouter dropped out, but not the basic playback. When dcerouter comes back online, it can tap into whatever is going on and update its state.

There is also the issue of notifying the human when something has failed. It's important to know when something has gone wrong, without a doubt. This system is supposed to secure my home - it can't fail silently. Currently, most failures require us to dig through logs to figure out what devices are affected. I think someone was working on an e-mail message event responder - that might be one good way to get the word out. In order to make this robust, there would need to be a separate entity in charge of making sure everything is running properly - one that is not associated with anything that can fail. That probably means that it could not use DCE, or at least not have its messages routed through dcerouter.

How do those of you who use the security and telecomm features approach reliability checks and error reporting? Or do you just "trust it"?

chrisbirkinshaw · June 04, 2010, 10:36:33 AM

Thanks for the responses totallymaxed, however my point wasn't really that it should be more stable - yes this will come - my statement was that it should be more resilient to failure. When something dies the system should let me know via the orbiter and/or web interface.

My pet niggles for example:

1. When the core dies media playback stops
2. Serial ports occasionally seem to disconnect, at which point the device becomes disabled
3. Some devices do not get started and nothing is logged (I was talking about Onkyo RS232 driver and URS_UIRT so these already have logging)
4. When I turn lights on/off they are updated on the floorplan but often commands are not sent by Zwave or X10

It might be that addressing these kind of things saves a lot of time on support. Most people on the forums seem to be asking for help with troubleshooting, and I see that this is taking lots of developer's time.

Regards,

Chris

LinuxMCE Forums

News:

More resilient device (re)starting and status reporting

chrisbirkinshaw

totallymaxed

jimbodude

chrisbirkinshaw