LESSON ONE: CRASH OFTEN
The essence of this lesson is: automatically restart failed and cancerous services.
The downside of running your own system in a colo is that you are on the hook for maintenance. When your service dies you have to fix it now, even at 2AM. This is a constant tension in your life. You have to take a computer with you everywhere and you know that anytime anyone calls it could be another disaster you have to fix. It ruins your life.
One way to mitigate this problem is restart process that have died or become cancerous. Reddit uses Supervise to automatically restart applications. Special monitoring programs kill processes that use too much memory, use too much CPU, or aren’t responsive. Instead of worrying just restart and the system is up. Of course you have to read the logs and find a root cause, but until then it keeps you sane.
Так оно и есть, причем сейчас есть более продвинутая реализация супервизора - runit. Кстати, в аппаратных реализациях mission-critical систем использование микросхемы-супервизора является обязательным. Но электротехнику, электронику и прочие дисциплины современные "программисты" не изучают...