Tuesday, May 5, 2015

Extensible Chaos Monkey

Chaos Monkey

Chaos Monkey is a great concept introduced by Netflix to create random issues in cloud server environments so that those issues can be addressed early and test against unexpected failures. That makes sure any system can recover from any kind of failure which can happen in production.

The good news is that they didn't keep it secret. They made that tool online including source in Github. It is written in Java so it was a little difficult for the .Net community to extend. There are many .Net ports of the same tool or tried to implement the same.

The problem

In our company also things are not different from any other production environment. There are issues which are first reported / only happens in production. We have an online audit system that contains many backend processing components. The backend queueing system is developed in such a way that any application track developer can add a new queue type and write it's associated handler. Though there are guidelines for the backend developers to roll back properly, there are times when developers are not caring it to meet the deadlines and it goes to production. In production, we end up with the wrong data states. Some of the reasons are unexpected IIS / AppPool recycling. Database timeout/outage etc...Since the application is not expecting those states, it will be clueless about how to recover from there. Finally, support has to manually run SQL statements in production servers to correct the state.

Ideally, QA is supposed to test scenarios such as IIS reset and all when the backend services are running. But they have limitations. It is very difficult to make sure that they cover all the scenarios.  Ensuring the randomness and tracing the abnormal event to reproduce in dev will be difficult. There are also difficulties in manually creating abnormal scenarios when the tests are running overnight.

Solution

All these things lead to a testing strategy where issues / abnormal scenarios that are expected in production needs to created in dev/QA environments so that we can identify how the system behaves to those abnormal events and sometimes change the application flow to recover those states.

This is the point we started looking towards already existing ChaosMonkey solution. It seems suitable for us. But we are not able to use it straight away because ChaosMonkey is targeted towards cloud but our production deployment is in house.

Why extensible

So we tried to extend the tool. But since its in Java and most of our developers are familiar with .Net ecosystem, we started looking for .Net port of Chaos Monkey. Unfortunately not able to find much.  Though there are some they are also targeted towards the cloud. To restart the local IIS server, had to write a lot of code. 

So finally decided to take one as a base and make it extensible to meet the scenario. Forked ChaosMonkey implementation from Githb by Simonmnro and added my own changes to make it pluggable.

My version of Extensible chaos monkey is available in the below location.

What's Next

  • Clean separation of plugin code and config
  • Adding more plugins such as increase memory pressure.
Feel free to comment if there is a better solution to the issue and contact me if willing to contribute to the project.

No comments: