Tuesday, January 21, 2020

Can developer be a superhero or embrace Site Reliability Engineering?

This is again a post without any code snippet. I really don't want to but I have to explain things.

One of my previous posts was touching the problem of developers not producing quality code and other people have to suffer for that. Unfortunately, that post had edited to remove my work-related matter due to the policy of my current company.

But that made me think a lot about enterprise software engineering. Is it possible to build an environment where nobody stretches beyond 8 hours or wasting their career? For the time being, I am excluding the start-up and technology product developers where the same team defines the business requirements and release timelines.

Below are some questions
  1. If developers are given the time they estimated to develop perfect software(infinite time is not practical) will there be no issues in QA and higher environments such as production?
  2. If the QA team gets enough time they needed, is there a guarantee that nothing will break in production?
  3. Should there be a dedicated team to fight the issues in production? Excluding the operations team who does routine maintenance and support team who perform high privileged operations such as fixing data.

What developer may miss even if he gets the full estimated time?

Infrastructure

Let's start with the enterprise infrastructure. There is no guarantee that the application development team in an enterprise knows about all the enterprise infrastructure decisions. Even if they are given enough training, they may not have the basic strong understanding of networking, security, etc to digest everything.
For example, the application might have taken below the assumption

"The country name and office address available in the active directory are maintained religiously"

It would work for 99.99% of the time. But it may break for one user in production and he/she happens to be a directory or higher official.

Business requirement

A technology product team mostly defines the features themselves. Even if a developer doesn't know the feature exactly his lead or manager would be fully aware of how the feature expected to work. But most of the enterprise development teams get feature requirements from outside. Also, things get lost or altered during the transit from business sponsors to business analysts to the development team. If it is a regulated industry, the regulations might have changed or introduced fresh. All those factors lead to what?
Production issues if not caught during the QA cycle.

Application developers only write less than 50% code!

Yes, today's Enterprise application developers are writing less than 50% code which is needed to run the application. Let's consider a .Net or Java or NodeJS application. Below are the areas where the application developers don't have any control but their application may fail because of the issues there.
  • OS - Bugs in OS or cloud provider which are not yet discovered or unpatched.
  • Drivers - Bugs in drivers used for networking, file, database etc..
  • Runtime - Unpatched or yet to be discovered a bug in .Net  runtime, JVM or NodeJS
  • Third-party libraries - Bugs in the libraries or libraries that are not maintained.
  • Compiler - It could be a DSL compiler if used or rare bugs in C# or Java compiler itself.
If developers want to write perfect code they may need to write all the above software too. Oh yes, there is hardware where all these runs may get issues due to overheating or manual mistakes.

Can software be certified fully by the QA team who got full estimated hours?

First, let us see what is meant by the estimate to uncover all the defects of an enterprise software system.

Estimate to find all the defects of a software system

Most of us who ever worked as QA will be thinking it is a utopian idea. If it is possible, Facebook and other giants should not have initiated the bug bounty program of the rewarding public for finding bugs.
Today's software systems are really complex. Both in terms of in-process code paths and integration graphs. Considering that fact, the estimate would be a massive number if QA wants to uncover all the defects which may affect the production. It would bring us back to the old waterfall model the release happens once in a year or 2. Everybody knows the Waterfall model was working at some point but not well now.
In today's agile world if the QA team needs 1 or 2 years to completely test the application, the application might be obsolete by the time of release.

We cannot deny the fact that QA automation can reduce the estimate but still, it would be a considerable amount.

Environment parity

Every enterprise theoretically wants to replicate the production environment to the QA. But even with the help of IaS (Infrastructure as Code), it is difficult. Just consider the active directory as an example. Most of the organizations don't want QA to use their production AD as it may affect real users. QA often gets a replica of the production AD. The application may work there easily. But when it is put into production, the user may be in a different forest and there may not be enough infrastructure measures taken to speed things up. Oh yes, Murphy's law applies everywhere and that user often turns to be a business sponsor or other higher official.

The other factors such as the knowledge about the business requirement and changes in requirements may affect QA too. Also, there could be a lot of other things which affect the estimate and the execution of the QA process results in bugs in production

Do we agree that defect-free software is a myth in this era?

Defect-free software would have been possible in the early days of computer science where the complexity was minimal. But now in the world of high-speed innovation and competition, it is very difficult to get defect-free software. There are some links on the same.

Does this mean developers & QA don't have responsibility?

Absolutely not. They cannot just commit and run away. Think about the life-supporting systems or the trading systems where the damage is very high.
They have to take all the measures to minimize the defects. The list of measures is never-ending. Some are below

Measures to minimize defects

Below are some measures to minimize the defects. Each item may need its own post to go in detail.

Development

  • Design of application before coding.
  • Using type-safe languages
  • Write observable code
  • Code review.
  • Scanning third-party libraries or evaluating CVE.
  • Programatically stop running software in unsupported environments.
  • Unit tests

QA

  • Participating in the design process
  • Automated integration testing
  • Automated security testing
  • Knowing when to release

Deployment

  • DevOps - Reduce the feedback cycle time.
  • Scripting everything such as IaC (Infrastructure as Code)
All these don't ensure there are no defects in production. Let us see how to deal with production issues next using an enhanced approach.

Enter Site Reliability Engineering

If we accept the fact that there is no defect-free software and failures are normal in production even with the world's perfect development and QA teams, let's move forward in addressing the same.

Many of us might have participated in short-lived mission teams to fight critical issues in production. Rapid Response team, S.W.A.T team, Tiger team, A team, etc..to name few. Most of the time, those teams disappear after the mission. What if we keep such a team consistently throughout the lifecycle of software? Is that part of the software engineering process and what is the name industry accepted for such a team?

Yes, such a team exists and industry call is "Site Reliability Engineering"

Below are basic things about SRE as or writing this post. Since it is in the early stages, things may change over time.
  • Similar to other technologies at scale SRE too pioneered by Google. As per Google, SRE is how Google runs production systems. Benjamin Treynor gets the credit.
  • There is a conference getting organized by Usenix from 2014 onwards called SRECon. There are around 4 conferences planned for this year 2020 around the world.
  • Books
  • The good news is that there are job postings for SRE members on LinkedInGlassdoor, etc..

How to run an SRE team

Since SRE is in the early stages there are no rigid standards. Though Google had practiced it for long and even published a book, every organization may not have the same scale of Google and problems. But below are some general guidelines which again subject to change when SRE matures.
  • Automation is the key part.
  • It is not about monitoring by looking at giant screens having full of fancy graphs and taking a decision or sending daily reports saying the system works fine. Instead, apply A.I to automate the same.
  • The team spends 50% on reactive and remaining proactive activities.
    • Reactive - production calls, documenting the incident, feedback to development teams, etc...
    • Proactive - Developing strategies, tools, etc... to avoid any reliability issues which may occur in the future. Also, understanding the product they are supporting.
  • The team comprises of 50% members from a development background and rest from operational and infrastructure background.
  • The team must be experienced enough as they are dealing with production critical systems.
  • Ability to block the next release if the system is below the expected SLA or any other issues related to reliability.
  • Engage only in applications that are production critical as SRE is not free.

Conclusion

As mentioned in multiple places, SRE is in the early stages and the things described in this post may change over time. It is good to see that the software industry is unifying the production support into software engineering than seeing it as a separate entity.
  • Today's massive mission-critical enterprise applications must need a dedicated SRE team to keep it reliable.
  • The team can be formed from existing system admins, application support, development teams but they have to think about coded solutions to prevent issues than firefighting all the time. 
  • They should have the right to say 'NO' to deployments and voice in planning.

SRE may be seen as a programmable version of old production support to attract developers into operations. But today's massive complex distributed applications need developers or coded solutions to keep those running.

May be until A.I take over this boring repetitive task.

Sorry if anybody feels bad about glorifying the coding tasks over others. Unfortunately, it is a fact.

References

https://www.gartner.com/en/documents/3979405/devops-teams-must-use-site-reliability-engineering-to-ma

No comments: