Dealing with legacy project / code base is a tricky thing. This issue is there in other engineering disciplines such as civil too. Think about a civil engineer, who want to do something to an old house. It may be taking an extension or building upstairs. Should he spent time on fixing the issues in the existing house or concentrate on the new construction? If he spend more time on fixing existing house, such as strengthening, that work may not be visible to the stakeholder ie the house owner. But that is required to carry out the extension/new tasks. Convincing house owner on how much to spend and where to spend is a difficult job for the engineer.
Lets come to the software engineering. Engineer is asked to work on a project to add features which has legacy code. After the first sight of code, we could see its not following any basic principles of coding. Classes with 1000s of lines and all methods written as public, everything coupled each other etc...How do we convince the stakeholders, it require refactoring before we start the new features?
Sometime in new projects itself, we feel that developer are spending more time to fix issues and all due to bad code and it requires refactoring. But how can we identify what areas require refactoring. The ideal case is to refactor the entire application, but most of the time, if the code base is large, its not practical and cost effective.
Lets go back to civil engineering. If civil engineer has some proofs such as cracks on walls, leaks or thin walls he can present the same to the owner and get the approval for strengthening. But how can a software engineer find out such proof to do refactoring?
The answer is he need to find out the hot-spots where the code is hot.
Lets understand what is cold code. This is the code which is dead/unused/less changed or just works! We can see in any legacy systems more than 50-60% code is not changing though those are written badly. Sometimes that code is not getting called or even its getting called, it does it duty. Bad code doesn't mean that its not doing it duty. Bad code here refers to wrongly structured code which is difficult to understand and maintain. Maintain means adding features and fixing defects. If the code is working as per the expectation, should we change it just to keep up with the coding standards and to make the code beautiful?
Can we write perfect code
Programming is more to art than engineering. In art there is no perfection. Our today's perfection may not be tomorrow's. It also changes from person to person. So whatever we do to our code to beautify today, after some time we feel that its legacy and needs refactoring. One another reason for that is, in programming there are many ways to achieve same thing. One person think one way is great, the next person thinks great way is something different.
Unless we have unlimited funding, its not practical to refactor a large code base entirely. So we need to find out the hot spots which require refactoring to add value.
What is hot code? We can say that its the code which needs to be refactored first. Below are some examples. We require the help from source control systems to find hot spots where hot code is located.
Code that continuously changing
If a portion of code getting continuously changing, we can say its the hot code. Remember the open closed principle. This continuously modified code is not following that principle and refactoring that area will help us in long term.
Code that changed my different teams
If a portion of code is getting changed by multiple functional teams, we can smell one more area of hot code. Ideally, if that portion is following Single Responsibility Principle, only one functional team will be touching that code. Another problem, if multiple teams are touching same code is lack of responsibility. If something breaks in that area, nobody will take responsibility as many people are working on it. Refactoring that portion will help to avoid many things.
The best example will be a serializer / other generic utilities getting changed by multiple teams. If a team doing 'User Registration' feature is changing the serializer, we can easily identify that serializer is not generic and needs refactoring.
As we have seen, these method work by analyzing the source control systems.
Even if we present the above proofs which demands refactoring, stakeholders may not agree for refactoring. So we need to explain other advantages of finding hot-spots. One is bug prediction.
The concept is simple. If some code is changing and the reason for change is defect fix, we can predict that new changes in the area will introduce more defects unless we refactor. If we refactor that area, we can reduce the cost for defect fixing. Stakeholders always looks at money and quality to the end user. If the defects are getting reduced which improve the quality to end user, they will agree to refactor.
Below are some links related to how google used this source control based technique.http://google-engtools.blogspot.com/2011/12/bug-prediction-at-google.html
After implementing, there were studies on whether this really helps humans and could see that its not able to tell developer, what is the bug exactly. It just tells there are chances for bug in the file as its changed frequently.
Below are some implementations of the google bug prediction algorithm.
There are other tools as well for analyzing source control systems and extract information.http://people.engr.ncsu.edu/ermurph3/papers/icsm11.pdf
Moral of the story
Next time do not just say, we need to refactor the entire code base and stay without any answer when stakeholders ask why we need to refactor and what is the estimate .
Instead present proof that these are the hot-spots we need to refactor and we need these much time.