Building quality into our code using the red bins system
BUILDING BRIDGES – In this article, we learn how a team of software developers leverages the lean concept of “red bins” to ensure the quality of the code they write.
Words: Marek Kalnik, CTO and Co-founder, BAM - Paris, France
When I went to Japan back in October 2017, I was deeply impressed by one of the companies we visited. Their CEO told us their client allows them three mistakes a year, and that they already were at two. Two mistakes in a year, in a company with 1,500 different product references and producing 8 million pieces a month! When I visited, they were on a 65-days-without-a-defect streak.
This had me wonder about quality at my company. What is the number of bugs present in our code each week?
I decided to measure the number and the result I got was quite sobering: 17 bugs found in a week. With nine projects running in parallel, that means an average of two bugs per project. However, I quickly found that some projects were outliers with four to nine bugs a week, while others were pretty stable at zero to two bugs a week.
THE RED BINS SYSTEM
In lean thinking, to build quality into the process we use the so-called “red bins” – a system that ensures that no defect is left undetected and unfixed. No error is allowed to make it past the part of the process that generated it. The idea behind it is twofold: first of all, the red bins help develop the operator’s ability to recognize defects; secondly, they highlight the problematic part of the process, telling us where we need to intervene.
With the red bins in mind, I started working with two teams who were consistently generating a lot of bugs to understand what was going on. The first thing I learned was that those teams were not ready to understand the causes of the bugs on their own. When I asked them what caused their last bug, the answer I generally got was, “We haven't tested enough”. After developing a feature for the client, each developer tests it manually before shipping, to see if it works correctly. The team I talked to was deeply convinced that they didn't test enough and they should be testing more to avoid these bugs. Testing is important, but what about the root causes? What about the code that has been written incorrectly?
To help the team understand the problems they had, I started to work on a red bin routine with the team leader: twice a week, he and I would meet with the developer who generated a bug to analyze it. Together, we’d try to understand in detail what went wrong.
After describing the bug and discussing its impact, we found the exact part of code that was written incorrectly and caused problem. This was by no means an easy task: even with the feature already fixed, we would often spend half an hour discussing where exactly the error was. This was also a way of establishing a common standard, by trying to agree on the best way to implement the feature.
Having identified the faulty line of code, we discussed how it was developed. How was the feature specified? What was the technical strategy followed? What were the exact steps taken by the developer to implement it? We tried to get down to the maximum level of detail in understanding how the developer worked on the feature.
This helped us to identified what the skills are that the team needs to develop in order to produce defect-free code. We also looked for tools that we may introduce to prevent some families of problems. This feedback would help the team leader to understand the next steps to take to train the team.
One of the teams I worked with was building a mobile app that allows customers to rent a car without having to go through a rental company. The application was using an external module that allowed it to lock and unlock car doors via a Bluetooth device.
At one point we discovered that in some cases the application was not terminating the rental correctly. We traced the error to the Bluetooth key module, which sometimes failed due to connectivity issues: the application was able to finish the rental when offline, but the Bluetooth code was throwing an error and stopping the execution. The door-locking code was executing first, and when it couldn’t lock the car, it prevented the user from terminating her rental. Here’s the erroneous code:
This bug led the team to a detailed discussion on how the application was handling errors between its external modules and how to best protect the user. In this particular case, we decided that the client should be able to terminate her rental, rather than get stuck with a car that doesn’t lock and an active rental. The car is equipped with a timed security lock anyway, so it locks itself after some time when the engine is off. This was our fix:
After analyzing other parts of the application’s code, we discovered that other features may be impacted by this incorrect error-handling pattern. The team has created a new error handling standard and started Quality Dojos to fix existing code according to the standard. This has allowed us to reduce the number of bugs generated by this error to zero.
Since I started trying to better understand the causes of the bugs in our code, I have analyzed over 50 bugs with our teams. Using the Pareto method, we found common root-causes and subsequently launched kaizens to improve them:
- Nine bugs were caused by the incorrect handling of third party APIs or SDKs – the team that had this problem developed an "API implementation checklist" and error-handling standard, reducing the number of bugs in this category to zero. We are now looking into how we can share these standards with the whole company.
- Five bugs were introduced during refactoring, modifying and improving the structure of the existing code. This is one of the hardest tasks for a developer and we are working on improving our coders’ capabilities.
The task of understanding bugs and improving quality with kaizen is a challenging one. After six months working on this, however, I am happy to say we have learned a lot. The projects we have focused on have experienced a reduction in the number of introduced bugs. On the other hand, our main indicator – the number of total bugs in all projects – is pretty much stable, which means we have yet to discover how to apply our learnings at company level and achieve the level of quality I witnessed during my visit to Japan.