This blog series discusses scalability from three distinct perspectives: people, processes and technology. If you haven't read the first part yet, we recommend you read it first before continuing with this post.
Processes are critical to scale. They cover and assist all the different stages of software development, from the early phases (planning, design), to implementation, to putting the software into production and maintaining it. Any activity is ultimately risky, and being able to understand the potential risks and gains is essential when growing your business.
When the user load is growing and adding more resources to the existing system is no longer enough, the first step usually consists of redesigning the system. As in every journey, it's good to know both the start and the destination first. So before starting any architecture or software redesign, you need to understand what the current state is. Calculating the headroom of your current system is a process that involves determining the current usage and the free capacity of each component, and measuring it against the expected growth. This will give you a better idea of the time remaining before service degradation or outages, as well as help identify the real bottlenecks of the current architecture.
For the purposes of this blog post, we'll divide these processes into three categories:
- Architecture process
- Development process
- Release process
Frameworks with fancy names - the key to sound architectural designs
The early design decisions are of course also the most delicate, so it's generally advisable to use some sort of risk-mitigation process. Some industries, universities or government agencies like the NASA or the US Army have promoted various standards, identified by fancy acronyms like ATAM (Architecture Trade-off Analysis Method), ACDM (Architecture Centric Design Method), AGATE, etc. They serve this purpose very well, providing a framework for:
- gathering the objectives and the context about the system
- describing the requirements
- creating different scenarios
- identifying risks and trade-offs early in the life-cycle
- validating the proposals against the expectations.
The goal is to structure the thoughts and the proposals in a format which is easy to analyse and helps you form an educated decision. The models might go as far as describing the system under different "facets" or "views", like the business processes, the services of the system, and its logical and physical architecture.
Many of the aforementioned methods have an iterative nature, with close feedback loops that promote interaction between architects, engineers, system administrators and all the stakeholders. This helps you gather each one's experiences, goals, knowledge, and perspective, all of which ultimately lead to a good design, covering most of the issues that are brought out.
Regardless of the size of your company or your project, following a rigorous methodology in the early design stages will save a lot of time and issues down the line. Despite the complex names, such a process is actually quite simple to implement, and effective even if not followed verbatim.
The main idea is to have a collaborative design process, with input from all the involved engineering assets and stakeholders, leading to the creation of various proposals, which are then analysed under different scenarios. Ultimately the process is meant to reveal potential risks with each approach, and raise awareness of the trade-offs of each proposal.
Development process: ensuring quality and time-to-market
The central part of the product life cycle is the actual development. The way this is carried out affects product quality and its time-to-market, so it's another area that you'll want to optimise. There are several methods to streamline the development process, from the waterfall method to a series of frameworks that fall under the umbrella of "agile" or "lean" development, like Extreme Programming, Scrum and Kanban.
The main idea behind the agile approach is to build the product in small iterative steps, each resulting in a release. This way, the most important features (those having the highest value) are implemented first, and constant feedback is used as a control mechanism. These short iterations are like project phases, and all development processes take place within these phases. The short duration of the iterations allows for continuous re-steering of the direction of the product, as the development reveals more about its nature, bit by bit, as it's being built.
The agile values of feedback, adaptability and user involvement go beyond the software development life cycle, extending to project management, leadership, governance, procurement, and other areas. As such, agile favours change and continuous improvement, with practices like incremental design (instead of BDUF, or Big Design Up Front), test-driven development, refactoring, continuous integration, etc. All of these tools are designed to provide the team with a way to respond to change and scale the application at any point. To successfully move to Agile requires a difficult mental shift - viewing work in terms of business value - which my colleague Ian Barber recently blogged about.
Release process: barrier conditions and risk evaluation
Whenever a change or a new feature is implemented, before being deployed into the production environment, it should pass a series of barrier conditions that protect your project from significant failures and keep the availability of the service high. These barriers are represented by code reviews, automated and manual Quality Assurance processes, performance and stress testing, monitoring, etc.
A barrier condition that deserves a special mention is the risk assessment and management, studying the probability, the impact and the effect of every known risk on the project, so you can funnel the efforts in addressing the ones that are most likely to occur or that have the biggest impact first.
There are many ways to quantify the risk of any action, from gut feeling (fast, but not very accurate), to more rigorous and accurate procedures like PDPM (Process Decision Program Chart) and FMEA (Failure Mode Effect Analysis), to name two popular ones. Once you have identified and evaluated the risk and its countermeasures, you should also define a threshold to the amount of risk that you are willing to accept for any given release, as well as plan rollback strategies should the risk occur for real.
As for the aforementioned design methodologies, the purpose of these frameworks is to invite some thought and raise awareness about the risks and their severity, so potential issues can be tackled before they even occur, or at least easily detected. They can also help you shape your services in a way to limit the effects and the impact of failures. To make this possible, the system must have appropriate monitoring hooks to catch issues (not only failures, but also when it is behaving significantly differently than it's supposed to). Also, you should always plan for outages, be prepared and have a backup/rollback plan.
You can find example FMEA worksheets online.
Once the risks have been identified and quantified, you should take actions to eliminate the failure mode or minimise its occurrence and severity, and improve its detection. For these worksheets to be really valuable, they should be kept updated during the entire product life cycle: when it's designed, deployed, changed, or when the surrounding conditions change, or when new problems arise.
The benefits of these processes are many, from early identification and elimination of potential issues, to improved quality and reliability of the product - ultimately leading to cost savings and increased revenues. If done well, not only do they improve the product or service, but also the processes themselves, in a way that will minimise the likelihood of failures, since at a certain scale they can be very costly.
We now have an optimised team following rigorous processes, working towards the goal of a truly scalable system. But what about the system itself? In my next blog post, I'll discuss scalability from a technical perspective, looking at certain architectural principles that promote scalable environments.