Load Shedding in Distributed Systems
Graceful Failures: The reality behind Distributed Systems
Table of contents
Distributed systems are composed of multiple nodes that can fail independently. As a result, in practice, distributed systems experience frequent partial failures. These failures may be due to node outages, network partitioning, or any other number of unanticipated events. These unexpected failures have the potential to bring the entire system down and impact user experience. What’s even more troubling is that some of these failures tend to recur at unpredictable intervals. One of the processes by which we handle such unanticipated system failure is load shedding—we deliberately reduce resources under stress as a means of preventing more widespread failure.
Introduction
When there is a limited supply of something, we try to ration it. This is called supply-chain management. When electricity or some other resource is not available all the time, we institute load shedding to ration it. This means that when there’s less power available than what is needed by the grid, different parts of the grid are switched off for a while so that they don’t take more than their fair share of power. When this happens with distributed systems, we call it Load Shedding in Distributed Systems.
Distributed System
A distributed system is just like any other system where individuals or organizations operate independently but in coordination with one another for a common goal. A distributed system has multiple nodes or processes operating at a distance from one another and are connected through network services such as remote procedure calls or message-passing services.
What is load shedding?
Load shedding —a term often used in electrical engineering— according to dictionary.com is "the deliberate shutdown of electric power in a part or parts of a power-distribution system, generally to prevent the failure of the entire system when the demand strains the capacity of the system."
Load shedding is a controlled process in which electricity supply is deliberately interrupted in certain areas in order to manage demand and prevent the entire power grid from collapsing. It is typically done when there is a shortage of electricity generation or transmission capacity, or when there is a high risk of overloading the system.
Load Shedding in Computer Science
From the definitions above, It is clear that load shedding occurs when the underlying system has insufficient capacity to continue to operate normally. Hence, it is a technique used in systems to handle situations where the system is overwhelmed and cannot keep up with the demand. When load-shedding occurs, the system will prioritize certain requests and temporarily stop processing others in order to reduce the load on the system and prevent it from crashing.
What is Load Shedding in Distributed Systems?
Load shedding is the act of deliberately shedding some load to keep a system from collapsing due to overload. The distributed systems that power our internet and businesses are built on a bunch of computers that have to be on at all times so that the system works — this is called having “all hands on deck.” This means that there’s a finite amount of resources that can be dedicated to these systems, and there’s often a lot of demand for them.
In certain situations where the system is failing partially, there are only two options:
Let the system fail
or engage in load shedding.
Load shedding in distributed systems can mean shutting down services, slowing down operations, or re-routing requests. Load shedding is a defensive approach to dealing with an overburdened system that involves deliberately bringing the system to a lower level of service than usual in order to buy time for system administrators to add new capacity to the system or repair broken equipment. It is a common practice in power grids and other critical systems where a lack of capacity can lead to system failure.
Failing Gracefully in Distributed System
Distributed systems that fail gracefully are designed to redistribute the load shed from the failed component onto the healthy components of the system.
A distributed system is said to have successfully shed load if it can handle the excess load without failing entirely. Though distributed systems have a higher probability of failure than their centralized counterparts, they have an advantage in that they can handle extremely large amounts of traffic with relatively low infrastructure costs.
Distributed systems experience frequent partial failures. These failures may be due to node outages, network partitioning, or any other number of unanticipated events. Load shedding is a process by which we handle these unanticipated system failures —we deliberately reduce resources under stress as a means of preventing more widespread failure.
The major reason for load shedding
There are many reasons why load shedding is necessary for distributed systems. First, they are not expected to run at 100%. In fact, they work best when they’re running at 80% capacity or less. In a perfectly balanced system, if one aspect is running at 100%, it’s going to take away resources from other parts of the system and cause them to slow down. A system that’s 100% busy is a sign of bad management. When you have a distributed system that is at 100% capacity, you have no room for error. This is what can cause a system to crash or go down. Managers of distributed systems need to be aware of where the system is at capacity and where it can be shed to keep it from crashing or going down.
How to trigger load shedding?
When load-shedding is triggered, the system will stop processing some of the incoming requests and prioritize others. This is done in order to reduce the load on the system and prevent it from crashing. The requests that are prioritized may be those that are considered more important or time-sensitive, such as requests for critical services or emergency services.
The key to successfully shedding load in a distributed system is the ability to detect failures and trigger an automated response that reroutes traffic to other healthy nodes.
Load Shedding Strategies
Several methods and strategies can be used to implement load-shedding in a distributed system.
Limiting the resource consumption rate
Resource limits can be used to shed load in distributed systems by reducing the resource consumption rate. The rate at which a resource is consumed can be monitored, and if the rate exceeds a certain threshold, the resource is shed to prevent the system from being overloaded.
Merging and rerouting requests
In order to avoid a cascading failure in distributed systems, the load-shedding strategy must be designed to shed load from the node that is experiencing the problem rather than shedding from the node that receives the request. For example, when a node that is responsible for serving requests is experiencing problems, the load-shedding strategy must be designed to send the requests to another node that can handle them.
Dropping requests
When the load-shedding strategy involves dropping requests, the nodes in the distributed system must have the ability to recognize and ignore certain types of requests. Dropping requests is typically used as a last resort, when other load-shedding strategies (e.g., reducing the resource consumption rate or rerouting requests) would require too much effort to implement.
Queuing Requests
One common method is to use a queue to store incoming requests and process them in a first-in, first-out (FIFO) order. When the queue becomes full, the system can stop accepting new requests and process the ones that are already in the queue.
Load Balancing Requests
Another method is to use a load balancer to distribute incoming requests evenly across multiple servers in the system. If one server becomes overloaded, the load balancer can redirect requests to other servers in the system to help alleviate the load.
Artificial intelligence (AI) algorithms
Load-shedding can also be implemented using artificial intelligence (AI) algorithms, such as machine learning (ML) algorithms. These algorithms can analyze incoming requests and determine which ones should be prioritized based on various factors, such as the importance of the request, the expected response time, and the current load on the system.
Side Effects of Load Shedding
While load-shedding can be an effective way to prevent a distributed system from crashing, it can also have negative consequences. For example, if the system is constantly shedding the load, it may not be able to meet the demands of its users. This can lead to frustration and may result in a loss of business or customer loyalty.
To mitigate these negative consequences, it is important to carefully monitor the system and implement load-shedding only when necessary. It is also important to have adequate capacity in the system to handle the expected workload and to have robust failover mechanisms in place to ensure that the system remains operational even in the event of a failure.
Conclusion
Load shedding in distributed systems refers to the practice of deliberately reducing resources or capacity to prevent the entire system from failing due to overload. This can involve shutting down services, slowing down operations, or rerouting requests. It is to prevent cascading failures, where the failure of one component leads to the failure of other components in a chain reaction. To implement effective load shedding in distributed systems, it is important to have clear policies and procedures in place, as well as the ability to monitor the system and identify potential issues before they become critical. Load shedding should be implemented carefully and only when necessary, and the system should be designed to have sufficient capacity and failover mechanisms to ensure reliable operation