Load Shedding in Distributed Systems
Graceful Failures: The reality behind Distributed Systems
Table of contents
Distributed systems are made up of many parts, each of which can fail on its own. Because of this, distributed systems often have partial breakdowns in the real world. These problems could be caused by node failures, network partitions or any number of other things that weren't planned. These unexpected failures could bring the whole system down and affect users. What’s even more troubling is that some of these failures tend to happen again and again at unpredictable times. Load shedding is one way that we deal with these kinds of unplanned system failures— we purposely cut back on resources to stop more general failures during times of stress.
Introduction
When there is a limited supply of something, we try to "ration" it. This is called supply-chain management. When electricity or some other resource is not always available, we implement load shedding to ration it. This means that when there isn't enough power for the grid, different parts of the grid are turned off for a while so they don't use more power than their fair share of power. When this happens with distributed systems, we call it Load Shedding.
Distributed System
In a distributed system, individuals or organisations work independently but in collaboration to achieve a common goal. A distributed system consists of numerous nodes or processes that operate independently of one another and are connected via network services such as remote procedure calls or message-passing services.
What is load shedding?
Load shedding —a term often used in electrical engineering— according to dictionary.com is "the deliberate shutdown of electric power in a part or parts of a power-distribution system, generally to prevent the failure of the entire system when the demand strains the capacity of the system."
Load shedding is a regulated process in which—electricity supply is intentionally disrupted in specific locations to manage demand and avoid the entire power grid from collapsing. It is typically done when there is a shortage of electricity generation or transmission capacity, or when there is a high risk of overloading the system.
Load Shedding in Computer Science
From the definitions above, It is clear that load shedding occurs when the underlying system has insufficient capacity to continue to operate normally. Hence, it is a technique used in systems to handle situations where the system is overwhelmed and cannot keep up with the demand. When load-shedding occurs, the system will prioritize certain requests and temporarily stop processing others in order to reduce the load on the system and prevent it from crashing.
What is Load Shedding in Distributed Systems?
Load shedding is the act of deliberately shedding some load to keep a system from collapsing due to overload. The distributed systems that power our internet and businesses are built on a bunch of computers that have to be on at all times so that the system works — this is called having “all hands on deck.” This means that there’s a finite amount of resources that can be dedicated to these systems, and there’s often a lot of demand for them.
In certain situations where the system is failing partially, there are only two options:
Let the system fail
or engage in load shedding.
Load shedding in distributed systems can mean shutting down services, slowing down operations, or re-routing requests. Load shedding is a defensive approach to dealing with an overburdened system that involves deliberately bringing the system to a lower level of service than usual in order to buy time for system administrators to add new capacity to the system or repair broken equipment. It is a common practice in power grids and other critical systems where a lack of capacity can lead to system failure.
Failing Gracefully in Distributed System
Distributed systems that fail gracefully are designed to redistribute the load shed from the failed component onto the healthy components of the system.
A distributed system is said to have successfully shed load if it can handle the excess load without failing entirely. Though distributed systems have a higher probability of failure than their centralized counterparts, they have an advantage in that they can handle extremely large amounts of traffic with relatively low infrastructure costs.
Distributed systems experience frequent partial failures. These failures may be due to node outages, network partitioning, or any other number of unanticipated events. Load shedding is a process by which we handle these unanticipated system failures —we deliberately reduce resources under stress as a means of preventing more widespread failure.
The major reason for load shedding
There are many reasons why load shedding is necessary for distributed systems. First, they are not expected to run at 100%. In fact, they work best when they’re running at 80% capacity or less. In a perfectly balanced system, if one aspect is running at 100%, it’s going to take away resources from other parts of the system and cause them to slow down. A system that’s 100% busy is a sign of bad management. When you have a distributed system that is at 100% capacity, you have no room for error. This is what can cause a system to crash or go down. Managers of distributed systems need to be aware of where the system is at capacity and where it can be shed to keep it from crashing or going down.
How to trigger load shedding?
When load-shedding is triggered, the system will stop processing some of the incoming requests and prioritize others. This is done in order to reduce the load on the system and prevent it from crashing. The requests that are prioritized may be those that are considered more important or time-sensitive, such as requests for critical services or emergency services.
The key to successfully shedding load in a distributed system is the ability to detect failures and trigger an automated response that reroutes traffic to other healthy nodes.
Load Shedding Strategies
Several methods and strategies can be used to implement load-shedding in a distributed system.
Limiting the resource consumption rate
Resource limits can be used to shed load in distributed systems by reducing the resource consumption rate. The rate at which a resource is consumed can be monitored, and if the rate exceeds a certain threshold, the resource is shed to prevent the system from being overloaded.
Merging and rerouting requests
In order to avoid a cascading failure in distributed systems, the load-shedding strategy must be designed to shed load from the node that is experiencing the problem rather than shedding from the node that receives the request. For example, when a node that is responsible for serving requests is experiencing problems, the load-shedding strategy must be designed to send the requests to another node that can handle them.
Dropping requests
When the load-shedding strategy involves dropping requests, the nodes in the distributed system must have the ability to recognize and ignore certain types of requests. Dropping requests is typically used as a last resort, when other load-shedding strategies (e.g., reducing the resource consumption rate or rerouting requests) would require too much effort to implement.
Queuing Requests
One common method is to use a queue to store incoming requests and process them in a first-in, first-out (FIFO) order. When the queue becomes full, the system can stop accepting new requests and process the ones that are already in the queue.
Load Balancing Requests
Another method is to use a load balancer to distribute incoming requests evenly across multiple servers in the system. If one server becomes overloaded, the load balancer can redirect requests to other servers in the system to help alleviate the load.
Artificial intelligence (AI) algorithms
Load-shedding can also be implemented using artificial intelligence (AI) algorithms, such as machine learning (ML) algorithms. These algorithms can analyze incoming requests and determine which ones should be prioritized based on various factors, such as the importance of the request, the expected response time, and the current load on the system.
Side Effects of Load Shedding
While load-shedding can be an effective way to prevent a distributed system from crashing, it can also have negative consequences. For example, if the system is constantly shedding the load, it may not be able to meet the demands of its users. This can lead to frustration and may result in a loss of business or customer loyalty.
To mitigate these negative consequences, it is important to carefully monitor the system and implement load-shedding only when necessary. It is also important to have adequate capacity in the system to handle the expected workload and to have robust failover mechanisms in place to ensure that the system remains operational even in the event of a failure.
Conclusion
In distributed systems, load shedding is the process of reducing resources or capability on purpose so that the whole system doesn't fail from being overloaded. This can mean turning off services, slowing down processes, or sending requests in a different direction. It is to stop cascading failures, in which the failure of one part causes a chain effect of failures in other parts. For load shedding to work well in distributed systems, there needs to be clear policies and procedures in place, as well as the ability to keep an eye on the system and spot possible problems before they become serious. The system should be built with enough capacity and failover methods to ensure reliable operation ration. Load shedding should be done carefully and only when it's necessary.