Backend Engineering w/Sofwan

Fault tolerance in distributed systems

Sofwan A. Lawal — Fri, 13 Oct 2023 21:52:27 GMT

In today's connected world, distributed systems are everywhere. They help run things like cloud computing and social media, which we use every day. But these systems can sometimes fail, so making sure they work well is very important. That's why we need fault tolerance. In this article, we'll talk about why fault tolerance matters in distributed systems and discuss different ways to make it happen.

Understanding Distributed Systems

Before we talk about why fault tolerance is important in distributed systems, let's first understand what they are. A distributed system is a group of connected computers that work together to reach a shared goal. Instead of doing everything on one computer like in a traditional system, distributed systems spread tasks and information across many computers. This helps with growing bigger, balancing work, and handling mistakes, making distributed systems useful for many different things.

The Importance of Fault Tolerance

Fault tolerance is the ability of a system to keep functioning properly when failure occurs. In distributed systems, component failures can occur for various reasons, such as hardware malfunctions, network issues, or software errors.

Without fault tolerance mechanisms in place, a single point of failure can lead to system outages, data loss, and a bad user experience. The importance of fault tolerance in distributed systems can be summarized in several key points:

1. Increased Reliability

Fault tolerance makes a distributed system more reliable. It lessens the effect of problems so users can keep using the system without stopping. This dependability is very important for important areas like banking, health, and managing big systems.

2. High Availability

Fault tolerance guarantees that a distributed system stays up and running, no matter what obstacles come its way! High availability is absolutely crucial for applications that just can't afford any downtime, like e-commerce websites, streaming platforms, and communication tools - they all need to be accessible 24/7!

3. Data Integrity

In distributed systems, data is copied to many places for better speed and backup. Fault tolerance methods keep data correct and consistent, even when parts break or data moves between places.

4. Scalability

One of the best things about distributed systems is their scalability, they can easily adapt to handle more work as needed. Fault tolerance is crucial in maintaining the scalability of these systems, as the addition or removal of nodes should not disrupt overall system operation.

5. Disaster Recovery

In distributed systems, a whole data centre might fail. Fault tolerance strategies, such as geographic redundancy, can help in disaster recovery scenarios, ensuring that the system can recover and continue operating in a different location.

Achieving Fault Tolerance in Distributed Systems

To achieve fault tolerance in distributed systems, various strategies and techniques are employed. Here are some of the most common approaches:

1. Redundancy

Redundancy involves replicating data or services across multiple nodes or components. If one node fails, another one which has the data can seamlessly take over, ensuring uninterrupted service. Redundancy can be applied at various levels, including data redundancy, node redundancy, and component redundancy.

2. Load Balancing

Load balancing is a technique that distributes incoming traffic or requests evenly across multiple nodes. This not only improves performance but also enhances fault tolerance. If one node becomes overwhelmed or fails, the load balancer can redirect traffic to healthy nodes, preventing overloads and downtime.

3. Failover and Failback

Failover and failback mechanisms automatically switch from a failed component to a backup or secondary component. This approach is often used for critical systems such as databases and web servers. After the primary component recovers, failback mechanisms switch back to the original component.

4. Replication

Data replication is an essential technique for fault tolerance. By replicating data across multiple nodes, distributed systems can ensure data availability even if some nodes fail. Various replication strategies, including master-slave, leader-follower, and quorum-based approaches, are used to maintain data consistency and availability.

5. Geographic Redundancy

For disaster recovery and high availability, geographic redundancy is employed. This involves replicating data and services across multiple data centres or locations, often in different regions or countries. If one location experiences a failure, the system can continue operating from another location.

6. Error Detection and Recovery

Implementing mechanisms for error detection and recovery is crucial for fault tolerance. Systems can use techniques such as heart-beating, health checks, and automated recovery procedures to identify and mitigate failures in real time.

7. Distributed Consensus Algorithms

Distributed consensus algorithms like Paxos and Raft play a significant role in maintaining data consistency and fault tolerance. These algorithms help distributed systems agree on the order of operations and ensure that data remains accurate, even in the presence of network partitions or node failures.

8. Monitoring and Logging

Comprehensive monitoring and logging are essential for identifying and diagnosing failures. Logging enables administrators to trace the cause of issues while monitoring tools provide real-time insights into system performance and health.

Challenges of Fault Tolerance

While fault tolerance is essential for the reliability of distributed systems, it comes with its own set of challenges:

Complexity: Implementing fault tolerance mechanisms can significantly increase the complexity of a distributed system, making it more challenging to design, deploy, and maintain.
Resource Overhead: Redundancy, replication, and other fault tolerance strategies usually require additional hardware and computational resources, which can increase operational costs.
Consistency vs. Availability: Maintaining a balance between data consistency and system availability is a common challenge in distributed systems. Ensuring both can be complex, particularly in the presence of network partitions.
Latency: Some fault tolerance mechanisms, such as geographic redundancy, can introduce latency, which may be unacceptable for real-time or low-latency applications.

Conclusion

Fault tolerance is key in distributed systems, it helps them stay reliable and protects data even when components fail. Using methods like redundancy, load balancing, and data replication, these systems can be strong enough for today's connected world. But, it's important to balance fault tolerance with its challenges and costs, to make sure the system works well with its purposes and goals. As technology keeps changing, fault tolerance remains an important issue for people who design and run distributed systems.

Distributed Systems: Synchronisation in Complex Systems

Sofwan A. Lawal — Sun, 20 Aug 2023 20:36:51 GMT

Complex systems are used in almost every aspect of computer science and engineering, from distributed databases and networked applications to multi-core processors and real-time embedded systems. To make sure that these complex systems work right and give dependable results, it is of the utmost importance to make sure that they are consistent and honest. Synchronisation becomes a key idea in keeping this regularity, allowing different parts and processes to work together well and produce correct results.

Introduction

A distributed system is a type of computer system that is made up of numerous independent computers which are connected through a network and communicate with one another. A computer that is part of such a system is referred to as a node, and each node in the system is tasked with carrying out a certain operation.

Distributed systems are capable of managing enormous amounts of data and traffic and can continue to function normally even if some of the nodes in the system fail. This is because distributed systems are composed of multiple nodes, and because of this, distributed systems can provide excellent scalability, dependability, and fault tolerance, which are the key advantages of using distributed systems.

However, some difficulties emerge with distributed systems, such as the requirement to synchronise and maintain consistency across all nodes.

Understanding Synchronization

Synchronisation is the coordination of activities and events between more than one entity to reach a shared goal in a way that is correct and makes sense. In the setting of complex systems, synchronisation is the process of managing how different parts, processors, threads, or distributed nodes interact with each other. This is done to keep the system in a coherent state and avoid conflicts that could cause it to act wrongly or corrupt data.

What are the Challenges in Complex Systems?

Complex systems often have many parts or processes that run at the same time and need to share resources, talk to each other, and share data.

There are some issues that can occur when things are not kept, a few of them are highlighted below:

Race Conditions

These occur when multiple processes or threads access shared resources concurrently and the final outcome depends on the order of execution. Race conditions can lead to unpredictable behavior and data corruption.

Deadlocks

A deadlock happens when multiple processes are unable to proceed because each is waiting for a resource held by another, resulting in a standstill.

Data Inconsistency

In distributed systems, data is often replicated across different nodes. Without proper synchronization, inconsistencies can arise due to delayed updates or conflicting modifications.

Starvation

Some processes may be indefinitely delayed in accessing resources or progressing due to poor synchronization strategies, leading to reduced system performance.

Which aspect of distributed system requires Synchronization?

Synchronisation is essential in a distributed system because it guarantees that all of the system's nodes are working towards the same objective and are aware of the actions taken by the other nodes in the system.

Simply put, the process of coordinating the actions of numerous computers (nodes) to make them function more efficiently together is referred to as synchronisation.

It is required in several aspects of distributed systems, some of which are:

Resource Access

It's possible that numerous nodes in a distributed system will require access to the same resource at the same time. Therefore, to make sure that only one node can access a resource at any given moment, synchronisation techniques are utilised.

Event Ordering

Events can happen at different times on different parts of a distributed system. Synchronisation techniques are used to make sure that events are set up in the right way so that nodes can handle them in the right order.

Clock Synchronization

In a distributed system, each node has its own clock that isn't necessarily in sync with the others. Synchronisation techniques are used to ensure that all nodes have the same perception of time passing.

What is Consistency in Distributed Systems?

Consistency is the requirement that all nodes in a distributed system see the same data at the same time. In a distributed system, maintaining consistency is challenging because data can be modified on different nodes at different times. Consistency is required in several areas, two of which are:

Data replication: Data may be replicated across numerous nodes in a distributed system to offer fault tolerance and availability. Consistency between replicas is required to ensure that all nodes see the same data.

Distributed transactions: Transactions in a distributed system may involve numerous nodes. To ensure that a transaction is completed correctly, all nodes engaged in the transaction must maintain consistency.

What are the types of Synchronization Mechanisms?

Locks/Mutexes

These are basic synchronisation primitives that only let one process or thread hold a lock at a time. This limits who can use a shared resource. Even though it works well to avoid race situations, using it wrong can lead to deadlocks or less concurrency.

Semaphores

Semaphores are counters with number values that are used to control who can use a set of resources. They are useful for limiting the number of processes that can use a shared resource at the same time.

Monitors

Monitors are a type of concept that combines data and synchronisation basics into a single unit. They wrap up shared data and methods and only let one thread run at a time in the monitor, so only one thread can view the data at a time.

Message Passing

In a distributed system, processes talk to each other by sending each other messages. Messages are sent and received consistently and in the right order when the right protocols are used.

Atomic Operations

These are actions that are carried out as a whole, without being broken up into smaller steps. They safeguard data from outside influence and guarantee its integrity.

What are the techniques for achieving synchronization and consistency?

Several techniques are used to achieve synchronization and consistency in distributed systems. Some of these techniques include:

Two-phase commit

Two-phase commit is a protocol used to ensure that distributed transactions are completed correctly. In the two-phase commit protocol, all nodes involved in a transaction must agree to commit the transaction before it is considered complete.

Vector clocks

Vector clocks and logical clocks are techniques for ordering events in a distributed system, which aids in tracking the causality relationship between occurrences. Vector clocks give each node a vector that represents its point of view on events, whereas logical clocks keep a global logical time ordering. Even in the presence of network delays and asynchronous communication, these techniques help maintain a consistent view of events.

Quorum-based systems

Quorum-based systems are used to ensure that data replicas are consistent. A majority of nodes in a quorum-based system must agree on the value of a piece of data before it is considered correct.

Consensus algorithms

Consensus algorithms are used in distributed systems to ensure that all nodes agree on a certain value or decision. Consensus algorithms are used in situations when nodes must agree on a value, such as when electing a leader or establishing the order of transactions.

Clock synchronization protocols

Clock synchronisation techniques are used to ensure that all nodes in a distributed system view time in the same way. Network Time technology (NTP) is a commonly used clock synchronisation protocol that maintains clock synchronisation over the network within a few milliseconds.

What are the things to consider when adopting synchronisation?

Network Latency

Network latency makes it take longer for nodes in different places to talk to each other, which makes it hard to keep real-time continuity. Synchronisation methods have to take into account different latency times and make sure that nodes don't mistakenly think that delayed updates are events that happened out of order.

Node Failures

Distributed systems often have nodes fail because of problems with the hardware, bugs in the software, or problems with the network. Synchronisation methods need to be able to handle situations in which nodes stop responding. This makes sure that the system continues to work correctly even when nodes stop responding.

Scalability

As distributed systems get bigger, it gets harder to keep everything in sync. Synchronisation mechanisms must be able to grow as the number of nodes and interactions increase, without sacrificing speed.

Consistency-Availability Trade-off

The CAP theorem says that a distributed system can provide at most two out of three properties: Consistency, Availability, and Partition tolerance. Strategies for synchronisation often require making trade-offs between these qualities, which requires careful thought based on the needs of the application.

Concurrency and Contentions

When multiple nodes access and change shared resources at the same time, this can lead to disagreement and conflict. Finding the right balance between concurrency and synchronisation is tricky because too much locking can slow down speed, while too much concurrency can lead to data corruption.

Demonstrating synchronization in a distributed system

Suppose we have a simple distributed system with two replicas of a key-value store. We want to ensure that any updates to the key-value store are synchronized across both replicas so that clients can always read the latest version of the data.

We can achieve this by implementing a synchronization protocol between the replicas. One such protocol is the two-phase commit protocol, which involves the following phases:

Prepare phase: The coordinator asks all replicas to prepare to commit changes.
Commit phase: If all replicas can prepare successfully, the coordinator asks all replicas to commit the changes.

In this example, we define a Replica class that represents a single replica of the distributed database. The class contains a Map that holds the key-value pairs of the database, as well as get() and set() methods to read and write to the database.

// Define a replica class that holds a copy of the distributed databaseclass Replica {  private data: Map<string, string> = new Map();  get(key: string): string | undefined {    return this.data.get(key);  }  set(key: string, value: string): void {    this.data.set(key, value);  }}

We also define a Coordinator class that acts as the coordinator for the two-phase commit protocol between replicas. The class contains an array of Replica objects, as well as a transactionInProgress flag to prevent multiple transactions from occurring simultaneously. The beginTransaction() method is the entry point for the two-phase commit protocol. It first checks whether a transaction is already in progress, and returns an error if one is. Otherwise, it proceeds to the prepare and commit phases.

// Define the coordinator class that coordinates two-phase commit protocol between replicasclass Coordinator {  private replicas: Replica[];  private transactionInProgress: boolean = false;  constructor(replicas: Replica[]) {    this.replicas = replicas;  }  async beginTransaction(transaction: Transaction): Promise<boolean> {    if (this.transactionInProgress) {      console.error("Transaction already in progress.");      return false;    }    console.log(`Starting transaction with key ${transaction.key} and value ${transaction.value}`);    try {    // Phase 1: Prepare phase - ask all replicas to prepare to commit changes      this.transactionInProgress = true;      const prepareResponses = await Promise.all(        this.replicas.map(async (replica) => await this.sendPrepare(replica, transaction))      );      if (prepareResponses.some((response) => !response)) {        console.error("One or more replicas failed to prepare.");        return false;      }    // Phase 2: Commit phase - ask all replicas to commit changes      const commitResponses = await Promise.all(        this.replicas.map(async (replica) => await this.sendCommit(replica, transaction))      );      if (commitResponses.some((response) => !response)) {        console.error("One or more replicas failed to commit.");        // Strong consistency        return false;      }      console.log(`Transaction committed with key ${transaction.key} and value ${transaction.value}`);      return true;    } catch (error) {      console.error(`Error during transaction: ${error.message}`);      return false;    } finally {      this.transactionInProgress = false;    }  }  private async sendPrepare(replica: Replica, transaction: Transaction): Promise<boolean> {    console.log(`Preparing replica ${replica}`);    // Simulate network delay    await new Promise((resolve) => setTimeout(resolve, Math.random() * 1000));    const currentValue = replica.get(transaction.key);    if (currentValue !== undefined) {      console.log(`Replica ${replica} prepared successfully`);      return true;    } else {      console.log(`Replica ${replica} failed to prepare`);      return false;    }  }  private async sendCommit(replica: Replica, transaction: Transaction): Promise<boolean> {    console.log(`Committing to replica ${replica}`);    // Simulate network delay    await new Promise((resolve) => setTimeout(resolve, Math.random() * 1000));    replica.set(transaction.key, transaction.value);    console.log(`Replica ${replica} committed successfully`);    return true;  }}

In the prepare phase, the Coordinator object sends a prepare message to each replica, and waits for a response. If any replica fails to prepare successfully, the transaction is aborted and an error response is returned.

In the commit phase, the Coordinator object sends a commit message to each replica, and waits for a response. If any replica fails to commit successfully, the transaction is aborted and an error response is returned.

To simulate the network delay between replicas, the sendPrepare() and sendCommit() methods each contain a call to setTimeout() with a random delay.

Finally, we set up an Express app with a single endpoint /transaction that calls the beginTransaction() method on the Coordinator object.

import express from "express";interface Transaction {  key: string;  value: string;}const replica1 = new Replica();const replica2 = new Replica();const coordinator = new Coordinator([replica1, replica2]);const app = express();app.use(express.json());app.post("/transaction", async (req, res) => {  const transaction = req.body as Transaction;  const result = await coordinator.beginTransaction(transaction);  if (result) {    res.sendStatus(200);  } else {    res.sendStatus(500);  }});app.listen(3000, () => {  console.log("Server started on port 3000");});

With this implementation, clients can initiate a transaction by sending a POST request to the /transaction endpoint. The Coordinator object will then coordinate the two-phase commit protocol between the replicas, ensuring that any updates to the database are synchronized across both replicas.

Some Real-world applications of synchronisation

The role of synchronization in maintaining consistency is evident across a broad spectrum of real-world applications:

Databases: Transaction management relies on synchronization to maintain the integrity of data and prevent conflicts between concurrent transactions.
Operating Systems: To ensure fair and effective resource utilisation, process scheduling, memory management, and resource sharing all need synchronisation.
Parallel Programming: Synchronisation is used by multi-core processors to organise the execution of multiple threads, making sure that data is shared correctly and preventing "race conditions."
Distributed Systems: Systems with multiple nodes need synchronisation to handle data replication, keep stability, and stop conflicts.

Conclusion

In the complicated and interconnected world of computers, synchronisation is a key part of keeping systems correct, secure, and able to work together. Synchronisation mechanisms stop race conditions, deadlocks, and data inconsistencies by managing how different parts communicate and share resources. As technology improves and systems get more complicated, it's still important to know a lot about synchronisation and how it can be used to make reliable, high-performance systems that give accurate results.

Database Engines: Overview

Sofwan A. Lawal — Tue, 25 Jul 2023 21:51:21 GMT

Today, information is essential to the success of every business. The need for efficient data storage, retrieval, and administration grows as the volume of data generated grows. This is when DBMSs come in handy. Data storage and management are made easier with the help of database engines. They offer a toolkit for effective data storage, retrieval, and manipulation. What are database engines, and why are they important in database engineering? We'll answer those questions and more in this post.

Definition of Database Engines

A database engine is a component of software that facilitates working with databases. It controls how information in a database is saved, retrieved, and altered. Database engines act as a bridge between the database's data and the user, facilitating interaction with the information stored there. MySQL, Oracle, and SQL Server are just a few of the most well-known database management systems.

Why do Database Engines Matter in Database Engineering?

Database engines play a critical role in database engineering. They provide the tools and capabilities necessary for the efficient and effective management of databases. Without database engines, it would be challenging to store, retrieve, and manipulate data in an organized and systematic manner.

Some of the key benefits of using database engines include:

Data Integrity

Database engines ensure that data is stored, retrieved, and manipulated accurately and consistently. This is crucial for maintaining data integrity, which is essential for making informed decisions based on the data.

Scalability

Database engines provide scalable solutions for storing and managing data. This means that as the amount of data increases, the database can be scaled up to accommodate the growth.

Security

Database engines provide a range of security features to protect data from unauthorized access, modification, and destruction.

Performance

Database engines are designed to optimize the performance of database operations, ensuring that data is retrieved and manipulated quickly and efficiently.

Ease of Use

Database engines provide user-friendly interfaces that make it easy for users to interact with the database and perform operations on the data.

Types of Database Engines

There are several types of database engines, each with its own strengths and weaknesses.

The most common types of database engines include:

Relational Database Engines

Relational database engines are the most widely used type of database engines. They are based on the relational database model, which organizes data into tables, each with a unique primary key. Relational databases use Structured Query Language (SQL) to retrieve and manipulate data. Some of the popular relational database engines include MySQL, Oracle, and SQL Server.

How Relational Database Engines Work

Relational database engines store data in tables, with each table representing a different entity or object. Each table has a unique primary key that is used to identify the rows in the table. Tables can be related to each other using foreign keys, which are used to establish relationships between tables.

To retrieve data from a relational database, users use SQL queries. SQL is a standard language used to manipulate relational databases. SQL queries are used to select, insert, update, and delete data from the database.

Examples of Relational Database Engines

MySQL:
MySQL is an open-source relational database engine that is widely used in web applications. It is known for its scalability and performance.
Oracle:
Oracle is a popular relational database engine used in enterprise applications. It provides a range of features for managing large and complex databases.

NoSQL Database Engines

NoSQL database engines are designed to handle unstructured or semi-structured data. Unlike relational databases, NoSQL databases do not use tables to store data. Instead, they use a variety of data models, such as document, key-value, graph, and column-family models. NoSQL databases are highly scalable and can handle large amounts of data with ease. Some of the popular NoSQL database engines include MongoDB, Cassandra, and Redis.

How NoSQL Database Engines Work

NoSQL databases are designed to handle unstructured or semi-structured data, which makes them highly flexible and scalable. NoSQL databases use a variety of data models to store data, including document, key-value, graph, and column-family models.

In a document model, data is stored in a document, which is similar to a JSON object. In a key-value model, data is stored as a set of key-value pairs. In a graph model, data is stored as nodes and edges, which are used to represent relationships between objects. In a column-family model, data is stored in columns, which are grouped into column families.

NoSQL databases use a variety of query languages to retrieve and manipulate data. Some of the popular query languages used in NoSQL databases include MongoDB Query Language (MQL) and Cassandra Query Language (CQL).

Examples of NoSQL Database Engines

MongoDB:
MongoDB is a popular NoSQL database engine that is widely used in web applications. It uses a document data model and provides a range of features for handling unstructured data.
Cassandra:
Cassandra is a highly scalable NoSQL database engine that is designed to handle large amounts of data with ease. It uses a column-family data model and provides a range of features for handling distributed data.

Graph database engines

Graph database engines are made to handle complex connections between data, like those found in social networks, recommendation engines, and knowledge graphs. They are based on graph theory, which is a set of mathematical rules for showing and analysing the relationships between objects.

Graph database engines store data as nodes and edges. Nodes are things like people, places, or goods, and edges are things like "likes," "follows," or "location" that show how these things are related to each other. This makes it possible to query complicated relationships quickly and easily and to do advanced analytics and machine learning on the data.

One of the best things about graph database engines is that they can grow horizontally, which makes them great for working with a lot of data. They are also very flexible because you can add new nodes and edges without changing the way the graph is laid out. This makes them great for things like recommendation engines, catching fraud, and building knowledge graphs.

Some popular graph database engines include Neo4j and ObjectDB.

Object-oriented database engines

These are designed to store objects rather than rows and columns. Examples include ObjectDB and Versant.

In-Memory Database Engines

In-memory database engines are made to keep data in memory instead of on a hard drive. This makes them very useful because info can be found and changed quickly. In-memory databases are frequently used in programmes that need to handle data in real-time, like online trading platforms and games.

How In-Memory Database Engines Work

In-memory database engines keep information in memory, which makes them very fast. Standard SQL queries are used to get and change data, just like in regular relational databases. In-memory databases are often used with traditional databases, where data is first saved on a hard drive and then cached in memory so it can be accessed quickly.

Examples of In-Memory Database Engines

SAP HANA:
SAP HANA is an in-memory database engine that is widely used in enterprise applications. It is known for its high performance and scalability.
Redis:
Redis is an open-source in-memory database engine that is commonly used as a cache. It provides a range of features for storing and retrieving data in memory.

Factors to Consider When Choosing a Database Engine

When choosing a database engine, there are several factors to consider, including:

Scalability: Can the database engine handle the amount of data you need to store and manipulate?
Security: Does the database engine provide the necessary security features to protect your data?
Data Consistency: Does the database engine ensure that data is stored and manipulated accurately and consistently?
Performance: How quickly can the database engine retrieve and manipulate data?
Ease of Use: Is the database engine easy to use and understand?
Cost: What is the cost of using the database engine?

Conclusion

Database engines are an important part of modern tools for managing data. They give you the tools and skills you need to handle databases efficiently and effectively. Choosing the right database engine is very important if you want to store, recover, and change data in a consistent and accurate way. The most popular types of database engines are Relational, NoSQL, and In-Memory. Each has its own strengths and weaknesses. It's important to think about things like scalability, security, data consistency, performance, ease of use, and cost when picking a database engine.

Distributed System: Understanding Quorum-Based Systems

Sofwan A. Lawal — Sat, 06 May 2023 22:58:18 GMT

In distributed systems, quorum-based approaches are essential mechanisms for maintaining consistency and availability in the face of network partitions or failures. A quorum is a subset of nodes in a distributed system that must agree on a particular decision or action before it is considered valid. Quorum-based systems are designed to ensure that a decision or action is not taken unless a sufficient number of nodes agree on it, thereby guaranteeing data consistency and availability. In this article, we will explore the concept of quorum-based systems in detail, including their role in distributed systems, the types of quorums, and how they are implemented.

Introduction

As you may already know, distributed systems are made up of many independent components that must coordinate their efforts to achieve a common goal. However, if these nodes are unable to interact with one another or if a node fails, the results can be catastrophic. Quorum-based systems serve this purpose.

A Quorum-Based System is a simple mechanism for ensuring consensus among a group of nodes in a distributed system. A majority of nodes, or a "Quorum," is required to reach a consensus.

Distributed systems will struggle without quorum-based systems, which facilitate fair and timely decision-making. In their absence, life-or-death choices may be postponed or botched. Managing distributed systems can be difficult, but with a quorum-based approach in place, disagreements between nodes are avoided.

Quorum-Based Systems in Distributed Systems

In order to accomplish a goal, multiple servers or nodes in a network coordinate and share data. In these types of systems, it is of the utmost importance that data remain accessible and intact in the case of a node failure or network partition.

Quorum-based systems are one way to make sure that data is always consistent and available. In a quorum-based system, a choice or action is only made if enough nodes agree on it. This makes sure that the action or choice is correct and that the system stays the same and is available.

As an illustration of what a distributed system looks like, let's take a look at a system that has ten nodes. When there is a requirement for a quorum of six nodes, a decision is not considered to be valid unless it has the support of at least six of the nodes in the network. If there must be a quorum before a decision can be made, then the majority vote of six or more nodes must be in favour of the decision before it can be carried out.

Quorum-based approaches are particularly useful in distributed systems since individual nodes or the network as a whole can experience failure or become fragmented. Because of this, it is extremely important to ensure that data consistency and availability are maintained, even if some of the nodes in the network are offline or unable to connect with one another.

Types of Quorums

There are several types of quorums used in distributed systems. The most common types are:

Read Quorum

In a distributed system, a read quorum is the number of nodes that must agree on a read process for it to be valid. For example, let's say we have a distributed system with ten nodes and a read group of six. When a read action is asked for, the system will only give back the data if at least six nodes can be reached and agree on the value of the data.

Read quorums are helpful when getting data from a distributed system because they make sure the data is right and available. If you can only reach less than six nodes, the read action is invalid, and the system will keep waiting until a quorum is met.

Write Quorum

A group of nodes in a distributed system that all have to agree on a write action for it to be valid is called a "write quorum." For example, let's say we have a distributed system with ten nodes and a write quorum of six. When a write action is requested, the system won't do it unless it can reach at least six nodes and they all agree on the value of the data.

Membership Quorum

In a distributed system, a membership quorum is a group of nodes that must all agree on changes to the system's membership before they are considered true. Say, for example, that we have a distributed system with ten nodes and a group of six nodes. When a node joins or leaves the system, the change is only true if at least six other nodes agree with it.

Configuration Quorum

In a distributed system, a configuration quorum is a group of nodes that must all agree on changes to the system's configuration before they are considered true. For example, let's say we have a distributed system with ten nodes and a setup group of six. When a setup change is asked for, such as changing the number of nodes in the system or the replication factor, the change is only acceptable if at least six nodes agree with it.

Implementing Quorum-Based Systems

Implementing quorum-based systems in distributed systems requires careful thought about a number of things, such as the size of the system, the number of nodes, and the desired level of consistency and availability.

Using a consensus algorithm, like the Paxos algorithm or the Raft algorithm, to set up quorum-based systems is one way to do it. These methods make sure that at least a certain number of nodes agree on a decision or action, even if some nodes fail or the network breaks up.

Another way is to use a distributed hash table (DHT), which is a system that maps keys to values and is spread out over a large area. In a DHT, a read or write operation can be done by contacting only a subset of the system's nodes, and a quorum can be needed to make sure the operation is correct.

No matter what method is used, setting up quorum-based systems requires careful planning and testing to make sure that the system stays consistent and usable even if there are problems or parts of the network are cut off.

Quorum Consensus algorithms

Quorum Consensus algorithms make sure that distributed systems are always consistent and reliable. By exchanging messages and deciding on a certain value, these algorithms help nodes in a distributed system come to a decision.

Some popular Quorum Consensus algorithms are:

Paxos is a protocol that is utilised in distributed systems that are fault-tolerant. It ensures that there will be no more than one value selected, and that the selected value will continue to be adhered to after it has been selected.
Raft algorithm is yet another prominent one. This algorithm was developed to be easier to comprehend than Paxos, and it does so by breaking down the essential components of consensus algorithms into a number of distinct parts.
Zab is another consensus algorithm that is used in Apache ZooKeeper, a popular open-source distributed coordination system.
Viewstamped replica is a protocol for state machine replication that was introduced in 1985.

Even though Quorum-Based methods are very helpful, putting them into place is not always easy. It can be hard to get everything to work in sync, and it can be hard to get performance to its best. But the pros of Quorum-Based systems are more important than the cons because they cut down on costs and make sure that only correct results are given.

Examples of Quorum-Based Systems

Moving on to some examples of systems using quorum-based protocols, we have Cassandra, DynamoDB, HBase, and Consul.

Cassandra is a distributed NoSQL database used by large organizations such as Apple and Instagram. It uses a modified version of the Paxos consensus algorithm.
On the other hand, DynamoDB, a managed NoSQL database service from AWS, uses quorum-based replication for maintaining consistency.
HBase, an open-source, distributed database system, uses the Hadoop Distributed File System to store data. It uses a modified version of Googles Chubby lock service for distributed coordination.
Lastly, Consul, a service discovery and configuration tool, uses a Raft-based consensus algorithm.

These examples show how quorum-based systems are not limited to a specific industry or use case. The range of systems incorporating this protocol shows its applicability to multiple scenarios, from databases to distributed systems.

Advantages of Quorum-Based Systems:

Consistency: Quorum-based systems keep distributed systems consistent by needing a certain number of nodes to agree on a decision or action before it can be considered valid.
Availability: Quorum-based systems make sure that a certain number of nodes are always available to handle requests and make decisions, even if some of the nodes fail or the network is split up.
Flexibility: Quorum-based systems can be set up to work with different sizes and types of quorums to meet different needs for consistency and availability.
Fault tolerance: Quorum-based systems can keep working as long as a certain number of nodes are available, even if some of the nodes fail or the network breaks up.
Scalability: In order to improve throughput and performance, quorum-based systems can scale horizontally by adding more nodes to the system.

Disadvantages of Quorum-Based Systems

Complexity: Implementing quorum-based systems can be hard, and they need to be carefully designed and tested to make sure they work well and are always available, even if there are problems or parts of the network that don't work.
Performance: Quorum-based systems can hurt performance because they need communication between nodes to reach a quorum, which can slow down throughput and increase latency.
Maintenance: Quorum-based systems need to be maintained on a regular basis to make sure they stay consistent and ready. This can take a lot of time and resources.
Configuration: It can be hard to set up quorum-based systems because the size and type of the quorum must be carefully chosen to meet particular requirements for consistency and availability.
Security: Quorum-based systems can be attacked in ways like Byzantine faults, which can hurt the stability of the system and the security of the data. You can read more here

Example (PoC) in NodeJS/Typescript with express

This PoC implements a simple distributed system that allows clients to read and write data to the system using quorum-based systems. The system uses Express as the web framework and Axios as the HTTP client for communicating with other nodes in the system.

import express, { Request, Response } from 'express';import axios from 'axios';const app = express();const nodes = [  'http://node1.example.com',  'http://node2.example.com',  'http://node3.example.com',];const readQuorumSize = 2;const writeQuorumSize = 3;app.get('/read/:key', async (req: Request, res: Response) => {  const key = req.params.key;  const readNodes = nodes.slice(0, readQuorumSize);  const values = await Promise.allSettled(    readNodes.map((node) => axios.get(`${node}/data/${key}`))  );  const validValues = values.filter(    (value) => value.status === 'fulfilled' && value.value.data !== undefined  );  if (validValues.length < readQuorumSize) {    return res.status(500).send('Not enough nodes available to read data');  }  const data = validValues[0].value.data;  res.send(data);});app.post('/write/:key', async (req: Request, res: Response) => {  const key = req.params.key;  const value = req.body;  const writeNodes = nodes.slice(0, writeQuorumSize);  const results = await Promise.allSettled(    writeNodes.map((node) => axios.post(`${node}/data/${key}`, value))  );  const validResults = results.filter(    (result) => result.status === 'fulfilled'  );  if (validResults.length < writeQuorumSize) {    return res.status(500).send('Not enough nodes available to write data');  }  res.send('Data written successfully');});app.listen(3000, () => {  console.log('Server started');});

The system uses a configuration quorum for both read and write operations, which ensures that a quorum of nodes agrees on the operation before it is considered valid. The read quorum size is set to 2, and the write quorum size is set to 3, which means that at least 2 nodes must agree for a read operation to be valid, and at least 3 nodes must agree for a write operation to be valid.

In the read operation, the system queries a subset of nodes specified by the read quorum size, and if enough valid responses are received, the system returns the value of the data. In the write operation, the system sends the data to a subset of nodes specified by the write quorum size, and if enough valid responses are received, the system returns a success message.

This PoC is just a basic example of a quorum-based system and can be extended and modified to meet specific requirements.

Conclusion

In distributed systems, quorum-based systems are an essential mechanism for maintaining consistency and availability. By requiring a quorum of nodes to agree on a particular decision or action, quorum-based systems ensure that the system remains consistent and available, even in the face of node failures or network partitions.

Modulus Sharding in Software Engineering

Sofwan A. Lawal — Sun, 23 Apr 2023 01:31:40 GMT

Modulus sharding is a method used in software engineering to spread data across various servers in a way that improves performance, scalability, and reliability. The method involves dividing data into smaller, easier-to-handle pieces and putting those pieces on different computers, or "shards." This article will give a detailed overview of modulus sharding, including its benefits, challenges, and best practices.

Overview of Modulus Sharding

Modulus sharding involves dividing data into various shards based on several servers, or "nodes," which has already been decided. Each shard is given a unique identifier, which is usually an integer value, and the data is spread across the shards based on how much of the shard ID is left after dividing it by the number of nodes. For example, if we have three nodes and six shards, each node would be responsible for two shards, with the shards distributed as follows:

Node 1: Shards 1 and 4
Node 2: Shards 2 and 5
Node 3: Shards 3 and 6

This distribution makes sure that each shard is spread out evenly across the nodes, which helps to improve speed and scalability. Also, because each node is in charge of a subset of the shards, the system as a whole is more resistant to breakdowns. If any one node fails, it won't affect the availability of the whole system.

Benefits of Modulus Sharding

There are several benefits to using modulus sharding in software engineering:

Improved Performance

Modulus sharding is a technique that can help to enhance the performance of a system by dividing data across numerous nodes. This allows each node to manage a smaller portion of the data, which in turn leads to improved performance. This can help to improve overall response times by reducing the burden that is placed on individual nodes.

Increased Scalability:

Because more nodes can be added to the system as needed to accommodate expanding volumes of data, modulus sharding can also help to boost the scalability of a system. This is because of how the data is stored. In addition to this, the fact that each node is accountable for a part of the data makes it much simpler to add additional nodes without negatively affecting the performance of the nodes that are already present.

Better Fault Tolerance

A system's fault tolerance can be improved by utilising modulus sharding because it allows for the failure of any individual node without having an effect on the overall system's availability. It is also possible to recover data that has been lost because it is distributed among numerous nodes, which makes it possible to reassemble the data using the nodes that are still available.

Challenges of Modulus Sharding

While modulus sharding can provide significant benefits, there are also several challenges to consider when implementing the technique:

Data Distribution:

When dealing with enormous amounts of data, the process of distributing the data among multiple shards can become very complicated. It is essential to give careful consideration to the distribution algorithm to guarantee that the data is dispersed uniformly across all of the shards.

Shard Rebalancing

If the system continues to expand and undergoes modifications, it is possible that rebalancing the shards may become essential to guarantee that the data will be dispersed uniformly throughout the nodes. This can be a difficult process because it involves moving data between nodes without affecting the system's availability.

Query Performance

It is possible that queries that span many shards would run more slowly than queries that just require data from a single shard because data is dispersed across multiple nodes. It is essential to give careful consideration to the design of queries to reduce the negative effects of shard distribution on the performance of queries.

Example (PoC)

As an example(PoC); We'll be writing a code which set up a sharded MongoDB cluster using four MongoDB shards and uses an Express server to expose a REST API for creating and retrieving user objects. The user objects are stored in the MongoDB shards and are distributed across the shards using a simple shard key based on the user ID.

Firstly, we'll import the required dependencies:

 // Import the required packages import express, { Request, Response } from 'express'; import mongoose from 'mongoose';

Creating an instance of the Express app:

 // Create an instance of the Express application const app = express();

Defining an interface/schema for the User object:

 // Define the User interface that extends the mongoose.Document interface interface User extends mongoose.Document {   id: number;   name: string; } // Define the schema const userSchema = new mongoose.Schema({   id: Number,   name: String, });

Initializing an empty array to store connections to each shard:
```
 const shards: mongoose.Connection[] = [];
```

Defining a function to connect to a shard:

 const connectToShard = async (shardNumber: number) => {   const shard = mongoose.createConnection(`mongodb://localhost/users${shardNumber}`);   await shard.once('open', () => {     console.log(`Connected to shard ${shardNumber}`);   });   shards.push(shard); };

Connecting to each shard:

 // Connect to all four MongoDB shards for (let i = 0; i < 4; i++) {   connectToShard(i); }

Defining a function to get the shard for a user:

 // Define a function to get the MongoDB shard for a given user ID const getUserShard = (userId: number): mongoose.Connection => {   const shardIndex = userId % shards.length;   return shards[shardIndex]; };

Defining a function to get a UserModel for a specific shard:

 // Define a function to get the UserModel for a given shard const UserModel = (shardIndex: number): mongoose.Model => {   const shard = shards[shardIndex];   return shard.model('User', userSchema); };

Defining a route to get a user by ID:
```
 // Define a route to get a user by ID app.get('/users/:id', async (req: Request, res: Response) => {   const userId = parseInt(req.params.id);   // Get the MongoDB shard for the user ID   const shardIndex = userId % shards.length;   // Get the UserModel for the shard   const UserModelForShard = UserModel(shardIndex);   const user = await UserModelForShard.findOne({ id: userId });   if (!user) {     res.status(404).send('User not found');     return;   }   res.send(user); });
```
- The logic behind const shardIndex = userId % shards.length; is to determine which shard a given user's data should be stored on based on their user ID.
- In this case, the userId is used to calculate a modulus value (%) based on the length of the shards array. The modulus operation returns the remainder of dividing the userId by the shards.length.
- The resulting modulus value is then used as an index to access the corresponding shard in the shards array. This ensures that each user's data is stored on a specific shard based on their user ID, while also distributing the data evenly across all available shards for horizontal scaling.

Defining a route to create a new user:

app.post('/users', async (req: Request, res: Response) => {  const user: User = req.body;  // Get the MongoDB shard for the user ID  const shardIndex = user.id % shards.length;  // Get the UserModel for the shard  const UserModelForShard = UserModel(shardIndex);  const createdUser = await UserModelForShard.create(user);  res.send(createdUser);});

Starting the Express server:

app.listen(3000, () => {  console.log('Server running on port 3000');});

Putting it all together!

The code demonstrates how to connect to a MongoDB shard, how to create a user schema and model, and how to query and create user objects using the appropriate shard.

import express, { Request, Response } from 'express';import mongoose from 'mongoose';const app = express();interface User extends mongoose.Document {  id: number;  name: string;}const userSchema = new mongoose.Schema({  id: Number,  name: String,});const shards: mongoose.Connection[] = [];// Define a function to connect to a MongoDB shardconst connectToShard = async (shardNumber: number) => {  const shard = mongoose.createConnection(`mongodb://localhost/users${shardNumber}`);  await shard.once('open', () => {    console.log(`Connected to shard ${shardNumber}`);  });  shards.push(shard);};// Connect to all four MongoDB shardsfor (let i = 0; i < 4; i++) {  connectToShard(i);}// Define a function to get the MongoDB shard for a given user IDconst getUserShard = (userId: number): mongoose.Connection => {  const shardIndex = userId % shards.length;  return shards[shardIndex];};// Define a function to get the UserModel for a given shardconst UserModel = (shardIndex: number): mongoose.Model => {  const shard = shards[shardIndex];  return shard.model('User', userSchema);};// Define a route to get a user by IDapp.get('/users/:id', async (req: Request, res: Response) => {  const userId = parseInt(req.params.id);  // Get the MongoDB shard for the user ID  const shardIndex = userId % shards.length;  // Get the UserModel for the shard  const UserModelForShard = UserModel(shardIndex);  const user = await UserModelForShard.findOne({ id: userId });  if (!user) {    res.status(404).send('User not found');    return;  }  res.send(user);});app.post('/users', async (req: Request, res: Response) => {  const user: User = req.body;  // Get the MongoDB shard for the user ID  const shardIndex = user.id % shards.length;  // Get the UserModel for the shard  const UserModelForShard = UserModel(shardIndex);  const createdUser = await UserModelForShard.create(user);  res.send(createdUser);});app.listen(3000, () => {  console.log('Server running on port 3000');});

Best Practices for Modulus Sharding

To ensure that modulus sharding is implemented effectively, there are several best practices that software engineers should follow:

Plan for Growth:

Modulus sharding should be designed to handle future growth in terms of both the amount of data and the number of nodes. It's important to think carefully about the distribution algorithm and the way shards are rebalanced to make sure they can grow well.

Monitor Performance

It is important to keep an eye on important metrics like query response times, shard distribution, and node health to make sure the system is working well. This can help figure out if there are any problems or speed bottlenecks with certain nodes or shards.

Consider Data Access Patterns

When making the sharding plan, it's important to think about how the data is accessed. If the same data is frequently accessed at the same time, it may be best to keep it on the same shard to speed up queries.

Use Consistent Hashing

Consistent hashing is a way to spread data across nodes so that shards don't have to be rebalanced as often. This can help make the system more scalable and lessen the effect of adding or taking away nodes.

Implement Replication

It is important to set up data replication across various nodes to improve fault tolerance. This can help make sure that data is still accessible if a node fails.

Test and Validate

Before putting a sharded system into production, it's important to test and validate it fully. This can help find problems or performance bottlenecks before they affect end users.

Conclusion

Modulus sharding is a powerful method that can help software systems run faster, grow, and handle errors better. But it's important to think carefully about the distribution algorithm, the way shards are rebalanced, and other things to make sure the implementation works. Software engineers can use modulus sharding to help their systems grow and work well by following best practises and keeping an eye on speed.

Understanding Scalability: Beyond Speed

Sofwan A. Lawal — Sat, 15 Apr 2023 23:48:59 GMT

When discussing the planning and development of software, it is common practice to use the terms "speed" and "scalability" interchangeably. However, these ideas do not refer to the same thing, and it is critical to have a solid understanding of the distinctions between them to develop software that is both successful and efficient.

Speed

Speed refers to the ability of a software system to perform a specific task quickly. For instance, a search engine can be regarded as having a high level of performance if it can provide pertinent results in a matter of milliseconds. Speed plays a crucial role in user interaction with a software system, as it directly impacts their experience and satisfaction. Users have a widespread expectation that software will be quick and responsive; any delays or lags in performance will likely result in irritation and unhappiness on their part.

Scalability

Scalability, on the other hand, is the capacity of a software system to manage an expanding volume of work or traffic. Scalability refers to the ability of a system to accommodate more users. Scalability is essential because software systems are frequently intended to expand and transform throughout the course of their lifetimes. As a result, it is essential that these systems are able to manage rising demand without crashing or becoming inoperable.

Understanding the difference

When discussing software systems, it is essential to have an understanding that scalability and speed are two separate and independent ideas. Either a system can be quick while lacking the ability to scale, or it can be scalable while lacking the ability to be quick. Examining a web application that enables users to upload and share images is a great way to demonstrate this idea, so let's get started.

Imagine that this app was originally made to only support a small number of people and a small number of photos. In this case, the programme might be very good at quickly processing and showing the images. But if the number of users and picture library grows a lot, the system might not be able to keep up with the demand. So, the performance may get worse, which could lead to slow response times or even a total system failure. Even though the application is very fast when dealing with smaller workloads, it is not scalable in this case. The significance of distinguishing between these two characteristics when assessing the performance of software systems is illustrated by this example.

Consider a software system that is intended to manage a high volume of traffic but is not optimised for speed. This presents a similar challenge. This system might be able to accommodate a huge number of requests and users, but the processing of each request might take a very lengthy time. In this particular scenario, the system may be scalable, but individual consumers may find it to be slow.

Speed vs Scalability in Web Applications

Let's consider a web application that allows users to upload and share photos. To make things simple, let's assume that each photo is stored as a file on disk and that the web application simply serves the file to the user when requested.

Here is some code that reads a photo file from the disk and returns it to the user:

import express, { Request, Response } from 'express';import path from 'path';const app = express();app.get('/photo/:filename', (req: Request, res: Response) => {  const photoPath = path.join(__dirname, 'photos', req.params.filename);  res.sendFile(photoPath);});app.listen(3000, () => {  console.log('Server is listening on port 3000');});

This code is fast because it simply reads the file from the disk and returns it to the user. However, if the number of users and photos grows significantly, the system may become overwhelmed and slow down or even crash. In this case, the system may not be scalable, even if it is fast for small loads.

To make the system more scalable, we could introduce a caching layer that stores frequently accessed photos in the memory. Here is some TypeScript code that implements this caching layer using the express-cache-controller middleware:

import express, { Request, Response } from 'express';import path from 'path';import cacheController from 'express-cache-controller';const app = express();app.use(cacheController());app.get('/photo/:filename', (req: Request, res: Response) => {  const photoPath = path.join(__dirname, 'photos', req.params.filename);  res.sendFile(photoPath);});app.listen(3000, () => {  console.log('Server is listening on port 3000');});

In this code, we've added the express-cache-controller middleware to enable caching. This middleware adds a Cache-Control header to the response that tells the client how long to cache the response. By default, the middleware caches responses for 60 seconds. This caching layer can help to improve scalability by reducing the number of file reads and network requests.

Speed vs Scalability in Database Systems

Let's consider a database system that stores information about users and their purchases. To make things simple, let's assume that we have a single table called users with the following schema:

CREATE TABLE users (  id INT PRIMARY KEY,  name VARCHAR(255),  email VARCHAR(255),  address VARCHAR(255),  city VARCHAR(255),  state VARCHAR(255),  zip VARCHAR(10));

Here is some TypeScript code that retrieves a user's information from the database using the node-mysql2 library:

import mysql from 'mysql2/promise';const pool = mysql.createPool({  host: 'localhost',  user: 'root',  password: 'password',  database: 'database',});async function getUserInfo(userId: number) {  const connection = await pool.getConnection();  try {    const [rows] = await connection.query(      'SELECT name, email, address, city, state, zip FROM users WHERE id = ?',      [userId],    );    const row = rows[0];    return {      name: row.name,      email: row.email,      address: row.address,      city: row.city,      state: row.state,      zip: row.zip,    };  } finally {    connection.release();  }}

This code is fast because it simply executes a single SQL query to retrieve the user's information from the database. However, if the number of users and purchases grows significantly, the system may become overwhelmed and slow down or even crash. In this case, the system may not be scalable, even if it is fast for small loads.

To make the system more scalable, we could introduce a caching layer that stores frequently accessed user information in memory. Here is some TypeScript code that implements this caching layer using the node-cache library:

import mysql from 'mysql2/promise';import NodeCache from 'node-cache';const pool = mysql.createPool({  host: 'localhost',  user: 'root',  password: 'password',  database: 'database',});const userCache = new NodeCache({ stdTTL: 60, checkperiod: 120 });async function getUserInfo(userId: number) {  let userInfo = userCache.get(userId);  if (!userInfo) {    const connection = await pool.getConnection();    try {      const [rows] = await connection.query(        'SELECT name, email, address, city, state, zip FROM users WHERE id = ?',        [userId],      );      const row = rows[0];      userInfo = {        name: row.name,        email: row.email,        address: row.address,        city: row.city,        state: row.state,        zip: row.zip,      };      userCache.set(userId, userInfo);    } finally {      connection.release();    }  }  return userInfo;}

In this code, we've added a caching layer using the node-cache library. This caching layer stores user information in memory for a specified period (60 seconds in this example). If the requested user information is in the cache, we return it immediately without querying the database. Otherwise, we query the database and store the result in the cache before returning it to the client. This caching layer can help to improve scalability by reducing the number of database queries and network requests.

Real-World Examples

To illustrate this concept further, let's look at some real-world examples.

Social media platforms such as Facebook, Twitter, and Instagram are examples of software systems that need to be both fast and scalable. These platforms need to be fast to provide a good user experience, and they need to be scalable to handle the large number of users and data that they generate.

For example, Facebook has over 2.8 billion monthly active users, and it needs to be able to handle a huge amount of traffic and data. To achieve this, Facebook uses a variety of techniques to improve scalability, including distributed systems, caching, load balancing, and sharding. These techniques allow Facebook to handle a huge amount of data and traffic, but they also introduce some latency in the system. In other words, Facebook may not always be the fastest platform, but it is designed to be highly scalable.

Example 2: E-commerce Platforms

E-commerce platforms such as Amazon and eBay also need to be both fast and scalable. These platforms need to be fast to provide a good user experience, and they need to be scalable to handle the large number of products and transactions that they generate.

For example, Amazon is one of the largest e-commerce platforms in the world, and it needs to be able to handle a huge amount of traffic and data. To achieve this, Amazon uses a variety of techniques to improve scalability, including distributed systems, caching, load balancing, and partitioning. These techniques allow Amazon to handle a huge amount of data and traffic, but they also introduce some latency in the system. In other words, Amazon may not always be the fastest platform, but it is designed to be highly scalable.

Conclusion

In conclusion, speed and scalability are two important but distinct concepts in software design and development. A system may be fast for small loads but may not be scalable when the load increases significantly. To make a system more scalable, we may need to introduce additional layers such as caching, load balancing, or sharding. These layers may introduce additional complexity and overhead, but they can help to improve scalability and ensure that the system can handle increased loads in the future.

Real-Time Messaging Protocol (RTMP/s)

Sofwan A. Lawal — Sun, 26 Mar 2023 22:10:11 GMT

RTMP is a widely used protocol for streaming audio, video, and data over the Internet in real time. It is particularly well-suited for applications that require low latency, such as live streaming events, online gaming, and video conferencing. RTMPS is a secure version of the protocol that adds an additional layer of security by encrypting the data being transmitted between the client and the server.

In this article, we will discuss RTMP in detail, its architecture, the advantages and disadvantages of RTMP, and its use cases.

Introduction

Real-time Messaging Protocol (RTMP) is a streaming protocol that is designed to deliver video, audio, and other types of data in real time over the internet. Developed by Macromedia, it was first released in 2002 and is now owned by Adobe Systems. RTMP is widely used for live streaming and video-on-demand (VOD) applications.

What is RTMP?

RTMP (Real-time Messaging Protocol) is a streaming protocol designed to deliver audio, video, and other types of data in real time over the internet. It was developed in 2002 and is now owned by Adobe Systems. RTMP uses a client-server architecture where the client sends a request to the server to establish a connection, and the server sends the requested data to the client. It supports low-latency streaming, high-quality audio and video, adaptive bitrate streaming, and encryption for the secure transmission of data. RTMP is commonly used for live streaming, video-on-demand, webinars, and gaming applications.

RTMP is a client-server protocol, which means that the client (usually a Flash player or a web browser) establishes a connection with the server, and the server sends the data to the client over the established connection. The client can then display the data (e.g., video or audio) in real time as it is received.

Why RTMP?

RTMP was originally developed to support the real-time streaming of video and audio data between a Flash player and a server, and it is still widely used today for this purpose. However, the protocol has also been adopted by many other streaming platforms and applications, including YouTube, Facebook, and Twitch.

Before the development of RTMP, traditional HTTP-based streaming protocols had significant latency, which made them unsuitable for live-streaming applications. RTMP solved this problem by enabling low-latency streaming, which made it possible to deliver real-time content to viewers reliably and efficiently. Today, RTMP remains a popular choice for live streaming and video-on-demand applications and is widely used by broadcasters, content providers, and businesses.

RTMPs

In addition to RTMP, there is also a secure version of the protocol called RTMPs (RTMP over Secure Sockets Layer). RTMPs use the same underlying protocol as RTMP but add an additional layer of security by encrypting the data being transmitted between the client and the server using the Secure Sockets Layer (SSL) protocol. This helps to protect against man-in-the-middle attacks and other forms of data tampering.

RTMP Architecture

RTMP uses a client-server architecture where the client sends a request to the server to establish a connection. Once the connection is established, the server sends the requested data to the client. The client can also send data to the server, such as commands and user input.

The RTMP protocol consists of several components:

RTMP Client: This is the application that sends the request to the RTMP server to establish a connection and receive the streaming data. The RTMP client can be a web browser, a desktop application, a mobile app, or any other device that can connect to the internet and receive streaming data.
RTMP Server: This is the server that receives the request from the RTMP client and sends the streaming data back to the client. The RTMP server can be a dedicated server or a cloud-based server, and it is responsible for managing the connection, handling the streaming data, and delivering it to the client.
RTMP Protocol: This is the communication protocol used between the RTMP client and the RTMP server. It defines the format of the streaming data, how it is transmitted over the internet, and how it is processed by the client and server. The RTMP protocol uses a combination of TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) to deliver the streaming data, and it supports various codecs for audio and video compression.

Advantages

Low Latency: RTMP has very low latency, which makes it ideal for live-streaming applications. Latency is the delay between the time a video is captured and the time it is displayed on the screen. Low latency means that viewers can watch live events in real time without any noticeable delay.
High Quality: RTMP supports high-quality video and audio streams, which makes it ideal for broadcasting high-quality content.
Adaptive Bitrate: RTMP supports adaptive bitrate streaming, which means that the video quality can be adjusted based on the user's internet connection speed. This ensures that users with slow internet connections can still watch the content without buffering.
Security: RTMP supports encryption, which makes it secure for transmitting sensitive data.
Cross-platform Support: RTMP is supported by most major operating systems, including Windows, Mac, and Linux.

In addition to its low latency and high-bandwidth capabilities, RTMP also has several other features that make it a popular choice for streaming applications. These include:

Protocol multiplexing: RTMP can multiplex multiple streams of data over a single connection, which allows for efficient use of network resources.
Stream control: RTMP provides several controls that can be used to adjust the quality and resolution of a stream in real-time, based on the available bandwidth and other factors.
Encryption: RTMP supports encryption of the data being transmitted between the client and the server, which helps to protect against man-in-the-middle attacks and other forms of data tampering.
Metadata: RTMP supports the inclusion of metadata with a stream, which can be used to provide information about the stream, such as the title, description, and other metadata.

Example

Here is an example of how you might implement RTMP streaming in Node.js using TypeScript and the Express web framework:

First, you will need to install the required dependencies:

npm install express @types/express @types/node @types/node-fluent-ffmpeg fluent-ffmpeg @types/fluent-ffmpeg node-media-server

Next, create a new TypeScript file and import the dependencies:

import { NodeMediaServer } from 'node-media-server';

Then, Configure the NodeMediaServer:

const app = express();const config = {  rtmp: {    port: 1935,    chunk_size: 60000,    gop_cache: true,    ping: 60,    ping_timeout: 30  },  http: {    port: 8000,    allow_origin: '*'  }};const nms = new NodeMediaServer(config);nms.run();

Next, publish the stream using ffmpeg

# video file with H.264 video and AAC audio:ffmpeg -re -i INPUT_FILE_NAME -c copy -f flv rtmp://localhost/live/STREAM_NAME# video file that is encoded in other audio/video formatffmpeg -re -i INPUT_FILE_NAME -c:v libx264 -preset veryfast -tune zerolatency -c:a aac -ar 44100 -f flv rtmp://localhost/live/STREAM_NAME

Accessing the stream

# RTMPrtmp://localhost/live/STREAM_NAME# http-flvhttp://localhost:8000/live/STREAM_NAME.flv

More context on the NodeMediaServer Package here.

Use Cases for RTMP

Live Streaming: RTMP is commonly used for live-streaming applications, such as sports events, concerts, and news broadcasts.
Video-on-Demand: RTMP is also used for video-on-demand applications, where users can watch pre-recorded videos.
Webinars: RTMP is commonly used for webinars, where presenters can stream live video and interact with the audience in real time.
Gaming: RTMP is used for streaming gaming content, such as live gameplay and tournaments.

In recent years, there has been a shift away from RTMP towards other streaming protocols, such as HTTP Live Streaming (HLS) and Dynamic Adaptive Streaming over HTTP (DASH). Despite these shifts towards other protocols, RTMP is still widely used in the streaming industry and is likely to remain a popular choice for many applications in the coming years.

Conclusion

RTMP is widely used in the streaming industry because it is a reliable and efficient protocol for streaming high-quality video and audio over the Internet. It is particularly well-suited for applications that require low latency, such as live streaming events, online gaming, and video conferencing.

Internet Control Message Protocol (ICMP)

Sofwan A. Lawal — Sun, 19 Mar 2023 14:47:52 GMT

ICMP (Internet Control Message Protocol) is an Internet Standard protocol used for network health and control, error reporting, network diagnostics, and monitoring. It allows network devices to request information from each other, find the source of a network problem, and monitor the health of a network. ICMP runs on top Internet Protocol (IP).

What is ICMP?

ICMP is an abbreviation for Internet Control Message Protocol. It is used to monitor and manage networks, as well as to report errors. ICMP is one of the Internet's fundamental protocols. It is specified in the TCP/IP stack and is, therefore, independent of the application. Every server, router, and switch on the Internet uses the ICMP protocol. The protocol is used to diagnose and monitor networks, as well as to notify and handle errors. It works in tandem with other networks, such as TCP, UDP, and IP.

Purpose of ICMP

Network administrators and IT support staff rely heavily on ICMP for reporting errors and fixing connections. If an error occurs during the sending or getting of an IP packet, the network device can use ICMP to notify the other end of the connection. In addition to determining whether or not two devices are communicating, ICMP signals can be used to identify and resolve connectivity problems. On top of that, ICMP is utilised by network managers to track down the source of any slowdowns or malfunctions in their networks.

Importance of ICMP

Within the realm of computer networking, ICMP is responsible for several essential tasks. The following are some of its most important roles:

Error Reporting

ICMP is the protocol that is used to notify errors and problems that occur while IP packets are being transmitted. If a packet is lost or if a router makes a mistake while processing a packet, for instance, ICMP messages can be used to inform the sender and the user about the problem.

Troubleshooting Network

ICMP messages can be used to check if objects on a network are connected and to figure out what's wrong with the network. For instance, the ping utility sends an ICMP Echo Request message to a device and looks for an Echo Reply message. If the receiving device can be reached and can reply, it will send an Echo Reply message. If not, the sender can figure out that there's a problem with the connection.

Traceroute is another tool that uses ICMP data to figure out what's wrong with a network. It sends a series of packets to a destination device, each with an increasing TTL value. Each router along the way replies with an ICMP Time Exceeded message. By looking at the TTL values and response times of the packets, network managers can find out where the packets are going and see if there are any problems along the way.

Path MTU Discovery

The maximum transmission unit (MTU) size of the network link between two devices can be found out with ICMP. This makes sure that packets don't get broken up, which can help the network work better and lower the chance of packet loss. With ICMP data, you can find out the MTU size of a network path and change the size of the packets to match.

Traffic Management

ICMP messages can be used to manage network traffic by letting routers talk to each other and changing how packets move to keep the network from getting backed up. For example, a router can use the ICMP Source Quench message to tell a sender to slow down the rate at which packets are sent if the network is busy.

Security

Some types of network attacks, such as denial-of-service (DoS) attacks and IP spoofing attacks, can be found and stopped with the help of ICMP data. For example, the ICMP Echo Request flood attack sends a large number of ICMP Echo Request messages to a target device. This overwhelms the device and makes it hard for the network to work. Network managers can find and stop these kinds of attacks by using ICMP messages.

IPv6 Neighbor Discovery

ICMPv6 is used by IPv6 devices for neighbour discovery, which is the process of discovering other devices on a network. ICMPv6 messages are used to identify and communicate with other devices on the same network segment, which is essential for IPv6 network operations.

ICMP Message Types

ICMP messages come in different types, and each type serves a unique purpose. Some of the common ICMP message types include:

Echo Request/Reply

A "ping," or ICMP Echo Request/Reply, is a straightforward network diagnostic instrument that enables one device to check the network connectivity of another device.

The process works as follows:

An ICMP Echo Request packet is sent from the starting device to the IP address of the receiving device. This packet contains a unique identifier and a sequence number.
After receiving an ICMP Echo Request packet, the receiver will respond with an ICMP Echo Reply packet. The request ID and transmission sequence number are identical in this response.
Upon receiving the ICMP Echo Reply packet, the initiating device can determine whether the destination device is reachable and responsive.

If the initiating device doesn't get an ICMP Echo Reply packet within a certain amount of time, it can assume that the target device is not reachable or is having trouble connecting.

Destination Unreachable

When a packet is unable to be delivered to the location that was specified for it, this sort of message is generated. The message provides details regarding the reasons the packet could not be transmitted, such as an error that occurred on the network or an unreachable destination.

Time Exceeded

This message type is transmitted whenever a packet is thrown away because the time-to-live (TTL) number it was set to has been exceeded. The time-to-live (TTL) value of a packet is reduced by one whenever it travels through a router; once it approaches zero, the packet is deemed invalid and is thrown away.

Redirect

A router will use this type of message when it wants to convey the information to a device that a more advantageous route to a destination is now accessible.

Router Advertisement/ Solicitation

Routers make use of these different message types in order to publicise their existence and provide information regarding the topology of the network.

Applications of ICMP

ICMP is primarily used for error reporting and network troubleshooting, but it also has several other applications; which include

Finding Out Which Host Is Responsible For A Network Problem

Finding the cause of a network problem is a very important use of ICMP. By sending a packet with a "time to live" (TTL) value of 1, network managers can use ICMP to find the host that is causing a network problem. When the packet gets to its target, the host will send an ICMP error message that says something like "The IP address of this packet's destination cannot be reached." The network controller will then know that the problem is with the host whose IP address is given in the error message.

Network Monitoring and Reporting

ICMP can be used to monitor the health of a network. For example, you can use ICMP to check whether a host is reachable or if a network device is up and running. When you send a request to a host, it will generate a response to let you know that everything is okay or if there is a problem.

Three main ICMP message types can be used for monitoring and reporting:

Echo request (ping) - This is used to check if a host is up and running. This can be used to look for hosts that are unreachable on a network. It can also be used to find out how long a host takes to respond to a ping request.
Echo reply - This is used to respond to an echo request. It tells you when a host is up and running.
Destination unreachable - This can be used to let you know that a host is down, or that there is a network problem that causes the host to be unreachable.

Network Tracing

This is how a host uses ICMP to trace the path of data packets through a network. If host A sends a Traceroute message to host B, host B will send back an ICMP Traceroute Reply message. Each network device that receives the message will add information about its location in the message and send the message to the next device.

Conclusion

ICMP is a critical protocol in computer networking with many applications. It plays a vital role in error reporting and network troubleshooting, and it also has several other applications, such as path MTU discovery, traffic management, security, and IPv6 neighbour discovery. Without ICMP, network administrators would have a much harder time diagnosing and resolving network issues, optimizing network performance, and maintaining network security.

Authentication and Identity Validation

Sofwan A. Lawal — Sat, 28 Jan 2023 19:23:40 GMT

Authentication and identity validation are important concepts in software engineering, as they ensure that only authorized users have access to certain resources or systems. In this article, we will explore the basics of authentication and identity validation, and discuss some best practices for implementing these security measures.

Introduction

When you log into your bank, social media account, or email, you are usually prompted to provide some sort of verification such as a password or a secret answer. This is known as authentication.

When you are building an authentication system from scratch, you need identity validation as part of your user onboarding process. With identity validation, we can check what information about the user is publicly available and how trustworthy those sources are.

For this article, I've consulted a DevRel expert from IdentityPassa company that offers a suite of products to help you verify and gain deeper insights about your customers/business to stay compliant and avoid fraudulent activities for tips to consider when dealing with Identity-aware systems.

What is Authentication?

Authentication is the process of verifying the identity of a user, device, or system. It is typically done through the use of credentials, such as a username and password, biometric data, or a token. When you log into a website, app or other digital service, you are authenticating yourself with that service.

Users prove their identity to a system via the process of Authentication. When you log into your email account, you prove that you are the owner of that account by entering your password.

Types of Authentication

The goal of authentication is to confirm that the user or device attempting to access a system or resource is who or what they claim to be. There are several types of authentication methods, including:

Knowledge-based authentication: What you know

This type of authentication relies on the user being able to provide a piece of information that only they should know, such as a password or a personal identification number (PIN).

Possession-based authentication: What you have

This type of authentication requires the user to present a physical object that they possess, such as a security token or a smart card.

Inherence-based authentication: What you are

This type of authentication uses biometric data, such as fingerprints, facial recognition, or voice recognition, to verify the identity of the user.

What is Identity Validation?

Identity validation is the process of confirming that an individual is who they claim to be. It is the process of confirming that a person attempting to sign up for a service or log into an existing account is who they claim to be. It involves verifying that the information provided by a user is accurate and corresponds to a real person. This can also involve verifying the user's name, date of birth, address, and other personal details.

There are several methods for identity validation, including:

Manual verification: This involves manually reviewing the information provided by the user and comparing it to other sources, such as government records or credit bureau data.
Automated verification: This involves using software or other automated systems to verify the information provided by the user. This can include using algorithms to check for inconsistencies or red flags or using external sourcessuch as IdentityPass to confirm the accuracy of the information.

When you create an account on Uber or Lyft, you authenticate yourself through your account by entering your email and password. But when you first use the account, you are asked to confirm your identity by adding a photo of your license. You are validating that you are who you say you are. Identity validation is often used in situations where sensitive information is being shared, like when applying for a loan or setting up a health insurance account. It is also used in business settings for compliance reasons, such as in financial services where certain types of accounts require an accepted or verified ID.

Why is Identity Validation Important?

Identity validation is important for several reasons:

Security: By verifying the identity of users, organizations can ensure that only authorized individuals have access to sensitive information and systems. This helps to prevent unauthorized access, data breaches, and other security incidents.
Compliance: Many industries and organizations are subject to regulations that require them to verify the identity of individuals. For example, financial institutions are required to comply with anti-money laundering (AML) and know-your-customer (KYC) regulations, which mandate the verification of customer identities.
Fraud prevention: By validating the identity of users, organizations can detect and prevent fraudulent activity. For example, by verifying that the information provided by a user corresponds to a real person, organizations can prevent individuals from creating fake accounts or using stolen identities.
Trust and credibility: By validating the identity of users, organizations can build trust and credibility with their customers. This can be especially important for businesses that rely on online transactions, where customers may be hesitant to provide personal information without assurance that it will be protected.
Accurate record-keeping: Identity validation also helps organizations to maintain accurate and up-to-date records of their customers and clients. This can help organizations comply with regulations and laws that require the maintenance of accurate records and can be useful for future reference.

Overall, identity validation is an important aspect of security and compliance that helps organizations to protect their assets and customers, prevent fraud, and maintain trust and credibility. Organizations must have well-defined and implemented procedures for identity validation that are compliant with industry and legal standards.

Authenticate! identify!! Verify!!!

Yes! Exactly in that order. Let your users tell you who they are, attempt to identify them, and finally verify that they are exactly everything they claim to be.

Authentication is the process of proving that you are who you say you are by entering a password, name or other identifiers. You log into an account or website by authenticating yourself. You can also authenticate someone else by logging into their account and entering the correct password or other identifiers.
Identification is the process of confirming that you are who you say you are. You can identify yourself by providing personal details like your name, date of birth or address.
Verification is the process of confirming that something is true. Similarly, you can verify an account by providing an identifying feature like your mothers maiden name or your National Identification Number.

Should you implement identity validation?

If you are storing sensitive information like National Identification Numbers, you may be required to implement identity validation. Additionally, some industries, like financial services, demand that businesses meet strict identity validation requirements. Other industries, like healthcare, also often require identity validation. Before implementing identity validation, make sure you understand the requirements in your industry.

It is important to note that not all identity verification providers are created equal. Selecting a provider that offers the features required for your business, including robust fraud prevention and reliable results, is critical. A very good example is IdentityPass which offers a wide range of solutions to help businesses with many possible identity and business verification needs.

Best Practices for Implementing Identity Validation

Collect only what is necessary: Review your data requirements and make sure they are necessary and relevant to your business. The fewer verifications you require, the less friction there is in the sign-up process.
Build flexibility into your requirements to reduce false negatives: False negatives occur when someone fails to pass a verification check and is unable to sign up for an account. These can cause major problems for your business.
Make it easy for users to verify their information: Offer multiple verification methods, and make sure you are guiding users through the process and helping them along the way.
Use identity validation early in the onboarding process: By validating the users identity at the beginning of the onboarding process, you can significantly reduce false negatives and decrease the complexity of your sign-up and onboarding process.
Use encrypted communication: When transmitting authentication credentials or other sensitive information, it's important to use encrypted communication to prevent interception by attackers.
Implement identity validation measures: To ensure that the information provided by users is accurate and corresponds to real people, it's important to implement identity validation measures, such as manual or automated verification processes.

Conclusion

Authentication is the process by which a user proves their identity to a system, while identity validation is the process of confirming that a person attempting to sign up for a service or log into an existing account is who they claim to be.

Minimising Correlated Failures in Distributed Systems

Sofwan A. Lawal — Mon, 16 Jan 2023 09:11:51 GMT

Scalability and dependability are two areas where distributed systems face new problems. When many services are hosted on separate computers, they must use network protocols to talk to one another. The more the variety of services available, the greater the likelihood that something will go wrong. Distributed systems will inevitably experience some form of failure; the important thing is how you plan to handle it. Furthermore, if your system is hosted on cloud infrastructure's virtual machines (VMs), failures can have knock-on repercussions for other customers who are utilising the same physical hardware. This article explores the topic of improving the resilience of large-scale distributed systems in the face of failure.

Introduction

Distributed systems are complex. They involve many different processes and services that can fail or degrade independently. As a result, distributed applications need to be able to handle these failures gracefully. This article outlines five techniques for minimizing correlated failures in distributed systems:

failure isolation,
defensive coding,
continuous monitoring,
peer review, and
Immutable APIs.

These techniques help developers avoid the most common sources of correlated failures in software stacks and services across all layers of the stack and make it easier to debug issues when they do occur. These techniques will not eliminate every instance of correlated failures across your systems architecture, but they will go a long way toward reducing their presence and impact on your end users.

The Challenges of Scalability and Reliability

Distributed systems handle a large amount of data across a network of systems that may be geographically distributed. Distributed systems are more complex than centralized systems, but they can also be more efficient and scalable because they can employ additional computing resources.

For example, you might use a distributed system if you need to analyze data in a very large database that cant be processed on one computer. Distributed systems also have their unique challenges related to scalability and reliability.

To explain; lets take a look at the high-level architecture of a typical distributed system:

Distributed system architectures follow a standard pattern where data is ingested into a centralized data store.

This data store is responsible for replicating and distributing that data to other data stores that are spread across the network. If youre building a distributed system, its important to understand that these data stores are not equally reliable.

Each data store has its unique level of availability and reliability. In many ways, this is what makes distributed systems difficult to build. You have to account for these differences, and you have to account for the fact that systems inevitably fail. In other words, distributed systems are fragile by default. You have to do things (and lots of things) differently to make them reliable.

Why is it so hard to build reliable distributed systems?

The central challenge of building a distributed system is that the system itself is distributed all the components are distributed across different locations. Distributed systems pose unique reliability challenges that cant be solved with a single, centralized approach because that centralized approach will only be as reliable as the weakest component. When something fails, it can impact the entire system, and distributed systems are always susceptible to failure.

Identifying the Critical Path in Distributed Systems

When youre trying to optimize a distributed system, the first step is to identify the critical path. The critical path is the path that determines the overall availability of the system. The critical path is the path that takes the longest amount of time to complete. Its the path where a failure will have the most impact on the system as a whole. If this path fails, the entire system will be at risk of failing.

To identify the critical path, you have to look at everything that your system does. You have to understand every operation that your system performs and every operation that each component of your system performs. Once youve identified the critical path, you can focus your attention on making that path as reliable as possible. The less reliable the path, the more attention you should give to it.

Use Redundancy to Reduce Failures in Distributed Systems

Redundancy is the ability to withstand failure by having multiple redundant components that can take over if a component fails. It is a common technique used to improve the availability of distributed systems because it enables you to make the critical path more reliable by adding more components to that path.

The more components you have performing a single operation, the less likely each component is to fail. There are many different kinds of redundancy you can use in distributed systems, including:

Automated failover - Automated failover uses a secondary component to take over when the primary component fails. This can be as simple as having a system that triggers a human to manually take over when a component fails.
Service-level agreement (SLA) - An SLA is an agreement between a service owner and a client that specifies the level of availability and performance the system will maintain.
Load balancing - Load balancing distributes the workload across multiple components. This can be useful for distributing the workload across multiple instances of an application or across instances of multiple applications.
Redundant data stores - Redundant data stores allow you to write the same data to multiple copies of a data store. This helps to ensure that the data will be retained in the event that one data store fails.

Use Failure Detection to Repair Your System After a Failure Occurs

Failure detection is the process of monitoring your system to identify when it fails. For example, you can detect a failure when a service is unavailable or when it returns an error. There are several different techniques for detecting failures, but some of the most common include:

Timeouts - Timeouts can be used to detect when a service is taking too long to respond. This is especially useful when communicating with services hosted on different networks.
Retry logic - Retry logic can be used to detect when a service is unavailable by retrying the request until it succeeds.
Circuit breakers - Circuit breakers can be used to detect when a service is failing and automatically stop sending requests to that service.
Thresholds - Thresholds can be used to trigger an alert when a metric crosses a certain threshold. - Outages - Outages are when a service is completely unavailable.

Dont Rely on One Thing When Building a Distributed System

One of the most important things to remember when building a distributed system is that you cant rely on any single component to provide 100% uptime. You cant rely on a single data store, a single network, or a single service. Instead, you have to build the system in such a way that it can survive even when one or more components fail. You have to build the system in such a way that it can withstand the occasional, inevitable failure of a component. To do that, you have to design the system to be fault-tolerant by using the following principles:

Isolation - Isolation is the ability to run one component as an independent unit. This means that the failure of one component wont impact the other components in the system.
Decoupling - Decoupling is the ability of components to communicate without depending on each other. This means that the failure of one component wont impact the other components in the system.
Redundancy - Redundancy is the ability to withstand failure by using multiple components to perform the same task. This means that the failure of one component wont impact the other components in the system.

More Techniques for avoiding correlated failures

The most fundamental truth of distributed systems is that they often experience failures. There will always be unexpected problems, no matter how meticulously you plan and test your system. To construct distributed systems that are both scalable and trustworthy, it is essential to learn how to handle these errors.

Failure Isolation

The key to dealing with failures is to minimize the side effects of those failures. This means isolating the part of the system that failed from the rest of the system. The best way to do this is to design your system so that each component can be operated in isolation. This lets you scale systems horizontally by adding more capacity without adding the risk of cascading failure.

Defensive Coding

Defensive coding is another important part of the design of distributed systems. Building a distributed system requires a different way of thinking than building an application that runs on a single server. Distributed systems must consider stability, scalability, and performance, unlike single-server apps.

Distributed systems, in particular, must follow the best practices for handling errors, especially when it comes to dealing with unplanned events like network and hardware failures. That means you have to handle all errors with grace, not just fatal ones. Distributed systems have different requirements than single-server apps, making error handling difficult. Hardware and network may fail, thus, distributed systems must be designed to manage failures as normal.

Continuous Monitoring

Monitoring and logging play a significant role in the design of distributed systems. This is because it can be difficult to keep track of the numerous moving elements that distributed systems frequently contain. In particular, sharding is frequently used to scale distributed systems that rely on distributed databases. Data distribution across numerous machines is called sharding. How do you identify the downed machines and the missing data if your distributed system depends on a sharded database? How may communication errors between database shards be found?

Monitoring entails more than only monitoring uptime. It involves monitoring every element that affects uptime. Monitoring must also be spread in a distributed system. Distributed components cannot be monitored by a centralised monitoring solution. Similar to centralised data, dispersed monitoring systems won't be able to collect it.

Peer Review

Reviewing your code with new eyes is one of the best ways to stop bugs from entering. Peer reviews should be conducted as soon as possible after the design phase of distributed systems is complete. This will find as many problems and design flaws up-front before the code is even put into use. You can spot possible difficulties before they become serious issues that need to be corrected by showing your design to a coworker. There are a few various approaches you can take. You can use a collaborative tool like a shared document or a design review board or you can present your design to a colleague in person.

Immutable API

API design is also crucial in developing distributed systems. In the context of distributed systems, APIs serve as the connecting tissue. It's not enough to develop an API and cross your fingers. A well-designed API is essential. An immutable API design is one approach. Simply said, an immutable API design is one in which the API endpoints are built exclusively for CRUD operations (creating, reading, updating, and deleting) in a way that prevents data alteration. There are a couple of reasons why this matters. As a result, your API can scale more easily without you having to handle concurrency or resource-locking issues. The only guarantee in distributed systems is that something will go wrong. The risk of a single component failure triggering a cascading failure across the system is mitigated if the API is designed to prevent data alteration.

Summing up

Distributed systems present new challenges in scalability and reliability. When all services reside on different machines, they must communicate with one another through networked protocols. The more services there are, the more opportunities there are for something to go wrong. Failure is a natural part of distributed systems, but its how you deal with that failure that matters most. To handle this, you can use redundancy to reduce failures, use failure detection to repair the system after a failure occurs, and don't rely on one thing when building a distributed system. Distributed systems are challenging to build, but they are also powerful and scalable. To optimize them, you have to understand the critical path and focus on making them as reliable as possible.

Distributed Transactions: Overview

Sofwan A. Lawal — Tue, 03 Jan 2023 21:48:10 GMT

A distributed transaction is one that involves numerous database systems or other resources in a single transaction. Changes made to one system or resource must be reflected in all other systems or resources involved in the transaction in such instances. In other words, any changes made by the transaction must be committed or rolled back in all systems or resources involved. This article defines distributed transactions and describes how they function.

What are Transactions?

Transactions are a basic notion in database systems and other data-manipulation systems. A transaction is a unit of work that consists of one or more data operations. In a nutshell, a transaction is a collection of commands that either complete entirely or fail altogether.

Properties of a Transaction

The key properties of a transaction are atomicity, consistency, isolation, and durability.

Atomicity

Atomicity refers to the property of a transaction that ensures that either all or none of the operations in the transaction are performed. This means that if an error happens during transaction execution, all changes made by the transaction are undone, and the system is returned to a consistent state.

Consistency

The property of a transaction that assures that the transaction leaves the system in a consistent state is referred to as consistency. A consistent state is one in which all of the system's rules and restrictions are met.

Isolation

The property of a transaction that ensures that the changes performed by the transaction are not visible to other transactions until the transaction is committed is referred to as isolation. This means that other transactions cannot see the intermediate states of the data while the transaction is running.

Durability

The term "transaction durability" is used to describe the quality of a transaction that guarantees its modifications will survive a system failure. This guarantees that the transaction's modifications will survive a crash of the system.

When is a Transaction Distributed?

Transactions are straightforward to implement in a single-system environment. Depending on the outcome of the transaction, the system either commits the changes to the data store or rolls them back and records them in a temporary log called the transaction log.

However, the problem becomes more complicated in a distributed system, where numerous systems or resources are involved in a single transaction. This is because the transaction must produce results that are consistent and persistent across all systems or resources involved. This is known as a distributed transaction.

Distributed transactions refer to a situation where multiple database systems, or other resources, are involved in a single transaction. In such cases, the changes made to one system or resource must be reflected in all the other systems or resources participating in the transaction. In other words, all the changes made by the transaction must be committed or rolled back in all the participating systems or resources.

Requirements for distributed transactions

There are two important requirements for distributed transactions:

Consistency: this means all distributed databases are equally up to date with the most recent information.
Termination: the distributed transaction is either fully executed or not executed at all. If a distributed transaction fails, it needs to fail for every database that participated in the transaction.

Importance of Distributed Transactions

When a business process involving several systems or resources must be atomicthat is, all changes must be committed or none of them are committeddistributed transactions become crucial. A distributed transaction would be necessary, for instance, to guarantee the completion or reversal of a bank transfer between two different banking systems in the event of an error.

For processing a payment, Distributed Transactions might be helpful when validating and charging a credit card. Typically, billing information is kept separate from credit card information in a database. Using distributed transactions, we can synchronise the data in these two databases.

Challenges of Implementing Distributed Transactions

There are two major challenges involved in implementing distributed transactions, which are:

Consistency Problem

A major hurdle in implementing distributed transactions is making sure the changes performed by the transaction are consistent across all systems or resources involved. The issue is commonly referred to as the "consistency problem."

Durability Problem

Assuring that the transaction's modifications will survive a system failure is another difficult task. The issue is commonly referred to as the "durability problem."

Techniques and Protocols for Implementing Distributed Transactions

To solve the consistency and durability problems, various techniques and protocols have been developed for implementing distributed transactions.

Two-Phase Commit Protocol (2PC)

The transaction log is a short-term log used in this protocol to record the changes made by the transaction. Each system or resource involved in the transaction receives a request from the transaction coordinator, the coordinating entity, asking it to get ready to commit the modifications. Once the coordinator determines that all systems or resources are ready to commit, they will issue a commit request. If the coordinator detects that any system or resource is not yet ready to commit, they will issue a rollback request, at which point everything will reverse its recent actions.

There are several variations of the two-phase commit protocol, including the three-phase commit protocol and the distributed commit protocol. These variations address specific problems or improve the efficiency of the protocol.

Optimistic Concurrency Control (OCC)

In this technique, each system or resource participating in the transaction maintains a version number for the data. When a transaction attempts to update the data, it checks the version number. If the version number has not changed, the transaction updates the data and increments the version number. If the version number has changed, the transaction rolls back the changes and retries the update.

XA Standard

The XA standard is a technique for implementing distributed transactions that involve the use of an XA interface to coordinate the transaction. The XA interface defines a set of functions that can be used to start, end, and roll back a transaction.

Sagas Pattern

One way to execute distributed transactions is through the use of the Sagas pattern, which entails slicing up the transaction into several smaller, self-contained pieces. Individual sagas can be committed or rolled back without affecting others.

Eventual Consistency Model

Relaxing the consistency constraints of the transaction and letting the participating systems or resources finally converge on a consistent state is the basis of the eventual consistency model, a technique for implementing distributed transactions.

Summary

A distributed transaction is necessary because transactions that span multiple databases might fail due to network interruptions or other issues. It is an important concept in database systems and other distributed systems, and various techniques and protocols have been developed to solve the consistency and durability problems involved in implementing distributed transactions.

Two-Phased Commit and eXtended Architecture: The Basics

Sofwan A. Lawal — Fri, 30 Dec 2022 17:44:36 GMT

Two-phase commit (2PC) and XA (eXtended Architecture) are two important concepts in database transactions and distributed systems. They both provide a way to ensure that transactions involving multiple resources are either completed successfully or rolled back in case of failure, thus maintaining the integrity of the data. In this article, we will explain the two-phase commit protocol and XA in detail and discuss their use cases and limitations.

Introduction

In distributed transaction processing, a commit operation finalizes a transaction and makes it visible to other participants. In an extended two-phase commit (2PC) protocol, the commit action is split into two phases: Prepare and Commit. The first phase is called prepare because the participant prepares to commit by checking some pre-conditions. If those conditions are not satisfied, the participant can not continue in the second phase by committing but has to roll back their work. A failure at this point results in aborting the transaction and starting again from the beginning of the process. The main advantage of a 2PC protocol is that it enables automatic recovery from failures during transactions.

What is a Transaction?

Before we delve into 2PC and XA, it is important to understand what a transaction is. A transaction is a sequence of operations that are performed as a single unit of work. The main goal of a transaction is to ensure that the data remains consistent and reliable, even in the face of failures or errors.

In database systems, a transaction can consist of multiple database operations, such as inserts, updates, and deletes, that are performed on one or more tables. Transactions allow us to ensure that the data remains consistent and correct, even if some of the operations fail. For example, if we are transferring money from one bank account to another, we want to make sure that the money is deducted from the first account and added to the second account, or that no changes are made at all if something goes wrong.

Why Distributed Transaction Processing?

Distributed transaction processing has become an important requirement in many application scenarios. The reasons are simple:

first, we want to achieve scalability by increasing the size of the computing clusters to handle larger workloads.
Second, we want to achieve availability by ensuring that no single point of failure can bring down the system.

Achieving scalability and availability requires distributed systems with atomic transactions.

What is the Two-Phase Commit Protocol (2PC)?

The two-phase commit protocol is a distributed transaction protocol that ensures that a transaction is either completed successfully or rolled back in case of failure. It is called "two-phase" because it consists of two phases: a prepare phase and a commit phase.

The `Prepare` Phase

In the prepare phase, the transaction coordinator (also known as the "transaction manager") sends a request to all the participating resources (such as databases or message queues) to prepare for the commit. The resources then perform any necessary checks and updates, and return a response indicating whether they are ready to commit or not. If all the resources are ready to commit, the transaction coordinator moves on to the commit phase.

The `Commit` Phase

In the commit phase, the transaction coordinator sends a commit request to all the resources. If all the resources respond successfully, the transaction is considered committed and the changes are made permanent. If any of the resources fail to commit, the transaction coordinator sends a rollback request to all the resources and the transaction is considered failed.

The two-phase commit protocol is used to ensure that all the participating resources are in sync and that the changes are made consistently across all the resources. It is a reliable and widely used protocol, but it has some limitations, which we will discuss later.

Distributed Transaction Processing with 2PC

For distributed transaction processing, a two-phase commit protocol is necessary for ensuring that transactions are managed and controlled by more than one participant. Since all participants can't communicate with each other directly, a distributed transaction manager is required for controlling the transaction.

Transaction Manager

The transaction manager is responsible for controlling the transaction and coordinating the communication between the distributed resource managers.It does this by using a two-phase commit protocol

A two-phase commit protocol ensures that at least two participants have to be involved in every transaction:

the transaction manager and
at least one resource manager.

This means that a two-phase commit protocol requires a network connection between the transaction manager and the resource managers.

What is the eXtended Architecture Protocol?

XA is an extension of the two-phase commit protocol that allows transactions to span multiple resources, such as databases, message queues, and file systems. It is used to coordinate the commit or rollback of a transaction across multiple resources, ensuring that the changes are made consistently and reliably.

In XA, each resource participating in the transaction is represented by an XA resource manager. The XA resource manager is responsible for managing the transactions on the resource and communicating with the transaction manager. The transaction manager is responsible for coordinating the commit or rollback of the transaction across all the participating resources.

The XA protocol defines a set of APIs (Application Programming Interfaces) that the transaction manager and the XA resource managers use to communicate and coordinate the transaction. These APIs include functions for starting, committing, and rolling back a transaction, as well as for checking the status of a transaction.

XA is a powerful tool for managing distributed transactions, but it has some limitations, which we will discuss later.

Use Cases for 2PC and XA

2PC and XA are used in a variety of scenarios where transactions involve multiple resources, such as databases, message queues, and file systems. Some common use cases include:

Financial transactions: 2PC and XA are widely used in the financial industry to ensure the integrity of financial transactions, such as money transfers, stock trades, and credit card payments.
E-commerce: In e-commerce systems, 2PC and XA are used to ensure that orders, payments, and inventory updates are all completed consistently and reliably.
Supply chain management: In supply chain management systems, 2PC and XA are used to ensure that orders, shipments, and inventory updates are all coordinated and consistent across multiple resources.
Healthcare: In healthcare systems, 2PC and XA are used to ensure that patient records, treatments, and billing information are all consistent and accurate.

Limitations of 2PC and XA

While 2PC and XA are powerful tools for managing distributed transactions, they have some limitations:

Performance: 2PC and XA can have a significant impact on performance, as they involve multiple round-trips and communication between the participating resources and the transaction manager. This can make them slower than other transaction protocols.
Complexity: 2PC and XA are complex protocols that require a significant amount of programming and infrastructure to implement.
Single point of failure: The transaction manager is a single point of failure in the 2PC and XA protocols. If the transaction manager fails, the entire transaction will fail.
Limited scalability: 2PC and XA can be challenging to scale, as they involve multiple round-trips and communication between the participating resources and the transaction manager.

2PC With No Rollback

In a 2PC scenario where no rollback occurs, the prepare phase proceeds and all participants agree to commit. Since no participant is executing a rollback at this point, the transaction can be committed. A 2PC with no rollback is an optimistic implementation where the transaction participants proceed with the commit action in the second phase. If, however, some participants were not able to satisfy the conditions, they wont proceed and will roll back their work. This is called an optimistic approach because the participants proceed with committing their work without necessarily knowing whether their work will be visible to the other participants. The advantage of an optimistic approach is that it can lead to faster throughput in distributed transactions since no participants will be delaying the completion of their work.

2PC With Rollback

The main difference between a 2PC with no rollback and a 2PC with a rollback is that a 2PC with a rollback can proceed only if all participants agree to commit the transaction. If any participant fails to meet the pre-conditions and is unable to continue with the commit in the second phase, these participants have to roll back their work. An advantage of a 2PC with rollback is that it is more conservative and is therefore likely to lead to slower throughput because many distributed transactions may take longer to complete. For example, if a transaction cannot proceed because a resource manager is down, the participants will not be able to commit, and they will have to roll back their work.

Example: XA with NodeJS, TypeScript & Express

Here is an example of using XA with Node.js, TypeScript, and Express:

NOTE: These code examples are for illustration purposes only and do not represent a complete or real-world implementation of a distributed transaction management system. They are meant to provide a general understanding of the concepts involved and should not be used as is in a production environment.

Firstly, We're going to create an XA class to manage distributed transactions. This class will help us create and manage transactions that involve multiple resources. We'll be using MySQL client, to persist and coordinate the state of different resources and ensure the transaction is either committed or rolled back as needed.

import { Client } from 'pg';class XA {  private client: Client;  private transaction: Transaction | null = null;  constructor(client: Client) {    this.client = client;  }  async beginTransaction(): Promise {    // Begin a new transaction    this.transaction = new Transaction(this.client);    return this.transaction;  }}

Then we're going to create a Transaction class. The Transaction class is an important part of a distributed transaction management system because it helps to coordinate the actions of multiple resources involved in a transaction. It is responsible for managing the lifecycle of a distributed transaction, including the prepare, commit, and rollback phases.

class Transaction {  private client: Client;  private transactionId: string;  private resourceManagers: ResourceManager[] = [];  constructor(client: Client) {    this.client = client;    this.transactionId = Math.random().toString(36).substr(2, 10);  }  async addResourceManager(url: string, resourceName: string): Promise<void> {    // Add a new resource manager to the list    this.resourceManagers.push(new ResourceManager(url, resourceName));  }  async prepare(data: any): Promise<void> {    // Send a prepare request to all the resource managers, including the transaction ID and necessary data    const results = await Promise.all(this.resourceManagers.map(rm => rm.prepare(this.transactionId, data)));    // Update the transaction status in the database    await this.client.query('INSERT INTO transactions (id, status) VALUES ($1, $2)', [this.transactionId, 'prepared']);    // If any of the resource managers failed to prepare, rollback the transaction    if (results.some(result => !result)) {      await this.rollback();      throw new Error('Transaction failed to prepare');    }  }  async commit(): Promise<void> {    // Send a commit request to all the resource managers, including the transaction ID    await Promise.all(this.resourceManagers.map(rm => rm.commit(this.transactionId)));    // Update the transaction status in the database    await this.client.query('UPDATE transactions SET status = $1 WHERE id = $2', ['committed', this.transactionId]);  }  async rollback(): Promise<void> {    // Send a rollback request to all the resource managers, including the transaction ID    await Promise.all(this.resourceManagers.map(rm => rm.rollback(this.transactionId)));// Update the transaction status in the databaseawait this.client.query('UPDATE transactions SET status = $1 WHERE id = $2', ['reverted', this.transactionId]);  }}

Now, to the resource manager class which is another important part of the distributed transaction management system. The ResourceManager class is typically responsible for receiving requests from the Transaction class to prepare, commit, or rollback a transaction, and for interacting with the shared resource to perform these actions. It may also be responsible for other tasks related to managing the shared resource, such as creating and releasing locks on the resource, or handling errors that occur during the transaction.

class ResourceManager {  private url: string;  constructor(url: string) {    this.url = url;  }  async prepare(transactionId: string, data: any): Promise<boolean> {    try {      // Send a prepare request to the resource manager, including the transaction ID and necessary data      await axios.post(`${this.url}/prepare`, { transactionId, data });      return true;    } catch (error) {      return false;    }  }  async commit(transactionId: string): Promise<void> {    // Send a commit request to the resource manager, including the transaction ID    await axios.post(`${this.url}/commit`, { transactionId });  }  async rollback(transactionId: string): Promise<void> {    // Send a rollback request to the resource manager, including the transaction ID    await axios.post(`${this.url}/rollback`, { transactionId });  }}

Here is an example of how to use the updated XA and Transaction classes to manage a distributed transaction in a Node.js application using TypeScript and the Express framework:

import express, { Router } from 'express';import { Client } from 'pg';import { XA, Transaction } from './xa';const app = express();const router = Router();const client = new Client();const xa = new XA(client);router.post('/transfer', async (req, res) => {  try {    // Begin a new transaction    const transaction = await xa.beginTransaction();    // Add the necessary resource managers to the transaction    await transaction.addResourceManager('http://debit.service/api');    await transaction.addResourceManager('http://credit.service/api');    // Prepare the transaction, including the necessary data    const data = {      fromAccount: req.body.fromAccount,      toAccount: req.body.toAccount,      amount: req.body.amount,    };    await transaction.prepare(data);    // Commit the transaction    await transaction.commit();    res.sendStatus(200);  } catch (error) {    res.sendStatus(500);  }});app.use(router);app.listen(3000, () => {  console.log('Listening on port 3000');});

In this example, the prepare method of the Transaction class will send a POST request to the /prepare endpoint of each of the resource managers, passing along the transactionId and the necessary data. The resource managers will then use this data to prepare for the transaction.

The commit method of the Transaction class will then send a POST request to the /commit endpoint of each of the resource managers, passing along the transactionId. The resource managers will use this request to commit the actions they prepared for in the previous step.

If any errors occur during the transaction, the rollback method of the Transaction class will be called, which will send a POST request to the /rollback endpoint of each of the resource managers, passing along the transactionId. The resource managers will use this request to roll back any actions they took during the prepare phase.

This is a basic example of how to use the XA and Transaction classes to manage a distributed transaction in a Node.js application. You may need to modify these classes and the example code to fit the specific needs of your application.

Warning: These code examples are for illustration purposes only and do not represent a complete or real-world implementation of a distributed transaction management system. They are meant to provide a general understanding of the concepts involved and should not be used as is in a production environment.

Conclusion

In conclusion, 2PC and XA are important concepts in database transactions and distributed systems. They provide a way to ensure that transactions involving multiple resources are either completed successfully or rolled back in case of failure, thus maintaining the integrity of the data. However, they have some limitations, including performance, complexity, and scalability, which should be taken into consideration when deciding whether to use them in a particular application.

Circuit Breaker in Microservices

Sofwan A. Lawal — Wed, 28 Dec 2022 12:27:35 GMT

Its a given fact that microservices-based software architecture brings its own set of challenges. With so many microservices and services interacting with each other, increased complexity and the risk of failures or cascade failures are inevitable. To address these challenges, we need to find ways to isolate risky components and prevent their failure from propagating throughout the system. Consequently, in this article, we will explore the circuit breaker pattern in microservices architecture and see how it can help you deal with faults and failures.

Introduction

When designing an enterprise microservices architecture, one of the biggest concerns is how to manage failure in a distributed system. The software industry has seen several examples of large-scale failures, such as Microsoft Azure outage and AWS S3 outage in 2017. Both cloud outages had a huge impact on many businesses because they were so widespread. With the microservices architecture pattern, you can build your applications using small services that have a single responsibility and are combined to create larger capabilities. When building these smaller services, it's important to implement resiliency measures to make sure they remain online when encountering errors or unexpected conditions.

What is a Circuit Breaker?

When things go wrong, we must have some contingency in place. Otherwise, our services will keep failing and well end up with no services at all. This is where circuit breakers come into play.

A Circuit Breaker is a fault-tolerance pattern that's used to handle transient errors and prevent cascading failures. In other words, it's a mechanism to stop the propagation of errors by shutting things down gracefully.

In Distributed Systems

In distributed systems, a circuit breaker can be implemented to end the flow of requests to a service that has exceeded its maximum threshold of error rate and latency. The pattern is widely used in distributed systems to improve their reliability and availability. It is implemented as a monitoring mechanism to detect faults and then decide if an action needs to be taken to prevent the faulty components from affecting the system as a whole.

Why use the circuit breaker pattern?

There are several benefits to using the Circuit Breaker pattern in a microservice architecture:

Improved resilience and reliability: By automatically failing requests to downstream services that are not responding or experiencing high latency, the circuit breaker helps to prevent cascading failures and improve the overall resilience and reliability of the system.
Increased availability: By failing fast and stopping the chain reaction of failures, the circuit breaker helps to ensure that the system remains available and can continue to serve user requests.
Reduced resource consumption: When a downstream service is experiencing problems, the circuit breaker can help to reduce the load on the service by failing requests before they reach the service. This can help to reduce the resource consumption of the service and prevent it from becoming overloaded.
Enhanced monitoring and visibility: The circuit breaker can provide useful information about the health of downstream services, allowing the system to be monitored and any issues to be identified and addressed quickly.

Overall, the Circuit Breaker pattern is an important tool for improving the resilience and reliability of microservice architectures and helping to ensure that the system remains available and responsive to user requests.

Why is it needed in Microservices Architecture

A microservice architecture consists of a large number of microservices that are built for specific tasks and can be reused across different applications. Because these microservices are independent, they can be deployed and scaled independently, allowing your organization to meet changing business requirements.

The Problem

This architecture is highly distributed and is therefore susceptible to a wide variety of faults, such as latency issues, outages, or unbalanced loads. When a problem occurs in one of these services, it can quickly escalate and affect all of the other microservices in the system. If a service encounters an error, it will return an error, which can travel through the entire system and create a chain reaction.

The Solution

Circuit breakers can be used to stop this propagation of errors by shutting things down gracefully. When a circuit breaker is activated, it prevents requests from reaching a faulty microservice, preventing the error from cascading through the system. With circuit breakers, you can ensure that your system remains stable even in the event of an unexpected incident.

Implementing a Circuit Breaker in Microservices

One way to implement a circuit breaker is to use a state machine that tracks the health of the downstream service. The state machine has three states:

closed: Allow all requests
open: Fail all request
half-open: Allow some requests

When the circuit breaker is in the closed state, requests to the downstream service are allowed to pass through. If the downstream service starts to experience failures or high latency, the circuit breaker transitions to the open state and begins to fail requests to the downstream service.

After a certain amount of time has passed, the circuit breaker transitions to the half-open state and allows a limited number of requests to pass through. If these requests are successful, the circuit breaker transitions back to the closed state. If the requests fail, the circuit breaker transitions back to the open state.

Here is an example of how you might implement the circuit breaker in a Node.js application using the Express framework and TypeScript:

First, we will create a simple function that represents a downstream service that we want to protect with a circuit breaker. This function will make an HTTP request to a mock service and return the response data.

import axios from 'axios';async function callDownstreamService(): Promise<string> {  try {    const response = await axios.get('http://mock-service.com/data');    return response.data;  } catch (error) {    throw new Error(error.message);  }}

Next, we will create a circuit breaker class that will wrap the call to the downstream service. This class will use a state machine to track the health of the downstream service and automatically fail requests if necessary:

import axios from 'axios';// Enum to track the state of the circuit breakerenum CircuitBreakerState {  CLOSED, // Circuit is closed and requests to the downstream service are allowed through  OPEN, // Circuit is open and requests to the downstream service are failed  HALF_OPEN, // Circuit is half-open and a limited number of requests are allowed through}class CircuitBreaker {  // The current state of the circuit breaker  private state: CircuitBreakerState;  // The number of errors that need to occur before the circuit breaker transitions to the open state  private errorThreshold: number;  // The amount of time the circuit breaker stays in the open state before transitioning to the half-open state  private resetTimeout: number;  // The number of errors that have occurred  private errorCount: number;  // The number of requests allowed through in the half-open state  private halfOpenRequests: number;  // The time when the circuit breaker last changed state  private lastStateChange: number;  // Constructor for the circuit breaker class  constructor(errorThreshold: number, resetTimeout: number, halfOpenRequests: number) {    this.state = CircuitBreakerState.CLOSED;    this.errorThreshold = errorThreshold;    this.resetTimeout = resetTimeout;    this.halfOpenRequests = halfOpenRequests;    this.errorCount = 0;    this.lastStateChange = Date.now();  }  // Method to call the downstream service using the circuit breaker  async callService(): Promise<string> {    try {      // Check the current state of the circuit breaker      switch (this.state) {        case CircuitBreakerState.CLOSED:          // Circuit is closed, so allow the request through and reset the error count          return await this.callServiceAndResetErrorCount();        case CircuitBreakerState.OPEN:          // Circuit is open, so check if the reset timeout has expired          if (this.isResetTimeoutExpired()) {            // Reset timeout has expired, so transition to the half-open state and allow a request through            this.transitionToHalfOpenState();            return await this.callServiceAndIncrementErrorCount();          } else {            // Reset timeout has not expired, so fail the request            throw new Error('Circuit is open');          }        case CircuitBreakerState.HALF_OPEN:          // Circuit is half-open, so check if the number of allowed requests has been exceeded          if (this.errorCount < this.halfOpenRequests) {            // Allowed requests have not been exceeded, so allow the request through            return await this.callServiceAndIncrementErrorCount();          } else {            // Allowed requests have been exceeded, so transition back to the open state and fail the request            this.transitionToOpenState();            throw new Error('Circuit is open');          }      }    } catch (error) {      // An error occurred, so increment the error count      this.incrementErrorCount();      throw error;    }  }  // Method to call the downstream service and reset the error count  private async callServiceAndResetErrorCount(): Promise<string> {    try {      // Call the downstream service      const response = await callDownstreamService();      // Reset the error count      this.resetErrorCount();      // Return the response from the downstream service      return response;    } catch (error) {      // An error occurred, so increment the error count      this.incrementErrorCount();      throw error;    }  }  // Method to call the downstream service, increment the error count, and transition to the closed state  // if the request is successful  private async callServiceAndIncrementErrorCount(): Promise<string> {    try {      // Call the downstream service      const response = await callDownstreamService();      // Reset the error count      this.resetErrorCount();      // Transition to the closed state      this.transitionToClosedState();      // Return the response from the downstream service      return response;    } catch (error) {      // An error occurred, so increment the error count      this.incrementErrorCount();      throw error;    }  }  // Method to reset the error count  private resetErrorCount(): void {    this.errorCount = 0;  }  // Method to increment the error count and transition to the open state if the error threshold is reached  private incrementErrorCount(): void {    this.errorCount++;    if (this.errorCount >= this.errorThreshold) {      this.transitionToOpenState();    }  }  // Method to check if the reset timeout has expired  private isResetTimeoutExpired(): boolean {    return Date.now() - this.lastStateChange > this.resetTimeout;  }  // Method to transition to the closed state  private transitionToClosedState(): void {    this.state = CircuitBreakerState.CLOSED;    this.lastStateChange = Date.now();  }  // Method to transition to the open state  private transitionToOpenState(): void {    this.state = CircuitBreakerState.OPEN;    this.lastStateChange = Date.now();  }  // Method to transition to the half-open state  private transitionToHalfOpenState(): void {    this.state = CircuitBreakerState.HALF_OPEN;    this.lastStateChange = Date.now();  }}

Finally, we can use the CircuitBreaker class in an Express route handler to protect the call to the downstream service:

import express from 'express';import { CircuitBreaker } from './circuit-breaker';const app = express();// Create a new circuit breaker with an error threshold of 5, a reset timeout of 10 seconds, and allowing 2 requests through in the half-open stateconst circuitBreaker = new CircuitBreaker(5, 10000, 2);app.get('/data', async (req, res) => {  try {    // Call the downstream service using the circuit breaker    const data = await circuitBreaker.callService();    // Send the response data back to the client    res.send(data);  } catch (error) {    // An error occurred, so send a 500 error back to the client    res.status(500).send(error.message);  }});app.listen(3000, () => {  console.log('Server listening on port 3000');});

This example creates a circuit breaker with an error threshold of 5, a reset timeout of 10 seconds, and allows 2 requests through in the half-open state. If the downstream service fails 5 times in a row, the circuit breaker will transition to the open state and start failing requests. After 10 seconds have passed, the circuit breaker will transition to the half-open state and allow 2 requests through. If these requests are successful, the circuit breaker will transition back to the closed state. If they fail, the circuit breaker will transition back to the open state.

Using a circuit breaker can help to improve the resilience and reliability of a microservice architecture by providing a mechanism to fail fast and prevent cascading failures. It is important to tune the circuit breaker's parameters, such as the time to stay in the open state and the number of requests to allow through in the half-open state, to ensure that it is effective in protecting the system without causing undue disruption.

I hope this example helps to illustrate how the Circuit Breaker pattern can be implemented in a Node.js application using the Express framework and TypeScript.

Conclusion

When designing an enterprise microservices architecture, it's important to implement resiliency measures to make sure the services remain online when encountering errors or unexpected conditions. A circuit breaker is a fault-tolerance pattern that can be used to handle transient errors and prevent cascading failures in distributed systems. With circuit breakers, you can ensure that your system remains stable even in the event of an unexpected incident.

Service Discovery in Distributed Systems

Sofwan A. Lawal — Mon, 26 Dec 2022 10:55:49 GMT

Service discovery is a key component of microservices architecture, as it enables microservices to communicate with each other and discover each other's location. In this article, we will delve into the concept of service discovery in microservices, its benefits, and how it is implemented using NodeJS/Typescript.

Introduction

Service discovery is a key aspect of building and operating microservices-based architectures. It refers to the process of finding and connecting to the desired service or resource within a distributed system. In a microservices architecture, each service is independently deployable and scalable, and they communicate with each other through APIs or other means. Service discovery enables these services to locate and connect reliably and efficiently, regardless of their location or state. It is an essential component of a microservices architecture, as it allows for the dynamic discovery and orchestration of services, enabling flexibility and resilience in the face of change.

Service Discovery in Microservices

In a microservices architecture, each service is a self-contained unit that performs a specific task and communicates with other services through APIs. Service discovery refers to the process of finding the location of a particular service and establishing communication with it. It is an essential part of microservices architecture, as it enables microservices to discover and communicate with each other.

Service Registry

Service discovery is typically implemented using a service registry, which is a centralized database that stores the location and metadata of all the services in the system. When a service wants to communicate with another service, it queries the service registry to find the location of the target service. The service registry returns the location of the target service, and the calling service establishes a connection and communicates with it through APIs.

There are several ways to implement a service registry. One common approach is to use a central database, such as a database server or a distributed key-value store. Another approach is to use a central HTTP server that provides a REST API for registering, unregistering, and looking up services.

Regardless of the implementation, it is important to ensure that the service registry is reliable and highly available, as it is a critical component of the microservice architecture. If the registry goes down, the services in the system may be unable to communicate with each other and the system may become unavailable.

Benefits of Service Discovery in Microservices

There are several benefits to using service discovery in microservices architecture:

Decentralized Architecture

Service discovery enables a decentralized architecture, where each service can operate independently and communicate with other services through APIs. This allows for greater flexibility and scalability, as services can be added, removed, or modified without affecting the overall system.

Resilience

Service discovery allows services to communicate with each other through APIs, which means that services can continue to operate even if other services are down or unavailable. This increases the overall resilience of the system.

Load Balancing

Service discovery can be used to implement load balancing, where multiple instances of a service are available to handle requests. The service registry can store the location of all the instances of a service, and the calling service can use a load-balancing algorithm to distribute requests among the instances.

Dynamic Scaling

Service discovery can be used to implement dynamic scaling, where the number of instances of a service is increased or decreased based on the workload. This allows the system to scale up or down based on demand, which helps to optimize resources and reduce costs.

How Does Service Discovery Work?

Service discovery typically involves the use of a registry, which is a central repository that maintains a list of all the available services and their locations. When a service needs to communicate with another service, it queries the registry to find the location of the target service.

There are several ways in which the registry can be implemented, including:

Centralized registry: In a centralized registry, all services register with a central server, which maintains a list of all the available services and their locations. When a service needs to communicate with another service, it queries the central server to find the location of the target service.
Decentralized registry: In a decentralized registry, each service maintains its own registry and shares it with other services. When a service needs to communicate with another service, it queries the registry of the target service to find its location.
Hybrid registry: In a hybrid registry, a central server maintains a list of all the available services, but each service also maintains its own registry. This allows for a centralized view of all the available services, while still allowing for decentralized communication between services.

There are also several tools and technologies available for implementing service discovery, including:

DNS: Domain Name System (DNS) is a distributed database that maps domain names to IP addresses. DNS can be used for service discovery by mapping a domain name to the IP address of a service.
Load balancers: Load balancers can be used for service discovery by routing requests to the appropriate service based on the domain name or IP address.
Service mesh: A service mesh is a network of microservices that can be used to implement service discovery and communication between services.

Implementing a Service Registry for service Discovery using a centralized registry

Here is an example of how you could implement a centralized registry for service discovery using the Express.js framework:

import express, { Request, Response } from 'express';import bodyParser from 'body-parser';// Service registry: maps service names to their addresses// Ideally, this is a database system (NoSQL/Redis/Mysql etc).const serviceRegistry = new Map<string, string>();// Create an Express appconst app = express();// Parse request bodies as JSONapp.use(bodyParser.json());// Register routes for the four actionsapp.get('/services', (req: Request, res: Response) => {  // Return the list of registered services  res.json(Array.from(serviceRegistry.keys()));});app.post('/register', (req: Request, res: Response) => {  // Add the new service to the registry  const { name, address } = req.body;  serviceRegistry.set(name, address);  res.send('Success');});app.post('/unregister', (req: Request, res: Response) => {  // Remove the service from the registry  const { name } = req.body;  serviceRegistry.delete(name);  res.send('Success');});app.get('/lookup/:name', (req: Request, res: Response) => {  // Look up the requested service in the registry and return its address  const name = req.params.name;  const address = serviceRegistry.get(name);  if (address) {    res.json({ address });  } else {    res.status(404).send('Not found');  }});// Start the serverconst port = 3000;app.listen(port, () => {  console.log(`Registry listening on port ${port}`);});

This registry implementation listens for HTTP requests on port 3000, and provides the following four actions:

GET /services: Returns a list of the names of all the registered services.
POST /register: Registers a new service by adding it to the registry. The request body should be a JSON object with two properties: name (the name of the service) and address (the address of the service).
POST /unregister: Unregisters a service by removing it from the registry. The request body should be a JSON object with a single property: name (the name of the service).
GET /lookup/: Looks up the address of the service with the given name. If the service is not found, the server returns a 404 error.

Each service that wants to register with the registry can make a POST request to /register with its name and address in the request body, and unregister with a POST request to /unregister with its name in the request body. Other services can look up the address of a specific service by making a GET request to /lookup/, where is the name of the service they want to find.

Implementing a Service Discovery mechanism for Microservices using a centralized registry

Here is an example of how you could implement a service discovery mechanism for microservices using a centralized registry, written in TypeScript and using the Express.js framework:

import axios from 'axios';import express, { Request, Response } from 'express';import bodyParser from 'body-parser';// Base URL of the registryconst registryUrl = 'http://registry:3000';// Register a new service with the registryasync function registerService(name: string, address: string) {  try {    await axios.post(`${registryUrl}/register`, { name, address });    console.log(`Service "${name}" registered with the registry`);  } catch (error) {    console.error(`Error registering service "${name}": ${error.message}`);  }}// Unregister a service with the registryasync function unregisterService(name: string) {  try {    await axios.post(`${registryUrl}/unregister`, { name });    console.log(`Service "${name}" unregistered from the registry`);  } catch (error) {    console.error(`Error unregistering service "${name}": ${error.message}`);  }}// Look up the address of a service in the registryasync function lookupService(name: string): Promise<string | undefined> {  try {    const response = await axios.get(`${registryUrl}/lookup/${name}`);    return response.data.address;  } catch (error) {    console.error(`Error looking up service "${name}": ${error.message}`);  }}// Create an Express appconst app = express();// Parse request bodies as JSONapp.use(bodyParser.json());// Register a route to register a new serviceapp.post('/register', (req: Request, res: Response) => {  const { name, address } = req.body;  registerService(name, address);  res.send('Success');});// Register a route to unregister a serviceapp.post('/unregister', (req: Request, res: Response) => {  const { name } = req.body;  unregisterService(name);  res.send('Success');});// Example usage: register a new service and look it upregisterService('service-a', 'http://localhost:3001');const address = await lookupService('service-a');console.log(`Address of service "service-a": ${address}`);

This example uses the Axios library to make HTTP requests to the registry. The registry is assumed to be running at the URL http://registry:3000.

The registerService function makes a POST request to /register with the name and address of the service to be registered. The unregisterService function makes a POST request to /unregister with the name of the service to be unregistered. The lookupService function makes a GET request to /lookup/ to look up the address of the service with the given name.

Each microservice can use these functions to register and unregister itself with the registry, and to look up the addresses of other services it needs to communicate with.

Conclusion

Service discovery is a crucial component of microservices architecture, as it enables microservices to communicate with each other and to discover each other's location. It is typically implemented using a centralized service registry

Load Shedding in Distributed Systems

Sofwan A. Lawal — Sat, 24 Dec 2022 09:08:20 GMT

Distributed systems are made up of many parts, each of which can fail on its own. Because of this, distributed systems often have partial breakdowns in the real world. These problems could be caused by node failures, network partitions or any number of other things that weren't planned. These unexpected failures could bring the whole system down and affect users. Whats even more troubling is that some of these failures tend to happen again and again at unpredictable times. Load shedding is one way that we deal with these kinds of unplanned system failures we purposely cut back on resources to stop more general failures during times of stress.

Introduction

When there is a limited supply of something, we try to "ration" it. This is called supply-chain management. When electricity or some other resource is not always available, we implement load shedding to ration it. This means that when there isn't enough power for the grid, different parts of the grid are turned off for a while so they don't use more power than their fair share of power. When this happens with distributed systems, we call it Load Shedding.

Distributed System

In a distributed system, individuals or organisations work independently but in collaboration to achieve a common goal. A distributed system consists of numerous nodes or processes that operate independently of one another and are connected via network services such as remote procedure calls or message-passing services.

What is load shedding?

Load shedding a term often used in electrical engineering according to dictionary.com is "the deliberate shutdown of electric power in a part or parts of a power-distribution system, generally to prevent the failure of the entire system when the demand strains the capacity of the system."

Load shedding is a regulated process in whichelectricity supply is intentionally disrupted in specific locations to manage demand and avoid the entire power grid from collapsing. It is typically done when there is a shortage of electricity generation or transmission capacity, or when there is a high risk of overloading the system.

Load Shedding in Computer Science

From the definitions above, It is clear that load shedding occurs when the underlying system has insufficient capacity to continue to operate normally. Hence, it is a technique used in systems to handle situations where the system is overwhelmed and cannot keep up with the demand. When load-shedding occurs, the system will prioritize certain requests and temporarily stop processing others in order to reduce the load on the system and prevent it from crashing.

What is Load Shedding in Distributed Systems?

Load shedding is the act of deliberately shedding some load to keep a system from collapsing due to overload. The distributed systems that power our internet and businesses are built on a bunch of computers that have to be on at all times so that the system works this is called having all hands on deck. This means that theres a finite amount of resources that can be dedicated to these systems, and theres often a lot of demand for them.

In certain situations where the system is failing partially, there are only two options:

Let the system fail
or engage in load shedding.

Load shedding in distributed systems can mean shutting down services, slowing down operations, or re-routing requests. Load shedding is a defensive approach to dealing with an overburdened system that involves deliberately bringing the system to a lower level of service than usual in order to buy time for system administrators to add new capacity to the system or repair broken equipment. It is a common practice in power grids and other critical systems where a lack of capacity can lead to system failure.

Failing Gracefully in Distributed System

Distributed systems that fail gracefully are designed to redistribute the load shed from the failed component onto the healthy components of the system.

A distributed system is said to have successfully shed load if it can handle the excess load without failing entirely. Though distributed systems have a higher probability of failure than their centralized counterparts, they have an advantage in that they can handle extremely large amounts of traffic with relatively low infrastructure costs.

Distributed systems experience frequent partial failures. These failures may be due to node outages, network partitioning, or any other number of unanticipated events. Load shedding is a process by which we handle these unanticipated system failures we deliberately reduce resources under stress as a means of preventing more widespread failure.

The major reason for load shedding

There are many reasons why load shedding is necessary for distributed systems. First, they are not expected to run at 100%. In fact, they work best when theyre running at 80% capacity or less. In a perfectly balanced system, if one aspect is running at 100%, its going to take away resources from other parts of the system and cause them to slow down. A system thats 100% busy is a sign of bad management. When you have a distributed system that is at 100% capacity, you have no room for error. This is what can cause a system to crash or go down. Managers of distributed systems need to be aware of where the system is at capacity and where it can be shed to keep it from crashing or going down.

How to trigger load shedding?

When load-shedding is triggered, the system will stop processing some of the incoming requests and prioritize others. This is done in order to reduce the load on the system and prevent it from crashing. The requests that are prioritized may be those that are considered more important or time-sensitive, such as requests for critical services or emergency services.

The key to successfully shedding load in a distributed system is the ability to detect failures and trigger an automated response that reroutes traffic to other healthy nodes.

Load Shedding Strategies

Several methods and strategies can be used to implement load-shedding in a distributed system.

Limiting the resource consumption rate

Resource limits can be used to shed load in distributed systems by reducing the resource consumption rate. The rate at which a resource is consumed can be monitored, and if the rate exceeds a certain threshold, the resource is shed to prevent the system from being overloaded.

Merging and rerouting requests

In order to avoid a cascading failure in distributed systems, the load-shedding strategy must be designed to shed load from the node that is experiencing the problem rather than shedding from the node that receives the request. For example, when a node that is responsible for serving requests is experiencing problems, the load-shedding strategy must be designed to send the requests to another node that can handle them.

Dropping requests

When the load-shedding strategy involves dropping requests, the nodes in the distributed system must have the ability to recognize and ignore certain types of requests. Dropping requests is typically used as a last resort, when other load-shedding strategies (e.g., reducing the resource consumption rate or rerouting requests) would require too much effort to implement.

Queuing Requests

One common method is to use a queue to store incoming requests and process them in a first-in, first-out (FIFO) order. When the queue becomes full, the system can stop accepting new requests and process the ones that are already in the queue.

Load Balancing Requests

Another method is to use a load balancer to distribute incoming requests evenly across multiple servers in the system. If one server becomes overloaded, the load balancer can redirect requests to other servers in the system to help alleviate the load.

Artificial intelligence (AI) algorithms

Load-shedding can also be implemented using artificial intelligence (AI) algorithms, such as machine learning (ML) algorithms. These algorithms can analyze incoming requests and determine which ones should be prioritized based on various factors, such as the importance of the request, the expected response time, and the current load on the system.

Side Effects of Load Shedding

While load-shedding can be an effective way to prevent a distributed system from crashing, it can also have negative consequences. For example, if the system is constantly shedding the load, it may not be able to meet the demands of its users. This can lead to frustration and may result in a loss of business or customer loyalty.

To mitigate these negative consequences, it is important to carefully monitor the system and implement load-shedding only when necessary. It is also important to have adequate capacity in the system to handle the expected workload and to have robust failover mechanisms in place to ensure that the system remains operational even in the event of a failure.

Conclusion

In distributed systems, load shedding is the process of reducing resources or capability on purpose so that the whole system doesn't fail from being overloaded. This can mean turning off services, slowing down processes, or sending requests in a different direction. It is to stop cascading failures, in which the failure of one part causes a chain effect of failures in other parts. For load shedding to work well in distributed systems, there needs to be clear policies and procedures in place, as well as the ability to keep an eye on the system and spot possible problems before they become serious. The system should be built with enough capacity and failover methods to ensure reliable operation ration. Load shedding should be done carefully and only when it's necessary.

Understanding The Difference Between LSM Tree and B-Tree

Sofwan A. Lawal — Sun, 18 Dec 2022 18:06:33 GMT

Lets face it, data is a tricky thing to manage. All kinds of challenges arise when you attempt to store and organize data efficiently. In the world of databases, some structures are better suited than others for specific tasks. In this blog post, well discuss two common tree-based database structures: LSM Tree and B-Tree. Both have their advantages, with LSM Trees being more commonly used in modern applications. Which one is right for you? Keep reading for more information!

What is an LSM Tree?

An LSM Tree is a data structure thats commonly used in databases. Its not a specific database, its a data structure that can be used in several different database types. LSM stands for Log-Structured Merge. This structure has a few great benefits: Its fast. Its durable. Its efficient in both space and time.

What is a B-Tree?

A B-Tree is a specific type of data structure that is designed to store data in a way thats easy to find and manage. B-Trees are commonly used in relational databases, such as MySQL and Oracle. B-Trees are used to organize data on the fly and can rearrange themselves to accommodate more data as it is added to the database. B-Trees are made up of nodes, where data is stored, and links, which are used to navigate between nodes.

Whats the difference between an LSM Tree and a B-Tree?

There are some key differences between an LSM Tree and a B-Tree. The biggest difference is in how each structure stores data.

In an LSM Tree, data is sorted based on the path that it takes through the tree structure. In contrast, a B-Tree sorts data based on the values within the data itself.
Another difference has to do with how the structures arrange data. The data in an LSM Tree is stored in a single data file. In a B-Tree, each node has its own data file.
Finally, both structures use different methods of storing links. In an LSM Tree, links are stored in the same data file as the node they connect to. In a B-Tree, links are stored separately from the data.

When should you use an LSM Tree?

If you need to store a substantial amount of information but only have a limited amount of space available, an LSM Tree is an excellent option to consider. Because this structure is both rapid and efficient, you should have no trouble obtaining the data you need promptly. An LSM Tree works well for data that is accessed in a random order rather than in a sequential one. If you require support for sequential access, an LSM Tree is not the best option to go with.

When should you use a B-Tree?

If your data is ordered, a B-Tree is a way to go. A B-Tree is the optimal data structure to use when your data is stored sequentially and must be retrieved in the same order every time. Additionally, B-Trees are frequently employed to keep track of metadata, or data about data. Because of their adaptability, B-Trees can be employed in a wide variety of contexts. B-Trees are more inefficient than LSM Trees when it comes to processing time and data file size. Using an LSM Tree rather than a B-Tree is the way to go if you need to store a lot of information.

Final words: Which one is best for you?

Now that we've gone over the basics of both LSM Trees and B-Trees, you should have a good idea of what each data structure has to offer. An LSM Tree is a fast and efficient way to store a lot of data in a small amount of space. A B-Tree is a sequential data structure that can be used to store metadata and other kinds of information. Which one do you like best? Think about what kind of data you want to store and how you will need to get to it. Use a B-Tree if your data needs to be in a certain order or a certain order. An LSM Tree is the best choice for you if you need to store a lot of data quickly and efficiently.

Time-based One-Time Password (TOTP): What is it?

Sofwan A. Lawal — Mon, 28 Nov 2022 18:17:46 GMT

A time-based one-time password (TOTP) is a one-time password generated based on the current time and a shared secret key. This method of authentication is used in addition to a username/email and password for increased security. TOTP is used in situations where it is not feasible to use hardware-based tokens, such as when logging in from a public computer. TOTP is an open standard defined in RFC6238. This article will explain the basics you need to know about TOTP authentication as well as how to implement it in NodeJS applications. Lets get started!

What is TOTP Authentication?

TOTP authentication is an authentication method using a time-based One-Time Password (OTP). Traditional login methods are made more secure by TOTP authentication because a user can only access an account for a limited period of time. Access will be prohibited if you try to log into an account outside of the limited time window. TOTP authentication is most frequently used for 2FA, software, and remote employee logins.

How does TOTP Authentication work?

A shared secret key and the current time are used to generate a new one-time-use password for TOTP authentication. The user then logs in or completes a login using this password (in the case of 2FA). Users are advised to enter the code as soon as it is generated because this password changes every 30 seconds. Users must enter a username, a code generated from the given time period, and depending on the system requirement a password in order to use TOTP authentication. In contrast to conventional login techniques, where users only need to remember their username and password, this requires more information from the user.

Why is TOTP becoming more popular?

Because of its high level of security, TOTP authentication is being used more frequently. In the traditional authentication technique, it is simpler for hackers to access numerous accounts because users just need one username and password. However, users must enter a code produced from the selected time period during TOTP authentication, which adds an additional degree of protection. This implies that in order to log into several accounts, a user would require access to various devices.

How to implement TOTP in NodeJS?

There are a few things we need to do in order to implement TOTP authentication in NodeJS. In this section, we're going to use a package called speakeasy. Speakeasy distinguishes out from the other 2FA projects on GitHub because of how active it is. We're going to experiment with this package in a new project to simplify things.

Install Dependencies

Execute the following command start a new NodeJS project and install the Speakeasy package. Additionally, a package required for managing POST requests with a payload will be installed, along with Express.js.

npm init -ynpm install express body-parser speakeasynpm install -D typescript @types/express @types/speakeasy

Ensure that an app.js file is created in the current project directory and that it contains the boilerplate content listed below.

import express, { Request, Response, NextFunction } from "express";import * as bodyParser from "body-parser";import { generateSecret, totp } from 'speakeasy';export const app = express();app.use(bodyParser.json());app.use(bodyParser.urlencoded({ extended: true }));// generate a secret token to be saved in an application like Google Authenticatorapp.post("/generate-secret", (request: Request, response: Response, next: NextFunction) => { });// validate that the TOTP is valid for a given secret and is not expiredapp.post("/validate-token", (request: Request, response: Response, next: NextFunction) => { });app.listen(5000, () => {    console.log("Listening at :5000");});

Generate TOTP Secret

First, we need to generate a shared secret key. This key will be used to generate the one-time-use passwords by the authenticator app. The generated token can be made into a QR code which can be scanned by an authenticator app. typescript app.post("/generate-secret", (request: Request, response: Response, next: NextFunction) => { const {otpauth_url, base32} = generateSecret({ length: 20 }); saveSecretToDB(request.userId, base32); response.send({ "secret": base32 }); }); or Generating a QR code

By manually entering a key or scanning a QR code, users can add a page to which they authenticate using applications like the Google Authenticator. The latter is common and significantly quicker. We employ QRcode library to produce QR images. We can install the QRcode package by running npm install qrcode @types/qrcode

import * as QRCode from 'qrcode';app.post("/generate-secret", (request: Request, response: Response, next: NextFunction) => {    const {otpauth_url, base32} = generateSecret({ length: 20 });    // How you store generated secret key varies by implementation    saveSecretToDB(request.userId, base32);    QRCode.toFileStream(response, otpauth_url);});

Validate TOTP Secret

To validate a code provided by the user which is generated by the authenticator app typescript app.post("/validate-token", (request: Request, response: Response, next: NextFunction) => { // How you get the secret for first-time activation depends on the implementation const secret = request.body.secret || findSecretFromDB(request.userId); const isValid = totp.verify({ secret: secret, encoding: "base32", token: request.body.token, window: 0 }) // Depending on your implementation, the secret is probably already saved saveSecret(secret) response.send({ isValid }); });

Conclusion

TOTP authentication is an extra layer of security used in two-factor login methods. It creates a one-time-use password using a shared secret key and a predetermined amount of time. Due to its high level of security and simplicity of use, TOTP authentication is becoming more and more popular as an additional security measure.

Idempotency In APIs: Planning for Uncertainty

Sofwan A. Lawal — Mon, 07 Nov 2022 11:51:18 GMT

Idempotency is a property that can be applied to operations, algorithms, and code. In software engineering, it refers to the ability of an operation to be performed multiple times on the same input without resulting in an unnatural state. An idempotent operation is one that can be invoked any number of times without changing the result. The opposite of idempotency is non-idempotency, which occurs when the result changes with every call. This blog post explains why you should care about idempotency and provides examples of how you can design your APIs to make them more idempotent.

Introduction

Idempotency is a property of operations that can be performed without changing the result of that operation. Idempotent operations can be repeated multiple times and have the same effect as if they had been performed once. In non-idempotent APIs, It'll be difficult for us to handle issues that may result from errors and network uncertainties, which causes clients to resend some successfully handled requests.

Consider a request a backend application which receives a debit order from a client (Let's keep it simple) ``` import express, { Request, Response } from "express"; import * as bodyParser from "body-parser";

const app = express(); app.use(bodyParser.urlencoded({ extended: true })); app.use(bodyParser.json());

let orders = []; let balance = 500; Defining the `process-order` endpoint to handle the request. app.post('/process-order', (req: Request, res: Response) => {

if (balance < req.body.amount) { return res.status(400).json({ message: 'Insufficient wallet balance!' }); }

balance = balance - Number(req.body.amount);

let newOrder = { id: Math.floor(Date.now() + Math.random()), //random 4 digit number to act as ID item: req.body.item, amount: req.body.amount, };

//add it to orders database orders.push(newOrder);

const response = { message: Order placed successfully!, };

return res.status(201).json(response); }); Finally start the server on a port app.listen(3000, () => console.log('Server running!')); ```

Now to the interesting part, Consider a request sent to the backend application containing a debit order from your wallet ``` function async sendOrder() { return await axios.post("https://server.url/process-order", { amount: 100, item: "lord-of-the-rings" }) }

const response = await sendOrder(); // response.status is 201 Assuming there was a network error on the client side, retrying this request will result in multiple debits and performance implications. const response = await sendOrder(); // response.status is 201 again If this request is tried one more time, then we get another debit on our users wallet, which is redundant const response = await sendOrder(); // response.status is 201 again ``` The performance implications of idempotency come from not having to execute an operation more than once if it doesnt change the state of the system. You only need one request to trigger an action, not a second one with the same parameters. Moreover, you dont need a second request in case something goes wrong on the first try and you need to retry it.

Idempotency by examples

Idempotency is an important property for APIs because it allows them to be invoked multiple times with the same input without failing or producing unpredictable results. Idempotency can be applied at the service endpoint, inputs and outputs of data transformations, and individual request-response interactions within services.

We can implement idempotency key in any way we deem fit, however common ones are

Adding idempotent key in the body // request body { amount: 100, item: 'item-key', idempotentKey: 'unique-key-id' }
Adding idempotent key in the url query parameters // request url const url = `https://server.url/endpoint?idempotentKey=unique-key-id`
Adding idempotent key in the header (Most recommended) X-Idempotent-Key: some-unique-key-id In the example shown above, Consider the backend application is redesigned to include support for idempotency. We'll add the Idempotent key in an header X-Idempotent-Key.

// To keep it simple, in real-world application, you'll setup an external cache server like redis// or any distributed caching strategy you find fit for your applicationconst cache: Request<string, OrderItem|boolean> = {}// This middleware intercepts every requests going to the process-order endpointsapp.use('/process-order', (req: Request, res: Response, next: NextFunction) {  // fetching the key from the header when it exists  const idempotentKey = req.headers['x-idempotent-key']   if (!idempotentKey) {   // proceed to handle the request because the request is not idempotent;    return next()   }  const processedOrder = cache[idempotentKey];  if (!processedOrder) {    return next() // Proceed to processing because the request was not previously processed  }  // Here, the request has been handled already  return res.status(200).json({    message: `Order processed successfully!`,  });})

The implementation above, which depended on a cache includes a flaw which should be considered. The use of a non-consistent cache for idempotency results in the introduction of a consistency bug because there is no atomicity guarantee between the cache and the "main database" (whatever that may be). Depending on the precise timing of the idempotence information update, you can:

end up with duplicate operations if you update the idempotence DB too late (or not at all, if the process fails or encounters an error!)
fail to register some operations because you update the idempotence DB too soon and crash, and when the client tries again, you act as though the change has already been done but in fact, it hasn't because it crashed the first time!

To solve this I recommend relying on the actual main database to decide when the operation fails or succeeds. The implementation above can be re-written as

app.use('/process-order', (req: Request, res: Response, next: NextFunction) {  // fetching the key from the header when it exists  const idempotentKey = req.headers['x-idempotent-key']   if (!idempotentKey) {   // proceed to handle the request because the request is not idempotent;    return next()   }  // Relying on the actual data-source to figure out if the operation has been completed before  // This operation usually involves queries from the database  const processedOrder = orders.find(order => order.requestId === idempotentKey);  if (!processedOrder) {    return next() // Proceed to processing because the request was not previously processed  }  // Here, the request has been handled already  return res.status(200).json({    message: `Order processed successfully!`,  });})

The actual process-order endpoint can be written to look like this implementation

// Normal processingapp.post('/process-order', (req: Request, res: Response) => {  if (balance < req.body.amount) {    return res.status(400).json({ message: 'Insufficient wallet balance!' });  }  balance = balance - Number(req.body.amount);  const idempotentKey = req.headers['x-idempotent-key']  let newOrder = {    id: Math.floor(Date.now() + Math.random()), //random 4 digit number to act as ID    item: req.body.item,    amount: req.body.amount,    requestId: idempotentKey ?? null // Added the idempotentKey as requestId  };  //add it to orders database  orders.push(newOrder);  // If you employ a cache, but this has its considerations and issues  // cache[idempotentKey] = true;  const response = {    message: `Order placed successfully!`,  };  return res.status(201).json(response);});

Now "idempotently 😂" thinking and re-writing the client implementation earlier to be idempotent. ``` // Probably persisted in the SessionStorage/LocalStorage or Component memory. // This will be added to all unique actions and changed when a new action needs to be triggered let idempotentKey = 'some-unique-key'

function async sendOrder(idempotentKey: string) { return await axios.post("https://server.url/process-order", { amount: 100, item: "lord-of-the-rings" }, { headers: { "X-Idempotent-Key": idempotentKey } }) }

const response = await sendOrder(idempotentKey); // response.status is 201 ```

Subsequent request will result in 200 success without causing multiple debit or redundant orders. const response = await sendOrder(idempotentKey); // response.status is 200

Trying again results in no state change const response = await sendOrder(idempotentKey); // response.status is 200 again

This has important performance implications for applications, so its worth thinking about how to design idempotent APIs from the start.

Idempotent and High-Performance APIs

APIs that dont guarantee idempotency can lead to very slow applications. Imagine an e-commerce website where users are allowed to place orders. If an order is placed in error, you might want to cancel it. To be able to cancel an order, you need to be able to identify the order in question. To identify the order, you need to know the order ID. To know the order ID, you need to know the order total. But first, you need to know what payment method was used to place the order. If an order ID is non-idempotent, youll have to execute a whole chain of operations before you can find the order you want to cancel. Each of these operations involves communicating with the backend services, and each of them has the potential to fail due to network issues or other problems. If there are problems along the way, they will involve retrying the same chain of operations again, possibly even slower than before.

API Responses without Idempotency

An example of an API response that is not idempotent is one that includes a unique ID for each resource that is returned. This is the approach youll find in many database APIs, where you get an ID as part of the result. These IDs are not suitable as unique identifiers since they can change each time you retry an operation. IDs generated based on a combination of the current time and the internal state of the server are also not suitable as unique identifiers. These time-based IDs are bound to change on each retry, making them non-idempotent. Any other approach that relies on some internal server state that changes with each request is not idempotent.

The problem with non-idempotent APIs

While the examples above are easy to identify, real-world APIs often break idempotency in ways that are harder to spot. What if the response to an order details request also includes the order ID? It may seem like you now have everything you need in one call. But as long as the order ID is tied to the current state of the server, its not idempotent. Its still possible that the first attempt at placing the order fails due to misconfiguration or network issues, and you need to retry the order details request. What if the order ID is stored as a global variable on the server side? This isnt idempotent either, as it will be reset as soon as the order is placed.

How to make your API idempotent

To make your API idempotent, you need to identify a state that might change between retries and make it immutable. Data that cant change between retries is suitable for storing the unique identifiers. Furthermore, such data can be shared across multiple requests without needing to be duplicated. You should use a distributed system to store the state. One thing to consider is that you should never use any kind of separate database for idempotence, unless you can make this separate DB consistent with the main DB in one transactional context (which is slow and complex).

Hence, Avoid implementing idempotence as a bolt-on solution using an external DB for which you cannot guarantee the consistency of writes relative to where the business entities live. Usually, this means you need one of:

Atomicity - you must write the operation ID together with the changes that it introduces, so that you can rely on the equivalence: operation_ID_is_present === the_operation_was_already_applied. This usually means DB transactions, sometimes even distributed transactions (JTA/XA, but don't go down that rabbit hole).
Natural idempotence - you don't rely on any external operation identifier, but instead deduplicate the essence of the command. This is tricky, for example you might not be able to have an "Add to cart" command, but instead "Make it so that there is 1 of this item in the cart". Then, deduplicating commands is a matter of seeing: is there currently 1 unit of this item in the cart? If so, do nothing.

Conclusion

APIs that are designed for high performance from the start are easier to develop and maintain. They are less likely to break in unexpected ways, and they are easier to optimise. This includes making sure they are idempotent. An API that is not idempotent may work just fine in testing, but once its in production, its bound to have serious performance issues. The key to idempotency is identifying the state that can change in each retry and making it immutable. Data that cant change between retries is suitable for storing the unique identifiers. Furthermore, such data can be shared across multiple requests without needing to be duplicated. This way, you can reduce the number of RPC calls significantly, which saves on network overhead, as well as processing resources. It can also make your system more robust by reducing the number of points of failure.

Gossip Protocol: The Basics and Beyond

Sofwan A. Lawal — Sun, 23 Oct 2022 23:22:53 GMT

Gossip is a common strategy for sharing information in dynamic networks. It is a type of self-organising network topology that helps nodes share information efficiently and quickly. It is the most efficient way to exchange information among multiple nodes in the same network. A Gossip protocol is an algorithm that enables multiple nodes to maintain consistent information while remaining decentralised and operationally independent. The properties and principles of gossip algorithms give us insight into optimal strategies for sharing data between computing nodes. In this blog post, we will learn about Gossip Protocol and its application in Distributed Systems.

Gossip in Computer Networks

Gossip is the general act of sharing information. In computer science, gossip is a model of asynchronous distributed computation in which a task is performed by a network of independent entities, each of which may be interrupted at any time and can at its discretion temporarily transfer the task to other entities. Gossip is appropriate for situations where a central control mechanism is either unavailable or too expensive to be practical. Gossip is such a widespread phenomenon among human beings that it has its own word, but it is not restricted to humans. Gossip has been observed in animals such as birds, insects, and a number of mammals. Gossip is a great example of the fact that small-world networks are not just a human phenomenon.

What is a Gossip Protocol?

A gossip protocol is an algorithm that allows multiple nodes to maintain consistent information while remaining decentralised and independently operated. Each node in a gossip protocol maintains a gossip graph, which is a type of data structure. The root of each gossip graph is the node's identifier, and links to neighbouring nodes are also present. A gossip protocol uses this gossip graph to transfer and update the state of each node. Gossip protocols are categorised according to how the gossip graph is updated. Lets suppose N is the number of nodes in the system and lets assume that each node has a list of neighbouring nodes. Now, what happens when one of the nodes changes its state? Gossip protocols can be generalised as follows:

Each node will maintain a data structure called the gossip graph.
Each gossip graph has the nodes identifier as the root and has links to neighbouring nodes.
A gossip protocol uses this gossip graph to transfer and update the state of each node.
Gossip protocols can be categorised according to how the gossip graph is updated.

Types of Gossip Protocols

Uniform Random Gossip

Uniform random gossip is a gossip protocol in which each node i of the network chooses another node j uniformly at random and then forwards the message m that it has received thus far. This protocol has been considered as a uniform model for gossip, where the network is assumed to have uniform density and the message is forwarded with the same probability among nodes in all directions. This simple model is expected to have the same properties as more sophisticated but also more complicated models.

Echoing Gossip

Echoing gossip is a gossip protocol where two nodes i and j exchange the messages that they have received from other nodes.

Synchronous Gossip

Synchronous gossip is a type of gossip protocol where the message is forwarded by a node only after its receipt. The advantage of Synchronous gossip is that it doesnt have any network communication delay but it is highly susceptible to network failure.

Gossip Protocol in a Distributed System

A gossip protocol can be used to create a distributed algorithm using a network of independent nodes. In a gossip protocol, nodes can be considered as agents that pass messages to one another. Each node has a set of neighbours with which it exchanges messages. You can use a gossip protocol to implement a distributed algorithm that maintains a consistent state across the nodes.

Distributed Coordination Using Gossip protocol

One of the properties of gossip protocols is that they are self-stabilising and fault-tolerant. They do not require any periodic communication between nodes to synchronise the states. Any node can send a message to any other node at any time. A node that receives a message forwards it to all the nodes that it is connected to. This continues until all the nodes receive the message. The nodes that receive the message keep a copy of the message for some time and send a reply to the node that sent the message. The nodes that receive the reply keep a copy of the message for some time and pass the message to their neighbours. This continues until all the nodes have received the message forwarded by themselves. For the nodes that have received the message forwarded by themselves, it means that the message is consistent and correct. They discard the message and are updated with the correct values.

Leader Election

In the absence of a leader, any node can become a leader by sending a message to the nodes that it is connected. The nodes that receive the first message are called peers and pass the message to their neighbours. The message is passed from one node to another until it reaches all the nodes in the network. The entire system is asynchronously managed. The nodes continue to pass messages to neighbouring nodes until they have received the message forwarded by the same node multiple times.

Properties of Gossip Protocols

Uniform random gossip has the following properties

The state of each node in the network is consistent with that of every other node.
The order in which messages are received at each node is consistent with the order in which the messages are sent.
The number of messages that are passed in the network is equal to the number of messages that are sent.
The number of messages that each node receives is equal to the number of messages that each node sends.
The number of processors that each processor receives is equal to the number of processors that each processor sends.
The variable delay in receiving messages from neighbouring processors is the same for all processors.
Gossip protocols are fault-tolerant in that a network failure does not cause any inconsistency in the data held by the processors.

Advantages of Gossip Protocols

Gossip protocol and has the following advantages

Gossip protocol is self-stabilising.
Gossip has a high completion probability.
Gossip protocol is fault tolerant.

Conclusion

In this article, we have explored the concept of Gossip Protocols and their use in Distributed Systems. Gossip protocols are self-stabilising, fault-tolerant schemes that use the dynamics of information diffusion to maintain data integrity. A gossip protocol is a simple algorithm that enables multiple nodes to maintain consistent information while remaining decentralised and operationally independent. We have also highlighted the different types of Gossip Protocols such as Uniform Random Gossip, Unequal Random Gossip, Echoing Gossip, Synchronous Gossip, etc. Gossip protocols are simple algorithms that enable multiple nodes to maintain consistent information while remaining decentralised and operationally independent.

Try-Confirm-Cancel (TCC) Protocol

Sofwan A. Lawal — Tue, 11 Oct 2022 18:09:37 GMT

Intro

In the world of distributed transactions, there are many ways to get transactions working. What is important is that we find a way to make sure they can succeed or fail as a unit. If multiple transactions are going to execute on different databases and commit as one logical unit, it is important that we find a way to ensure they either all commit or all fail. We can do this in the world of distributed transactions by using something called the T-C-C protocol. This stands for Try, Confirm, and Cancel. Its not hard to understand once you see it in action.

What is a Distributed Transaction?

A distributed transaction is a transaction that spans across databases and systems. Distributed transactions are useful when we need to perform a set of operations that span across multiple systems. In a distributed transaction, we can either commit or roll-back the transaction across all systems. A distributed transaction can involve multiple participants. This means that when a transaction is committed or rolled back, the information is available to all participants. Participants in a distributed transaction can communicate with one another to make sure the distributed transaction is carried out as expected.

The T-C-C Protocol

When using the Try-Confirm-Cancel protocol, a transaction is started in one of three ways:

We first try to execute the transaction.
We then confirm that the transaction completed, and;
we cancel the transaction if it fails.

If were using pessimistic locking, we can also have to wait for the locks to clear before executing the transaction.

TRY Phase (Command) in Distributed Transactions

The TRY phase puts a service in pending state. For example, a TRY request in a flight booking service will reserve a seat for the customer and insert a customer reservation record with reserved state into database. If any service fails to make a reservation or times out, the coordinator will send a cancel request in the next phase.

The TRY phase is the first phase. When a try command is issued and it fails, then the transaction fails and depending on the logic, a CANCEL (Idempotent) command may be issued. The TRY command validates that the distributed transaction will likely succeed. The distributed transaction will be committed only if the try command succeeds. The distributed transaction will be rolled back if the try command fails.

CONFIRM Phase (Command) in Distributed Transactions

The CONFIRM phase moves the service to confirmed state. A confirm request will confirm that a seat is booked for a customer in a flight booking system and he or she will be billed. A customer reservation record in database will be updated to confirmed state. If any service fails to confirm or times out, the coordinator will either retry confirmation until success or involve manual intervention after it retries a certain number of times.

The CONFIRM phase is the second phase. The confirm command validates the transaction to have succeeded. The transaction has been committed only if the confirmation succeeds. The distributed transaction will be rolled back if the confirmation fails.

CANCEL Phase (Command) in Distributed Transactions

The cancel command is used to invalidate or reverse a transaction. A distributed transaction can be cancelled at any time. CANCEL is a reverse of the initial action (Try). The transaction manager will rollback the transaction when the CANCEL command is used. It will carry out each of the cancel operations (Cancel) specified by the preliminary operations (Try) one at a time and discard all of the completed items from the preliminary operations (Try).

A distributed transaction might fail due to various reasons, such as network failure, power, error etc. When we cancel a transaction, its like were reversing the distributed transaction. The distributed transaction is reverted back to its original status before the distributed transaction was performed. However, distributed transactions are asynchronous, so its very likely that the distributed transaction has already been committed when we cancel the distributed transaction.

Problems with the TCC protocol

Even though the TCC protocol has some benefits, it also has some drawbacks.

First, it's slow. This is because you have to wait until the transaction is complete before you can do anything else.
If you use pessimistic locking, then you may also have to wait to gain the locks. This can slow down your distributed transactions.
Additionally, it is more complicated to implement. Since you have to make sure that all of your transactions are set up to use this protocol, it may not be ideal for all situations.

Conclusion

Distributed transactions are useful when we perform operations that span across multiple systems. A distributed transaction is like a single transaction that spans multiple databases and systems. The TCC protocol is useful when the transactions are distributed between multiple services. The TCC protocol ensures that the transaction either all commits or all fails. We first try to execute the transaction. We then confirm that the transaction is completed and finally we cancel the transaction if it fails. Remember that distributed transactions are mostly asynchronous and using the TCC commands will help you understand the state of the distributed transaction better.

Consistency Models in Distributed Systems

Sofwan A. Lawal — Wed, 05 Oct 2022 12:51:46 GMT

As data volume and complexity increase, distributed systems are becoming more and more widespread. This has caused distributed systems, such as databases, streaming data, and service-oriented architectures, to increase. Distributed systems must gracefully handle failures while maintaining service availability or data access. Any system may occasionally experience failures, but these must be quickly identified and dealt with to ensure a minimally disruptive recovery. This blog post will explore few consistency models for distributed systems.

A note on Distributed System

The term "distributed system" refers to a system that is distributed in nature. This means it is split across multiple locations and is not run entirely on a single machine. A distributed system could be a group of computers all hosted at the same location or a group of servers running in different data centres. The main goal of a distributed system is to achieve greater scalability and reliability than a single system can provide.

Even though many distributed systems are capable of automatically coordinating their actions, sometimes it's just not possible to do so. It is also difficult to speculate what will take place in the event that one of the computers loses internet connectivity or experiences some other problem. One of the reasons why we have consistency models is because of things like this.

What is Consistency?

The degree to which a system or any copies of the system accurately reflect the state of the system at any point in time is referred to as consistency. It is frequently used to assess a system's functionality or effectiveness. When talking about distributed systems, the term "consistency" refers to the agreement between data replicas or the rules that govern how those replicas can be changed. Distributed systems have the potential to be either consistent or inconsistent, depending on the methods that are used to design and implement them.

We'll look at these two consistency models for distributed systems:

Eventual consistency
Strong consistency

Transactions in Distributed Systems

In a distributed system, a series of operations that are carried out all at once make up a transaction. When a transaction is considered consistent, it must finish successfully without causing any errors or affecting any other transactions that are still open. You have the option of either committing or rolling back a transaction. A transaction is said to have been committed once it has been successfully applied, and it is said to have been rolled back whenever an error occurs while the transaction is being executed. All of the replicas will be brought up to date with the most recent information once a transaction has been successfully committed by all the replicas.

Eventual Consistency in Distributed Systems

Distributed systems, such as cloud computing, are becoming more prevalent in today's technology landscape. As a result, there are many challenges associated with distributed systems. These challenges include eventual consistency, which is when distributed systems eventually reach a point of consistency. In other words, there will be agreement across all parts of a distributed system.

As the number of nodes increases and the network becomes more complex, it can become harder for eventual consistency in distributed systems. Eventual consistency can lead to unnecessary costs and inefficiencies for distributed systems that need to maintain consistency. Distributed system administrators and backend engineers must understand eventual consistency and how it can impact their distributed system.

The advantage of eventually consistent operations is that it does not require resource-intensive methods such as locking and can accommodate many concurrent operations. However, the disadvantage is that data may not be accurate and consistent until a certain amount of time has passed, which means that it can take a while until the system is fully functional.

Strong Consistency in Distributed Systems

In distributed systems, communication between components must be reliable and timely to function efficiently and produce quality output. Reliable communication means that data must be sent and received correctly; timely communication implies that components should always receive new data when it arrives rather than waiting until later to process it.

Strong consistency is a model where replicas of a system are consistent with the system's state at a given time. Strong consistency is a form of consistency where replicas of a system are consistent with the system's state at a given time. Replicas are updated with new information so that they are always consistent with one another.

The advantage of strong consistency is that it can accurately reflect the latest state of the data, but the disadvantage is that it can become an issue if multiple clients are accessing the data at the same time because the clients may end up getting different versions of the data.

Distributed Transaction

A distributed transaction is a transaction applied to a distributed database, and it is typically composed of two phases: Commit and Abort. In the commit phase, all updates made by the transaction are applied to the database. It is important to note that updates are applied atomically, which means that the database does not have a partially updated state. Only one transaction can be applied to the database at any given time, and updates from other transactions are queued until the current transaction is finished. In the abort phase, the transaction is rolled back into the database.

Factors that contribute to Consistency

Distributed systems are critical for modern applications. They rely on multiple nodes to function correctly and are common in web services, cloud computing and other distributed systems.Several factors contribute to consistency across distributed systems:

Message routing: Messages should always go directly from one node to another. This minimises the number of hops a message has to make and reduces latency.
Data replication: All nodes should have identical copies of all data, so they can all be accessed simultaneously. If any changes are made to the data, a single copy should be updated on all nodes.
Time synchronisation: All nodes must agree on the time to set their clocks appropriately. This is typically handled via a network clock or a reliable time source.

Conclusion

Consistency is an essential aspect of distributed systems that helps us ensure that the system remains functional and reliable. Two consistency models for distributed systems are eventual consistency and strong consistency. Distributed transactions are applied to distributed databases and are composed of two phases: commit and abort.

Conflict-free Replicated Data Types: A Quick Introduction

Sofwan A. Lawal — Tue, 27 Sep 2022 14:03:52 GMT

Distributed systems engineers have been debating the benefits and drawbacks of various coordination mechanisms in concurrent settings for a long time. Finding the benefits of each strategy and identifying its strengths have been made easier by recent research. Conflict resolution existed before CRDTs (Conflict-free Replicated Data Types) became a popular subject among distributed engineers. We have a collection of conflict-free data types in CRDTs that can be securely replicated in a distributed system.

Conflict Resolution

Rock, Paper, Scissors:

When you think of conflict resolution, the first thing that might come to mind is the famous game of Rock, Paper, Scissors. In this popular game, players choose one of three gestures

rock being a closed fist representing a strong punch;
paper being an open palm representing a gentle but steady beating;
scissors being two fingers bent at right angles representing cutting.

If any player chooses the same gesture as their opponent once again, then that player has lost. In other words; there can be only one winner in this game. Thats what comes to mind when we think about conflict resolution in data storage systems. The idea is to not have two copies of the same piece of data ever resolve differently from each other. Under this principle, we can use something called CRDTs for conflict-free replicated data types.

Airline Reservation System:

While many systems in the real world can tolerate network partition, some cannot take the chance of receiving conflicting updates from various clients. An airline reservation system, for instance, cannot allow two users to simultaneously hold two different copies of the same ticket; one user must win, and the other loses. We must find appropriate data structures that can synchronise in this environment safely in order to prevent these scenarios.Such data structures include CRDTs, which use properties of distributed systems that are assumed to occur in practice to develop effective solutions to a number of issues.

What are Conflict-free Replicated Data Types?

A conflict-free replicated data type (CRDT) is a data type whose replicated instances can be safely and automatically merged even if they were created on different machines, and possibly under different operating systems. A class of data structures known as CRDTs can be updated consistently in a network that is only partially connected. The idea of concurrent data types served as the inspiration for the name of this group of data structures. A data type that is intended to be accessed by multiple threads of execution is known as a concurrent data type. Replicated data types, in contrast, are intended to be replicated (copied) to numerous machines.

Why Do We Need CRDTs?

In the real world, we cant guarantee that there will be no network partitions. Some of them are planned, like when one part of the network becomes overloaded and gets shut down. Others are unplanned, like when a cable is accidentally pulled out of a router. When a network partition occurs, different nodes in the system might be disconnected from each other for a short or prolonged period of time. CRDTs use a set of assumptions about the nature of this distributed system to efficiently synchronize replicated data. To understand how CRDTs work, it's important to understand the CAP theorem, which provides a useful overview of the challenges of building highly available distributed systems.

How do CRDTs work?

A CRDT consists of a set of individual state machines that are replicated and typically synchronized with each other. Once the state machines are set up with the correct initial state, any client can make an update and all other clients will receive that update and apply the same update to their local state machine. We will use a Counter as an example.

In a CRDT implementation of a counter, each client maintains an integer that represents the current overall value of the counter. The client performs two operations:

incrementing the local counter value;
publishing a message that sends the new current value of the counter to all other clients.

The publishing happens at a rate set by the network, and the clients will process the received messages to look up the current value of the counter, and then increment their local counter value to the new published value. This way, the clients are always in sync with the overall counter value and no client can have an incorrect value due to the fact that they rely on the published messages to keep up with the overall counter value.

Mergers and Co-ends: Merging Changes from Discrete Combinations of Events

The simplest form of CRDT is a counter. A counter can be defined as a discrete state machine that receives pushes (increments), and pops (decrements).

A push from one client is immediately visible to all other clients. When clients connect to the system, they receive the overall current value of the counter. When a client wants to add its own counter, it performs pop and a push. The push will increment the overall current value and the pop will decrement the overall current value by one.

Thus, the overall current value can be seen as a discrete combination of the local current values at each client. This way, each client can generate a new overall current value that is a discrete combination of its local current value and all other clients current values.

Conflicts in Continuous Sums: Summarizing Continuous Changes While Preserving Discrete Changes

A counter is a discrete state machine that can be described as a discrete sum of events. A more complex example is a set that can be modelled as a continuous sum of events. A set can be seen as a table where each row represents a distinct value. The set can be used to answer questions such as are there any items with a value less than 5?. The overall current value of the set can be seen as a continuous sum of all push events. A push event adds a row to the table representing a distinct value. If there is no row with that value, the overall current value remains the same. The overall current value must be incremented to account for the rows that were already there. Otherwise, the continuous sum of all pushes will be greater than the continuous sum of all pops.

Fixing the Discrete and Continuous Sums: Adding a Subtraction

We've explained that adding a continuous value to a discrete value causes a problem. To fix the problem, we must add a subtraction operation between a continuous value and a discrete value. This way, we can account for the discrete values without adding to the continuous values. The implementation can maintain a running subtraction between the continuous and the discrete values. This allows us to subtract the continuous values from the discrete values and obtain the correct results.

Implementing a Set as a Set

We have described the basic theory behind CRDTs and the use cases for counters and sets. Before we can implement a set, we need to implement a counter first. We can implement the counter by creating a series of event handlers that handle the following events:

Initialize - create the initial current value of the counter
Increment - increase the current value by one
Decrement - decrease the current value by one
Publish - publish the current value to all clients.

The events are used for two purposes: to generate the overall current value and inform the client.

Wrapping up

This blog post covered what CRDTs are, why we need them, and how they work. We started with a high-level overview and then hid the details behind their general concepts, properties, and assumptions. This will allow you to understand the different use cases where CRDTs excel.

Caching: A Beginners Explanation to Better Performance

Sofwan A. Lawal — Mon, 19 Sep 2022 09:44:36 GMT

We all understand that milliseconds are important when it comes to performance. Users will engage with our websites, apps, or products more pleasantly if we can eliminate a few milliseconds here and there, that's why we cache. In order to avoid repeatedly returning to the original source or dataset, caching enables us to save frequently visited content in memory. Caching speeds up and optimises common tasks like database lookups or calculations by storing them locally. Every Backend engineer should have "caching" in their toolbox. This blog post explains the few types of caching, how to optimise performance with each, and when to employ each type.

What is Caching?

Storage of frequently utilised resources, such as calculation results or data, is referred to as caching. If a piece of data is cached, subsequent requests for it can be fulfilled from the cache rather than having to be recalculated or requested from the original source. If the data is accessed frequently, this can considerably cut the time needed to complete activities. Caching has a straightforward premise. When a piece of information is likely to be needed again soon, it makes sense to store it so that it may be easily retrieved and used when required.

The Goal of Caching

Caching aims to deliver frequently accessed content and computations as rapidly as feasible from a nearby location. The database, the server's memory, or the user's PC are all examples of nearby locations. In order to serve frequently accessed content from a nearby location, we must first store it there. Caching is useful in this situation. It enables us to save frequently used info close by for easier retrieval.

When to Cache

Caching is useful whenever you have frequently accessed data or computations that you cant afford to have run each time they are requested. You might have data that takes a long time to compute or that you need to fetch from a database. Typically, caching is used in these situations:

When data is static
Server-side caching is the ideal choice if the data you need to cache is static and doesn't change frequently. This is due to the fact that the data will always be in the same location, negating the requirement for client-side caching that must be refreshed whenever it changes.
When data is dynamic
Use client-side caching whenever possible if the data you are caching is dynamic. The ability to refresh the cached data as necessary is made possible via client-side caching, which is significant when working with user input.

Types of Caches

There are many types of caching that all have their own uses and benefits. To understand when to use each type of cache, its helpful to know their differences and similarities.

Database Level Caching

It might be challenging to cache your database, therefore some advance preparation is necessary. However, if done right, it may be quite profitable. All the difficulties associated with storing and retrieving data are taken care of by this kind of caching. In other words, the information (or output) is saved to a different database table or partition within the same database. In this manner, all it takes to get the data again is a query on the partitioned database. When your application has a large number of readers for a particular dataset and a low number of writes, database-level caching is fantastic. This indicates that it's a good idea to save the information in a location that makes it simple to go to when you need it again. If your dataset is mainly read-only, this is quite helpful.

Browser Cache

The browser cache is one of the simplest and oldest forms of caching. A browser cache is a directory on a computer where frequently used or static content is stored for quick retrieval. It is commonly used to store images, videos, or web pages that are visited often so that they do not have to be downloaded again. The browser cache can be used for almost every type of content. This includes HTML pages, CSS, JavaScript, images, and fonts.

HTTP Cache

The HTTP cache stores data on the server side. The data is stored in a format that the client can understand and process while being rewritten as static content that can be served to clients without any interaction with the server. It is commonly used along with a Content Delivery Network (CDN) to decrease the load on the server. The HTTP cache is the most robust form of caching. It can store information in a variety of ways, including the browser cache, the reverse proxy, or even the database. It is used for both static and dynamic data. However, it is best used for static data since it would be difficult to manage how a client is viewing that data.

Server Side Storage (SSS) / Static Content Cache

The most common type of cache is a static content or static language content cache. Static content is any asset like images, CSS, JavaScript, fonts, or any other static asset that you dont want to change. This type of cache is an SSS, which stands for server-side storage. This means that the files are stored on an external server. This server will be a CDN (more on that shortly) or the origin server. This type of caching is often used to reduce load time and decrease your websites bandwidth costs. Because static content isnt changing frequently, it makes sense to store it on the server so it doesnt have to be pulled from the original source each time.

Cache Invalidation

An important aspect of caching is ensuring that the data you are caching is accurate. If you are caching data that is incorrect, it can have serious consequences. To avoid these consequences, you must invalidate your caches when the data they are caching is updated. There are two ways to invalidate your caches:

Expiration
When you set an expiration date on a cache, it will automatically invalidate itself when that date passes. This works well for caches that dont change very often, but it doesnt work as well for caches that change often.
Invalidation Trigger
An invalidation trigger is a piece of code that is triggered when the data you are caching is updated. This works well for caches that are updated often.

Cache Hierarchy

The concept of caching has a hierarchy, with each subsequent layer of caching being slower than the layer before it. The lower in the hierarchy a piece of data is, the faster it will be accessed. The higher up in the hierarchy, the slower it will be accessed since it has to go through all the layers below it.

Conclusion

Caching is an essential tool in any backend engineers toolkit. There are many types of caches, each used for different purposes. Server-side static content caching is the most common type of cache. Caching is a great way to speed up your application, but there is a risk that you might have out of date or even incorrect data in your cache. Make sure to invalidate your cache when you need to update data or change an algorithm.

Distributed Systems: Distributed Lock Strategies and The locker problem

Sofwan A. Lawal — Mon, 12 Sep 2022 11:37:00 GMT

Every team working with distributed systems must eventually overcome the issue of distributed locking. Every system we create is vulnerable to failure, and this risk is heightened while operating in a time of war. If you don't take action, there are numerous ways for your software to become trapped in a never-ending cycle of failure detection, emergency recovery, etc.

What is Distributed Locking?

Distributed locking is the process of managing access to a single resource that is shared by a number of clients at the global level. This enables a group of clients to effectively work together to make sure that only one of them has access at any given time. The aim of distributed locking, a type of mutual exclusion, is to guarantee that only one operation can take place at a time. The most typical comparison is to a locker room where each person has just one locker and must arrange access to it. The locker in a distributed system could be a database or another shared resource. Distributed locking can be used whenever a distributed system has to coordinate access to a shared resource.

The locker problem in distributed systems

Consider Alice and Bob as two individuals who require access to a single locker at a train station. They must cooperate with one another to access the one locker that is available. There are three possible outcomes for this situation:

One of them might open the locker first, and the other might follow suit by failing or waiting. Depending on the outcome of a coin flip, Alice might win the locker and Bob might lose it. But since neither party will be aware when they are up against the other, they might have to wait an eternity.
Or perhaps both of them simultaneously obtain the locker. They may arrive at the same moment and fail to wait for one another if they are both moving too slowly or too quickly. If the locker is broken, they can both receive it at the same time.

In distributed systems, the locker problem can be solved in exactly the same three methods. Although you can add as many computers as you want, each person is still only allowed one locker.

Distributed Lock Strategies

The locker problem is solved using a wide range of techniques in distributed systems. When two operations occur simultaneously in distributed systems, it doesn't necessarily signify that two users are attempting to use the same locker. It denotes the simultaneous creation or updating of a resource by two individuals. A database row, a message queue, or anything else that can exist separately from other activities could all be considered resources. The system must wait for one of these processes to finish before continuing in order to avoid these procedures racing and leading to a damaged system.

Varieties of distributed lock strategies.

Relying on Shared Nothingness and Good Timing
Distributed Lockers
Allowing for Merging of Conflicting Operations

Relying on Shared Nothingness and Good Timing

In many distributed systems, the solution to a distributed locking problem is as simple as waiting for one action to finish before starting another. This is frequently referred to as a distributed locking "wait-and-see" strategy. Operations in this kind of system are intended to pause for a while before continuing. This is frequently observed with distributed databases like Postgres, distributed caches like Redis, and message queuing systems like RabbitMQ and Kafka. Every operation in these systems is planned to finish as soon as possible because they are all created for shared emptiness. As a result, if two operations are vying for a lock, the activity that can wait for the other operation to finish first will often win.

Distributed Lockers

Centralised services known as distributed lockers are in charge of creating, locking, and unlocking shared resources. They are frequently used to give a variety of applications basic distributed locking capabilities.

The Network Information Service (NIS), established by Sun Microsystems in the middle of the 1980s as a common means to distribute user/group information across networks, is one of the early instances of this paradigm. NIS provided a centralised locker service that other programmes could use to manage access to shared resources because it was designed to be utilised across many types of systems.
ZooKeeper is another illustration of a locker service. To manage access to common filesystem paths on a distributed system, ZooKeeper was initially intended to be a centralised service. It transformed into a widespread dispersed locker service over time.

Allowing for Merging of Conflicting Operations

It may be possible for you to profit from the attempted race if two operations are vying for the same lock. For example, you could use the attempted race to create a completely new distributed resource that both processes rely on. In this case, the system detects the conflict, waits for one of the activities to finish, and then uses the results to create a new distributed resource that is used by both operations. A "take-turns" mechanism is a common name for such a distributed lock technique. An example of this is a distributed lock that uses a shared counter to determine who will get the lock next. This is typical in distributed logging systems like Kafka. In this system, whenever one operation is composing a log entry, the other two operations attempt to obtain the shared log lock. If they are successful, they both raise the counter to demonstrate that they are using the lock. If that doesn't work, they both raise the counter and try to open the door once more. The counter is reduced and the other operation is notified that it is their turn now when one of the operations completes successfully.

Conclusion

Distributed locking is one of the most important problems that arise in distributed systems, which provide their own set of difficulties. This issue can be solved using a variety of tactics, each with a unique set of trade-offs. Since the locker problem is particularly difficult to resolve in distributed systems, numerous distributed locking solutions have been developed over time. The most crucial thing to keep in mind is that distributed locking is an opportunity for innovative solutions, not a problem.

User Datagram Protocol (UDP): How to Use it in the right place, at the right time.

Sofwan A. Lawal — Mon, 05 Sep 2022 07:16:17 GMT

The Internet is a huge and ever-changing space. At any given time, it has several different streams of traffic moving through it, although they rarely all head in the same direction. As a result, depending on the type of information being transferred, various protocols are needed to help with data transfer in various directions. One such protocol is User Datagram Protocol (UDP) which operates differently from its counterpart, Transmission Control Protocol. Let's examine UDP's specifics, how it differs from TCP, and when it should be used to enhance network performance.

What is User Datagram Protocol (UDP)?

User datagram protocol (UDP), sometimes referred to as Universal datagram protocol, is an Internet protocol for data transmission without providing error checking or a guarantee of delivery. If the data is lost, there is no retransmission offered. Although it may not seem like the best way to handle data transfer, this approach has some benefits, especially when efficiency and speed are top objectives. UDP is in Layer 4, or transport, of the Open Systems Interconnection (OSI) Model.

Understanding the Differences Between UDP and TCP

TCP is typically used for larger data transfers that need verification that the recipient is ready to receive them. Before data is exchanged, this validation is made through a "handshake" or another procedure that confirms the well-being of both the sender and the recipient. Because of this, TCP is slower. TCP makes sure that both sides are prepared to accept the data and transfer it completely.

UDP characteristics include the following:

It is connectionless.
It is best for VoIP, video streaming, gaming and live broadcasts.
It allows for missing packets.
It uses fewer resources and is faster.
The packets don't always come in the correct order.
It is better suited for applications that need a fast, efficient transmission, such as games.

TCP characteristics include the following:

It is a connection-based protocol.
It ensures that all sent data reaches the intended receiver and that no packets go missing.
it delivers packets in the correct order.
It's slower and requires more resources.
It is best suited for apps that need high reliability, and transmission time is relatively less critical.

Example: How to create a UDP service

A UDP service can be written in any programming language. But in this section, we'll be working with NodeJS and Typescript. Working with Datagram sockets is made possible by the NodeJS dgram module. It can be used to transmit messages between computers or servers.

The server

import { createSocket, RemoteInfo } from "dgram"// Creates a dgram.Socket objectconst udpserver = createSocket("udp4")// Registers a listener function to be called when a message is received.udpserver.on("message", (message: Buffer, remote: RemoteInfo) => {  console.log(`server got: ${message} from ${remote.address}:${remote.port}`)})// Binds the socket to a port and starts listening for datagrams on the socketudpserver.bind(8273, "localhost")

Closing the connection

...// Later, when you're ready to close the connectionudpserver.close();

The client

import { createSocket, RemoteInfo } from "dgram"// Creates a dgram.Socket objectconst udpserver = createSocket("udp4")const data = "Message from UDP client";// Creates a buffer from the string dataconst message = Buffer.from(data);// Sends the message to the serverudpclient.send(message, 8273, "localhost", (err) => {  if (err) {    console.log(err)  }  console.log("message sent")  udpclient.close(); // close socket})

When Should You Use UDP?

As was already noted, UDP works well for brief data transfers that don't need to be confirmed before being delivered. In situations where efficiency and speed are high objectives, UDP is ideal. UDP sends the data without any verification to ensure that it is transferred quickly. Because data loss is a possibility with UDP, it is not recommended for massive data transfers where accuracy and integrity are more crucial. UDP is the best protocol to use for transferring data that can be completed in a single transmission.

Ideally, I would consider UDP for the following

Game Development
Broadcasting and Multicasting
VoIP
Simple Logger

Tips to Optimise Network Performance with UDP

Now that we've discussed when and why you might utilise UDP, let's look at some ways you might use UDP to improve the speed of your network.

Making ensuring that your network equipment is set up correctly to function on UDP should be your first priority. This also applies to the hardware you employ to link several network parts together, such as routers.
Make sure you are employing a reputable network monitoring provider. This guarantees that you are able to recognise and address problems as soon as they appear.
Be sure you are using a reliable network switch architecture. The network switch architecture determines your network's efficiency. Dependable network switch architecture improves UDP performance.

Limitations of User Datagram Protocol (UDP)

While there are many advantages to using UDP, it is important to understand its limitations as well.

Data loss is a genuine concern because UDP offers no delivery verification. This is not only a drawback, but it also raises security issues, especially when sending sensitive data.
Most networks prioritise TCP over UDP. This means that compared to TCP, UDP transmission is assigned a lesser priority and is less likely to be sent promptly.
UDP makes network monitoring more difficult because data isn't confirmed as received. You won't know whether there's a problem until the recipient contacts you.

Conclusion

Now that youve read this article, you should have a better understanding of what UDP is and how it differs from TCP. Weve covered when and why you would use UDP, as well as tips to optimize your network performance with UDP. Weve also discussed the limitations of UDP, so now you can make an informed decision on which protocol to use.

HTTP on TCP: Stateless Protocol on the Internet's Stateful Network

Sofwan A. Lawal — Mon, 29 Aug 2022 07:03:04 GMT

The Internet is a stateful network. Services such as connection management, congestion control, and error recovery are all made possible via the TCP protocol. When it comes to transmitting data reliably, these services are a must. In response to requests from HTTP clients, an HTTP server retrieves the requested files and returns them to the client in a suitable format (often HTML). An HTTP client is a piece of software that establishes a connection to an HTTP server and sends requests by issuing standard TCP messages bearing the Hypertext Transfer Protocol methods as their special headers. In the internet's stateful network, this blog post illustrates how the stateless Hypertext Transfer Protocol (HTTP) works on top of TCP packets.

Transmission Control Protocol

The Transmission Control Protocol, also known as TCP, is a transport protocol that operates on the Transport layer of the Open Systems Interconnection model. It is a protocol for connection streams that ensures the sequence of delivery and offers automatic retransmission of data. The Transmission Control Protocol (TCP) offers an assurance that the right transfer of an entire file or document will take place. It does this by dividing the file into a number of smaller packets and ensuring that each of those packets travels across the network in an orderly manner. This allows the individual packets to later be reassembled into the original file.

HTTP on TCP: An Overview

The HyperText Transfer Protocol (HTTP) is the means by which text is transferred between computers. It's a protocol for applications that provides access to and manipulation of hypertext objects and components (including HTML pages, PNG images, and so on). In order to function, HTTP relies on TCP, which relies on IP. HTTP is an application protocol, it is different from network protocols like IP and transport protocols like TCP because it is directly consumed by applications. When it comes to the nuts and bolts of sending data across a network, HTTP depends on TCP. The transmission of information in an HTTP request and response is mostly handled by TCP.

HTTP Request on the Internets Stateful Network

The following procedures need to be carried out in order for a client to successfully communicate with a server, whether that server is the ultimate server or an intermediate proxy.

1. Open a TCP connection

Through the TCP connection, a request (or several requests) can be sent and a response can be received. The client has the option of establishing a new connection with the servers, reusing an existing connection, or establishing several TCP connections.

2. Send an HTTP message:

A correctly composed HTTP request contains the following elements:

**A request line: ** This contains the HTTP method, path and HTTP version GET /index.html HTTP/1.1
**Headers: ** HTTP headers give the recipient with information about the message, sender, and method of communication. The HTTP protocol specifications specify HTTP headers and detail how to use them. Example Host: blog.sofwancoder.com Accept-Language: en
Optional message body: The HTTP message body is called the message body or the entity-body. The message, which is in the body, can be sent in its original form or in a way that makes it easier to send, such as by being broken into parts (chunked transfer-coding). Not all request methods can be used with message bodies.

3. Read the response:

HTTP/1.1 200 OKDate: Wed, 17 Aug 2022 20:41:02 WATServer: NginxContent-Type: text/html(the response body goes comes here)

3. Close the connection:

Mark the connection for closure and close it off.

TCP 3-Way Handshake Process: Establishing a session before communicating

Step 1 (SYN)

In the first step, the client wants to connect to a server, so it sends a segment with SYN (Synchronize Sequence Number). This tells the server that the client is likely to start communicating and what sequence number it starts segments with.

Step 2 (SYN + ACK)

The server responds to the client request with SYN-ACK signal bits set. Acknowledgement(ACK) signifies the response of the segment it received and SYN signifies with what sequence number it is likely to start the segments with

Step 3 (ACK)

In the final part client acknowledges the response of the server and they both establish a reliable connection with which they will start the actual data transfer

This is how the TCP handshake works. The client and server need to agree on each other's sequence numbers so that no packets get lost in transit. The client will then send a TCP segment with the sequence number of the last segment sent and the ACK flag set.

Sending Data over a Session

The client will now send data over the TCP connection. It will first break down the data into segments and send the segments in TCP segments with the data as the payload. It will set the sequence number of the last segment sent to the number of segments it has broken down the data into. The client will also send a TCP segment with the ACK flag set and the sequence number of the last segment it has received.

The server will receive the segments, and acknowledge they have been received by sending TCP segments with the ACK flag set and the sequence number of the last segment it has received. The server will send segments with the SYN flag set and the sequence number of the last segment sent. The server acknowledges the segments it has received by sending TCP segments with the SYN flag set and the sequence number of the last segment it has received.

TCP teardown: Ending a session after communicating

Step 1 (Fin Wait)

A client sends a FIN packet and waits for a response; it can release some resources but awaits the response of the other part.

Step 2 (Close Wait)

A server receives the FIN packet and must release resources; it waits for a closing application level.

Step 3 (Last Ack wait)

The client can now fully close its job, but it must wait for network collision. it may have to send the final ACK another time.

Step 4 (CLOSE)

The server eventually receives the final ACK and destroys (kills) the connection.

TCP and reliability

The TCP protocol provides various mechanisms for managing reliability and congestion. For example, the TCP protocol retransmits data packets that are lost or marked as unread. The TCP protocol also reduces network congestion by limiting the rate at which a TCP sender can inject segments into the network. The TCP protocol implements these reliability and congestion control mechanisms by maintaining state information at the TCP header.

Conclusion

The Internet is a stateful network. The TCP protocol provides services such as connection management, congestion control, and error recovery. An HTTP server is a computer program that responds to requests from HTTP clients by serving them documents (often HTML files) in response to the request. An HTTP client is a computer program that connects to an HTTP server and makes requests by issuing standard TCP messages that have the Hypertext Transfer Protocol methods as their special headers.

MongoDB: Capped Collections & Time-To-Live Index

Sofwan A. Lawal — Mon, 22 Aug 2022 07:00:00 GMT

When you use MongoDB in a production environment, you may eventually need to lower the amount of memory your database needs or increase how much it can read and write. The good news is that MongoDB has several ways to deal with these problems. One of them is putting a limit on the size of your data by putting a cap on size of your collections. A capped collection is a special kind of MongoDB collection that can only hold up to a certain number of documents. This blog post will explain why capped collections are so useful and how they can be used with MongoDB.

Introduction to Capped Collection

A capped collection is a feature of MongoDB that lets you set a limit on how big the collection can get. This can be used to lower your database's need for memory or to make sure that your database can keep up with the ever-growing need for read/write capacity. Also, capped collections are helpful when you have to follow rules about how long you can keep data. Capped collections are the same as regular collections in MongoDB, but they have some extra options.

Introduction to TTL Indexes

Time-to-live (TTL) indexes are special single-field indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time or at a certain clock time. Data expiration is useful for some kinds of information that only need to stay in a database for a certain amount of time, like machine-generated event data, logs (which we'll look at later in this article), and session information.

Why is a capped collection useful?

There are several scenarios in which you might want to use a capped collection.

You want to preserve the insertion order, so the most recently inserted documents can be retrieved efficiently.
Cache, We can store some cache in MongoDB and then build the index for documents, since we mostly read from the cache and rarely write to it.
You want to store a large amount of data that is not needed all the time. In such cases, you can store the data in a capped collection to be deleted when it is no longer needed.
You have to adhere to certain data retention policies. If this is the case, you can use a capped collection and set the maximum size of the collection.

Why is the TTL index useful?

There are several scenarios in which you might want to use a TTL index on a collection.

You want to expire users' session at a specific time after login
You want to auto-delete data which isn't useful after a certain period of time
You want to prevent collection size from growing really large

How to implement a capped collection?

Lets say you want to store all the request logs to your API endpoint in a MongoDB database. The data structures of the logs are very simple - each log has an endpoint, the date, the request method, and the response code. In addition, you want to store the logs for a maximum of 10 years in the database. The logs are stored as a single document per request. You could store the logs in a normal collection.

Here is an example of a normal collection, that stores these logs using Typescript with Moongoose (an ODM which provides a straightforward, schema-based solution to model your application data).

import mongoose from "mongoose";const apiLogSchema = new mongoose.Schema(  {    endpoint: {      type: String,      required: true,      immutable: true    },    method: {      type: String,      required: true,      immutable: true    },    status: {      type: Number,      required: true,      immutable: true    },    createdAt: {      type: Date,      required: true,      default: () => new Date(),      immutable: true,      index: true    },  }, {    // options  })const ApiLog = mongoose.model('ApiLog', apiLogSchema)export default ApiLog

But you don't know how many API requests you'll get per year in advance. If you get more customers than you expected, your database will grow, and you may have trouble with how well it works or how much it can hold. So, a better way to store the logs would be in a collection with a cap size. Then you can tell the collection how big it can get by setting the maximum size. When the collection has as many documents as it can hold, older ones are automatically deleted.

Limiting the size of documents in a collection

In a capped collection, you set the highest bytes that a document can have. If this number is reached, the document is taken out of the collection automatically. This keeps the amount of storage space from getting out of hand, which can slow down the computer. Extending the normal collection above, we can add a capped option as shown below

const apiLogSchema = new mongoose.Schema({  ...  createdAt: {    type: Date,    required: true,    default: () => new Date(),    immutable: true,    index: true  },},  {  capped: {    size: 1024 * 1024 * 1024, // 1GB Maximum size    autoIndexId: true  }});

Limiting the number of documents in a collection

You can also limit the number of documents in a MongoDB collection. This can be done by setting the max property for the capped options as shown below. If this threshold is reached, the document is automatically removed from the collection. ```typescript const apiLogSchema = new mongoose.Schema({ ... , { capped: { size: 1024 * 1024 * 1024, // The size is always important max: 1_000_000, // Maximum of 1million records autoIndexId: true } });

const model = mongoose.model('Model', schema); export default model; ```

How to implement TTL Indexes on a collection?

To create a TTL index, use the db.collection.createIndex() method with the expireAfterSeconds option on a field whose value is either a date or an array that contains date values. ```typescript const apiLogSchema = new mongoose.Schema({ .... createdAt: { type: Date, required: true, default: () => new Date(), immutable: true, index: true }, });

// Optionally deleting logs after a number of seconds has elapsed apiLogSchema.index({ createdAt: 1 }, { expireAfterSeconds: 60 * 60 * 24 * 365 }) // 1 year

// Optionally deleting logs at a specified time (with an expiredAt field defined) apiLogSchema.index({ expireAt: 1 }, { expireAfterSeconds: 0 }) // Watch for date in expireAt

# ConclusionMongoDB offers a variety of ways to deal with problems such as the volume of data and the time it takes for data to become obsolete. One of these is the implementation of capped collections, which will restrict the maximum amount of space that can be occupied by your data. Another is the implementation of the TTL index, which will remove data at a predetermined time or after a predetermined amount of time has elapsed. You can limit the amount of memory and storage space that your database needs by using capped collections to either save only the most recent data, store a huge quantity of data that is often updated, or both.

What The Testing Pyramid Is And Why You Need To Know About It

Sofwan A. Lawal — Mon, 15 Aug 2022 09:19:45 GMT

Software testing is a crucial step in the creation of software. It involves a number of procedures that are carried out to evaluate and assess the calibre of software programmes. Prior to launch, testing entails running programmes or procedures to look for flaws, faults, or problems in software. You may keep a balance between various sorts of tests for your applications by using a code-testing pyramid, often known as the test pyramid. The testing pyramid will be thoroughly explained in this post, along with how it can assist your organisation to adopt better testing procedures.

Introduction

The testing pyramid is a well-liked framework for considering the many kinds of tests that have to be applied while testing software programmes. Starting with a modest number of high-level tests and adding more lower-level tests as necessary is the concept. By doing this, you can make sure that your test suite is thorough without being overly extensive.

What is the Testing Pyramid?

The testing pyramid illustrates how testing should be prioritised throughout the software development lifecycle. You may maintain a balance between various testing for your applications using this pyramid. Unit tests should be at the base of the pyramid, while exploratory/functional testing should be at the top. Automated functional, UI, and acceptance testing should be present at the second level, in the middle. Regression tests, code coverage tests, security tests, load/performance tests, compatibility tests, and other test types can also be incorporated into the pyramid.

Why is the Testing Pyramid Important?

The pyramid is significant because it aids software development teams in maintaining a balanced strategy for testing. Software testers can better realise that not all tests are created equal thanks to the pyramid. Functional/acceptance testing shouldn't be the main focus of software testing, according to one of the testing pyramid's main tenets. The testing pyramid's main objective is to make it clear to you that functional testing shouldn't take up the majority of your time. Unit testing and automated functional testing should take priority.

Organising your test

The testing pyramid can be implemented in a variety of methods, but one popular method is to start with unit tests, then include integration tests, and lastly end-to-end (E2E) tests. It makes perfect sense to begin with unit tests because they are the quickest and simplest to write. Although writing integration tests takes a little longer, it's still rather quick and simple. E2E tests should be added last because they take the longest and are the most challenging to write.

Of course, there isn't just one ideal setup for your test suite. The key is to consider the many test types you require before selecting the structure that makes the most sense for your project.

Different Test Types in a Software Testing Pyramid

Unit Tests
Unit tests are used to evaluate a specific capability of an application or a component. At the most fundamental level of the application, this kind of testing is carried out. Unit tests are used to test a specific method or section of functionality in a short piece of code. For instance, the code for adding numbers should be checked at the unit level if we have a calculator application. The functionality of source code created in programming languages like Java, C++,.Net, etc. is tested using unit tests.
Integration Tests
Once unit tests have been completed, the next step is integration testing. They perform tests to see how well the parts function together. Writing integration tests is more challenging than writing unit tests, but both are necessary.
Functional/Acceptance Tests
The purpose of functional testing is to verify that a programme performs as expected when used by an end user. Testers who aren't familiar with the programme perform these checks. With the use of functional tests, we can ensure that the software works as intended for actual users. The purpose of functional and acceptability testing is to ensure that the application functions as expected in the given context.
End-to-End Tests
End-to-end tests are the highest level of testing. They test the entire application from start to finish. Even though they're the most challenging to create, end-to-end tests are crucial.
Other Types of Tests
Apart from the testing types mentioned above, there are other types of tests like unit testing, functional testing, code coverage, security, load/performance testing, compatibility, etc.

Conclusion

The testing pyramid is a way to think about how to test your application in a good way. It is a great way to see how to balance your testing efforts within the software development lifecycle. But it's important to remember that no testing pyramid is the same and that every project or product is different. So, the pyramid should be used as a guide, not as the only way to do things. One of the best ways to make sure your application and team are high-quality is to use a well-balanced testing pyramid.

What is Address Resolution Protocol (ARP)

Sofwan A. Lawal — Tue, 09 Aug 2022 13:19:14 GMT

Computers require a common language in order to communicate with one another. We have communication protocols which are the guidelines that computers must go by when communicating. An aspiring network engineer may find it difficult to get started because there are so many distinct communication protocols available. The good news is that most of these networking terms are derived from common language. Therefore, knowing the fundamentals of ARP will offer you an advantage in practically every situation, regardless of what you intend to perform with your knowledge of networking.

What is ARP?

ARP stands for Address Resolution Protocol. It is an Internet Protocol that enables computers connected by a Local Area Network to recognise the hardware address of another device. A network address must be assigned to each device connected to a network. This address may be chosen statically or randomly from a pool of IP addresses. Once the machines on the network have their IP addresses, they must be able to find one another. This can be done over a local area network using ARP. ARP uses the Ethernet standard's built-in broadcasting feature to ask each device connected to the local area network (LAN) for the IP address that corresponds to a certain MAC address.

Why was ARP needed?

The difficulty with address resolution was obvious even before the TCP/IP protocol suite had even been fully developed. Even before Ethernet had been formally standardised as IEEE 802.3, a significant portion of the early development of IP was conducted on the then-emerging Ethernet local area networking technology. To enable communication over Ethernet networks, a method for mapping IP addresses to Ethernet addresses has to be established. The flexibility that comes from employing the dynamic resolution paradigm was something the IP designers wanted. They created the TCP/IP Address Resolution Protocol to achieve this (ARP). One of the early Internet RFCs currently in use, RFC 826, An Ethernet Address Resolution Protocol, released in 1982, has a description of this protocol.

How Does the Address Resolution Protocol Work?

When two computers are on the same network, they can use ARP to find each other and communicate.Heres how it works:

First, one computer sends a broadcast message over the network looking for the hardware address of the other computer.
All computers on the network receive this message and simply ignore it if it does not apply to them.
The computer with the IP address that matches the hardware address of the message receives it and sends a reply.
The computer that sent the initial message receives the reply and stores the hardware address of the other computer in its network table.
It then sends a message to let the other computer know the table has been updated to include the new address.

Why is ARP Important?

The ability for machines on the same network to find one another makes ARP crucial. This makes it possible for two computers to communicate even when they are using different protocols. For instance, a computer can transmit print jobs to a network printer that communicates via the Internet Printing Protocol (IPP), thanks to ARP.

Differences between an ARP and a DHCP server

It's often common to mistake ARP for DHCP (Dynamic Host Configuration Protocol).

ARP resolves IP addresses to MAC addresses and vice versa if youre able to trace the network where the device is connected. ARP finds the Physical address of a device on the network, such as a WiFi router or printer, by knowing the device's IP address.
DHCP servers are Internet servers that assign IP addresses easily and quickly. These IP addresses are usually assigned from a pool of addresses that DHCP servers have been assigned.

Conclusion

You can see that the Address Resolution Protocol is a crucial component of computer networking, particularly on a Local Area Network. Even if they are utilising different protocols, it enables computers to locate one another and interact. As a backend engineer, Understanding ARP functions is crucial whether you're setting up a computer network or connecting a device to one in order to keep the network operating smoothly.

TDDs and Why you need them?

Sofwan A. Lawal — Mon, 08 Aug 2022 18:06:20 GMT

Technical design documents, or TDDs for short, are often created to establish detailed plans for new additions or fixes to existing technical problems. They describe the rigorous thought process that goes into the total design of a feature or a solution to an issue.

What is a Technical Design Document?

Technical design documents may also be referred to as a blueprint or a blueprint document. A blueprint document is a piece of writing describing -in depth- a project or a solution for a technical issue. For instance, if you were creating a website, a technical design document would contain details about the website's objectives, the audience it is meant for, the information architecture, the content it should have, and design choices (such as the colour scheme to employ), and the development process.

The importance

A complete blueprint of the final product's appearance may be seen in the technical design paper, down to the minor details.

It describes the features and operation of the product. Team members will be aware of their responsibilities and what has to be done to generate a high-quality product by having a technical design document.
It can act as a guide to help you assimilate the information without becoming too overwhelmed. It's a fantastic method to stay on track and keep your eyes on the goal.
It can help guide your team members by giving them a template to work from.
It is a great way to make sure your product remains consistent and retains the same level of quality throughout.

How does it help?

It might be challenging to remember every little detail when creating something new. A technical design document is a handy paper because of this. Using a technical design document, you may keep track of all your thoughts and plans so that you don't forget them when the project is being developed. Making a technical design document can seem like a big task, but with the appropriate planning, it's not as bad as it sounds. This blog post will go through all you need to know about writing a technical design document. To begin, keep reading!

How to write good TDD?

A technical design document should contain a range of information, as we've already mentioned. However, your industry will determine your document's precise format and style. Nevertheless, there are a few rules you want to follow when composing your technical design document.

Goal of the project - The first sentence in your technical design document should state the project's goal. Software architecture documents are distinct from technical design documents. What you want to develop and why you want to do it should both be explained in your technical design paper. If you have a deadline, it's also a good idea to mention it.
Audience - Who are you providing this solution for? Your intended audience should be identified in your technical design document so that you may create a solution that is appropriate for their needs.
Information architecture - What information will be included in the project? How is that data set structured? This can include details like the data that will be processed, pages that will be present, the content that will be displayed on pages, and the navigational methods for that content.
Design choices: What colour scheme will you use? Which fonts are you planning to use? What kind of material are you looking to add? Your technical design document can include descriptions of each of these choices.
Build procedure - How will the project be built? What equipment will you employ? Where will the project files be kept? What kind of testing will you undertake? How will you deploy this solution? Your technical design document can include all of these specifics.

Things to Include in Your TDD

As we've already mentioned, a technical design document will contain various information. Technical design documentation varies greatly. However, the following are all elements of a high-quality technical design document:

Executive summary - An executive summary is a summary that presents the key points of a document at the beginning. This is an effective method for providing an overview of the information contained in your technical design paper.
Problem statement - What prompted your decision to start this project? What is the purpose of this project? Problem statements can be a component of your technical design document's introduction. This can assist you in maintaining your attention on the primary demand for the job.
Audience - Who is your target audience, exactly? Why is this project necessary? What issue does it address? Your technical design document's audience section can describe the target audience's requirements.
Scope - How big is the project you're working on? What restrictions apply? What is excluded from the project? The scope section is a fantastic area to describe the project's constraints.
Information architecture - How are the various portions of your project organised, and what are they? What kind of material will be presented in the project? This can include details like the pages that will be present, the content that will be displayed there, and the navigational methods for that content.

TDDs vary depending on the organisation and team. Your TTDs can include extra details like the objective, description, definitions of key terms, specifications, constraints, and acceptance criteria, as well as information on the current state (if applicable), potential solutions, scalability and performance considerations, an overview of the architecture, and any design decisions you make along the way.

Conclusion

Although it may seem complicated, writing a technical design document is not as frightening as it looks. Using a technical design document, you may keep track of all your thoughts and plans so that you don't forget them when the project is being developed. Additionally, it can save you on the course, mainly if the job is big. Having a technical design document will aid in maintaining organisation. When you first begin a project, your memory is blank. However, you'll likely forget some specifics as the project progresses. A technical design document can be quite helpful in this situation. You can quickly refresh your memory without overlooking crucial elements by consulting your technical design paper.