Software Architecture Concepts: [ Part 5] Key Characteristics of Distributed Systems

In this blog post we I will talk about some key characteristics of a highly scalable distributed system in particular:

  • Scalability Read Part 4
  • Reliability
  • Availability
  • Efficiency and Manageability.

Scalability

Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable.

A system needs to scale because of many reasons like increased data volume or increased amount of work, e.g., number of transactions. A scalable system would like to achieve this scaling without performance loss.

Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more servers into your pool of resources (Scale out) whereas Vertical scaling means that you scale by adding more power (CPU, RAM, Storage, etc.) to an existing server (Scale Up).

Good examples of horizontal scaling are Cassandra and MongoDB database systems as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.

Reliability

By definition, reliability is the probability a system will fail in a given period. In simple terms, a distributed system is considered reliable if it keeps delivering its value even when one or several of its components fail. Reliability is one of the main characteristics of any distributed system, since in such systems any failing unit (machine) can always be replaced by another working (healthy) one, ensuring the completion of the requested task.

A reliable distributed system can achieve this through redundancy of both the software logic components and its stored data. If the server carrying the user’s messages fails, another server that has the exact replica of the data should replace it.

Redundancy has a cost and a reliable system has to pay that to achieve such resilience for services by eliminating every single point of failure.

Availability

By definition, availability is the time a system remains working to perform its required function in a specific period. It is a simple measure of the percentage of time that a system remains operational under normal conditions. A car that can be driven for many hours a month without much downtime can be said to have a high availability. Availability takes into account time needed for maintenance, repair, spares availability, and other factors. If a car is down for maintenance, it is considered not available during that time.

Reliability is availability over time considering the full range of possible real-world conditions that can occur. A carthat can make it through any possible hot weather is more reliable than one that has problems during summer.

Reliability compared to Availability

If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing time needed for repair and ensuring that spares are always available when they are needed.

Efficiency

To understand how to measure the efficiency of a distributed system, let’s assume we have an operation that runs in a distributed manner and delivers a set of items as result. Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first result from the system and the throughput (or bandwidth) which denotes the number of results delivered in a given time unit. The two measures correspond to the following unit costs:

  • Number of data items processed by the nodes of the system regardless of the their size.
  • Size of data item that need to be processed .

Serviceability or Manageability

Another critical consideration while designing a distributed system is how easy it is to operate and maintain. Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained. If the time to fix a failed system and bring it up and running increases, then availability will decrease. Things to consider for manageability are the amount of monitoring and logging available to diagnose and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate (i.e., does it normally work without failure or exceptions?).

Early and automatic detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center or send a Notification to its operator (without human intervention) when the system experiences a system fault without the need for customers to call any customer service representatives.