GLOSSARY OF SYSTEM DESIGN BASICS
System Design Basics
Key Characteristics of Distributed Systems
Scalability
- Scalability is the capability of a system, process, or a network to grow and manage increased demand.
- Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable.
- horizontal scaling
- vertical scaling
Reliability
- By definition, reliability is the probability a system will fail in a given period.
- In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail.
- A reliable distributed system achieves this through redundancy of both the software components and data.
- Obviously, redundancy has a cost and a reliable system has to pay that to achieve such resilience for services by eliminating every single point of failure.
Availability
By definition, availability is the time a system remains operational to perform its required function in a specific period.
It is a simple measure of the percentage of time that a system, service, or a machine remains operational under normal conditions.
If a system is reliable, it is available. However, if it is available, it is not necessarily reliable.
In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing repair time and ensuring that spares are always available when they are needed.
Efficiency
- response time (or latency)
- throughput (or bandwidth)
Serviceability or Manageability
- the simplicity and speed with which a system can be repaired or maintained
Load Balancing
To utilize full scalability and redundancy, we can try to balance the load at each layer of the system. We can add LBs at three places:
- Between the user and the web server
- Between web servers and an internal platform layer, like application servers or cache servers
- Between internal platform layer and database.
Load Blancing Algorithms:
- Health Checks
- Least Connection Method
- Least Response Time Method
- Least Bandwidth Method
- Round Robin Method
- Weighted Round Robin Method
- IP Hash
Caching
Caches take advantage of the locality of reference principle: recently requested data is likely to be requested again.
They are used in almost every layer of computing: hardware, operating systems, web browsers, web applications, and more.
Application server cache
Content Distribution Network (CDN)
Cache Invalidation:
keep cache coherent with the source of truth (e.g., database).
Write-through cache
Under this scheme, data is written into the cache and the corresponding database at the same time.
Write-around cache
This technique is similar to write through cache, but data is written directly to permanent storage, bypassing the cache.
Write-back cache
Under this scheme, data is written to cache alone and completion is immediately confirmed to the client. The write to the permanent storage is done after specified intervals or under certain conditions.
Cache eviction policies:
- First In First Out (FIFO)
- Last In First Out (LIFO)
- Least Recently Used (LRU)
- Most Recently Used (MRU)
- Least Frequently Used (LFU)
- Random Replacement (RR)
Data Partitioning
Data partitioning is a technique to break up a big database (DB) into many smaller parts.
Partitioning methods:
Horizontal partitioning
In this scheme, we put different rows into different tables.
Vertical Partitioning
In this scheme, we divide our data to store tables related to a specific feature in their own server.
Directory Based Partitioning
Partitioning criteria:
- Key or Hash-based partitioning
- List partitioning
- Round-robin partitioning
- Composite partitioning
Common problems of data partitioning:
- Joins and Denormalization
- Referential integrity
- Rebalancing
Indexes
The goal of creating an index on a particular table in a database is to make it faster to search through the table and find the row or rows that we want.
Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
An index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update.
Proxies
In short, a proxy server is a piece of software or hardware that acts as an intermediary for requests from clients seeking resources from other servers.
- Typically, proxies are used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compressing a resource).
- Another advantage of a proxy server is that its cache can serve a lot of requests. If multiple clients access a particular resource, the proxy server can cache it and serve it to all the clients without going to the remote server.
Open Proxy:
- Anonymous Proxy
- Trаnspаrent Proxy
Reverse Proxy: 反向代理
- A reverse proxy retrieves resources on behalf of a client from one or more servers. These resources are then returned to the client, appearing as if they originated from the proxy server itself
Redundancy and Replication
server redundancy
data replication master-slave relationship between the original and the copies
SQL vs. NoSQL
SQL:
- Relational databases store data in rows and columns.
- Each row contains all the information about one entity and each column contains all the separate data points.
- Some of the most popular relational databases are MySQL, Oracle, MS SQL Server, SQLite, Postgres, and MariaDB.
NoSQL:
- Key-Value Stores: Data is stored in an array of key-value pairs. The ‘key’ is an attribute name which is linked to a ‘value’. Well-known key-value stores include Redis, Voldemort, and Dynamo.
- Document Databases: In these databases, data is stored in documents (instead of rows and columns in a table) and these documents are grouped together in collections. Each document can have an entirely different structure. Document databases include the CouchDB and MongoDB.
- Wide-Column Databases: Instead of ‘tables,’ in columnar databases we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase.
- Graph Databases: These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph.
SQL | NoSQL | |
---|---|---|
Storage | tables | key-value, document, graph, and columnar |
Schema | fixed | dynamic |
Querying | SQL (structured query language) | UnQL (Unstructured Query Language) |
Scalability | vertically scalable, expensive | horizontally scalable, easy |
ACID Compliancy | ACID compliant | sacrifice ACID compliance for performance and scalability |
**ACID :(Atomicity, Consistency, Isolation, Durability) **
CAP Theorem
CAP theorem states that it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP): Consistency, Availability, and Partition tolerance.
CAP theorem says while designing a distributed system we can pick only two of the following three options:
Consistency: All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads.
All users see the same data at the same time.
Availability: Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers.
System continues to function even with node failures.
Partition tolerance: The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. *Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
System continues to function even if the communication fails between nodes.
Consistent Hashing
Distributed Hash Table (DHT)
index = hash_function(key) 逻辑地址,不一定是物理地址
drawbacks of key % n
hash function:
- It is NOT horizontally scalable.
- It may NOT be load balanced, especially for non-uniformly distributed data.
In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the system), only k/n
keys need to be remapped where‘k
is the total number of keys and n
is the total number of servers. Recall that in a caching system using the mod
as the hash function, all keys need to be remapped.
In Consistent Hashing, objects are mapped to the same host if possible. When a host is removed from the system, the objects on that host are shared by other hosts; when a new host is added, it takes its share from a few hosts without touching other’s shares.
Long-Polling vs WebSockets vs Server-Sent Events
standard HTTP web request:
- The client opens a connection and requests data from the server.
- The server calculates the response.
- The server sends the response back to the client on the opened request.
Ajax Polling:
The basic idea is that the client repeatedly polls (or requests) a server for data.
The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned.
- The client opens a connection and requests data from the server using regular HTTP.
- The requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds).
- The server calculates the response and sends it back, just like regular HTTP traffic.
- The client repeats the above three steps periodically to get updates from the server.
HTTP Long-Polling (Hanging GET):
- If the server does not have any data available for the client, instead of sending an empty response, the server holds the request and waits until some data becomes available.
- Once the data becomes available, a full response is sent to the client. The client then immediately re-request information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event.
- The client makes an initial request using regular HTTP and then waits for a response.
- The server delays its response until an update is available or a timeout has occurred.
- When an update is available, the server sends a full response to the client.
- The client typically sends a new long-poll request, either immediately upon receiving a response or after a pause to allow an acceptable latency period.
- Each Long-Poll request has a timeout. The client has to reconnect periodically after the connection is closed due to timeouts.
WebSockets:
- full duplex (全双工)
- WebSocket provides Full duplex communication channels over a single TCP connection.
- It provides a persistent connection between a client and a server that both parties can use to start sending data at any time.
- The client establishes a WebSocket connection through a process known as the WebSocket handshake.
Server-Sent Events (SSEs):
- Under SSEs the client establishes a persistent and long-term connection with the server.
- The server uses this connection to send data to a client.
- Client requests data from a server using regular HTTP.
- The requested webpage opens a connection to the server.
- The server sends the data to the client whenever there’s new information available.