COEN 317 Distributed system

09/21

layer tier

layer: software
tier: hardware

layerd architecture style

separation of concerns
unnecessary transaction
grey line, hard to decide which layer to put

event based architecture styles

dynamic
at the beginning, each component don’t know who to communicate to
after that, can point-to-point

middleware

libararies, runtime environment
JVM

blocking

sent/transmission != reach the destination
blocking: suspended until transmission finishes
transmission: IT DEPENDS!
1. sending message buffer
2. channel
3. receiving messsage buffer

persistent communication

message queue
need to store data

asynchronous

synchronous: wait unitl delivery, not process of the data

soft state

cache

09/26

application-specific protocol
application-independent protocol:

middleware
TCP/IP protocol:

local OS

why layered?

separation of concerns

OSI reference model

application
presentation
- process information for applications
- serialization/deserialization
- encryption/decryption
session
transport
- end-to-end communication between different applications
network
- routing packets
data link
- send, detect and corrent data frames
- data frames: units of bits
physical layer
- actual transfer of binary bits data

TCP/IP protocol suite

application
- HTTP, FTP, SMTP
- process-to-process
transport
- TCP, UDP
- host-to-host
internet
- IP
- routing table
  
  host a-> router -> router -> host b
link

TCP UDP

TCP

data in streams
UDP

data in messages

remote procedure calls

remote invocation

one-way RPC
asynchronous RPC

remote object calls

message brokers

between clients and servers
beyond simple queues
message will be persistent

socket APIs

transport-level socket programming via socket interface
message-oriented transient communication at the transport layer

Messaging-passing interface

message-oriented transient communication at the application layer

queuing model

message-oriented persistent communication
asynchronous/time uncoupling
advanced message queuing protocol

rabbitMQ

0928

why some inaccuracy is tolerable?

more than one failure detection mechanism
in heart-beating, after some times it will correct itself

atomic > reliable

atomic: all or none receive the message overkill
reliable: we can tolerate some inaccuracy

centralized heart-beating out of consideration!

single point / performance
too much workload for the center node
can not be dynamic: the center needs to know the global state
saturated

ring heart-beating valid

predecessor and successor
what if 2 nodes fail? stablization protocol
detection is too slow

all-to-all heart-beating valid

full of unnecessary message
can not be dynamic

false positive / false negative

inaccurate
incomplete! intolerable

local buffers processes and channels

copy message to outgoing message buffer send
copy message from incoming message buffer receive
latency: in channel and in operation system

process failure / channel failure

1003

virtualization

Virtual Machine Memory = Hypervisor

platform virtualization, hosted VM
shared OS
1. hardware
2. host OS
  Patch, upgrade together
3. hypervisor, virtualizing the OS
4. applications, limited(no kernel module)
  example: Linux container
  host OS adminstrater sees everything! not fully isolated!
  reboot OS? all down!
full virtualization, native VM
independent guest OS
1. hardware
2. hypervisor, virtualizing the hardware
3. virtual hardwares, including VCPU, VNIC
4. guest OSs(Windows, Linux, MacOS)
5. applications
  fully isolated!
  less scalable
  better performance
combined
within each guest OS, platform virtualization(virtualzing some guest OS)

network address translation

NIC: Network Interface Control
NAT: Network address translation
hardware: MAC, Media Access Control
VMs: MACs
query NAT table for every connection
improvement: SRIOV

SRIOV

integrate NAT into NIC
PCIe: Peripheral Component Interconnect express chip-to-chip interconnect
Ethernet: via cables system-to-system

network bridging

poor scalability
bridge over the hypervisor for better performance

host only networking

inter instance communication, VM to VM

1005

P2P exmaple: Napster

server
1. stores no files
2. maintains a table
3. response with where to download the music(IP address, port)
client stores their own data
server search: ternary tree algorithm
problem:
1. infringement of copyright
2. security : each peer exposes a port; clear-text communication
3. fault-tolerance: centralized server, centralized algorithm
  single point of failure

Distributed Hash Table

objects: files
bucket: nodes
hash function: consistent hashing hash ring
SHA-1(filename IP address, port) => 160-bit string(key) m bits
may use leader-election algorithm
routing mechanism:
- routing: all operations, including insert, lookup and delete
- each node knows its successor and predecessor node id: 0~pow(2,m)-1
- each node computes a finger table(routing table, nodes it knows)
- if doesn’t have the file(after local search), querying, each step halves the distance O(log(N))
  1. go(forward) to the farthest node in finger table, but no crossing the key
  2. if none exist, send query to successor
- all query steps are RPCs
search under peer failures
- one solution: maintain (r) more than one successor entries instead of one
  - if one query fails, go to the next successor
  - sender’s finger table not yet corrected
- one solution: replicate file/key at r successors and predecessors
new peers joining (dynamic changes churn)
- stablization protocol followed by all nodes stablization algorithm
  - periodically check and update
  - ensure non-loopiness
- copy keys and update finger table
- bandwidth cost busy network
- alternatives: store and replicate only pointers to files(meta info)

1010

naming in distributed systems

same name; unique identifier
address = identifer + access point
context is import!
- global
- local
dynamic or static binding?

broadcast link network

identifier: IP and MAC address Media Access Control
arp -a every node in the network will broadcast their id

independent nodes P2P system

identifier: hash to 160-bits-string, then truncate it to m bits(name space)
name resolution: distributed hash tables
shaw-one algorithm

flat naming

no structure, just plain bits
ip address

structured naming

directory
file system:
- alias: short, multiple names
- multiple path to one file

centralized management generic problems

global state: no one can know all
transaction for second
single point of failure

hard links && symbolic links

hard links:
- When adding one hard link, this file’s reference count++
- must delete reference count before deleting
symbolic links:
- When adding one link, the file will store the meta data pointing to the original file
- linux: ln -s

mounting

nfs protocol
remote server B: share file/directory
- linux: exportfs /home/steen
local server A: mount remote shared file/directory locally
- linux: mount B.ip:/home/steen /remote/vu
- linux: cd /remote/vu

name resolution

client send query to name servers
host query DNS server
linux:
- the host knows where to query in DNS
  /etc/reslve.conf
1. nslookup www.scu.edu
2. dig
iterative
recursive
redundancy and caching server
- speed
- fault tolerance
caching on the client side: local browser name resolver

attribute-based naming

tuple/triple: key, value/set of values
Lightweigh Directory Access Protocol
when sending email, client query LDAP server
west coast, east coast, europe, …
Directory Information Base data sharding

1017

gossip multicast

fault tolerance
scalability sending messages fast

centralized approach

performance bottleneck not scalable
single point of failure not reliable
communication channel may be unreliable
- acknowlegement protocol

tree-based mechanism

overheads:
1. someone needs to be responsible to set up and maintain the tree
2. acknowledgement ACKs / NAKs
3. RMTP SRM

gossip protocol epidemic multicast

every round, pick random targets and send messages via UDP
may have duplicates(send messages back)
once become infected, do the same
push vs pull
- push: whenever have a new mutlicast message
- pull: periodically poll
  if node crashes, after coming back, and poll messages fault-tolerant
topology-aware gossip:
- to reduce router load
- fix: pick gossip target in its subnet

1024

time and ordering

host1, host2, host3
t1 t2 t3
each have local clocks(local frequencies)
within host1, different processes, all have same time t1
each process will order tasks it has
However, to order p1 from host1 and p2 from host2? Time Synchronization

time synchronization

external time synchronization:
- rely on external time source
- when off bounds, sychronize
- Cristian’s algorithm, NTP
- problem: unbounded latency in asynchronous system:
  1. network latency
  2. processing latency
internal time synchronization:
- between hosts’s processes, clock skew
- when off bounds, sychronize
- Berkeley algorithm
- problem: time drift from the UTC(correct time)
logical time synchronization

clock skew && clock drift

clock skew; distance
clock drift: speeds difference
non-zero clock skew: clocks are not synchronized
non-zero clock drift: the skew will increase

cristian’s algorithm

RTT: round-trip time
erro is at most (RTT-min2-min1)/2
still has problem:
bounded error => error rate more than zero
what if processes asynchronize within the bound? => won’t trigger the algorithm

Network Time Protocol

to reduce min1, min2
leaf nodes are clients
child has ts1, tr1, ts2, tr2, then calculate the offset = (tr1-tr2+ts2-ts1)/2
error: difference between the calcuated offset and the real offset
still have a non-zero error!! => still invalid in asynchronous distributed system

logical time, instead of absolute time

Leslie Lamport, Turing award
causality:
1. ordering within one process
2. send-recieve message between two processes
transitivity
lamport timestaps ts=max(local,msg)+1
timestamp doesn’t not determine order, only causal path does!
timestamps only matter if there is a causal path!
when there is no causal path? concurrent events
shortcoming: can’t tell where two events are concurrent or not

vector timestamps

n processes => n elements within the vector
each process maintains a vector
causality: iff VT1 <= VT2
concurrent (VT1 ||| VT2): !(VT1 <= VT2) && !(VT2 <= VT1)

1026

coordination coordinator/leader/introducer

receive client requrest => forward to appropriate host
handling of failure/join/leave

property/attribute: cpu, disk, ID, space
safety: get an election result
liveness: election terminates

ring election protocol consensus problem

elect highest id process as leader
received id == this.id => elected!!
whenever a new node joins, run the protocol
either become the new leader, or get the leader

fixing for failure

have predecessor/successor detect failure
use the failure detector

bully algorithm

premise: all knows all
handle failures better

google chubby

quorum: voting set
master lease: time duration

apache zookeeper

paxos: Zookeeper Atomic Broadcast

1031

transaction

series of operations executed by client
each operation is a RPC to a server
either commit or abort

atomicity

all or nothing; commit or abort
failure/exception: roll back(undone)

consistency

if the server starts in a consistent state, the transaction ends the server in a consistent state
Data is in a consistent state when a transaction starts and when it ends. For example, in an application that transfers funds from one account to another, the consistency property ensures that the total value of funds in both the accounts is the same at the start and end of each transaction.

isolation

scenario: concurrency of transactions
two transactions can not update the same object
one has to wait for the prior to commit

no access to intermediate result/states of other transactions
free from interference by operations of other transactions

durability

transactions’effects are saved in permanent storage

multiple servers within one transaction – distributed transaction

solution: 2-phase commit

lost update problem

depending on how the server schedules, one of the update wil be lost
cause of problem: inconsistency, simultaneous access of a shared&unsynchronized object; write-write conflict

inconsistent retrieval problem

concurrent transactions

transaction per second
goal: increase concurrency while maintaining correctness(ACID)

serial & interleaving schedule

serializability: T1,T2 == T1.T2 or T2.T1
why: serial executions are most consistent

conflicting operations

same object:
1. read write
2. write read
3. write write
serially equivalent

prevent violation of isolation

pessimistic
1. exclusive locking
  1. before read/write, lock(O)
  2. either enter the lock or wait
  3. when done, unlock(O)
  - improvement: exclusive locking is an overkill for read
2. another approach: read/write modes
  - read_lock(O), write_lock(O), unlock(O)
optimistic
1. check commit time/serial equivalence

two-phase locking protocol

guarantee serial equivalence

growing phase: only acquires or promotes lock
shrinking phase: only release lock

drawback: deadlock
detect deadlocks in a wait-for graph

3 necessary conditions for deadlock

Some objects are accessed in exclusive lock modes
Transactions holding locks cannotbe preempted
There is a circular wait (cycle) in the Wait-for graph

Coffman conditions if all four simultaneously => deadlock

mutual exlusion resources
hold and wait processes
no preemption process holding resources
cirular wait

optimistic lock

example: dropbox
first-cut
ordering
multi-version concurrency control: svn

1102

distributed system mutual exclusion

ATM
avoid WW conflicts, serial equivalence: false
distributed file systems
accessing objects in a safe and consistent way
server cooridination
in industry, Chubby, Apache Zookeeper
piece of code we ensure that there is at most one process executing it at any point of time
critical section of the process
1. enter()
2. accessResource()
3. exit()

single OS: process synchronization

semaphore
- signalling mechanism
- an integer variable S, initialized with the number of resources available
- only two functions can change its value:
  1. wait():
  - if S>0: S–, accessResource()
  - if S==0: wait
  1. signal(): S++
mutex
- mutual exclusion object
- locked-based technique
- lock, use, release

need to guarantee 3 properties

safety(essential)
at most one at a time
liveness(essential)
all request for a CS is granted eventually
ordering(desirable)

distributed system: central solution

elect a central master(leader)
master keeps:
1. a queue of waiting requests from processes
2. a special token which allows its holder to access CS
any process in group:
1. enter()
  - send a request to master
  - wait for token from master
2. exit()
  - send back token

analyzing performance metrics

bandwidth
client delay
synchronization delay

ring based mutual exlusion

safety: yes
liveness: yes
ordering: no
the nature of ring topology

Ricart-Agrawala’s Algorithm

premise: reliable all-to-all communication
no token
uses the notion of causality and multicast
send messages of <timestamp, process id>
lowing waiting time to enter CS
state
- wanted
- held
- released

Maekawa’s Algorithm

voting set(quorum)
safety but not liveness

1107

replication control

example: airline booking system
one client, query which one of multiple servers
multiple clients, coordinations; concurrency

why replication => higher availability

fault-tolerance
load balancing

availability vs reliability

availability:
1. The percentage of time that the infrastructure, system, or solution is operational under normal circumstances.
2. Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time
reliability:
1. The probability that the system will meet certain performance standards and yield correct output for a specific time.
2. Mean Time Betwen Failures
3. MTBF = (total elapsed time – sum of downtime)/number of failures
The measurement of Availability is driven by time loss whereas the measurement of Reliability is driven by the frequency and impact of failures.
Mathematically, the Availability of a system can be treated as a function of its Reliability. In other words, Reliability can be considered a subset of Availability.

maintain two properties

replication consistency
- All clients see single consistent copy of data, in spite of replication
- For transactions, ACID
- achieved by passive/active replication
replication transparency
- transparent to a single client
- achieved by FEs

replication consistency

two ways to forward updates from front-ends to replica group
1. passive replication
  1. leader election, select a primary replica (master)
  2. FEs send request to the master
2. active replication
  - treats all replicas indentically
both approaches: replicated state machine

passive replication

election: ring, bully, Google Chubby

active replication

FEs send multicast requests: gossip
multicast ordering:
1. FIFO
2. causal ordering, Lamport timestamp
3. total ordering
  - object-based: timestamp, size, etc
4. hybrid ordering

one-copy serializability

a concurrent execution of transactions equivalent to a serial execution of these transactions
one-copy: results should be consistent/the same: doesn’t matter if I execute a, or b, or both
if non-replication: just order the concurrent transactions; ensure correctness
when replicated:

distributed transactions

transaction: sequence of R/W operations
main challenge: maintain ACID, especially Atomiciy
atomic commit problem:
1. commit itself is distributed
2. make sure commit itself is atomic

one-phase commit not usable

special server called coordinator, which initiates atomic commit
tell other servers to either commit or abort
why not usable
1. one-way communication
  server has no say(can not time out), just coordinator telling servers what to do
2. server may crash before receiving commit message

two-phase commit

coordinator prepared to save
servers respond with yes/no
coordinator decides whether commit or abort

two-phase commit failures:

server crashes
save tentative updates into permanent storage right before replying yes/no
coordinator crashes:
log all decisions and received messages on disk
message loss: pessimistic
1. prepare message loss
2. yes/no message loss
3. commit/abort

1109

distributed systems snapshots

example:
transactions, take a global snapshot to construct the weight-for graph
distributed system global snapshot:
1. process state
2. communiation channel
  two main parts of distributed system: hosts and communications

uses of having a global picture

checkpointing: can restart distributed application on failure
garbage collection: objects at servers that don’t have any other objects at any servers with pointers to them
deadlock detection: useful in database transaction systems
termination of computation: useful in batch computing systems

how to take a global snapshot

obvious first solution
1. synchronize clocks of all processes
2. then ask all processes to record their states at known time t
problems:
1. time synchronization always has error clock skew
2. also, doesn’t record the state of messages in the channels
again: synchronization is not required! causality is enough!
(global) state to state movement obeys causality
1. process receives message
2. process sends message
3. proces takes a step

system model

requirements

snapshot should not interfere with normal applcation actions/sending messages
each process is able to record its own state
its heap, registers, program counter, code, etc (coredump)
global state is collected in a distributed manner
any process may initiate the snapshot

Chandy-Lamport global snapshot algorithm

initiator process records its own state
create and send Marker messages to all other processes via outgoing channels
turn on recording on all incoming channels
other processes start recording on each of the incoming channels
- if first marker message:
  1. record its own state
  2. mark incoming channel as emtpy (<>)
  3. turn on recording on other incoming channels
  4. send out markers
- already seen a marker message:
  1. k: sender; i: receiver
  2. mark the state of channel Cki as all the messages that have arrived on it since recording was turned on for Cki

consistent cut

cut: time frontier at each process and at each channel
- in the cut
- out of the cut
consistent cut: a cut that obeys causality
event e in cut C; f happens before e (f->e)
then event f is also in the cut C
any run of the Chandy-Lamport gloabal snapshot algorithm creates a consistent cut

correctness in distributed systems

liveness:
guarantee that something good will happen, eventually
eventually: does not imply a time bound, but if you let the system run long enough, then…
- examples:
  1. distributed computation: guarantee that it will terminate
  2. completeness in failure detectors: every failure is eventually detected by some non-faulty process
  3. consensus (election, Google Chubby): all processes eventually decide on a value
safety:
guarantee that something bad will never happen
- examples:
  1. there is no deadlock in a distributed system
  2. no object is orphaned in a distributed system
  3. accuracy in failure detectors
  4. in consensus: no two processes decide on different values

stable properties

stable: once true, stays true forever afterwards
all stable properties can be detected using the Chandy-Lamport algorithm

1114

file system

higher level of abstraction
file contents:
1. header:
  1. timestamps
  2. file type
  3. ownership
  4. access control list
  5. reference count: number of directories containing this file
2. blocks
directory

unix file system local methods

file descriptors
1. open
2. creat
3. close
read-write pointer
1. read
2. write
3. lseek
status
1. link
2. unlink
3. stat/fstat

distributed file systems

transparency
support concurrent clients
replcation: fault-tolerance

idempotent
stateless

security

autentication
authorization:
1. Access Control Lists, per file
2. Capability Lists, per user

why DFS not using file descriptors?

transparency: client should not know the absolute position of the file
confusion: a file may be replicated across multiple servers, each having a file descriptor

Network File System

virtual file system: provide transparency
- local unix file system, if file is local
- NFS client system, if file is not local
  - do RPCs
server: exportfs
client: mount
- not clone, but hook(point to)
- unlink dropbox, where file is local

why not clone?

replcation: consistency

virtual file system module

allow processes to access files via file descriptors
transparency: local and romote files look the same
“NFS file handles”
v-node

server optimizations

server caching: locality of access
writes: two flavors
1. delayed write:
- write in memory, flush to disk every 30s
- non-blocking
- fast but not consistent
1. write-through
- blocking
- consistent but may be slow(latency)

client optimization

client caching
cache recently-accessed blocks
when block is written, do a delayed-write to server
compromise between consistency and efficiency

AFS

whole file serving
whole file caching

1116

security threats

leakage
tampering
vandalism

common attacks

eavesdropping: taps into network
masquerading: pretend
message tampering: modify
replay attack
denial of service: bombard a port

CIA properties

confidentiality
integrity
availability

separate policy from mechanism

policy: what’s the goal
mechanism: how the goal is accomplished

Mechanisms: Golden A’s

authentication:
is a user claiming to be Alice, really Alice?
authorization:
Yes, it is Alice, but is Alice allowed to perfrom the requested operation?
auditing:
How did Eve manage to attack the system and breach defenses?
Usually done by continuously logging all operations.
No attack should go unnoticed!

designing secure systems

don’t know how powerful attacker is
when designing a secruity protocol, need to:
1. specify Attacker Model
  capabilities of attacker; tied to reality, no imagination
  - study
2. design security mechanisms to satisfy policy under the attacker model
  - engineers implement the mechanisms
3. prove that mechanisms satisfy policy under attacker model
  - a second organization may come into play, try to attack
4. measure effect on overall performance in the common ase, i.e., no attacks
  compare:
  - no-security-control throughput
  - throughput with security control
security is not a one-time deal!

basic cryptography

key:
- sequence of bytes assigned to a user
- can be used to lock a message, and only this key can be used to unlock that locked message

encryption

encrypted message = encrypt(original message, key)
decrypted message = decrypt(encrypted message, key)

two cryptography systems

symmetric key systems: DES
- shared keys
- hard to revoke permissions from principals
public-private key systems: RSA, PGP
- private key
- public key: known to all
- anything encrypted with Kapriv can be decrypted only with Kapub
- anything encrypted with Kapub can be decrypted only with Kapriv
- relatively costly: 2 operations, at least one is costly
combination: use public-private key system to generate shared key

authentication

direct authentication
indirect authentication: uses a trusted third-party server, authentication server

digital signatures

between two parties
SHA-1, MD-5

digital certificates

transitivity: chains of certificates
trace back

authorization

Access Control Matrix
Access Control Lists, per file/object
Capability Lists, per user/principal