0%

COEN317

COEN 317 Distributed system

09/21

layer tier

  • layer: software
  • tier: hardware

layerd architecture style

  • separation of concerns
  • unnecessary transaction
  • grey line, hard to decide which layer to put

event based architecture styles

  • dynamic
  • at the beginning, each component don’t know who to communicate to
  • after that, can point-to-point

middleware

  • libararies, runtime environment
  • JVM

blocking

  • sent/transmission != reach the destination

  • blocking: suspended until transmission finishes

  • transmission: IT DEPENDS!

    1. sending message buffer

    2. channel

    3. receiving messsage buffer

persistent communication

  • message queue
  • need to store data

asynchronous

  • synchronous: wait unitl delivery, not process of the data

soft state

cache

09/26

  • application-specific protocol

  • application-independent protocol:

    middleware

  • TCP/IP protocol:

    local OS

why layered?

separation of concerns

OSI reference model

  1. application

  2. presentation

    • process information for applications
    • serialization/deserialization
    • encryption/decryption
  3. session

  4. transport

    • end-to-end communication between different applications
  5. network

    • routing packets
  6. data link

    • send, detect and corrent data frames
    • data frames: units of bits
  7. physical layer

    • actual transfer of binary bits data

TCP/IP protocol suite

  1. application

    • HTTP, FTP, SMTP
    • process-to-process
  2. transport

    • TCP, UDP
    • host-to-host
  3. internet

    • IP

    • routing table

      host a-> router -> router -> host b

  4. link

TCP UDP

  • TCP

    data in streams

  • UDP

    data in messages

remote procedure calls

remote invocation

  • one-way RPC
  • asynchronous RPC

remote object calls

message brokers

  • between clients and servers

  • beyond simple queues

  • message will be persistent

socket APIs

  • transport-level socket programming via socket interface
  • message-oriented transient communication at the transport layer

Messaging-passing interface

  • message-oriented transient communication at the application layer

queuing model

  • message-oriented persistent communication

  • asynchronous/time uncoupling

  • advanced message queuing protocol

    rabbitMQ

0928

why some inaccuracy is tolerable?

  1. more than one failure detection mechanism
  2. in heart-beating, after some times it will correct itself

atomic > reliable

  • atomic: all or none receive the message overkill
  • reliable: we can tolerate some inaccuracy

centralized heart-beating out of consideration!

  • single point / performance
  • too much workload for the center node
  • can not be dynamic: the center needs to know the global state
  • saturated

ring heart-beating valid

  • predecessor and successor
  • what if 2 nodes fail? stablization protocol
  • detection is too slow

all-to-all heart-beating valid

  • full of unnecessary message
  • can not be dynamic

false positive / false negative

  • inaccurate
  • incomplete! intolerable

local buffers processes and channels

  • copy message to outgoing message buffer send
  • copy message from incoming message buffer receive
  • latency: in channel and in operation system

process failure / channel failure

1003

virtualization

  • Virtual Machine Memory = Hypervisor
  1. platform virtualization, hosted VM
    shared OS

    1. hardware
    2. host OS
      Patch, upgrade together
    3. hypervisor, virtualizing the OS
    4. applications, limited(no kernel module)
      example: Linux container
      host OS adminstrater sees everything! not fully isolated!
      reboot OS? all down!
  2. full virtualization, native VM
    independent guest OS

    1. hardware
    2. hypervisor, virtualizing the hardware
    3. virtual hardwares, including VCPU, VNIC
    4. guest OSs(Windows, Linux, MacOS)
    5. applications
      fully isolated!
      less scalable
      better performance
  3. combined
    within each guest OS, platform virtualization(virtualzing some guest OS)

network address translation

NIC: Network Interface Control
NAT: Network address translation
hardware: MAC, Media Access Control
VMs: MACs
query NAT table for every connection
improvement: SRIOV

SRIOV

integrate NAT into NIC
PCIe: Peripheral Component Interconnect express chip-to-chip interconnect
Ethernet: via cables system-to-system

network bridging

poor scalability
bridge over the hypervisor for better performance

host only networking

inter instance communication, VM to VM

1005

P2P exmaple: Napster

  • server
    1. stores no files
    2. maintains a table
    3. response with where to download the music(IP address, port)
  • client stores their own data
  • server search: ternary tree algorithm
  • problem:
    1. infringement of copyright
    2. security : each peer exposes a port; clear-text communication
    3. fault-tolerance: centralized server, centralized algorithm
      single point of failure

Distributed Hash Table

  • objects: files

  • bucket: nodes

  • hash function: consistent hashing hash ring
    SHA-1(filename IP address, port) => 160-bit string(key) m bits
    may use leader-election algorithm

  • routing mechanism:

    • routing: all operations, including insert, lookup and delete
    • each node knows its successor and predecessor node id: 0~pow(2,m)-1
    • each node computes a finger table(routing table, nodes it knows)
    • if doesn’t have the file(after local search), querying, each step halves the distance O(log(N))
      1. go(forward) to the farthest node in finger table, but no crossing the key
      2. if none exist, send query to successor
    • all query steps are RPCs
  • search under peer failures

    • one solution: maintain (r) more than one successor entries instead of one
      • if one query fails, go to the next successor
      • sender’s finger table not yet corrected
    • one solution: replicate file/key at r successors and predecessors
  • new peers joining (dynamic changes churn)

    • stablization protocol followed by all nodes stablization algorithm
      • periodically check and update
      • ensure non-loopiness
    • copy keys and update finger table
    • bandwidth cost busy network
    • alternatives: store and replicate only pointers to files(meta info)

1010

naming in distributed systems

  • same name; unique identifier

  • address = identifer + access point

  • context is import!

    • global
    • local
  • dynamic or static binding?

  • identifier: IP and MAC address Media Access Control
  • arp -a every node in the network will broadcast their id

independent nodes P2P system

  • identifier: hash to 160-bits-string, then truncate it to m bits(name space)
  • name resolution: distributed hash tables
  • shaw-one algorithm

flat naming

  • no structure, just plain bits
  • ip address

structured naming

  • directory
  • file system:
    • alias: short, multiple names
    • multiple path to one file

centralized management generic problems

  • global state: no one can know all
  • transaction for second
  • single point of failure
  • hard links:

    • When adding one hard link, this file’s reference count++
    • must delete reference count before deleting
  • symbolic links:

    • When adding one link, the file will store the meta data pointing to the original file
    • linux: ln -s

mounting

  • nfs protocol
  • remote server B: share file/directory
    • linux: exportfs /home/steen
  • local server A: mount remote shared file/directory locally
    • linux: mount B.ip:/home/steen /remote/vu
    • linux: cd /remote/vu

name resolution

  • client send query to name servers
    host query DNS server

  • linux:

    • the host knows where to query in DNS
      /etc/reslve.conf
    1. nslookup www.scu.edu
    2. dig
  • iterative

  • recursive

  • redundancy and caching server

    • speed
    • fault tolerance
  • caching on the client side: local browser name resolver

attribute-based naming

  • tuple/triple: key, value/set of values
  • Lightweigh Directory Access Protocol
  • when sending email, client query LDAP server
    west coast, east coast, europe, …
  • Directory Information Base data sharding

1017

gossip multicast

  • fault tolerance
  • scalability sending messages fast

centralized approach

  • performance bottleneck not scalable
  • single point of failure not reliable
  • communication channel may be unreliable
    • acknowlegement protocol

tree-based mechanism

  • overheads:
    1. someone needs to be responsible to set up and maintain the tree
    2. acknowledgement ACKs / NAKs
    3. RMTP SRM

gossip protocol epidemic multicast

  • every round, pick random targets and send messages via UDP

  • may have duplicates(send messages back)

  • once become infected, do the same

  • push vs pull

    • push: whenever have a new mutlicast message
    • pull: periodically poll
      if node crashes, after coming back, and poll messages fault-tolerant
  • topology-aware gossip:

    • to reduce router load
    • fix: pick gossip target in its subnet

1024

time and ordering

  • host1, host2, host3
  • t1 t2 t3
  • each have local clocks(local frequencies)
  • within host1, different processes, all have same time t1
  • each process will order tasks it has
  • However, to order p1 from host1 and p2 from host2? Time Synchronization

time synchronization

  1. external time synchronization:
    • rely on external time source
    • when off bounds, sychronize
    • Cristian’s algorithm, NTP
    • problem: unbounded latency in asynchronous system:
      1. network latency
      2. processing latency
  2. internal time synchronization:
    • between hosts’s processes, clock skew
    • when off bounds, sychronize
    • Berkeley algorithm
    • problem: time drift from the UTC(correct time)
  3. logical time synchronization

clock skew && clock drift

  • clock skew; distance
  • clock drift: speeds difference
  • non-zero clock skew: clocks are not synchronized
  • non-zero clock drift: the skew will increase

cristian’s algorithm

  • RTT: round-trip time
  • erro is at most (RTT-min2-min1)/2
  • still has problem:
    bounded error => error rate more than zero
    what if processes asynchronize within the bound? => won’t trigger the algorithm

Network Time Protocol

  • to reduce min1, min2
  • leaf nodes are clients
  • child has ts1, tr1, ts2, tr2, then calculate the offset = (tr1-tr2+ts2-ts1)/2
  • error: difference between the calcuated offset and the real offset
  • still have a non-zero error!! => still invalid in asynchronous distributed system

logical time, instead of absolute time

  • Leslie Lamport, Turing award
  • causality:
    1. ordering within one process
    2. send-recieve message between two processes
  • transitivity
  • lamport timestaps ts=max(local,msg)+1
  • timestamp doesn’t not determine order, only causal path does!
  • timestamps only matter if there is a causal path!
  • when there is no causal path? concurrent events
  • shortcoming: can’t tell where two events are concurrent or not

vector timestamps

  • n processes => n elements within the vector
  • each process maintains a vector
  • causality: iff VT1 <= VT2
  • concurrent (VT1 ||| VT2): !(VT1 <= VT2) && !(VT2 <= VT1)

1026

coordination coordinator/leader/introducer

  1. receive client requrest => forward to appropriate host
  2. handling of failure/join/leave
  • property/attribute: cpu, disk, ID, space

  • safety: get an election result

  • liveness: election terminates

ring election protocol consensus problem

  • elect highest id process as leader
  • received id == this.id => elected!!
  • whenever a new node joins, run the protocol
    either become the new leader, or get the leader

fixing for failure

  • have predecessor/successor detect failure
  • use the failure detector

bully algorithm

  • premise: all knows all
  • handle failures better

google chubby

  • quorum: voting set
  • master lease: time duration

apache zookeeper

  • paxos: Zookeeper Atomic Broadcast

1031

transaction

  • series of operations executed by client
  • each operation is a RPC to a server
  • either commit or abort

atomicity

  • all or nothing; commit or abort
  • failure/exception: roll back(undone)

consistency

  • if the server starts in a consistent state, the transaction ends the server in a consistent state
  • Data is in a consistent state when a transaction starts and when it ends. For example, in an application that transfers funds from one account to another, the consistency property ensures that the total value of funds in both the accounts is the same at the start and end of each transaction.

isolation

  • scenario: concurrency of transactions
  • two transactions can not update the same object
    one has to wait for the prior to commit
  1. no access to intermediate result/states of other transactions
  2. free from interference by operations of other transactions

durability

  • transactions’effects are saved in permanent storage

multiple servers within one transaction – distributed transaction

  • solution: 2-phase commit

lost update problem

  • depending on how the server schedules, one of the update wil be lost
  • cause of problem: inconsistency, simultaneous access of a shared&unsynchronized object; write-write conflict

inconsistent retrieval problem

concurrent transactions

  • transaction per second
  • goal: increase concurrency while maintaining correctness(ACID)

serial & interleaving schedule

  • serializability: T1,T2 == T1.T2 or T2.T1
  • why: serial executions are most consistent

conflicting operations

  • same object:

    1. read write
    2. write read
    3. write write
  • serially equivalent

prevent violation of isolation

  • pessimistic

    1. exclusive locking

      1. before read/write, lock(O)
      2. either enter the lock or wait
      3. when done, unlock(O)
      • improvement: exclusive locking is an overkill for read
    2. another approach: read/write modes

      • read_lock(O), write_lock(O), unlock(O)
  • optimistic

    1. check commit time/serial equivalence

two-phase locking protocol

  • guarantee serial equivalence
  1. growing phase: only acquires or promotes lock
  2. shrinking phase: only release lock
  • drawback: deadlock
  • detect deadlocks in a wait-for graph

3 necessary conditions for deadlock

  1. Some objects are accessed in exclusive lock modes
  2. Transactions holding locks cannotbe preempted
  3. There is a circular wait (cycle) in the Wait-for graph

Coffman conditions if all four simultaneously => deadlock

  1. mutual exlusion resources
  2. hold and wait processes
  3. no preemption process holding resources
  4. cirular wait

optimistic lock

  • example: dropbox
  • first-cut
  • ordering
  • multi-version concurrency control: svn

1102

distributed system mutual exclusion

  • ATM
    avoid WW conflicts, serial equivalence: false

  • distributed file systems
    accessing objects in a safe and consistent way
    server cooridination
    in industry, Chubby, Apache Zookeeper

  • piece of code we ensure that there is at most one process executing it at any point of time

  • critical section of the process

    1. enter()
    2. accessResource()
    3. exit()

single OS: process synchronization

  • semaphore

    • signalling mechanism

    • an integer variable S, initialized with the number of resources available

    • only two functions can change its value:

      1. wait():
      • if S>0: S–, accessResource()
      • if S==0: wait
      1. signal(): S++
  • mutex

    • mutual exclusion object
    • locked-based technique
    • lock, use, release

need to guarantee 3 properties

  1. safety(essential)
    at most one at a time
  2. liveness(essential)
    all request for a CS is granted eventually
  3. ordering(desirable)

distributed system: central solution

  • elect a central master(leader)
  • master keeps:
    1. a queue of waiting requests from processes
    2. a special token which allows its holder to access CS
  • any process in group:
    1. enter()
      • send a request to master
      • wait for token from master
    2. exit()
      • send back token

analyzing performance metrics

  • bandwidth
  • client delay
  • synchronization delay

ring based mutual exlusion

  • safety: yes
  • liveness: yes
  • ordering: no
    the nature of ring topology

Ricart-Agrawala’s Algorithm

  • premise: reliable all-to-all communication
  • no token
  • uses the notion of causality and multicast
    send messages of <timestamp, process id>
  • lowing waiting time to enter CS
  • state
    • wanted
    • held
    • released

Maekawa’s Algorithm

  • voting set(quorum)
  • safety but not liveness

1107

replication control

  • example: airline booking system
  • one client, query which one of multiple servers
  • multiple clients, coordinations; concurrency

why replication => higher availability

  1. fault-tolerance
  2. load balancing

availability vs reliability

  • availability:

    1. The percentage of time that the infrastructure, system, or solution is operational under normal circumstances.
    2. Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time
  • reliability:

    1. The probability that the system will meet certain performance standards and yield correct output for a specific time.
    2. Mean Time Betwen Failures
    3. MTBF = (total elapsed time – sum of downtime)/number of failures
  • The measurement of Availability is driven by time loss whereas the measurement of Reliability is driven by the frequency and impact of failures.

  • Mathematically, the Availability of a system can be treated as a function of its Reliability. In other words, Reliability can be considered a subset of Availability.

maintain two properties

  1. replication consistency

    • All clients see single consistent copy of data, in spite of replication
    • For transactions, ACID
    • achieved by passive/active replication
  2. replication transparency

    • transparent to a single client
    • achieved by FEs

replication consistency

  • two ways to forward updates from front-ends to replica group
    1. passive replication
      1. leader election, select a primary replica (master)
      2. FEs send request to the master
    2. active replication
      • treats all replicas indentically
  • both approaches: replicated state machine

passive replication

  • election: ring, bully, Google Chubby

active replication

  • FEs send multicast requests: gossip
  • multicast ordering:
    1. FIFO
    2. causal ordering, Lamport timestamp
    3. total ordering
      • object-based: timestamp, size, etc
    4. hybrid ordering

one-copy serializability

  • a concurrent execution of transactions equivalent to a serial execution of these transactions
  • one-copy: results should be consistent/the same: doesn’t matter if I execute a, or b, or both
  • if non-replication: just order the concurrent transactions; ensure correctness
  • when replicated:

distributed transactions

  • transaction: sequence of R/W operations
  • main challenge: maintain ACID, especially Atomiciy
  • atomic commit problem:
    1. commit itself is distributed
    2. make sure commit itself is atomic

one-phase commit not usable

  • special server called coordinator, which initiates atomic commit
  • tell other servers to either commit or abort
  • why not usable
    1. one-way communication
      server has no say(can not time out), just coordinator telling servers what to do
    2. server may crash before receiving commit message

two-phase commit

  1. coordinator prepared to save
  2. servers respond with yes/no
  3. coordinator decides whether commit or abort

two-phase commit failures:

  1. server crashes
    save tentative updates into permanent storage right before replying yes/no
  2. coordinator crashes:
    log all decisions and received messages on disk
  3. message loss: pessimistic
    1. prepare message loss
    2. yes/no message loss
    3. commit/abort

1109

distributed systems snapshots

  • example:
    transactions, take a global snapshot to construct the weight-for graph
  • distributed system global snapshot:
    1. process state
    2. communiation channel
      two main parts of distributed system: hosts and communications

uses of having a global picture

  1. checkpointing: can restart distributed application on failure
  2. garbage collection: objects at servers that don’t have any other objects at any servers with pointers to them
  3. deadlock detection: useful in database transaction systems
  4. termination of computation: useful in batch computing systems

how to take a global snapshot

  • obvious first solution

    1. synchronize clocks of all processes
    2. then ask all processes to record their states at known time t
  • problems:

    1. time synchronization always has error clock skew
    2. also, doesn’t record the state of messages in the channels
  • again: synchronization is not required! causality is enough!
    (global) state to state movement obeys causality

    1. process receives message
    2. process sends message
    3. proces takes a step

system model

requirements

  1. snapshot should not interfere with normal applcation actions/sending messages
  2. each process is able to record its own state
    its heap, registers, program counter, code, etc (coredump)
  3. global state is collected in a distributed manner
  4. any process may initiate the snapshot

Chandy-Lamport global snapshot algorithm

  • initiator process records its own state
  • create and send Marker messages to all other processes via outgoing channels
    turn on recording on all incoming channels
  • other processes start recording on each of the incoming channels
    • if first marker message:
      1. record its own state
      2. mark incoming channel as emtpy (<>)
      3. turn on recording on other incoming channels
      4. send out markers
    • already seen a marker message:
      1. k: sender; i: receiver
      2. mark the state of channel Cki as all the messages that have arrived on it since recording was turned on for Cki

consistent cut

  • cut: time frontier at each process and at each channel

    • in the cut
    • out of the cut
  • consistent cut: a cut that obeys causality
    event e in cut C; f happens before e (f->e)
    then event f is also in the cut C

  • any run of the Chandy-Lamport gloabal snapshot algorithm creates a consistent cut

correctness in distributed systems

  • liveness:
    guarantee that something good will happen, eventually
    eventually: does not imply a time bound, but if you let the system run long enough, then…
    • examples:
      1. distributed computation: guarantee that it will terminate
      2. completeness in failure detectors: every failure is eventually detected by some non-faulty process
      3. consensus (election, Google Chubby): all processes eventually decide on a value
  • safety:
    guarantee that something bad will never happen
    • examples:
      1. there is no deadlock in a distributed system
      2. no object is orphaned in a distributed system
      3. accuracy in failure detectors
      4. in consensus: no two processes decide on different values

stable properties

  • stable: once true, stays true forever afterwards
  • all stable properties can be detected using the Chandy-Lamport algorithm

1114

file system

  • higher level of abstraction
  • file contents:
    1. header:
      1. timestamps
      2. file type
      3. ownership
      4. access control list
      5. reference count: number of directories containing this file
    2. blocks
  • directory

unix file system local methods

  • file descriptors
    1. open
    2. creat
    3. close
  • read-write pointer
    1. read
    2. write
    3. lseek
  • status
    1. link
    2. unlink
    3. stat/fstat

distributed file systems

  1. transparency
  2. support concurrent clients
  3. replcation: fault-tolerance
  • idempotent
  • stateless

security

  1. autentication
  2. authorization:
    1. Access Control Lists, per file
    2. Capability Lists, per user

why DFS not using file descriptors?

  1. transparency: client should not know the absolute position of the file
  2. confusion: a file may be replicated across multiple servers, each having a file descriptor

Network File System

  • virtual file system: provide transparency

    • local unix file system, if file is local
    • NFS client system, if file is not local
      • do RPCs
  • server: exportfs

  • client: mount

    • not clone, but hook(point to)
    • unlink dropbox, where file is local

why not clone?

  • replcation: consistency

virtual file system module

  • allow processes to access files via file descriptors
  • transparency: local and romote files look the same
  • “NFS file handles”
  • v-node

server optimizations

  • server caching: locality of access

  • writes: two flavors

    1. delayed write:
    • write in memory, flush to disk every 30s
    • non-blocking
    • fast but not consistent
    1. write-through
    • blocking
    • consistent but may be slow(latency)

client optimization

  • client caching
  • cache recently-accessed blocks
  • when block is written, do a delayed-write to server
  • compromise between consistency and efficiency

AFS

  • whole file serving
  • whole file caching

1116

security threats

  1. leakage
  2. tampering
  3. vandalism

common attacks

  1. eavesdropping: taps into network
  2. masquerading: pretend
  3. message tampering: modify
  4. replay attack
  5. denial of service: bombard a port

CIA properties

  • confidentiality
  • integrity
  • availability

separate policy from mechanism

  • policy: what’s the goal
  • mechanism: how the goal is accomplished

Mechanisms: Golden A’s

  • authentication:
    is a user claiming to be Alice, really Alice?
  • authorization:
    Yes, it is Alice, but is Alice allowed to perfrom the requested operation?
  • auditing:
    How did Eve manage to attack the system and breach defenses?
    Usually done by continuously logging all operations.
    No attack should go unnoticed!

designing secure systems

  • don’t know how powerful attacker is
  • when designing a secruity protocol, need to:
    1. specify Attacker Model
      capabilities of attacker; tied to reality, no imagination
      • study
    2. design security mechanisms to satisfy policy under the attacker model
      • engineers implement the mechanisms
    3. prove that mechanisms satisfy policy under attacker model
      • a second organization may come into play, try to attack
    4. measure effect on overall performance in the common ase, i.e., no attacks
      compare:
      • no-security-control throughput
      • throughput with security control
  • security is not a one-time deal!

basic cryptography

  • key:
    • sequence of bytes assigned to a user
    • can be used to lock a message, and only this key can be used to unlock that locked message

encryption

  • encrypted message = encrypt(original message, key)
  • decrypted message = decrypt(encrypted message, key)

two cryptography systems

  1. symmetric key systems: DES

    • shared keys
    • hard to revoke permissions from principals
  2. public-private key systems: RSA, PGP

    • private key
    • public key: known to all
    • anything encrypted with Kapriv can be decrypted only with Kapub
    • anything encrypted with Kapub can be decrypted only with Kapriv
    • relatively costly: 2 operations, at least one is costly
  3. combination: use public-private key system to generate shared key

authentication

  • direct authentication
  • indirect authentication: uses a trusted third-party server, authentication server

digital signatures

  • between two parties
  • SHA-1, MD-5

digital certificates

  • transitivity: chains of certificates
  • trace back

authorization

  • Access Control Matrix

  • Access Control Lists, per file/object

  • Capability Lists, per user/principal