跟的是Spring 2022。

刚开始还惊讶CS自学指南MIT6.824: Distributed System预计学时为什么是200小时,学了几天终于知道为什么需要这么长时间了,不说Project的难度,就每节课几乎都超过15页的Paper也需要花很多时间阅读。真是一篇Paper翻来覆去地阅读,也才获得几分理解。

LEC 4: Primary-Backup Replication

video

2 approaches:

  • state transfer
  • replicated state machine

The Design of a Practical System for Fault-Tolerant Virtual Machines

Abstract

1 Introduction

primary/backup approach and replicate servers:

  • ship changes to all state of the primary: need large bandwidth (particular in memory)
  • state machine approach:
    • less extra infromation need to keep the primary and backup in sync
    • the low bandwidth

This paper only attempt to deal with fail-stop failures, which are server failures that can be detected before the failing server causes an incorrect internally visible action.

2 Basic FT Design

2.1 Deterministic Replay Implementation

2.2 FT Protocol

2.3 Detecting and Responding to Failure

3 Fractical Implementation of FT

3.1 Starting and Restarting FT VMs

3.2 Managing the Logging Channel

3.3 Operation on FT VMs

3.4 Implementation Issues for Disk IOs

3.5 Implementation Issues for Network IO

4 Design Alternatives

4.1 Shared vs. Non-shared Disk

4.2 Executing Disk Reads on the Backup VM

5 Performance Evaluation

5.1 Basic Performance Results

5.2 Network Benchmarks

7 Conclusion and Future Work

LET 5: Fault Tolerance: Raft(1)

Video:

Majority vote: out of all of the servers not just alive servers.

e.g.

2 of 3

2f+1 ——> f failures: still keep going Quorum Systems

Any two majorites overlap in at least one server ——> Raft rely on to avoid split brain.

In Search of an Understandable Consensus Algorithm

LET 7: Fault Tolerance: Raft(2)