跟的是Spring 2022。
刚开始还惊讶CS自学指南
的MIT6.824: Distributed System预计学时为什么是200小时,学了几天终于知道为什么需要这么长时间了,不说Project的难度,就每节课几乎都超过15页的Paper也需要花很多时间阅读。真是一篇Paper翻来覆去地阅读,也才获得几分理解。
LEC 4: Primary-Backup Replication
video
2 approaches:
- state transfer
- replicated state machine
The Design of a Practical System for Fault-Tolerant Virtual Machines
Abstract
1 Introduction
primary/backup approach and replicate servers:
- ship changes to all state of the primary: need large bandwidth (particular in memory)
- state machine approach:
- less extra infromation need to keep the primary and backup in sync
- the low bandwidth
This paper only attempt to deal with fail-stop failures
, which are server failures that can be detected before the failing server causes an incorrect internally visible action.
2 Basic FT Design
2.1 Deterministic Replay Implementation
2.2 FT Protocol
2.3 Detecting and Responding to Failure
3 Fractical Implementation of FT
3.1 Starting and Restarting FT VMs
3.2 Managing the Logging Channel
3.3 Operation on FT VMs
3.4 Implementation Issues for Disk IOs
3.5 Implementation Issues for Network IO
4 Design Alternatives
4.1 Shared vs. Non-shared Disk
4.2 Executing Disk Reads on the Backup VM
5 Performance Evaluation
5.1 Basic Performance Results
5.2 Network Benchmarks
6 Related Work
7 Conclusion and Future Work
LET 5: Fault Tolerance: Raft(1)
Video:
Majority vote: out of all of the servers
not just alive servers.
e.g.
2 of 3
2f+1 ——> f failures: still keep going Quorum Systems
Any two majorites overlap in at least one server
——> Raft rely on to avoid split brain.