豆瓣🔗:
atomic
In the context of concurrent operations
: describing an operation that appears to take effect at a single point
in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also isolation.
In the context of transactions
: grouping together a set of writes that must either all be committed or all be rolled back
, even if faults occur.
REST
REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST has been gaining popularity compared to SOAP, at least in the context of cross-organizational service integration, and is often associated with microservices. An API designed according to the principles of REST is called RESTful.
message broker/message queue/message-oriented middleware
Using a message broker has several advantages compared to direct RPC:
It can act as a buffer
if the recipient is unavailable or overloaded, and thus improve system reliability.
It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).
It allows one message to be sent to several recipients. It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
References
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann
Reading Notes:
1、
- 数据密集:Big-Data
- 计算密集:Deeping-Learning、HPC(High-Performance-Computing)
2、FOSS: Free and Open Source Software
3、back pressure(背压)=flow control(流量控制)
4、byzantine fault(拜占庭错误):转发任何类型(矛盾的or虚假的)的消息
5、failover: the leadership role from one node to another node
CH1
1、
Redis: datastores/message queues
Apache Kafka: message queues/database-like durability guarantees 持久化存储保证
2、
Memcached
Elasticsearch/Solr: full-text search server
main database
3、
- 水平拓展:单个节点——>分布式多机环境
- 无状态服务:容易
- 有状态服务:复杂
- 垂直拓展
CH2
1、网络、层次模型——>关系模型(RDBMS+SQL)
NoSQL:Not only SQL
- 比RDB更好的拓展需求:very large datasets, very high write throughput
- free and open source software over commercial database products
- specialized query operations that are not well supported by the rational model
- for a more dynamic and expressive data model
2、ORM(Objected-Reational Mapping frameworks): ActiveRecord, Hibernate
3、Document-oriented Database: MongoDB, RethinkDB, CouchDB, Espresso
无模式(读时模式):数据的结构是隐式的,只有在读取时才解释。类似于:动态(运行时)类型检查
写时模式(rational database):模式是显式的,确保数据写入时都必须遵循。类似于:静态(编译时)类型检查
4、property graph: Neo4j, Titan, InfiniteGraph
CH3
Chapter 3 数据存储与检索
storage engines
存储引擎:
-
log-structured ~: 日志结构
- 仅支持追加式更新的log
-
page-oriented ~: 面向页 B-tree
the hash table’s limitations:
- the hash table must fit in memory:哈希表必须全部放入内存。
- it is difficult to make an on-disk hash map perform well
- a lot of random access I/O
- expensive to grow when it becomes full
- hash collisions require fiddly logic
- Range queries are not efficient:只能采用逐个查找的方式查询每一个键。
SSTable
SSTable
- Sorted String Table 排序字符串表
- 按key-value对的顺序按键key排序
LSM-Tree
Log-Structured Merge-Tree(LSM-Tree) 日志结构的合并树
LSM存储引擎:基于合并和压缩排序文件原理的存储引擎。
Lucene
- an indexing engine for full-text search used by Elasticsearch and Solr 索引引擎
- key: 单词(词条)
- value: 所有包含该单词的文档ID的列表(倒排表)
Bloom filters: memory-efficient
data structure for approximating the content of a set: if a key does not appear in the database.
B-Tree
-
fixed-size blocks/pages
-
read/write one page at a time
-
individual keys(a leaf page):
-
inline
-
references to the pages where the values can be found
-
-
an additional data structure on a disk: a write-ahead log(WAL,redo log) 预写日志/重做日志(仅支持追加)
分支因子:B-Tree中一个页所包含的子页引用数量
B-Tree: faster for reads
- 先写WAL
- 后修改树本身的页(还可能发生页分裂)
LSM-Tree: faster for writes
Index
- clustered index: storing
all row data
within the index 在索引中直接存储行数据 - nonclustered index: storing
only references to the data
with the index 仅存储索引中数据的引用
View
- materialized view: 查询结果的实际副本,并被写到磁盘。
- standard(virtual) view:编写查询的快捷方式。从虚拟存储中读取时,SQL引擎将其动态拓展到视图的底层查询,然后处理拓展查询。
OLTP(Online Transaction Processing)
- log-structured school: BitCask, SSTable, LSM-Tree, LevelDB, Cassandra, HBase, Lucene
- update-in-place school: B-Tree(RDB, NoSQL)
CH4
1、
in-memory representation——>a byte sequence
- encoding
- serialization
- marshalling
a byte sequence——>in-memory representation
- decoding
- derialization
- unmarshalling
PART 2 Distributed Data
Chapter 5: Replication
数据复制的难点:需要复制的数据不是一成不变的,是持续变化的。
复制数据变化的3种发放:
- 主存复制
- 多主节点复制
- 无主节点复制
主存复制
- 只有主节点可以接受写请求
- 客户端角度,从节点都是只读的
同步复制与异步复制
半同步:某一个从节点是同步的,其他节点则是异步模式
全异步:
- 无法保证数据的持久化(主节点失败且不可恢复,复制到从节点的写请求会丢失)
- 主节点可以继续响应从节点,系统的吞吐性能更好
处理节点失效
从节点失效:追赶式恢复
主节点失效:节点切换
脑裂:两个从节点同时自认为是主节点
复制日志的实现
基于语句
不适用的场景:
- 任何调用
非确定性函数
的语句:NOW()、RAND()- 主节点在记录操作语句时,将
非确定性函
数替换为执行之后的确定性结果
- 但需要考虑很多边界条件
- 主节点在记录操作语句时,将
- 使用了
自增列
、依赖于DB的现有数据
的语句 - 有副作用的语句:触发器、存储过程、用户定义的函数
基于预写日志WAL
缺点:
- 日志描述的数据结果WAL非常底层:哪些磁盘块的哪些字节发生改变
- 复制方案和存储引擎紧密耦合
基于行的逻辑日志
逻辑日志
e.g., MySQL的二进制日志binlog
基于触发器
应用程序层进行复制控制
多主节点复制
无主节点复制
Chapter 9: 一致性与共识
可线性化(原子一致性、强一致性)
可线性化 Linearizability :
-
读写寄存器(单个对象)的最新值保证
-
例如:
- 两阶段加锁
- 实际以串行执行
- 关系型数据库中主键的约束,需要线性化保证;外键/属性约束,并不一定要求线性化
- 多核CPU上的内存:非线性化(除非使用了内存屏障/fence指令)
-
放弃线性化的原因:为了性能,而不是容错
可串行化 Serializability :
- 事务的隔离属性
- 用来确保事务执行的结果与串行执行的结果完全相同
CAP定理
“网络分区的情况下,选择一致还是可用”
Chapter 10: Batch Processing批处理系统
As usual, history has a tendency of repeating itself.
the ideas and lessons from Unix carry over to large-scale, heterogeneous distributed data systems.
UNIX的思想和经验普遍见于大规模、异构分布式数据系统。
Batch Processing with Unix Tools
Simple Log analysis简单日志分析
|
|
in a matter of seconds
在几秒钟内
Chains of commands versus custom program
a simple program to do the same thing
|
|
Sorting versus in-memory aggregation
排序 VS 内存聚合
排序:
-
UNIX流水线
-
job’s working set is larger than the available memory: make efficient use of disks 高效地使用磁盘
-
SSTables and LSM-Tables
- chunks of data can be sorted in memory
- written out to disk as segment files
- multiple sorted segments can be merged into a larger sorted file
-
Merge-sort has sequential access patterns that perform well on disks. ? 不太理解这句。
内存聚合:
- Ruby脚本使用一个URL的内存哈希表
- 适用于大多数中小型网站(取决于URL的不同个数)
发现中文翻译的确不太好。中文副标题为“排序与内存聚合”,刚开始还没反应过来这其实是对日志文件的两种不同处理方式。去翻英文版,发现标题为Sorting versus in-memory aggregation
,使用了versus
,马上可以体会到对日志分析的不同处理方法。还是读英文版吧,毕竟是第一手而不是又经过另一个人翻译有偏差的二手资料。阮一峰翻译好的一部分原因是他会保留一些英文表示,可以理解为该书翻译是为了更多读者可以看懂,但versus使用一个“与”代替反而失掉了准确性😅。
The Unix Philosophy UNIX设计哲学
A uniform interface 统一接口
interface: file文件(file Descriptor文件描述符)
A file is just an ordered sequence of bytes.
- an actual file on the filesystem
- a
communication channel
toanother process
(Unix socket, stdin, stdout) - a device driver (say /dev/audio or /dev/Ip0)
- a socket representating a TCP connection
Separation of logic and wiring 逻辑与布线分离
比如这个,英文为“Separation of logic and wiring”,也是很明显英文容易理解,当时读中文就在想这什么意思。
stdin、stdout
pros:
- loose coupling松耦合/late binding后期绑定/inversion of control控制反转
cons:
- 不利于多个输入输出的程序 Programs that need multiple inputs or outputs are possible but tricky. ?不太懂
- Can’t pipe a program’s output into a network connection. ?不太懂
Transparentcy and experimentation 透明与测试
MapReduct and Distributed Filesystems
有意思的类比:
MapReduce有点像分布在数千台机器上的UNIX工具。
A single MapReduce job is comparable to a single Unix process.
- takes one/more inputs
- produces one/more outputs:
- are written once in a sequential fashion
- not modifying any existing part of a file once it has been written
HDFS
shard-nothing approach: HDFS
- requires no special hardware
- only computers connected by a
conventional datacenter network
.
shard-disk approach
- Network Attached Storage(NAS)
- Storage Area Network(SAN)
- implemented by a
centralized storage appliance
集中存储设备, often using custom hardware and special networkinfrastructure
such as Fibre Channel
NameNode
: a center server keeps track of which file blocks are stored on which machine.
MapReduce Job Execution
The sorting is performed in stages. 分阶段进行排序。
MapReduce discards the partial output of a failed job
.