Designing Data-Intensive Applications Notes

豆瓣🔗：

英文：Designing Data-Intensive Applications
中文：《数据密集型应用系统设计》

atomic

In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also isolation.

In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur.

REST

REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST has been gaining popularity compared to SOAP, at least in the context of cross-organizational service integration, and is often associated with microservices. An API designed according to the principles of REST is called RESTful.

message broker/message queue/message-oriented middleware

Using a message broker has several advantages compared to direct RPC:

It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.

It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.

It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).

It allows one message to be sent to several recipients. It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

Reading Notes:

1、

数据密集：Big-Data
计算密集：Deeping-Learning、HPC(High-Performance-Computing)

2、FOSS: Free and Open Source Software

3、back pressure(背压)=flow control(流量控制)

4、byzantine fault(拜占庭错误)：转发任何类型（矛盾的or虚假的）的消息

5、failover: the leadership role from one node to another node

CH1

1、

Redis: datastores/message queues

Apache Kafka: message queues/database-like durability guarantees 持久化存储保证

2、

Memcached

Elasticsearch/Solr: full-text search server

main database

3、

水平拓展：单个节点——>分布式多机环境
- 无状态服务：容易
- 有状态服务：复杂
垂直拓展

CH2

1、网络、层次模型——>关系模型(RDBMS+SQL)

NoSQL:Not only SQL

比RDB更好的拓展需求：very large datasets, very high write throughput
free and open source software over commercial database products
specialized query operations that are not well supported by the rational model
for a more dynamic and expressive data model

2、ORM(Objected-Reational Mapping frameworks): ActiveRecord, Hibernate

3、Document-oriented Database: MongoDB, RethinkDB, CouchDB, Espresso

无模式（读时模式）：数据的结构是隐式的，只有在读取时才解释。类似于：动态（运行时）类型检查

写时模式（rational database）：模式是显式的，确保数据写入时都必须遵循。类似于：静态（编译时）类型检查

4、property graph: Neo4j, Titan, InfiniteGraph

CH3

Chapter 3 数据存储与检索

storage engines

存储引擎：

log-structured ～: 日志结构
- 仅支持追加式更新的log
page-oriented ～: 面向页 B-tree

the hash table’s limitations:

the hash table must fit in memory：哈希表必须全部放入内存。
it is difficult to make an on-disk hash map perform well
- a lot of random access I/O
- expensive to grow when it becomes full
- hash collisions require fiddly logic
Range queries are not efficient：只能采用逐个查找的方式查询每一个键。

SSTable

Sorted String Table 排序字符串表
按key-value对的顺序按键key排序

LSM-Tree

Log-Structured Merge-Tree(LSM-Tree) 日志结构的合并树

LSM存储引擎：基于合并和压缩排序文件原理的存储引擎。

Lucene

an indexing engine for full-text search used by Elasticsearch and Solr 索引引擎
key: 单词（词条）
value: 所有包含该单词的文档ID的列表（倒排表）

Bloom filters: memory-efficient data structure for approximating the content of a set: if a key does not appear in the database.

B-Tree

fixed-size blocks/pages
read/write one page at a time
individual keys(a leaf page):
- inline
- references to the pages where the values can be found
an additional data structure on a disk: a write-ahead log(WAL,redo log) 预写日志/重做日志（仅支持追加）

分支因子：B-Tree中一个页所包含的子页引用数量

B-Tree: faster for reads

先写WAL
后修改树本身的页（还可能发生页分裂）

LSM-Tree: faster for writes

Index

clustered index: storing all row data within the index 在索引中直接存储行数据
nonclustered index: storing only references to the data with the index 仅存储索引中数据的引用

View

materialized view：查询结果的实际副本，并被写到磁盘。
standard(virtual) view：编写查询的快捷方式。从虚拟存储中读取时，SQL引擎将其动态拓展到视图的底层查询，然后处理拓展查询。

OLTP(Online Transaction Processing)

log-structured school: BitCask, SSTable, LSM-Tree, LevelDB, Cassandra, HBase, Lucene
update-in-place school: B-Tree(RDB, NoSQL)

CH4

1、

in-memory representation——>a byte sequence

encoding
serialization
marshalling

a byte sequence——>in-memory representation

decoding
derialization
unmarshalling

PART 2 Distributed Data

Chapter 5: Replication

数据复制的难点：需要复制的数据不是一成不变的，是持续变化的。

复制数据变化的3种发放：

主存复制
多主节点复制
无主节点复制

主存复制

只有主节点可以接受写请求
客户端角度，从节点都是只读的

同步复制与异步复制

半同步：某一个从节点是同步的，其他节点则是异步模式

全异步：

无法保证数据的持久化（主节点失败且不可恢复，复制到从节点的写请求会丢失）
主节点可以继续响应从节点，系统的吞吐性能更好

处理节点失效

从节点失效：追赶式恢复

主节点失效：节点切换

脑裂：两个从节点同时自认为是主节点

复制日志的实现

基于语句

不适用的场景：

任何调用非确定性函数的语句：NOW()、RAND()
- 主节点在记录操作语句时，将非确定性函数替换为执行之后的确定性结果
- 但需要考虑很多边界条件
使用了自增列、依赖于DB的现有数据的语句
有副作用的语句：触发器、存储过程、用户定义的函数

基于预写日志WAL

缺点：

日志描述的数据结果WAL非常底层：哪些磁盘块的哪些字节发生改变
复制方案和存储引擎紧密耦合

基于行的逻辑日志

逻辑日志

e.g., MySQL的二进制日志binlog

基于触发器

应用程序层进行复制控制

多主节点复制

无主节点复制

Chapter 9: 一致性与共识

可线性化（原子一致性、强一致性）

可线性化 Linearizability ：

读写寄存器（单个对象）的最新值保证
例如：
- 两阶段加锁
- 实际以串行执行
- 关系型数据库中主键的约束，需要线性化保证；外键/属性约束，并不一定要求线性化
- 多核CPU上的内存：非线性化（除非使用了内存屏障/fence指令）
放弃线性化的原因：为了性能，而不是容错

可串行化 Serializability ：

事务的隔离属性
用来确保事务执行的结果与串行执行的结果完全相同

CAP定理

“网络分区的情况下，选择一致还是可用”

Chapter 10: Batch Processing批处理系统

As usual, history has a tendency of repeating itself.

the ideas and lessons from Unix carry over to large-scale, heterogeneous distributed data systems.

UNIX的思想和经验普遍见于大规模、异构分布式数据系统。

Batch Processing with Unix Tools

Simple Log analysis简单日志分析

1
2
3
4
5
6


cat /var/log/nginx/access.log |
awk '{print $7}' |
sort |
uniq -c |
sort -r -n |
head -n 5

in a matter of seconds 在几秒钟内

Chains of commands versus custom program

a simple program to do the same thing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


counts=Hash.new(0)

File.open('/var/log/nginx/access.log') do |file|
  file.each do |line|
    url=line.split[6]
    counts[url]+=1
  end
end

top5=counts.map{|url,count| [count,url]}.sort.reverse[0...5]
top5.each{|count,url| puts "#{count} #{url}"}

Sorting versus in-memory aggregation

排序 VS 内存聚合

排序：

UNIX流水线
job’s working set is larger than the available memory: make efficient use of disks 高效地使用磁盘
SSTables and LSM-Tables
- chunks of data can be sorted in memory
- written out to disk as segment files
- multiple sorted segments can be merged into a larger sorted file
Merge-sort has sequential access patterns that perform well on disks. ? 不太理解这句。

内存聚合：

Ruby脚本使用一个URL的内存哈希表
适用于大多数中小型网站（取决于URL的不同个数）

发现中文翻译的确不太好。中文副标题为“排序与内存聚合”，刚开始还没反应过来这其实是对日志文件的两种不同处理方式。去翻英文版，发现标题为Sorting versus in-memory aggregation，使用了versus,马上可以体会到对日志分析的不同处理方法。还是读英文版吧，毕竟是第一手而不是又经过另一个人翻译有偏差的二手资料。阮一峰翻译好的一部分原因是他会保留一些英文表示，可以理解为该书翻译是为了更多读者可以看懂，但versus使用一个“与”代替反而失掉了准确性😅。

The Unix Philosophy UNIX设计哲学

A uniform interface 统一接口

interface: file文件（file Descriptor文件描述符)

A file is just an ordered sequence of bytes.

an actual file on the filesystem
a communication channel toanother process(Unix socket, stdin, stdout)
a device driver (say /dev/audio or /dev/Ip0)
a socket representating a TCP connection

Separation of logic and wiring 逻辑与布线分离

比如这个，英文为“Separation of logic and wiring”，也是很明显英文容易理解，当时读中文就在想这什么意思。

stdin、stdout

pros:

loose coupling松耦合/late binding后期绑定/inversion of control控制反转

cons:

不利于多个输入输出的程序 Programs that need multiple inputs or outputs are possible but tricky. ？不太懂
Can’t pipe a program’s output into a network connection. ？不太懂

Transparentcy and experimentation 透明与测试

MapReduct and Distributed Filesystems

有意思的类比：MapReduce有点像分布在数千台机器上的UNIX工具。

A single MapReduce job is comparable to a single Unix process.

takes one/more inputs
produces one/more outputs:
- are written once in a sequential fashion
- not modifying any existing part of a file once it has been written

HDFS

shard-nothing approach: HDFS

requires no special hardware
only computers connected by a conventional datacenter network.

shard-disk approach

Network Attached Storage(NAS)
Storage Area Network(SAN)
implemented by a centralized storage appliance集中存储设备, often using custom hardware and special network infrastructure such as Fibre Channel

NameNode: a center server keeps track of which file blocks are stored on which machine.

MapReduce Job Execution

The sorting is performed in stages. 分阶段进行排序。

MapReduce discards the partial output of a failed job.

Table of Contents