Save bcherny/b870a60d1650973df7e400c8603ac76d to your computer and use it in GitHub Desktop.
Notes: Designing Data-Intensive ApplicationsNotes on Martin Kleppmann's excellent Designing Data-Intensive Applications.
data Tuple = (ID, Data | ID)
# Property graph type Vertex = > type Edge = > # Triple-store type Data = string | number | boolean | . type Triple =
OLTP (OnLine Transaction Processing) | OLAP (OnLine Analytics Processing) | |
---|---|---|
Type | Relational (row-oriented) or Document | Columnar (column-oriented) |
User | End user | Analyst, Data Scientist |
Type of data | Latest | All historical data |
Read pattern | Small number of records, fetched by key | Aggregate over many records |
Write pattern | Random-access, low latency | ETL or event stream |
Perf bottleneck | Disk seek time | Disk bandwidth |
Indexing data structure | Log-structured (SSTable, LSM) or Update-in-place (B-tree) | LSM |
Examples | MySQL, LevelDB, Cassandra, HBase, Lucene | Hive, Presto, Spark, Redshift |
type Index = <[IndexKey: string]: HeapFileKey>type HeapFile =
Format | Created By | Encoding Formats | Schema Support? | Backwards/Forwards Compatability |
---|---|---|---|---|
Avro | Apache | Binary | Yes. Supports union , null | Yes. Writer/Reader schemas are auto-translated |
Protocol Buffers | Binary | Yes. Supports repeated | Yes. Can change field names, but can only add fields. New fields must be optional. | |
Thrift | Binary | Yes. Supports nested lists | Yes. Can change field names, but can only add fields. New fields must be optional. |
Protocol | Data format | Schema |
---|---|---|
REST | JSON | Often no schema. Can be codegenned, eg. using Swagger |
SOAP | XML | Yes, using WSDL |
RPC | Binary (eg. gRPC uses Protobuf) | Yes |
GraphQL | JSON | Yes |
Read Committed | Snapshot Isolation | Serializable | |
---|---|---|---|
Also known as | Repeatable read (PostgreSQL, MySQL), serializable (Oracle) | ||
Race conditions prevented | Dirty read, Dirty Write | All of Read Committed, plus: Read skew, Lost updates | All of Snapshot Isolation, plus: Write skew |
Implementation | Row-level locks for writes, serve old values while writes are in progress for reads | Row-level locks for writes, Multi-Version Concurrency Control (MVCC) for reads | Executing transactions in serial order (often using stored procedures, and only possible when data fits in memory), Two-phase locking ("2PL"), Serializable Snapshot Isolation ("SSI") |