Data Structure: log structure merge tree(LSM)

Data Structure: log structure merge tree(LSM):

LSM Tree is a data structure that restricts your datastore to append only operations, whereas your datastore is simply a log containing totally ordered key-value pairs. The sort order is by a key which can be any byte sequence. The values too can be any arbitrary byte sequence. For example, in Cassandra's data model, a column family (akin to a table in a relational database) contains a row key (the key) and any sequence of arbitrary key values that can vary in cardinality (the value).

1. Fast write
2. Data stored in relational database
a. backed by data structure B-tree

B-tree:

  • it's optimized for reads
  • updating is relatively more expensive as it involves random I/o, updating multiple pages on the disk
  • this limits how fast a B-tree can ingest data
3. LSM works differently writes are batched in memory as they arrive in a structure called Memtable (memory)
1. Memtable is ordered by object key and is usually implemented as a balanced binary trees
4. Memtable reaches some specific size, it is flushed to disk as an immutable Sorted String table(disk)

LSM tree


SSTable:

  1. SSTable stores the key-value pairs in a sorted sequence. These writes are all sequential IO, they are fast on any storage media
  2. The new SSTable becomes the most recent segment of the LSM tree
  3. As more data comes in more and more of these immutable SSTables are created and added to the LSM tree
  4. with each one representing a small chronological subset of the incoming changes
  5. an update to an existing object key does not overwrite an existing SSTable.
  6. instead, the new entry is added to the most recent SSTable which supersedes any entries in old SSTables for the object key.
  7. to perform a delete it adds a marker called a tombstone to the most recent SSTable for the object key
  8. when we encounter a tombstone on read we know the object has been deleted
  9. As outdated entries as keys are updated and tombstones are added it will take up precious space to fix these issues there are periodic merging and compaction processes running in the background
  10. To Merge SSTable and discard outdated or deleted values. This help to reclaim disk space and caps the number of SSTables a reader has to look through.
  11. Since they are sored this merging and compaction process is simple and efficient
  12. the approach is similar to the one used to merge phase of the mergesort algorithm
  13. this in a nutshell is how an LSM tree provides fast writes


Recap:

* An LSM tree buffers incoming writes in memory
* when the buffer fills up, we sort and flush it to disk as an immutable SSTable.
* As more and more SSTable are created because of update and delete which mark tombstone and consumes precious memory space, to resolve this issue LSM tree merges the SSTables and Compacts them in the Background using an approach similar to the one used to merge phase of the mergesort algorithm


Compaction Process:

  • When SSTables are merged they are organized into levels
  • This is where tree part of the LSM tree comes in part
  • the different strategies to determine when and how SSTables are merged and compacted
1. Tiering => write optimized
2. Leveling => read Optimized

Key Points:

  1. Compaction keeps the number of SSTables manageable
  2. SSTables are organized into levels where each is exponentially larger as SSTables from the level above are merged into it
  3. Compaction consumes a lot of I/O, Mistuned Compaction could starve the system and slow down both reading and writing


Standard Optimization for LSM trees in Production Systems

  • to look up a key it performs searches on SSTables at every level Though the search is fast on sorted data going through all the data consumes a lot of I/O.
  • Many Systems keep a summary table in memory that contains the min/max range of each disk block of every level this allows the system to skip searches on those disk blocks where the key does not fall within the scope this saves a lot of random IO
  • Another issue that could be potentially expensive is looking up a key that does not expensive
  • this would require looking through all the eligible blocks in all level
  • The most system keeps a Bloom filter on each level, Bloom Filter is a space-efficient data structure that returns 
    1. a firm no if the key does not exist
    2. probable yes if the key might exist

Bloom Filter:

  • it's a space-efficient probabilistic data structure
  • is used to answer the question, Is this element in the set?
    1. A Bloom filter would answer with either a "firm no" or a "Probable yes".
    2. False positives are possible -> element not there but say it is present
    3. False negatives are not possible -> element is there but says it is not 
    4. we can't remove the item from the bloom filter, it never forgets.
        • This above feature allows the system to skip a level entirely if the key does not exist there which dramatically reduces the number of random I/O required.

        LSM tree, Bloom filter on all stages


        NO SQL database backed by LSM tree could be tuned to support a very high write rate

         NoSQL database: Cassandra, RocksDB, influx dB

        reference:  youtube LSM treeLSM Tree Blog

        Comments

        Popular Posts