diff --git a/docs/readthedocs/source/etl.rst b/docs/readthedocs/source/etl.rst new file mode 100644 index 0000000000000000000000000000000000000000..327142fde30823c4c958235e91d92a1fd77fc51d --- /dev/null +++ b/docs/readthedocs/source/etl.rst @@ -0,0 +1,138 @@ +=== +ETL +=== + +ETL framework is most commonly used in [staged sync](../../eth/stagedsync). + +It implements a pattern where we extract some data from a database, transform it, +then put it into temp files and insert back to the database in sorted order. + +Inserting entries into our KV storage sorted by keys helps to minimize write +amplification, hence it is much faster, even considering additional I/O that +is generated by storing files. + +It behaves similarly to enterprise [Extract, Tranform, Load] frameworks, hence the name. +We use temporary files because that helps keep RAM usage predictable and allows +using ETL on large amounts of data. + +Example +======= + +.. code-block:: go + + func keyTransformExtractFunc(transformKey func([]byte) ([]byte, error)) etl.ExtractFunc { + return func(k, v []byte, next etl.ExtractNextFunc) error { + newK, err := transformKey(k) + if err != nil { + return err + } + return next(k, newK, v) + } + } + + err := etl.Transform( + db, // database + dbutils.PlainStateBucket, // "from" bucket + dbutils.CurrentStateBucket, // "to" bucket + datadir, // where to store temp files + keyTransformExtractFunc(transformPlainStateKey), // transformFunc on extraction + etl.IdentityLoadFunc, // transform on load + etl.TransformArgs{ // additional arguments + Quit: quit, + }, + ) + if err != nil { + return err + } + +Data Transformation +=================== + +Data could be transformed in two places along the pipeline: + +* transform on extraction + +* transform on loading + +Transform On Extraction +======================= + +.. code-block:: go + + type ExtractFunc func(k []byte, v []byte, next ExtractNextFunc) error + +Transform on extraction function receives the currenk key and value from the +source bucket. + +Transform On Loading +==================== + +.. code-block:: go + + type LoadFunc func(k []byte, value []byte, state State, next LoadNextFunc) error + +As well as the current key and value, the transform on loading function +receives the `State` object that can receive data from the destination bucket. + +That is used in index generation where we want to extend index entries with new +data instead of just adding new ones. + +Buffer Types +============ + +Before the data is being flushed into temp files, it is getting collected into +a buffer until if overflows (`etl.ExtractArgs.BufferSize`). + +There are different types of buffers available with different behaviour. + +* `SortableSliceBuffer` -- just append (k, v1), (k, v2) onto a slice. Duplicate keys + will lead to duplicate entries: [(k, v1) (k, v2)]. + +* `SortableAppendBuffer` -- on duplicate keys: merge. (k, v1), (k, v2) + will lead to k: [v1 v2] + +* `SortableOldestAppearedBuffer` -- on duplicate keys: keep the oldest. (k, + v1), (k v2) will lead to k: v1 + +Loading Into Database +===================== + +We load data from the temp files into a database in batches, limited by +`IdealBatchSize()` of an `ethdb.Mutation`. + +Saving & Restoring State +======================== + +Interrupting in the middle of loading can lead to inconsistent state in the +database. + +To avoid that, the ETL framework allows storing progress by setting `OnLoadCommit` in `etl.TransformArgs`. + +Then we can use this data to know the progress the ETL transformation made. + +You can also specify `ExtractStartKey` and `ExtractEndKey` to limit the nubmer +of items transformed. + + +`etl.Transform` function +======================== + +The vast majority of use-cases is when we extract data from one bucket and in +the end, load it into another bucket. That is the use-case for `etl.Transform` +function. + +`etl.Collector` struct +====================== + +If you want a more modular behaviour instead of just reading from the DB (like +generating intermediate hashes in https://github.com/ledgerwatch/turbo-geth/blob/master/core/chain_makers.go, you can use +`etl.Collector` struct directly. + +It has a `.Collect()` method that you can provide your data to. + + +Optimizations +============= + +* if all data fits into a single file, we don't write anything to disk and just + use in-memory storage. diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst index 66d171db208e57df38db4118e1ba06a5e4b465c5..10881a60393f7d60eadc49d7dad4e6f6a43d6fdd 100644 --- a/docs/readthedocs/source/index.rst +++ b/docs/readthedocs/source/index.rst @@ -14,6 +14,7 @@ Welcome to Turbo Geth's documentation! rpc/tutorial stagedsync snappy + etl rpc/index Indices and tables