Time series data storage
There are multiple lists of time series databases around;
awesome-time-series-database is a curated and relatively
complete one. I spent some time choosing a DBMS for such a task,
and summarising the things I've learned here.
The primary issue with a relatively large amount of time series data
(dozens of billions of rows) and relatively limited resources (no dozens
of Tios of RAM, not even that amount of SSD storage) is reading from a
disk: the queries are usually simple, but plain reading is the bottleneck.
With wider rows (dozens of columns, or some columns storing arrays),
a column-oriented DMBS looks like a good idea, if only some of the columns
are needed for typical queries. There's a few more things specific to time
series data, e.g., a lot of querying by range and insertions, but rare
removal or updates. Specialized DBMSes tend to acknowledge that,
supposedly optimizing for it, and even providing handy facilities for
building reports based on TS data.
Some types of DBMSes that at least don't waste much of one's time:
- Proprietary
-
Easy to filter out.
- Abandoned
-
For some reason they stay in the lists, but relying on an
abandoned DBMS to store your data is unwise, unless you have
enough time/resources to continue development.
- Experimental
-
When there's just a paper saying how great it is, and a small
short-living repository suggesting to try their Docker image.
Might be fun, of course, but not for production.
- Small
-
Even NSA and GCHQ publish open-source analytical DBMSes (it
must be somewhat amusing to store data in those), as well as
all kinds of advertisement companies do, and I guess there are
less known projects led by enthusiasts. It is possible that
some of those are good, but it would take a while to find and
try out enough to find anything good, and even then there
would be a risk of those getting abandoned.
- In-memory
-
Simply was not suitable in my case, since there's too much
data for memory (and I probably wouldn't look for a fast one
if the amounts of data were that small).
The remaining ones tend to have smaller communities than general-purpose
DBMSes, and generally be worse in everything except for supposed
performance for time series data. I've tried just a few, since there's not
much to try after filtering them, and it takes quite some time even to
fill them with test data:
- InfluxDB
-
Looks too entreprise-y, but is in Debian repositories, so it's easy to
try. Or so I thought: insertion turned out to be too slow, inserting
about 100 minutes of test data in an hour. Maybe the issue was in the
client library, and that's after decreasing batches to 1000 records at a
time, since otherwise it was failing with timeouts. It looked handy, but
I gave up on benchmarking it and moved on to others. There also is
an issue with clustering (or, rather, a lack of it in the libre
version), which is scary when the data is close to not fitting on a
single machine.
- PostgreSQL
-
Not a specialized TSDB, so didn't hope for it to outperform the
specialized ones, but at that point I've decided to actually try it, to
compare others to it: it is a good RDBMS, and was good for many years,
in many aspects. Initially I was going to use JSON fields, but columns
(including array columns) turned out to consume about twice less disk
space. It was rather slow for large selections with just a primary key,
but considerably faster with index-only scans.
- KairosDB + Cassandra
-
I expected this one to be fast, but it turned out to be a bit
slower than Postgres with just a primary key. Actually it was
a relief, since there are a few things I didn't quite like
about those (mostly the documentation and Java,
perhaps). Apparently a built-in web UI is among its prominent
features.
- ClickHouse (with MergeTree)
-
Made by Yandex. The build process, the documentation, the
functionality – pretty much everything is rather poor, but the
performance is indeed fine: like Postgres with index-only
scans (sometimes a little faster, sometimes a little slower,
but roughly that). Takes less disk space than Postgres
does. Maybe it's better for larger workloads, but I haven't
noticed anything outstanding in its performance, while it's
generally less flexible and less feature-rich than Postgres.
I've planned to use a specialized DBMS instead of Postgres if the former
would outperform the latter considerably, but that didn't happen: while
the bottleneck is disk I/O, there's not much to improve for a DBMS.
Aggregation alone would increase performance with any DBMS by more than a
thousand times in this case, and then there are RAID 0, sharding, possibly
hardware updates.
Some approaches that I found useful with PostgreSQL:
-
Storage of time intervals as
(time, interval)
tuples,
where time is a median time, and interval is a
distance to either end. This allows to easily group or partition
(including native partitioning) records appropriately and reliably.
-
If row-level security policies are used and rely on many nested queries,
Postgres may expect much more results than there can be, and plan
inefficient queries. Using a function to retrieve a list of available
objects (e.g., sensors, if the records refer those), and setting
something like
rows 1000
for it, would speed up some
queries considerably.
-
Custom aggregate functions are very handy.
-
When aggregation is used, storing the number of raw records is useful
for both diagnostics and adding new data into aggregated records (e.g.,
for taking weighted averages).