Friday, March 16, 2018

NDB Checkpoints vs Disk-based checkpoints

At one point in the development of the Partial LCP algorithm I started
wondering how it compares to an approach where one uses a standard
page-based checkpointing as happens in a traditional disk-based DBMS.

The scenario I thought was a case where one have a 10 TByte memory in
the system. In this case the likelihood of updating the same row twice
in a checkpoint is very small (except for hotspot rows of course).

With a row size of 250 bytes there will be 50 billion rows in the database.
I often assume that the checkpoint takes about 5 minutes in my modelling.
This means that even with 1 million writes per second less than 1% of the
data in the database is updated during a checkpoint.

A more normal update speed would be e.g. 100.000 writes per second.
In this case each checkpoint will write 10 Mbytes of rows per second
and thus about 6 GBytes is written in 5 minutes. This represents the
Delta, the changed records.

In addition in the Partial LCP implementation a part of the database
must also be written. In this case we have written 0.075% of the database
since the last checkpoints. The defaults requires thus that we select
a number of parts that ensures that at least 0.15% of the database is
fully written. Thus 2 of the 2048 parts will be written and thus the
total checkpoint size will be 22.5 GByte. To write this in 5 minutes
we need to write 75 Mbyte per second checkpoint data.

Now let us do the same experiment with a traditional disk-based
DBMS. We assume a standard page size of 8 kBytes. This means
that the page cache will have 1.28 billion pages. Thus with 100.000
updates per second we will update 36 M pages. Thus around 3% of
the pages. Thus it is very unlikely that any large number of pages
have more than one page write.

Thus a checkpoint must write each and every one of those 36M pages.
This means a total checkpoint size of 288 GByte. This means that the
DBMS must write almost 1 Gbyte of checkpoints per second, thus
more than 10x the amount NDB will write using the Partial LCP
algorithm.

In NDB it is possible to write even less during a checkpoint by setting
the configuration parameter RecoveryWork higher. Setting this to
its maximum size 100 means in the above calculation that we
only need to write 15 GByte per checkpoint and thus checkpoint
speed is 50 MBytes per second.

The drawback of this setting is that we increase the work at recovery
instead. The overhead stored on disk with setting 100 is 2x and this
overhead will amount to the same overhead at recovery time. The
default setting is 50% overhead in storage and at recovery.

It is possible to set it lower as well, down to 25%. In this case we will
write more, in the example we would write 37.5GByte and thus
125 MByte per second. So still 8x better than the disk-based
DBMS. In this case the overhead in storage is 25% and similarly
the overhead at recovery.

Although the overhead for restoring checkpoints is higher using
Partial LCP, the recovery will be a lot faster in 7.6. Recovery
contains running one LCP as part of recovery. This LCP
can be 100x faster compared to executing an LCP in 7.5.
Thus recovery will often be significantly faster using 7.6.

Also the disk storage for LCPs is decreased in 7.6.

No comments: