Saturday, March 17, 2018

NDB Cluster and disk columns

NDB is mainly an In-memory database. We have however also the possibility to
store non-indexed columns on disk. This data uses a page cache as any
other normal disk-based DBMS.

Interestingly with the increases of memory sizes one could think that
disk data becomes less important for MySQL Cluster. The answer is actually
the opposite.

The reason is again the HW development. NDB is designed with predictable
latency as a very basic requirement. In the past disks meant hard drives. Access
time to a hard disk was several milliseconds at best. Given that our requirement
was to handle complex transactions within 10 milliseconds disk data storage
was out of the question.

Modern HW is completely different, they use SSD devices, first attached through
the SATA interface that enabled up to around 500 MByte per second and
a few thousand IO operations per second (IOPS). The second step was the
introduction of SSD devices on the PCI bus. This lifted the performance up to more
than  1 GByte per second. These devices are extremely small and still very powerful.
I have an Intel NUC at home that has two of those devices.

Thus the performance difference between disk storage and RAM has decreased.

The next step on the way was to change the storage protocol and introduce NVMe
devices. These still use the same HW, but use a new standard that is designed for
the new type of storage devices. Given those devices we have now the ability to
execute millions of IOPS on a standard server box with access times of a few tens
of microseconds.

For NDB this means that this HW fits very well into the NDB architecture. The work
we did on developing the Partial LCP algorithm did also a lot of work on improving
our disk data implementation. We see more and more people that use disk data
columns in NDB.

The next step is even more interesting, this will bring storage into the memory bus and
access times of around one microsecond. For NDB this disk storage can be treated as
memory to start with, thus making it possible to soon have multiple TBytes of memory
in standard boxes.

Thus HW development is making the NDB engine more and more interesting to use.

One notable example that uses disk data columns in NDB is HopsFS. They use the
disk data columns to store small files in the meta data server of the HopsFS
implementation of the Hadoop HDFS Name Server. This means much faster
access to small files. The tests they did showed that they could handled hundreds
of thousands of file reads and writes per second even using fairly standard SSD disks
on the servers.

The implementation of disk data in NDB is done such that each row can have three
parts. The fixed memory part that is accessed quickly using a row id. The variable
sized part that is accessed through a pointer from the fixed size part.

The disk columns are also accessed through a reference in the fixed size part. This
reference is an 8-bit value that refers to the page id and page index of the disk
columns.

Before we can access those pages we go through a page cache. The page cache was
implemented on caching techniques that was state of the art a few years ago.

The idea is quite simple. The page cache uses a normal hot page queue. Pages are
brought up in this queue when they are accessed. A single access will bring it up,
but to be more permanent in the page cache a page has to be accessed several times.

Now each page is represented in those queues by a page state record. The basis
of the page cache algorithm is that a page can be represented in a page state
record even if the page is not in the page cache.

NDB has a configuration variable called DiskPageBufferEntries, by default this is
set to 10. It is the multiplication factor of how many more pages we have
page state records compared to the amount of pages we have in the page cache.

So for example if we have set DiskPageBufferMemory to 10 GByte and we have
set DiskPageBufferEntries we will have page state records that holds pages of
100 GBytes in the queues. Thus even when a page is paged out we keep it in the
list and thus we can see patterns of reuse that are longer than the page cache
we have access to. The factor of 10 means that the page state records are of
about 3% of the size of the page cache itself. Thus the benefits of the extra
knowledge about page usage patterns comes at a fairly low cost. The factor
10 is configurable.

Many cloud servers comes equipped with hundreds of GBytes (some even TBytes)
and can also store a number of TBytes on NVMe devices. NDB is well suited
for those modern machines and MySQL Cluster 7.6 have been designed to be
suitable for this new generation of HW.

Friday, March 16, 2018

Discovering rows that have been updated since last checkpoint

One important problem that requires a solution is to decide whether
a row has been updated since the last checkpoint or not.

Most implementations use some kind of mechanism that requires extra
memory resources and/or CPU resources to handle this.

NDB uses the fact that each row is already stamped with a timestamp.
The timestamp is what we call a global checkpoint id. A new global
checkpoint is created about once every 2 seconds (can be faster or
slower by configuration).

Thus we will overestimate the number of rows written since last checkpoint
with a little bit, but with checkpoints taking a few minutes, the extra overhead
of this is only around 1%.

Thus when we scan rows we check the global checkpoint id of the row, if
it is bigger than the global checkpoint that the last checkpoint had fully
covered we will write the row as changed since last checkpoint. Actually
we also have the same information on the page level, thus we can check
the page header and very quickly scan past an entire page if it hasn't been
updated since last checkpoint.

The same type of scanning is used also to bring a restarting node up to
synch with the live node. This algorithm has been present in NDB since
MySQL 5.1.

Partial LCPs and Read-only tables

In MySQL Cluster 7.5 we use Complete Checkpoints. In MySQL Cluster 7.6
we implement an approach where we only checkpoint a part of the database
in each checkpoint.

A special case is a checkpoint of a table partition where no changes
at all have happened since the last checkpoint. In this case we implemented
a special optimisation such that it is not necessary to checkpoint anything
at all for this table partition. It is only necessary to write a new LCP
control file which is 4 kBytes in size for each table partition (can grow to
8 kBytes if the recovery will require more than 980 checkpoints to
recover.

This means that if your database contains a large set of read-only tables,
there will be no need to checkpoint those tables at all. This feature
is used also when setting EnablePartialLcp to false.

Partial LCPs and disk space

One of the main objectives of the new Partial LCP algorithm in MySQL
Cluster 7.6 is to keep up with the development of modern HW.

I have already described in previous blogs how Partial LCP can handle
nicely even database sizes of 10 TBytes of memory with a very modest
load on the disk devices.

Now modern HW has shifted from using hard drives to using SSDs.

The original approach in NDB is assuming that the checkpoints and
REDO logs are stored on hard drives. In MySQL Cluster 7.5 the
disk space required for the REDO log is that it is a bit larger than the
DataMemory size. The reason is that we want to survive also when
loading massive amounts of data.

In MySQL Cluster 7.5 we cannot remove any checkpoint files until
a checkpoint is fully completed. This means that we require around
4x the memory size of disk space for REDO logs and checkpoints.

With hard drives this is not a problem at all. As an example my
development box has 32 GBytes of memory with 2 TByte of disk
space. Thus 64x more disk space compared to the memory space.

With modern servers this size difference between memory and
disks is decreasing. For example many cloud VMs only have
a bit more than 2x the disk size compared to the memory size.

So one goal of MySQL Cluster 7.6 is to fit in much less disk
space.

The aim is to solve this with a three-thronged approach.

1) Partial LCP means that we can execute the checkpoints much
faster. Since REDO logs only need to be kept for around two
checkpoints this means a significant decrease of size requirements
for REDO logs. The aim is to only need around 10% of the disk
space of memory for the REDO logs. This work is not completed
in 7.6.4. As usual there are no guarantees when this work will be
completed.

2) Using Partial LCP we can throw away old LCP files as soon
as we have created a new recoverable LCP for the table partition.
Thus it is no longer necessary to store 2 LCPs on disk. At the
same time there is some overhead related to Partial LCPs. By default
setting this overhead is 50% plus a bit more. Thus we should always
fit within about 1.6x times the memory size.

It is possible to set EnablePartialLcp to false, in this case all
checkpoints will be Complete Checkpoints. This means more
writes to disk for checkpoints, but it will decrease the storage
space to around the same as the memory size.

3) Using CompressedLCP set to 1 we can decrease LCP storage
by another factor of 2-3x (usually around 2.7x). This feature has
existed for a long time in NDB.

Thus it should be possible to significantly decrease the requirements
on storage space when running NDB using MySQL Cluster 7.6.