Thursday, August 21, 2008

Some food for thoughts: How to make use of new SSD devices

The hardware guys are presenting new storage devices called
SSD's based on flash memory. At the moment I think they are
about 3-4 times cheaper than DRAM memory and the gap seems
to be increasing. They're still far from the price of hard
drives but also here the gap seems to be closing.

So as I'm now an employee of Sun that actually puts together
systems with this type of HW in it I get questioned what I
as a DBMS developer can do with those devices.

First some comments on performance. These new devices will be
able to perform reads and writes of a few kilobytes large pages
in about 25-100 microseconds compared to hard drives which
takes about 3-10 milliseconds for the same thing.

An obvious use is obviously to use them to speed up database
logging, particularly in commit situations. However this
doesn't really require any significant changes to the SW
already out there. So I won't spend any more time on this use.

Another use is for MySQL Cluster. MySQL Cluster stores most data
in memory and can store non-indexed data on disk. So how can
SSD devices be used to improve this.

First some facts about performance of MySQL Cluster. In the data
node where the data actually resides it takes about 10
microseconds of processing time to perform a key lookup and a
scan has about 20 microseconds of start-up costs whereafter each
record takes 1-2 microseconds to fetch.

So now for the idea. Let's assume we'll use an SSD device as swap
memory. We would then purposely set the swap to be e.g. 10x
larger than the memory. For this to work we need to be able to
allocate memory from different swap pools, memory used for
transaction state and things like this we don't want swapped out
(working for Sun has an advantage since we can work with the OS
guys directly, but naturally I hope Linux developers also take the
same opportunity).

So during a key lookup we need to get one page from the hash index
and one page with the record in it. Guestimating a 90% hit rate in
the hash index and 80% hit rate on the data page we find that we
will about 0.3 swap misses per key lookup. If we assume 50
microseconds for this it means that mean key lookup will increase
from 10 microseconds to 25 microseconds. This should be
acceptable, given that we can increase data size by a factor of
about 10.

A similar analysis can be made for scans as well, but I'm lazy so
will leave it to you to perform :)

So given todays sizes of memories and SSD's it should be possible
to use systems with 64 GBytes of memory and 640 GB of SSD memory
and clustering 8 of those with replication gives us a main memory
based system for a reasonable price providing 2.5 TByte of user
data in a highly available system with high degrees of parallelism
in the system.

New partitioning features

As burtonator pointed out parallelism is an important
feature that partitioning makes possible. So I thought
it might be a good idea to mention a little bit what
we're doing in the area of partitioning.

It's quite correct that parallelism is one of the main
advantages of partitioning (not the only one though since
also partition pruning and dividing large indexes and
being able to add and drop partitions efficiently are
important as well). In 5.1 we focused on the maintenance
features of partitioning but the intention to move on
to parallelisation was more or less the main goal from
the very start.

This is why it's such extra fun to actually get going on
this when one has worked on the foundation for this work
for almost 4 years (partitioning development started out
2004 H2 and most of the partitioning code in 5.1 was ready
about two years later).

There are also ideas to introduce parallelism for scans of
large partitioned tables and also a few more maintenance
features that are still missing.

Another feature in the works for partitioning is the
ability to use partition pruning on several fields. This
will be possible for PARTITION BY RANGE and LIST. The
syntax will look something like this:

CREATE TABLE t1 (a varchar(20), b int)
PARTITION BY RANGE (COLUMN_LIST(a,b))
(PARTITION p0 VALUES LESS THAN (COLUMN_LIST("a", 1)),
PARTITION p1 VALUES LESS THAN
(COLUMN_LIST(MAXVALUE, 4)));

In this case it is possible to partition on any field type
and it is also possible to do partition pruning on multiple
fields in much the same way as it is for indexes.

E.g.
select * from t1 where a = "a";
select * from t1 where a = "a" and b = 2;

will both be able to use for partition pruning with the
second obviously able to do more pruning then the first one.

Wednesday, August 20, 2008

Multi-threaded ALTER TABLE

Today I achieved something which is a first in the MySQL
server as far as I'm aware of. I managed to run a query
with multiple threads. The query was:
ALTER TABLE t1 ADD COLUMN b int;
and the table had 4 partitions in it. So it used 4 threads
that each thread handled the copying of data from old
table to new table of one partition.

Currently it's designed for use by partitioned tables but
it should be very straightforward to do minor parallelisation
also of non-partitioned tables by e.g. breaking up in a scan
thread and a write thread.

It's nice to get started on this track and see how one can
make use of modern computers with a great deal of CPU power
if one can parallelise the applications. As an example a
dual socket box T5220 (2 Niagara II CPU's) can handle 128
threads in parallel.

Friday, August 01, 2008

3: Thoughts on a new NDB API: Adaptive send algorithm

I thought a bit more on the adaptive send algorithm and kind of like
the following approach:

Keep track of how many sends we are at maximum allowed to wait
until we send in any ways. This is the state of the adaptive send
algorithm which is adapted through the following use of statistics
(we call this state variable max_waits):

For each send we calculate how long time has passed since the
send that was sent max_waits sends ago. We also do the same for
max_waits + 1. At certain intervals (e.g. every 10 milliseconds) we
calculate the mean wait that a send would have to do, if this lies
within half the desired maximum wait then we accept the current
state, if also the mean value using max_waits + 1 is acceptable
then we increase the state by one. If the state isn't acceptable
we decrease it by one.

In the actual decision making we will always send as soon as we
notify that more than the maximum wait time has occurred so this
means that the above algorithm is conservative. However the user
should have the ability to control how long he accepts a wait
through a configuration variable, thus increasing or decreasing
send buffering at the expense of extra delays.

This algorithm is applied on each socket and the actual decision
making is done within the critical section and also the statistics
calculation and from coding this it seems like the overhead should
be manageable.