Thursday, July 31, 2008

1: Making MySQL Cluster scale perfectly in the DBT2 benchmark: Initial discussion

Since 2006 H1 I've been working on benchmarking MySQL
Cluster using the DBT2 test suite. Initially this meant
a fair amount of work on the test suite itself and also
a set of scripts to start and stop NDB data nodes, MySQL
Servers and all the other processes of the DBT2 test.
(These scripts and the DBT2 tests I'm using is available
for download at www.iclaustron.com)

Initially I worked with an early version of MySQL Cluster
based on version 5.1 and this meant that I hit a number
of the performance bugs that had appeared there in the
development process. Nowadays the stability is really good
so in the most case I've spent my time focusing on what
is required to use in the operating system and the
benchmark application for optimum scalability.

Early on I discovered some basic features that were required
to get optimum performance of MySQL Cluster in those cases.
One of them is to simply use partitioning properly. In the
case of DBT2 most tables (everyone except the ITEM table) can
be partitioned on the Warehouse id. So the new feature I
developed as part of 5.1 came in handy here. It's possible to
use both PARTITION BY KEY (warehouse_id) or PARTITION BY
HASH (warehouse_id). Personally I prefer PARTITION BY HASH
since it spreads the warehouses perfectly amongst the data
nodes. However in 5.1 this isn't a fully supported so one has
to start the MySQL Server using the flag --new to use this
feature with MySQL Cluster.

The second one was the ability to use the transaction
coordinator on the same node as the warehouse the
transaction is handling. This was handled by a new
feature introducted in MySQL Cluster Carrier Grade
Edition 6.3 whereby the transaction coordinator is
started on the node where the first query is targeted.
This works perfectly for DBT2 and for many other
applications and it's fairly easy to change your
application if it doesn't fit immediately.

The next feature was to ensure that sending uses as
big buffers as possible and also to avoid wake-up
costs. Both those features meant changes to the
scheduler in the data nodes of the MySQL Cluster.
These changes works very well in most cases where
there is sufficient CPU resources for the data nodes.
This feature was also introduced in MySQL Cluster CGE
version 6.3.

Another feature which is very important to achieve
optimum scalability is to ensure that the MySQL Server
starts scans only on the data nodes where it will
actually find the data. This is done through the use
of partition pruning as introduced in MySQL version
5.1. Unfortunately there was a late bug introduced
which I recently discovered which gave decreased
scalability for DBT2 (this is bug#37934 which contains
a patch which fixes the bug, it hasn't been pushed yet
to any 6.3 version).

With these features there were still a number of scalability
issues remaining in DBT2. One was the obvious one that the
ITEM table is spread on all data nodes and thus reads of the
ITEM table will use network sockets that isn't so "hot".
There are two solutions to this, one is that MySQL Cluster
implements some tables as fully replicated on all data nodes.
This might arrive some time in the future, the other variant
uses standard MySQL techniques. One places the table in
another storage engine, e.g. InnoDB, and uses replication to
spread the updates to all the MySQL Servers in the cluster.
This technique should be a technique that can be applied to
many web applications where there are tables that need to be
in MySQL Cluster to handle availability issues and that the
data is required to be updated through proper transactions, but
there are also other tables which can be updated in a lazy
manner.

Finally there is one more remaining issue and this is when the
MySQL Server doesn't work on partitioned data. That is in the
case of DBT2 if all MySQL Servers can access data in a certain
node group then the data nodes will have more network sockets to
work with which will increase cost of networking. This limits
scalability as well.

In the case of DBT2 this can be avoided by using a spread
parameter that ensures that a certain MySQL Server only uses a
certain node group in the MySQL Cluster. In a generic application
this would be handled by an intelligent load balancer that
ensures that MySQL Servers works on different partitions of
the data in the application.

What I will present in future blogs is some data on how much the
effects mentioned above have on the scalability of the DBT2
benchmark for MySQL Cluster.

What is more surprising is that there is also a number of other
issues related to the use of the operating system which aren't
obvious at all. I will present those as well and what those mean
in terms of scalability for MySQL Cluster using DBT2.

Finally in a real application there will seldom be a perfect
scalability occuring, so in any real application it's also
important to minimize the impact of scalability issues. The
main technology to use here is cluster interconnects and I
will show how the use of cluster interconnects affects
scalability issues in MySQL Cluster.

Note numbers from these DBT2 are merely used to be used here to
compare different configurations of MySQL Cluster.

Wednesday, July 30, 2008

2: Thoughts on a new NDB API: Send part

In the current API when sending one takes the Transporter mutex and
then sends all the signals generated towards one or many nodes.
There is also some handling of adaptive sends, however this adaptive
algorithm takes care of all nodes, thus waiting for sending is global
on all nodes.

The new design uses one mutex for the sending, however this mutex only
controls the sending part of one socket. Also the time for holding the
mutex is just enough to check the state, no send operations are done
while holding the mutex.

The new adaptive algorithm will keep track of the last sent messages on
this socket and in principle the idea is that if it's at least a 90-99%
probability that it is a good idea to wait, then it will wait (unless
the application has provided the force send flag). It will do so by
keeping track of the last few messages sent.

So in principle the data structure protected by the mutex is:
struct ic_send_node_mutex
{
IC_SEND_THREAD_MUTEX *send_thread_mutex;
Mutex mutex;
boolean send_active;
IC_COMM_BUFFER *first_cb;
IC_COMM_BUFFER *last_cb;
uint32 queued_bytes;
Timer first_buffered_timer;
Timer last_sent_timers[8];
uint32 last_sent_timer_index;
}

For each socket there is a specific send thread, this thread is mostly
sleeping, waiting for someone to wake it up from its sleep. One reason
to wake it up is if one thread has started sending and other threads
have provided so much work that it needs to offload this sending to
a specific thread (the idea is that the sending is normally done by
an application thread which is involved in user activity and we cannot
keep this thread for longer than a few sends, thus we need to make it
possible to offload send activity to a specific send thread when a high
load appears. The send thread could also be awakened to send buffered
messages that has timed out.

The flag send_active is true whenever a thread is actively sending,
and thus a thread that needs to send when this flag is set can
simply return immediately, if it's not true then it can set the flag
and start sending.

It would probably be possible to handle this without a mutex, but the
contention on this mutex should be small enough and also there is some
wakeup logic that makes sense for a mutex.

The application thread can prepare the NDB Protocol messages completely
before acquiring the mutex, the only activity which sometimes happens
inside the mutex is reading the time for handling of the adaptive
algorithm.

Sends normally goes to a NDB Data node but could also go to another
Client node and could even go to another thread in the same process.
This is important to handle parallelisation, thus to parallelise it
is sufficient to send a number of messages to other nodes and/or
threads. Each message can kick of at least one new thread.

Tuesday, July 29, 2008

1. Thoughts on a new NDB API, Baseline thoughts

I spent some time during my vacation thinking about some
new ideas. I designed the first version of the NDB API
about 10 years ago and obviously in those days the maximum
number of CPU's in most systems was 2 so it wasn't a big
problem having a single mutex protecting send and receive
in the NDB API (The NDB API is the low level API used by the
storage engine NDB which is the storage engine in MySQL
Cluster).

Another design criteria I made when designing the NDB API
was that most developers want to use a synchronous API.
Thus the asynchronous API was made afterwards and didn't
cover all operations. Most developers still develop using
synchronous API's, however most of the use for the NDB
API is for specialised applications such as Telco servers,
storage engine code, LDAP servers. Also I'm thinking in
even using it inside an operating system kernel to design
a clustered file system.

Thus today it seems like a better idea to use an asynchronous
API as the base and then put the synchronous API on top of
this.

When designing the original NDB API it was sufficient to think
of simple key lookups, later it was advanced with also handling
scans of tables and indexes. However current design problems
are related to parallelising SQL queries and also there are
implementations of things such as BLOB's that actually require
multiple sequential and parallel operations. Thus in the new
design it's necessary to consider the possibility of starting
complex operations involving multiple threads (sometimes even
multiple processes), multiple operations in sequence and in
parallel.

These ideas will be fed into the existing NDB API. It will also
be used in the iClaustron project where I aim to build something
that can be used as a clustered file system. iClaustron is both
designed with the aim to at some point in time be a useful thing,
but at the same time I use it as my personal playground where I
can test new ideas and see how my ideas turns out when turned
into source code.

The original NDB API was designed in C++ as all the rest of the
MySQL Cluster code. Within the data nodes I think we've found
a good compromise of what to use in the C++ language and what
not to use. However in general I found that debates around what
should be used in C++ tends to take an improportionate amount of
time compared to the value of those discussions. So for that
reason I decided to use C as the language of choice for iClaustron.
Actually there was more reasons for this, it makes it a lot easier
to use the code inside an operating system kernel such as Linux
or FreeBSD and second it makes it easier to write layers to other
languages such as Python, Perl,...

Most of the thoughts on this new NDB API has been in my mind for
more than 2 years (actually some of the thoughts have already been
implemented in NDB already), however during my vacation I had
some fun in designing out all the details I hadn't considered
previously.

It's my view of a nice vacation to relax on a beach or walking in the
mountains while inventing some new ideas based on an interesting
problems. I cheated by solving a Sudoku as well this vacation but in
general I like mind games that are related to what I do for a living.
Inventing the idea is the fun part of innovation, then comes the
hard part of actually doling out all the details, writing code and
testing code and selling the ideas. This is the work part.

I will follow this posting with a high level view on the ideas as far
they've been developed so far. In parallel I'll also "dump" the ideas
into code format. I like to think of my coding as a "brain dump", I
have a fairly unusual way of writing code. I think about the problem
for a long time and when I'm satisfied I write the code for it. I then
write all the code with only a very minimal set of compilation and
test cases. The main idea of coding in this phase is still design, so
in principal I write a design in the form of code. This also means
that I try to write as much comments as possible since I know that I
will otherwise forget my ideas. Working for MySQL has made me much
more aware of software engineering issues as well, so today I do also
a bit of thinking on software engineering as well in the design.

An architecture for the design is obviously very important, and the
architecture has borrowed heavily from the way the Linux kernel is
designed with lots of interfaces similar to the VFS interface in
Linux using a struct of a set of function pointers.