Tuesday, November 01, 2011

MySQL Thread Pool: Storage Engines

I got a question from the NDB folks that are currently adapting MySQL Cluster to
MySQL 5.5 about whether any special developments are needed to adapt the NDB
storage engine for use with the thread pool. Then I realised there are more
people out there that write storage engines that want to know how to optimise
their storage engines for the thread pool.

So first of all any storage engine will work with the thread pool as they are today
without any modifications. It is however possible to improve the performance of the
MySQL Server when using the thread pool by adapting the storage engine to the
thread pool APIs.

The new API that has been added to the MySQL 5.5 server is the thd_wait interface.
This interface makes it possible for storage engines to report to a thread pool
plugin before starting a wait and after finishing a wait.

As an example, we have adapted the InnoDB storage engine by adding the thd_wait
interface calls around row locks in InnoDB and before file IO due to misses in
the InnoDB buffer pool. The InnoDB code have also been changed to make those
callbacks as part of the implementation of the --innodb-thread-concurrency and
when waiting for flushes of the buffer pool as part of checkpoints and other
activities where writes are required to ensure proper operation of InnoDB.

The NDB storage engine has very different reasons for the waits, the NDB storage engine
implements the actual data management in the NDB data nodes (these nodes runs in
separate processes separate from the MySQL Server), thus the only reason for waits
in the MySQL Server is when we're waiting for packets to return from the NDB data nodes.

Most third-party storage engines probably fit fairly well with InnoDB and/or NDB in how
they are integrated with the thread pool plugin. So there are storage engines that
perform all the work inside the MySQL Server. The more advanced such engines are likely
to also have a buffer pool and thus should consider calling the thd_wait interface
when doing IO, these storage engines are also likely to acquire row locks or some
similar level of data lock that sometimes will require an extended wait. There are
also other storage engines that are distributed in nature such as NDB, these
storage engines will want to make the callbacks to the new thread pool API when
waiting for responses on the network.

For storage engines that implement some data structure similar to the THD object in the
MySQL Server, there is one additional thing to consider. When using a thread pool it
makes sense to consider pooling such objects given that the thread pool
will pool threads. As an example, we have such an object called Ndb in the NDB API that
has the potential to be pooled. The benefits of pooling such objects are that it
means less time to create them, less memory usage and thus fewer CPU cache misses
due to their usage.

The thd_wait interface is really simple. It contains two calls thd_wait_begin and
thd_wait_end. Both calls have the THD object as the first parameter. Often the THD
object isn't known in the storage engine code when needed. In this case one simply
uses NULL as the THD object. The thd_wait interface can even handle the case where
the thd_wait interface is used from threads that are private to the storage engine.
The thread pool will discover that there is no THD object attached to the thread
and ignore the call.

The thd_wait_begin call also have a second parameter that specifies the type of
wait that will show up in the thread pool information schema tables. There
will be statistics on waits per type. There are currently 10 wait types.

To see an example of usage of this interface, search for thd_wait in the InnoDB
storage engine source code in the MySQL 5.5 community server.

The MyISAM storage engine does not use this API because MyISAM relies on the
MySQL Server for locking. Also, MyISAM assumes that the OS takes care of caching
of pages. This means that there is a very high probability that writes to the
file system are handled directly in the file system cache without involving any
long waits.

What is effect of not modifying a storage engine to implement the thd_wait
interface? The thread pool operates by trying to always have one thread active
per thread group. If the active thread is blocked and the thread pool is informed
of the block, then the thread pool can start another thread to ensure that the
thread group is being efficiently used. If the storage engine is not modified to
implement the thd_wait interface, the thread pool is not informed of the block.
In this case, the thread group will be blocked for a while until the wait is
completed or until the query is defined as stalled. The throughput of the system
can to some extent be handled in those cases by increasing the number of thread
groups.

So implementing the thd_wait interface means better throughput and also less
variance of the throughput and waiting times.

To use these interfaces in a file, include two header files (the thd_wait interface is
part of the plugin APIs in the MySQL 5.5 community and commercial servers).

#include "mysql/plugin.h"
#include "mysql/service_thd_wait.h"

Below is the most important information in these header files.

typedef enum _thd_wait_type_e {
THD_WAIT_SLEEP= 1,
THD_WAIT_DISKIO= 2,
THD_WAIT_ROW_LOCK= 3,
THD_WAIT_GLOBAL_LOCK= 4,
THD_WAIT_META_DATA_LOCK= 5,
THD_WAIT_TABLE_LOCK= 6,
THD_WAIT_USER_LOCK= 7,
THD_WAIT_BINLOG= 8,
THD_WAIT_GROUP_COMMIT= 9,
THD_WAIT_SYNC= 10,
THD_WAIT_LAST= 11
} thd_wait_type;
void thd_wait_begin(MYSQL_THD thd, int wait_type);
void thd_wait_end(MYSQL_THD thd);

THD_WAIT_SLEEP: For uninterrupted sleeps.
THD_WAIT_DISKIO: For file IO operations that are very likely to cause an actual
disk read.
THD_WAIT_ROW_LOCK: For row locks/page locks in the storage engine.
THD_WAIT_GLOBAL_LOCK: For global locks such as the global read lock in the MySQL
Server.
THD_WAIT_TABLE_LOCK: When waiting for a table lock.
THD_WAIT_META_DATA_LOCK: For waiting on a meta data lock which isn't a table lock.
THD_WAIT_USER_LOCK: For some type of special lock.
THD_WAIT_BINLOG: When waiting for the replication binlog.
THD_WAIT_SYNC: When waiting for an fsync operation.

It's quite likely we will introduce more wait types, such as the wait on the network.

Thursday, October 27, 2011

MySQL Thread Pool: Summary

A number of blogs have been written with the intent of describing how
the thread pool manages to solve the requirements of the thread pool.

These blogs are:
MySQL Thread Pool: Problem Definition
MySQL Thread Pool: Scalability Solution
MySQL Thread Pool: Limiting number of concurrent statement executions
Automated benchmark tool for DBT2, Sysbench and flexAsynch
MySQL Thread Pool: Limiting number of concurrent transactions
MySQL Thread Pool: When to use?
MySQL Thread Pool vs. Connection Pool
MySQL Thread Pool: Optimal configuration
MySQL Thread Pool: Benchmarking

There are some interesting discussions in the comments on the scalability solution blog
and on the blog about limiting number of concurrent statement executions
and finally also on the blog about when to use.

These discussions are around when to use it, what other features might be worth
considering and some remarks on the type of benchmarks that could be used to
evaluate solutions.

The requirements we had on the thread pool solution and the solutions were:
1) Split threads into groups individually handled to avoid making the
solution a problem in itself, aim is to manage one active thread per
group.

Solution:
Connections are put into a thread group at connect time by round robin.
Configurable number of thread groups. This ensures that the thread pool
itself isn't a scalability hog.

2) Wait for execution of a query until the MySQL Server has sufficient
CPU and memory resources to execute it.

Solution:
Each thread group tries to keep the number of executing queries to one or
zero. If a query is already executing in the thread group, put connection
in wait queue.

3) Prioritize queries on connections that have an ongoing transaction.

Solution:
Put waiting connections in high priority queue when a transaction is
already started on the connection.

4) Avoid deadlocks when queries are stalled or execute for a long time.

Solution:
Allow another query to execute when the executing query in the thread
group is declared as stalled (after a configurable time).

Wednesday, October 26, 2011

MySQL Thread Pool: Benchmarking

We have executed a number of benchmarks using the thread pool to
see how it operates in various workloads. A thorough study on this
can be found in Dimitri's blog here.

Optimal number of active connections is the number of active connections
needed to achieve the best throughput for the MySQL Server. For an InnoDB
workload this is usually around 32-128 active connections.

From all our benchmarks we've seen that the performance of the thread pool
when operated with less than the optimal number of active connections is
about 1-3% slower than without thread pool since the behaviour is the same
and the thread pool adds a little bit more overhead. More or less all of
this overhead is to handle KILL query correctly.

When operated in the region of the optimal number of active connections
the performance is very similar. We have seen though that the thread pool
benefits very much from locking the MySQL Server to a number of CPUs
equal to the setting of the thread_pool_size configuration parameter.
When not locked to CPUs the performance is similar, when locked to CPUs
the thread pool gives 10-15% higher performance when using the optimal
number of active connections. The MySQL Server operated without thread
pool and locked to CPUs have no significant change of throughput compared
to not locking to CPUs.

When operating above optimal number of connections the thread pool
provides a great benefit, we've seen numbers all the way up to 100x
better performance when operating with a few thousand concurrently
active connections.

MySQL Thread Pool: Information Schema Tables

The thread pool have three information schema tables. These are TP_THREAD_STATE,
TP_THREAD_GROUP_STATE and TP_THREAD_GROUP_STATS.

The TP_THREAD_STATE table contains one row per thread that is currently
maintained by the thread pool. This row contains only interesting information
if the thread is actively executing a statement. In this case it contains information
how many 10 milliseconds slots the query has consumed, if the thread is blocked by
some event, the event is listed. Both of those information items are current state
and will change for each new query.

The TP_THREAD_GROUP_STATE table contains one row per thread group. It contains
information about number of threads of various types. The first type is consumer
threads, this is a thread not used for the moment, at most 1 such thread will
exist at any point in time. This is the next thread to use if the current threads
used are not enough and a new thread is needed.

The second type of threads are reserved threads, these are also threads not currently
used. They will be used when there is no consumer thread and a new thread needs to be
started.

It contains information about the current number of connections handled in this thread
group. It contains current information about the number of queued low priority
statements (QUEUED_QUERIES) and queued high priority statements (QUEUED_TRANS).

It contains information about configuration, thus state of stall limit, priority
kickup timer, algorithm used. Also information about current number of threads in
the thread group, current number of threads actively executing a statement in the
thread group and current number of stalled statement executions.

Finally it contains some useful information about thread number of a possible
waiter thread (the thread that listens to incoming statements), information about
the oldest query that is still waiting to be executed.

The last table is the TP_THREAD_GROUP_STATS that contains statistics about the
thread group.

There are statistics about number of connections, number of connections closed,
number of queries executed, number of queries stalled, number of queries queued,
number of queries that was kicked up in priority from low priority to high priority.

There is also statistics on threads, how many threads have been started, how many
threads have become consumer threads, become reserve threads, become waiter threads.
How many times the thread that checks for stalled threads decided to start a thread
to handle the possibility of executing a query.

Finally there is statistics about each blocking event coming from the MySQL Server
(meta data locks, row locks, file IO, sleeps and so forth).

One of the most important information here is the number of stalled queries
(STALLED_QUERIES_EXECUTED in TP_THREAD_GROUP_STATS), this counter
gives a good idea if we have many stalled queries, if there are too many such
queries, it is a good indication that one should consider increasing the
thread_pool_stall_limit.

Another very important information is the number of priority kickups
(PRIO_KICKUPS in TP_THREAD_GROUP_STATS). If this counter
grows too quick it is an indication that the thread_pool_prio_kickup_timer
might need to be higher.

It might at times be important to check the number of threads started
(THREADS_STARTED in TP_THREAD_GROUP_STATS) as well.
If the threads are started too often, it's a good indicator that we should
not be so aggressive in stopping threads and thus set thread_pool_max_unused_threads
a bit higher.

The current oldest waiting query might also be a good idea to track to ensure that
we don't get longer waits than what is acceptable. If we get too long waits here,
one can either change some configuration variable, but it might also be an indicator
that the MySQL Server is constantly overloaded and that some action should be done
to remedy this.

Tuesday, October 25, 2011

MySQL Thread Pool: Optimal configuration

The thread pool plugin has a number of configuration parameters that will affect
its performance. These are documented in the MySQL manual here.

To configure the thread pool for optimal operation the most important parameter is
the --thread_pool_size. This parameter specifies the number of thread groups that
the thread pool will create.

The default value of thread_pool_size=16 is very often a good starting point. We have
seen that for InnoDB Read Workloads it is sometimes possible to achieve even better
results when it is set to around 30-40. For write intensive InnoDB workloads the
optimum can be in the range 12-30. MyISAM workloads usually have an optimum a bit
lower in the range of 6-8. The default value of 16 will work well also for most
MyISAM workloads.

The next parameter to consider for optimum operation is --thread_pool_stall_limit.
This is set to 6 (=60ms) by default. This number is set very low for good operation
in most cases. In most cases with workloads that don't have very many long queries
it is ok to set this much higher. Setting it to 100 (=1 second) should be ok in most
cases.

In the information schema one can see how many queries are stalled, if there are too
many queries stalled, then it is a good idea to increase this parameter since stalled
queries lead to increased context switching activity and more threads to manage for the
operating system.

The next parameter --thread_pool_prio_kickup_timer is set rather high to 1000
(=1 second). This setting should be ok for most cases, in extremely loaded environments
where thousands of connections want to execute at the same time it's necessary to
increase this variable to ensure that queries aren't moved too early. At the same time
setting it too high means that long-running transactions can block out short transactions
too much. But settings up to 10000 (=10 seconds) should in most cases be ok.

There is a parameter which isn't supported --thread_pool_algorithm. This parameter
makes it possible to use a bit more aggressive scheduling algorithm in the thread pool.
In most cases this has no benefits other than in some cases achieving better results.
It has been left accessible if someone wants to experiment with it and give us feedback
about it.

The last parameter is --thread_pool_max_unused_threads. This parameter specifies the
maximum amount of unused threads we will keep per thread group. It's possible to have
quite a few unused threads and to ensure that we give back memory to the operating
system one can use this parameter. By default it's 0 which means that threads are never
released and kept around for future use. Setting to a nonzero value means that the server
will use less memory but can also attribute to a higher CPU overhead to create new
threads again later on.

Monday, October 24, 2011

MySQL Thread Pool vs. Connection Pool

Given that thread and connections in the MySQL Server
have been so intertwined, it is easy to confuse the
purpose of the MySQL Thread Pool and the purpose of
a Connection Pool.

The aim of a Connection Pool is that the MySQL
clients should not be forced to constantly do connect and
disconnect. Thus it is possible to cache a connection in
the MySQL client when a user of the connection no longer
needs it. Thus another user that needs a connection to the
same MySQL Server can reuse this cached connection later on.

This saves execution time in both the client and the server.
It does however not change the dynamics of how many queries
are executed in parallel in the MySQL Server. This means that
the likelihood of too many concurrent queries to execute in
the MySQL Server is the same with or without a Connection
Pool.

Also a Connection Pool operates on the client side. This
means that it doesn't see the state of the MySQL Server when
deciding whether to send a query to the MySQL Server or not. Thus
it doesn't have the required information to decide whether to
queue a query or not. Only the MySQL Server have this information
and thus the MySQL Thread Pool has to operate in the MySQL Server.
It cannot perform its task on the client side.

Thus it is easy to see that the MySQL Thread Pool and a
Connection Pool are orthogonal and can be used independent of
each other.

MySQL Thread Pool: When to use?

The most important variable to monitor is threads_running. This
variable keeps track of the number of concurrent statements
currently executing in the MySQL Server.

If this variable has spikes that put it in a region where the
server won't operate optimally (usually going beyond 40 for
InnoDB workloads) and most particular if it goes well beyond
this into the hundreds or even thousands of concurrent
statements then the thread pool will be something beneficial
to protect the MySQL Server from troubles in overload
situations.

Another indicator that you will benefit from use of the thread
pool is when you already now use the --innodb-thread-concurrency
variable. This variable tries to solve a similar problem and the
thread pool solves it at a better place even before query execution
has started and also provides additional benefits.

Also if your workload is mainly short queries then the thread
pool will be beneficial, long queries isn't bad for the thread
pool but will decrease its positive impact.

Friday, October 21, 2011

MySQL Thread Pool: Limiting number of concurrent transactions

There are hot spots in the MySQL Server that become hotter when many
transactions are handled concurrently. This means that it is imperative
to avoid having too many concurrent transactions executing in parallel.

The thread pool handles this by prioritizing queued queries according
to whether they have already started executing a transaction or not.
It is also possible for the user to decide that a connection will be of
high priority independent of whether a transaction is started or not.

Such a prioritization can have issues with livelock if there are transactions
that are very long. To avoid this problem a query will be moved to the high
priority queue after a configurable time have expired. This time is set in the
configuration parameter --thread_pool_prio_kickup_timer
(number of milliseconds before a query is kicked up).

However to avoid too many movements in a short time, the thread pool will
at most move one query per 10 milliseconds per thread group.

It is possible for the user to define his connection as always being of
high priority to ensure queries from that connection always moves faster
through the server.

Automated benchmark tool for DBT2, Sysbench and flexAsynch

A new benchmark tool is available here to enable automated
benchmark runs of DBT2, Sysbench and flexAsynch using MySQL
and MySQL Cluster.

The benchmark tool is based on dbt2-0.37 for DBT2, sysbench-0.4.12
for sysbench benchmarks and a flexAsynch program that is available in
MySQL Cluster source releases (the version needed for the automated
flexAsynch tests requires an updated version of flexAsynch.cpp which
hasn't been included in a MySQL Cluster release yet, a blog post
notifying when it arrives will be written).

The automation scripts are part of the dbt2-0.37.50.tar.gz package.
This package is needed to run all benchmarks. In addition a gzipped
source or binary tarball of MySQL or MySQL Cluster is required to
run the benchmarks. Finally to run sysbench benchmarks one also needs
to download the sysbench-0.4.12.5 tarball.

So assuming you have downloaded all those tarballs, how does one
run a sysbench benchmark on your local machine?

The first step is to create a benchmark directory, I usually use
$HOME/bench or /data1/bench. In this directory create a directory
tarballs. Place all three tarballs in this directory. Go into this
directory and unpack the dbt2 tarball through tar xfz dbt2-0.37.50.tar.gz.
Then copy the benchmark start script into the $HOME/bench directory
through the command:
cp $HOME/bench/tarballs/dbt2-0.37.50/scripts/bench_prepare.sh $HOME/bench/.

Then copy the example configuration file in the same manner using the
command:
cp $HOME/bench/dbt2-0.37.50/examples/autobench.conf $HOME/bench/.

Edit the autobench.conf to be conformant to your file system environment.
The example configuration file assumes the use of /data1/bench as the
directory to use.

Now it is time to prepare to run a benchmark, create a directory under
$HOME/bench for the test run. So for example if you want to call it
test_sysbench then run the command:
mkdir $HOME/bench/test_sysbench

Next step is to copy the autobench.conf file into this directory and
edit it.
cd $HOME/bench
cp autobench.conf test_sysbench/.

Now there are two way to go about editing this configuration file. If
you want to go fast and unsafe, then go ahead and edit the file
directly, there is a fair amount of explanations of the various
parameters in this file. If you want more help, then read the
dbt2-0.37.50/README-AUTOMATED to get more directions about how to
set-up the configuration file properly.

Now everything is ready to run the benchmark, this is done through
the commands:
cd $HOME/bench
./bench_prepare.sh --default-directory $HOME/bench/test_sysbench

If you want to follow the development of the benchmark in real-time
you can do this by issuing tail -f on the proper file. For sysbench
RO benchmarks there will be a file called
$HOME/bench/test_sysbench/sysbench_results/oltp_complex_ro_1.res
for the first test run (you can tell the benchmark to do several
runs). Do tail -f on this while the benchmark is running and you'll
get printouts from the sysbench program written on your console.
Among other things you'll get a string like this:
Intermediate results: 128 threads, 3564 tps
if the current running test uses 128 concurrent connections. By
default the intermediate results are reported every 3 seconds.

The final result will be reported in the file
$HOME/bench/test_sysbench/final_result.txt

An additional note is that if you want to test the flexAsynch
tests, then it is necessary to use a source tarball of the
MySQL Cluster 7.x series. This is simply because the flexAsynch
program isn't distributed in binary tarballs.

The benchmark script will take care of the build process for all
source tarballs, all important parameters you need to handle
is part of the autobench.conf script. You will however need to
install the proper compilers and build tools to enable builds of
MySQL, sysbench and DBT2 programs.

If you want to benchmark a MySQL Server using the thread pool, then
it is necessary to download a MySQL Enterprise Edition of the MySQL
Server. If you already have a commercial license with Oracle, then
simply use this to download the MySQL binary tarball through the
edelivery.oracle.com. If you don't have a commercial license, you
can use the Oracle Software Delivery Cloud Trial License Agreement
which gives you a 30-day trial license. So to get the binary tarball
go to edelivery.oracle.com, register if necessary, log in, answer
all required license agreements.

Next step is to select MySQL Database as the Product Pack and as
Platform select Linux x86-64. Finally download the TAR file for
generic Linux2.6 x86-64 platforms. When this download is completed
then unzip the file and you'll get the gzipped tarball you need
to run the thread pool benchmark.
Linux x86_64.

The sysbench contains a few extra features compared to the
sysbench-0.4.12 version. It contains support for intermediate
result reporting, support for multiple tables in the sysbench
benchmark, support for partitioned tables, support for using
secondary indexes, support for using HANDLER statements instead
of SELECT statements, and also support for running sysbench at
fixed transaction rates with a certain jitter.

DBT2 can in addition to running with a single MySQL Server also
run with multiple MySQL Servers when used with MySQL Cluster.
It contains a few new features here to control partitioning,
possibility to place the ITEM table in each MySQL Server and
so forth.

All scripts and many programs have updated parameters and all
scripts have extensive help outputs to make it easy to understand
what they can do.

It is fairly easy to extend the benchmark scripts. As an example
if you need to change a parameter which isn't included in
autobench.conf then first add it in bench_prepare.sh and then
add code to handle it in start_ndb.sh. Also update the
autobench.conf example file if you want to keep the feature
for a longer time. If you want to suggest changes to the
scripts please report it in My Oracle Support (support.oracle.com)
or in bugs.mysql.com and assign it to Mikael Ronstrom.

MySQL Thread Pool: Limiting number of concurrent statement executions

The main task of the thread pool is to limit the number of
concurrent statement executions. The thread pool achieves
this by trying to always operate a thread group such that
only one or zero queries are concurrently executed per
thread group.

There is however a livelock issue to consider. A long-running
query in a thread group will in this manner block out all
other queries in this thread group until the query is completed.

To resolve this issue there will be a configurable timer that
decides when a statement execution is declared as stalled. When
a query is declared as stalled, it is allowed to continue
executing until completed. The thread group will handle the
connection as stalled and not count it as an active connection.
Thus new queries can be executed in the thread group again when a
query have been declared as stalled.

Another issue is when a statement execution is blocked for some
reason. Queries can be blocked e.g. by Row Locks, File IO, Table
Locks, Global Read Locks and so forth. If it is likely that the
blockage will continue for at least a millisecond or so, then it
makes sense to start up another statement execution in the thread
group to ensure that we continue to keep the number of concurrent
active connections at the right level.

To enable this the MySQL Server will make callbacks to the thread
pool stating when a block begins and when it ends. The thread
pool will use this to keep track of number of active statement
executions and this is used to decide when to start a new query
and when to allow an incoming query to start.

It is important that the wait is sufficiently long since it is
necessary to immediately continue executing the query when the
blockage ends.

MySQL Thread Pool: Scalability solution

When implementing a thread pool or any other means of limiting concurrency in the MySQL Server, careful thought is required about how to divide the problem to ensure that we don't create any unnecessary new hot spots. It is very easy to make a design that manages all connections and threads in one pool. This design does however very quickly run into scalability issues due to the need to lock the common data structures every time a connection or thread needs to change its state.

To avoid this issue we decided to implement the thread pool using a set of thread groups. Each of those thread groups are independent of the other thread groups. Each thread group manages a set of connections and threads. It also handles a set of queues and other data structures required to implement the thread group operations. Each thread group will contain a minimum of one thread, connections will be bound to a thread group at connect time using a simple round robin assignment. The thread pool aims to ensure that each thread group either has zero or one thread actively executing a statement. This means that the interactions between threads within one thread group is extremely limited. Also the interactions won't grow as the MySQL Server gets more statements to process. Thus it is very hard to see this model become a scalability issue in itself.

So we solved the scalability problem using a Divide-and-Conquer technique.

Tuesday, October 18, 2011

MySQL Thread Pool: Problem Definition

A new thread pool plugin is now a part of the MySQL Enterprise Edition.
In this blog we will cover the problem that the thread pool is solving
and some high-level description of how it solves this problem.

In the traditional MySQL server model there is a one-to-one mapping between
thread and connection. Even the MySQL server has lots of code where thread
or some abbreviation of thread is actually representing a connection.
Obviously this mapping has served MySQL very well over the years, but there
are some cases where this model don't work so well.

One such case is where there are much more connections executing queries
simultaneously compared to the number of CPUs available in the server. The
MySQL Server also have scalability bottlenecks where performance suffers
when too many connections execute in parallel.

So effectively there are two reasons that can make performance suffer in
the original MySQL Server model.

The first is that many connections executing in parallel means that the
amount of data that the CPUs work on increases. This will decrease the
CPU cache hit rates. Lowering the CPU cache hit rate can have a significant
negative impact on server performance. Actually in some cases the amount
of memory allocated by the connections executing in parallel could at times
even supersede the memory available in the server. In this case we enter a
state called swapping which is very detrimental to performance.

The second problem is that the number of parallel queries and transactions
can have a negative impact on the throughput through the "critical sections"
of the MySQL Server (critical section is where mutexes are applied to
ensure only one CPU changes a certain data structure at a time, when such
a critical section becomes a scalability problem we call it a hot spot).
Statements that writes are more affected since they use more critical
sections.

Neither of those problems can be solved in the operating system scheduler.
However there are some operating systems that have attempted solving this
problem for generic applications on a higher level in the operating system.

Both of those problems have the impact that performance suffers more and
more as the number of statements executed in parallel increases.

In addition there are hot spots where the mutex is held for a longer time
when many concurrent statements and/or transactions are executed in
parallel. One such example is the transaction list in InnoDB where each
transaction is listed in a linked list. Thus when the number of concurrent
transactions increases the time to scan the list increases and the time
holding the lock increases and thus the hot spot becomes even hotter
as the concurrency increases.

Current solutions to these issues exist in InnoDB through use of the
configuration parameter --innodb-thread-concurrency. When this parameter
is set to a nonzero value, this indicates how many threads are
able to run through InnoDB code concurrently. This solution have its
use cases where it works well. It does however have the drawback that
the solution itself contains a hot spot that limits the MySQL server
scalability. It does also not contain any solution to limiting the
number of concurrent transactions.

In a previous alpha version of the MySQL Server (MySQL 6.0) a thread
pool was developed. This thread pool solved the problem with limiting
the number of concurrent threads executing. It did nothing to solve
the problem with limiting the number of concurrent transactions.
It was also a scalability bottleneck in itself. Finally it didn't
solve all issues regarding long queries and blocked queries.
This made it possible for the MySQL Server to become completely
blocked.

When developing the thread pool extension now available in the MySQL
Enterprise Edition we decided to start from a clean plate with the
following requirements:

1) Limit the number of concurrently executing statements to ensure
that each statement execution has sufficient CPU and memory resources
to fulfill its task.

2) Split threads and connection into thread groups that are
independently managed. This is to ensure that the thread pool
plugin itself doesn't become a scalability bottleneck. The
aim is that each thread group has one or zero active threads
at any point in time.

3) Limit the number of concurrently executing transactions
through prioritizing queued connections dependent on if
they have started a transaction or not.

4) Avoid deadlocks when a statement execution becomes long or
when the statement is blocked for some reason for an extended
time.

If you are interested in knowing more details of how the new
thread pool solves these requirements there will be a
webinar on Thursday 20 Oct 2011 at 9.00 PDT. Check here
for details on how to access it.

If you want to try out the thread pool go here.

Thursday, May 26, 2011

Better than linear scaling is possible

As part of my research for my Ph.D. thesis, I spent a lot of time
understanding the impact of CPU caches on the performance of a DBMS.
I concluded that in a parallel data server it is actually possible
to get better than linear scaling in certain workloads.

When executing a benchmark with 2 machines consisting of 8 cores where
those 8 cores share a 2 MByte cache has a total of 4 MByte CPU cache.
Assuming that the benchmark executes with a data set of 2 GByte, then
0.1% of the data fits in the CPU cache. As the number of machines grow,
the available CPU caches also grows, this means that when we have
32 machines, we have 64MByte of cache available. This means that we can
now store 1.6% of the data set in the CPU cache.

For benchmarks one mostly tries to scale the data set size when increasing
the number of nodes in the system. This is however not necessarily true in
real-life applications. For real-life applications the working set is
constant, the working set can grow in time as more customers join the service
or for other reasons. But the working set of a real life application doesn't
grow when you grow the number of machines in the database cluster.

It's very well known that there are many things that drives sublinear scaling,
the most important of those is the extra cost of communication in a larger
cluster. The number of communication lanes in a fully connected cluster is
n * (n - 1) / 2. This means that the number of communication lanes grow by
the square of the number of machines, O(n^2). The communication only
increase linearly in number of machines which means that each lane gets
linearly less bytes to communicate in a larger cluster. Given that
communication cost is fixed_cost + #bytes * cost_per_byte, this means
that the cost per byte sent will increase in a larger cluster since there
will be smaller packets and thus fewer bytes to pay for the fixed cost.

The above is one reason why sharding is a good idea, this means that we
partition the problem, thus we only use a subset of the communication lanes
and thus we avoid the increased cost of communication as the number of
machines grows. Obviously sharding also imposes limitations to the type of
queries you can handle efficiently.

Now to some specific facts about MySQL Cluster and why we can obtain
bettter than linear scaling here (reported in earlier blogs here and here).
For reads here we got 1.13M on 8 nodes, 2.13M on 16 nodes and 4.33M
reads on 32 nodes. For updates we got 687k on 4 nodes, 987k on 8 nodes
and finally 2.46M on 16 nodes. All the data in this benchmark was also
replicated.

The data nodes in MySQL Cluster use an architecture where we have up to
4 threads that handle the local database handling. These 4 threads handle
their own partitions. Next we have one thread that handles the transaction
coordinator role, we also have one thread that takes care of the receive
part of the communication. Finally we have a set of threads taking care
of file system communication. What this effectively means is that as we
grow the cluster size and the cost of communication grows, each data node
will consume more CPU power, however the architecture of MySQL Cluster
is done in such a way that this extra CPU power is spent in its own
CPU cores. Thus we simply use a bit more of the CPU cores for
communication when the cluster size grows.

The benefit of this approach is that it is easy to scale the number
of CPU cores used for communication. Given that modern machines often
comes with quite high number of CPU cores, this means that as machines
gets beefier, we can actually deliver better than linear scaling of
the workload one can achieve by growing the number of data nodes in
MySQL Cluster.

In MySQL Cluster each execution thread has its own scheduler.
This scheduler becomes more and more efficient as load grows for
two reasons. The first is that as the load grows, the queue is
longer and thus we need to refill the queue fewer times,
this means that we spend more time executing the same code over
and over again. This means that the instruction cache for that
code will be very hot and we will train the branch predictor
subsystem in the CPUs very well. This benefit we get both in
the code refilling the queue and the code to execute the actual
database workload. Given that the load is high we also avoid
running code that checks for messages and there is no messages
around. Thus as load increases the efficiency increases and
the actual number of instructions to execute per message also
decreases.

So when I presented this theory at the presentation of my
Ph.D. thesis this was only a theory. In the real world it's
very uncommon to see the effect of CPU caches and other effects
being greater than the added burden of a larger cluster. However
I have seen it twice in my career. The first was a benchmark
performed in 2002 on a very large computer where we hosted 32 nodes
(single CPU nodes in those days) and 23 benchmark applications.
Here we scaled from 0.5 million to 1.5 million going from
16 to 32 nodes. Now also in the results presented at the
MySQL Users conference and in my previous blogs we achieved better
than linear scaling in particular for the write benchmark, but also
to some extent for read benchmarks. I am sure the above isn't the
entire explanation of these effects, but the effects of the things
explained above certainly plays a role in it.

Tuesday, April 12, 2011

MySQL Cluster: Designed for high-scale SQL and NoSQL web applications

As shown in a number of blogs, the MySQL Cluster SW already uses the type of features found in many NoSQL products. It has an extremely efficient API through which it is possible to shoot millions of reads and writes towards a Cluster per second. It contains partitions of its data similar to shards in NoSQL and supports both high availability of those partitions and also repartitioning of the data when new nodes are added to the Cluster. Advanced replication solutions both providing replication inside a Cluster and between Clusters makes it possible to use MySQL Cluster in an a very large number of replication configurations, even scaling across multiple global data centers.

Finally MySQL Cluster makes it possible for you to choose to stay with your data in relational tables while still using NoSQL-like APIs, supporting on-line changes of partitioning and also adding new fields to tables while still reading and writing data in the tables. Using MySQL Cluster you can use MySQL APIs, the NDB API, Cluster/J, JPA, LDAP API and even more APIs are worked on and will soon be announced.

Most web data requires heavy use for generation of web pages where the use is mostly simple queries, but very many of them. Most of the web data also requires analysis to make intelligent business decisions based on the web generated data. A prototype of parallel query for MySQL Cluster was displayed at the MySQL Users Conference 2010. Tools such as this will also make it possible to analyse data efficiently in MySQL Cluster. Thus MySQL Cluster is a very efficient tool for working with many sorts of web data while retaining ACID compliance and a rich set of tools, expertise and best practices.

MySQL Cluster running 2.46M updates per second!

In a previous blog post we showed how MySQL Cluster achieved 6.82M reads per second. This is a high number. However what is also very interesting to see is how efficient MySQL Cluster is at executing updating transactions as well. We were able to push through the 1M transactions per second wall and even past the 2M transactions per second and all the way up to 2.46M transactions per second.

Monday, April 11, 2011

MySQL Cluster doing 6.82M reads per second

We ran a number of tests to see how many reads per second we could get from MySQL Cluster. We used a modified version of flexAsynch (as shown in previous blog), where each record read was 100 bytes in size.

With a cluster of 4 data nodes operating on 2 machines we were able to process 1.15M reads per second. On a cluster consisting of 8 data nodes executing on 4 machines we were able to process 2.13M reads per second. On a 16-data node cluster with 8 machines used for data nodes, we were able to process 4.33M reads per second and finally a cluster with 32 data nodes distributed on 16 machines we executed 6.82M reads per second. The tests were run on MySQL Cluster 7.1, we're confident that similar numbers can be achieved with MySQL Cluster 7.0 and also with the new beta version MySQL Cluster 7.2.

This benchmark will give you a good idea what can be achieved with direct usage of the NDB API, and using other APIs like Cluster/J, mod-ndb, NDB-memcached.

MySQL Cluster Benchmark

We had the opportunity to use a fair amount of machines to run a benchmark to see what throughput MySQL Cluster can achieve on a bit bigger clusters. The benchmark we use is a benchmark we developed for internal testing many years ago and shows very well the performance aspects of MySQL Cluster as discussed in some previous blogs of mine.

The benchmark is called flexAsynch, it's part of an internal series of benchmark we call the flex-series of benchmarks. It's first member was flexBench, this benchmark consisted of the following simple set of operations. First create the table with the set of attributes and the size of the attributes as specified by the startup options. Next step is to create a set of threads as specified by the startup options. Next step is that each thread will execute a number of transactions, the number which is configurable and each transaction can also run one or more operations as configured (one operation is either an insert of one record, update of one record, read of one record or delete of one record). The flexBench benchmark always starts by doing a set of inserts, then reading those, updating each record, reading it again and finally deleting all records. The flexBench benchmark also consisted of a verify phase such that we could also verify that the cluster actually read and updated the records as they should.

The flexAsynch benchmark was a further development of this benchmark, flexBench uses the synchronous NDB API, where each transaction is sent and executed per thread. This means that we can have as many outstanding transactions to the cluster as we have threads. flexAsynch uses the asynchronous NDB API, this API provides the possibility to define multiple transactions and send and execute those all at once. This means that we can get a tremendous parallelism in the application using this API. The manner in which MySQL Cluster is designed, it is actually no more expensive to update 10 records in 10 different transactions compared to updating 10 records in 1 transaction using this API. Jonas Oreland showed in his blog post how one API process using this API can handle 1 million operations per second.

The main limitation to how many operations can be executed per second is the processing in the data nodes of MySQL Cluster for this benchmark. Thus we wanted to see how well the cluster scales for this benchmark as we add more and more data nodes.

A data node in MySQL Cluster operates best when threads are locked to CPUs as shown in a previous blog of mine. Currently the main threads that operates in a data nodes is the thread handling local database operations (there are up to four of those threads), the thread doing the transaction synchronisation and finally the thread handling receive of messages on sockets connected to other data nodes or API nodes. Thus to achieve best operation one needs at least 6 CPUs to execute a data node. Personally I often configure 8 CPUs to allow for the other threads to perform their action without inhibiting query performance. Other threads are handling replication, file system interaction and cluster control.

To our disposal when running this benchmark we had access machines with dual Intel Xeon 5670 @2.93 GHz. This means 12 CPUs per socket. One thing to consider when running a benchmark like this is that the networking is an important part of the infrastructure. We had access to an Infiniband network here and used IP-over-Infiniband as communication media. It's most likely even better to use the Sockets Direct Protocol (SDP) but we had limited time to set things up and the bandwidth of IPoIB was quite sufficient. This made it possible to have more than one data node per machine.

In order to run flexAsynch on bigger clusters we also needed to handle multiple instances of flexAsynch running in parallel. In order to handle this I changed flexAsynch a little bit to enable one process to only create a table or only delete a table. I also made it possible to run the flexAsynch doing only inserts, only reads or only updates. To make it easier to get proper numbers I used a set of timers for read and update benchmarks. The first timer specified the warmup time, thus operations were executed but not counted since we're still in the phase where multiple APIs are starting up. The next timer specifies the actual time to execute the benchmark and finally a third timer specifies the cooldown time where again transactions are run but nor counted since not all APIs start and stop at exactly the same time. Using this manner we will get accurate numbers of read and update operations. For inserts we don't use timers and thus the insert numbers are less accurate.

The results of those benchmarks will be posted in blogs soon coming out.

MySQL Cluster and Sharding

Sharding is here defined as the ability to partition the data into partitions defined by a condition on a set of fields. This ability is central to the workings of MySQL Cluster. Within a Cluster we automatically partition the tables into fragments (shards in the internet world). By default there is a fixed amount of fragments per node. As mentioned we also use replication inside a Cluster, the replication happens per fragment. We define the number of replicas we want in the Cluster and then the MySQL Cluster SW maintains this number of fragment replicas per fragment. These fragment replicas are all kept in synch. Thus for MySQL Cluster the sharding is automatic and happens inside the Cluster even using commodity hardware.

One of the defining features of MySQL Cluster is to keep the fragments up and running at all times and that they are restored after a Cluster crash. However MySQL Cluster also supports adding nodes to the Cluster while it is operational, this means that we can add nodes on a running Cluster and repartition the tables during normal operation. This is part of the normal MySQL Cluster and is used in operation by many users and customers to increase the size of the Clusters in production clusters.

MySQL Cluster API, the NDB API

As mentioned in a previous blog the programming API on the client side is a very important part of the performance of MySQL Cluster. Every API that is used to access the Data Server in MySQL Cluster uses the NDB API. The NDB API is used in the NDB storage handler to make it possible to access data from MySQL APIs which is residing in MySQL Cluster.

The base of the good performance of the programming API is the ability to batch operations in various manners. In early MySQL Cluster history the MySQL Storage Engine API had very few interfaces that allowed for handling multiple records at a time. As we progressed, the Storage Engine API have added several APIs that can handle multiple records at a time. There is even some development work which has been presented at the UC 2010 where the Storage Engine API now can push entire queries down to the storage engine, even join queries. This has also been presented at a recent webinar with engineers.

The NDB API uses a model where one first defines the operation to issue towards the database. The calls to build an operation doesn't interact with the actual database. The actual message is sent to the data node only after the execute method have been called. The NDB API is designed to handle batching of operations in two levels. The first level is that it is possible to batch inside one thread. This means that one can open several transactions in parallel within the same thread and execute them in parallel with one execute call. In addition it is also possible to have several threads working in parallel and it is possible for every one of those threads to also be executing multiple transactions in parallel.

So the possibilities for parallelism using the NDB API is tremendous. Much of the cost of accessing a database is paid in the networking, so by using the parallel transactions inside a thread (called Asynchronous NDB API) and by using the multithreaded capabilities of the NDB API, it is possible to decrease the networking cost greatly by making TCP/IP packets larger. Mostly the cost of sending a TCP/IP packet is Fixed_cost + #Bytes * Byte_cost. The fixed cost was in the past about the same cost as sending 60 bytes. This extra cost of small messages have to be paid both in the server part and in the client part. Thus it pays off very well to send larger messages. When the message sizes goes towards 1 kByte, the extra cost is down in the range of 6-7% extra cost compared to infinite-sized messages whereas a 200 byte message have an additional 30% added cost.

An additional benefit of batching is that there will be less context switches since handling of several messages in parallel can be handled without context switches.

You can learn more about performance optimization of your own applications by reading this whitepaper.

Saturday, April 09, 2011

MySQL Cluster performance aspects

MySQL Cluster was designed for high performance from the very beginning. To achieve high performance one has to understand many aspects of computing. As an example the protocol is very important. In the original design work in 1994 we had a master thesis student build a prototype using a protocol which was based on BER encoding and other standard parts of many telecom protocols. After seeing the code in this prototype which was several thousands of lines of code just to handle the protocol, I realised that this type of protocol will simply cost too much on both the client side as well as on the server side. So this type of prototypes in early design work is extremely useful since it would have been very difficult to change this protocol once we started down the path of developing the Data Server.

Based on this work we instead opted for a protocol where almost everything in the protocol was of fixed size and entirely based on sending 32-bit words. We didn't want a protocol which transferred bytes to avoid the extra computational complexity this would require. So the NDB protocol which is used for query processing uses a message called TCKEYREQ, this message has about 10 32-bit words describing various fixed parameters such as TableId, ConnectionId, PartitionId and so forth. There is also a 32-bit word that contains a set of bits that is used to interpret the message. Actually reading this protocol can be done, completely avoiding branches since the bits can be used to address the proper words in the protocol message through some arithmetic. The only branching needed happens in taking care of keys and the actual query information which is of variable size.

The next important component of performance is the execution model. The MySQL Cluster Data nodes uses an execution model which is extremely well suited for modern CPUs. The Data nodes uses a set of threads, where each thread implements its own little OS with a scheduler. All communication inside the data nodes is based on messages. From a SW point of view the code to receive internal messages is exactly the same as the handling of messages arriving over the network. When sending a message it's the address which defines the type of message. The address contains three parts, the node id, the thread id and the module id (block number in the code). If the message is sent to a module with the same node id and thread id as the sending thread, then the message is an internal message and it will be sent by putting the message in the local message buffer, if the node id is the same but the thread id differs, then the message will be sent to another thread. The communication between threads is extremely efficient based on shared memory communication and this code is using the most efficient ways to communicate based on the HW and the OS. Finally when the node id differs, the message is sent as a network packet over to another data node or client node. There is a TCP/IP link between all nodes (fully connected mesh) and the data node will use mechanisms to ensure that the packets sent contains as many messages as possible without sacrificing latency (the user can affect the acceptable latency through a config parameter).

Given this model it means that a thread can be actively executing thousands of queries without any need of doing any context switches. This is one reason why MySQL Cluster benefits greatly when threads are locked to certain CPU cores and there is no contention from other programs to use these CPU cores. The data nodes have their own local OS and thus work extremely efficiently when the OS scheduler stays out of the way.

This particular model of executing where each thread of execution executes until it decides to send a message (the unit of execution is always execution of a message) was very popular in the 70s because of its efficiency. It was replaced by the time-sharing model given the simplicity of the time-sharing model. When designing MySQL Cluster we decided that a Data Server to handle millions of queries per second has more requirements on the efficiency of execution compared to the requirements of the simplicity of the design. Another great benefit of this execution model is that as the load on the Data Server increases, the throughput also grows. This is so since the execution threads will execute for longer time before they will look at the sockets for incoming traffic, this means that more messages will be gathered every time and thus the cost of each message byte decreases, the same happens with sending messages that as the number of messages to execute per round grows, the more data will be sent on each send call and thus decreasing the cost of each sent message byte.

The design is extremely modular even though its using a more complex execution model. Each module can only communicate with other modules using messages and the modules share no data. Thus if an error occurs in a module it's either due to bugs in this model or due to bad input data to the module. To debug the data node we trace every important branch, every message executed with it's data. This means that if a crash occurs we have very detailed information about how the crash occurred including the last thousand or so branches taken in the code and a few thousand of the last messages executed in the data node.

The final aspect of performance is the actual implementation of the database algorithms. To cover this in one blog message is obviously not possible but it covers an efficient design of data structures (we implement a hash based index and an ordered index), efficient implementation of the actual record storage with an efficient data structure to contain the record (includes capabilities to handle variable sized data and handling NULLable fields in a storage efficient manner and even being able to add fields to a record by usage of dynamic fields which are NULL when not present in the record). It includes an efficient model for recovery and finally an efficient model for transaction handling. In all of those aspects MySQL Cluster have added additional innovation to the world of databases with a particular focus on the performance aspects.

There is actually one more important part of the performance of MySQL Cluster and this is the programming API on the client side. I will discuss this in my next blog.

Friday, April 08, 2011

MySQL Cluster - NoSQL access with some SQL

As someone noted in a blog, the NDB API is a NoSQL API that was designed 15 years ago. When I wrote my Ph.D thesis (which is the design document that NDB Cluster is based on) I called it Design and Modelling of a Parallel Data Server for Telecom Applications. The important name I used here is Data Server. It was never intended as a pure SQL DBMS. It was always intended for any needs of Data Storage. The requirements on this Data Server was also written up in my thesis for those who care to read it and included HLR's (the telecom database used to keep track of your mobile phone), News-on-Demand, Multimedia Email, Event Data Services (logging of events in the telco and web applications used for charging, billing and understanding the customers) and a genealogy application.

MySQL Cluster have been very successful in the telecom space and chances are very high that a MySQL Cluster solution is used whenever you place a mobile phone call. Also many ISPs use MySQL Cluster to handle DNS lookups, authentication and many other internet services. As an example here the ISP I use every day and through which I post this blog message is using MySQL Cluster for this type of service. So I invoke services of the MySQL Cluster every time I access the web from my home. In addition, we have seen MySQL Cluster adopted into eCommerce, session management, content delivery, user profile management and on-line gaming applications.

MySQL Cluster was from the very start designed to handle many other applications as well in the web space. Today the internet environment contains quite a few different APIs to use for handling web data. MySQL Cluster already now have a plethora of different APIs that can be used to access the basic Data Server. MySQL Cluster can be used with every possible API that can be used to access a MySQL Server. In addition we have the Cluster/J API which is a low-level Java API with similar characteristics to the NDB API. Based on the Cluster/J API we have a standard JPA interface to MySQL Cluster. We even have an LDAP interface which means that the same data can be accessed through LDAP, SQL, Cluster/J, JPA, NDB API and many other interfaces based upon these of which I am sure I don't know every one. Another interesting interface is mod-ndb which makes it possible to query MySQL Cluster using a REST API and get results in JSON.

We are however not satisfied with the set of APIs we have towards MySQL Cluster so we'll be adding even more as we go to make the Data Server capabilities available to you from even more surroundings, these will be including additional APIs commonly used in the web space. Stay tuned for Tomas Ulin's keynote at the UC and Collaborate next week.