Wednesday, December 03, 2008

LOCK_open, THE mutex :)

In all my days at working at MySQL the
LOCK_open mutex have always been a
key mutex to understand, now that I'm
working on scalability improvements of
the server it's as important to change
this mutex into something less contentious.

So last week I finally decided to start
thinking about how we can resolve this
mutex which is at the heart of the MySQL
Server. In principle the idea is that
LOCK_open has been used to protect a
hash table with all the open tables in
the MySQL Server. However it has been
used for many other purposes as well.
So it's not trivial to move around it.

However the main scalability problem
with LOCK_open is the hash lock it
provides. So what to do about it?

My current thinking is that a four-thronged
approach will do the trick.

1) Divide and concquer, perform the hash calculation
outside of the mutex and divide the hash into
e.g. 16 smaller hashes. This creates one problem
which is how to prune the open table cache.
Obviously there is no longer a simple linked
list where I can find the oldest entry. This
problem I'm still contemplating, there's
probably already a number of known good solutions
to this problem since I find it popping up in
almost every similar design. So it's a problem
looking for a solution pattern.

2) Shrink the amount of data it protects
by only allowing it to protect the hash table
and nothing more. This means e.g. that some
counters need to be updated with atomic
instructions instead.

3) Shrink the time it is protected by inserting
the table share into the hash rather than the
table object (this is actually Monty's idea).

4) Use a different technique for the lock that
works better for short-term locks (usually
spinlocks are more successful here).

A combination of these techniques will hopefully
make it possible to decrese the impact of
LOCK_open on the server code.

Recovery features for ALTER TABLE of partitioned tables

A feature which hasn't been so public about the implementation
of partitioning is the support for atomicity of many ALTER TABLE
statements using partitioned tables.

This atomicity exists for
ALTER TABLE ADD PARTITION ....
ALTER TABLE REORGANIZE PARTITION ...
ALTER TABLE DROP PARTITION ...
ALTER TABLE COALESCE PARTITION

Given that partitioning often works with very large tables it
was desirable to have a higher level of security for ALTER TABLE
of partitioned tables. To support this a DDL log was implemented.
This DDL log will in future versions be used also for many other
meta data statements. The DDL log will record all files added,
renamed and dropped during an ALTER TABLE command as above.

The design is done in such a way that the ALTER TABLE will either
fail and then all temporary files will be removed (even in the
presence of crashes of MySQL Server). Otherwise the ALTER TABLE
will succeed even if not all old files have been removed at
the time of crash. The DDL log will be checked at restart of
MySQL Server and will REDO or UNDO all necessary changes to
complete the ALTER TABLE statement.

Given that MySQL Server crashes isn't likely to happen very often
in customer environments it was also desirable to add error
injection to the MySQL Server for testing purposes.

Here is a short cut from the file sql_partition.cc that displays
what happens here:

if (write_log_drop_shadow_frm(lpt) ||
ERROR_INJECT_CRASH("crash_drop_partition_1") ||
mysql_write_frm(lpt, WFRM_WRITE_SHADOW) ||
ERROR_INJECT_CRASH("crash_drop_partition_2") ||
write_log_drop_partition(lpt) ||
ERROR_INJECT_CRASH("crash_drop_partition_3") ||

At each ERROR_INJECT_CRASH it is possible to prepare
MySQL Server such that it will crash at this point in
the next statement using dbug statements that can
be issued also as SQL statements now.

So here one can see that we first log preparatory
actions, insert a test point, continue with the
next step of ALTER TABLE, insert a new test point,
write the next log entry, insert new test point,
and so forth.

With this recovery mechanism the new ALTER TABLE
statements should not cause problems with the
partitioned table after the ALTER TABLE even in
the presence of crashes in the middle of the
ALTER TABLE statement.

Tuesday, November 25, 2008

Impressive numbers of Next Gen MySQL Cluster

I had a very interesting conversation on the phone with Jonas
Oreland today (he also blogged about it on his blog at
http://jonasoreland.blogspot.com).

There is a lot of interesting features coming up in MySQL Cluster
version 6.4. Online Add Node is one of those, which can be done
without any downtime and even with almost no additional memory
needed other than the memory in the new machines added into the
cluster. This is a feature I started thinking almost 10 years ago
so it's nice to see the fourth version of the solution actually be
implemented and it's a really neat solution to the problem,
definitely fitting the word innovative.

The next interesting feature is to use a more efficient protocol
for handling large operations towards the data nodes. This makes it
use less bits on the wire, but even more it saves a number of copy
stages internally in the NDB data nodes. So this has a dramatic
effect on performance of reads and writes of large records. It
doubles the throughput for large records.

In addition the new 6.4 version also adds multithreading to the
data nodes. Previously the data nodes was a very efficient single
thread which handled all the code blocks and also the send and
receive handling. In the new 6.4 version the data nodes are split
into at least 4 threads for database handling, one thread for send
and receive and the usual assistance threads for file writes and
so forth. This means that a data node will fit nicely into a 8-core
server since also 1-2 cpu's are required for interrupt handling and
other operating system activity.

Jonas benchmarked using a benchmark from our popular flex-series
of benchmark. It started with that I developed flexBench more
than 10 years ago, it's been followed by flexAsynch, flexTT and a
lot more variants of the same type. It can vary the number of
threads, the size of the records, the number of operations per
batch per thread and a number of other things. flexAsynch is
really good at generating extremely high loads to the database
without doing anything useful itself :)

So what Jonas demonstrated today was a flexAsynch run where he
managed to do more than 1 million reads per second using only
one data node. MySQL Cluster is a clustered system so you can
guess what happens when we have 16, 32 or 48 of those nodes
tied together. It will do many tens of millions of reads per
second. An interesting take on this is an article in
Datateknik 3.0 (a magazine no longer around) where I was
discussing how we had reached or was about to reach 1 million
reads per second. I think
this was sometime 2001 or 2002. I was asked where we
were going next and I said that 100 million reads per
second was the next goal. We're actually in range of
achieving this now since I also have a patch lying
around which can increase the number of data nodes
in the cluster to 128 data nodes whereby with good
scalability a 100 million reads per second per
cluster is achievable.

When Jonas called he had achieved 950k reads and then I told
him to try out using the Dolphin DX cards which were also
available on the machines. Then we managed to increase the
performance to inch over 1 million upto 1.070.000.
Quite nice. Maybe even more impressive that it also was
possible to do more 600.000 write operations per second
(these are all transactional).

This run of flexAsynch was focused on seeing how many operations
per second one could get through. I then decided I was
interested in seeing also how much bandwidth we could handle
in the system. So we changed the record size from 8 bytes to
2000 Bytes. When trying it out with Gigabit Ethernet we reached
60,000 reads and 55.000 inserts/updates per second. A quick
calculation shows that we're doing almost 120 MBytes of reads
and 110 MBytes of writes to the data node. This is obviously
where the limit of Gigabit Ethernet goes so an easy catch of
the bottleneck.

Then we tried the same thing using the Dolphin DX cards. We got
250.000 reads per second and more than 200.000 writes per
second. This corresponds to almost 500 MBytes per second of
reads from the database and more than 400 MBytes of writes to
the data nodes.

I had to check whether this was actually the limit of the set-up
I had for the Dolphin cards (they can be set-up to use either
x4 or x8 on the PCI Express). Interestingly enough after working
in various ways with Dolphin cards for 15 years it's the first
time I really cared about the bandwidth it could chunk through.
The performance of MySQL Cluster have never been close to
saturating the Dolphin links in the past.

However today we managed to saturate the links. The maximum
bandwidth achievable by a microbenchmark with a single process
was 510 MBytes per second and we achieved almost 95% of this
number. Very impressive indeed I think. What's even more
interesting is that the Dolphin card used the x4 configuration
so it can actually do 2x the bandwidth in the x8 setting and
the CPU's were fairly lightly loaded on the system so it's
likely that we could come very close to saturating the load
even using a x8 configuration of the Dolphin cards. So that's
a milestone to me, that MySQL Cluster have managed to
saturate even the bandwidth of a cluster interconnect with
very decent bandwidth.

This actually imposes an interesting database recovery
solution problem into the MySQL Cluster architecture. How
does one handle 1 GBytes of writes to each data node in
the system when used with persistent tables which has
to be checkpointed and logged to disk. This requires
bandwidth to the disk subsystem in multiple GBytes per
second. It's only reasonable to even consider doing this
with the upcoming new high-performance SSD drives. I
heard an old colleague nowadays working for a disk
company mention that he had demonstrated 6 GBytes
per second to local disks, so this actually is a
very nice fit. Turns out that this problem can also be
solved.

Actually SSD drives is also a very nice fit with also
the disk data part of MySQL Cluster. Here it makes all
the sense in the world to use SSD drives as the place
to put the tablespaces for the disk part of MySQL
Cluster. This way also the disk data becomes part of
the real-time system and you can fairly easy build a
terabyte database with an exceedingly high
performance. Maybe this is to some extent a reply
Mark Callaghans request for a data warehouse based
on MySQL Cluster link. Not that we really focused so
much on it, but the parallelism and performance
available in a large MySQL Cluster based on 6.4 will
be breathtaking even to me with 15 years of thinking
into this behind me. A final word on this is that
we are actually also working on a parallel query
capability towards MySQL Cluster. This is going to
based on some new advanced additions to the storage
engine interface we're currently working on
(Pushdown Query Fragment for those that joined the
storage engine summit at Google in April this year).

A nice thing with being part of Sun is that they're
building the HW which is required to build these
very large systems and are very interested in doing
showcases for them. So all the technology to do
what has been discussed above is available within
Sun.

Sorry for writing a very long blog. I know it's
better to write short and to the point blogs,
however I found so many interesting tilts on the
subject.

Monday, November 24, 2008

Poll set to handle poll, eventports, epoll, kqueue and Windows IO Completion

This blog describes background and implementation of a poll
set to monitor many sockets in one or several receive
threads. The blog is intended as a description but also to
enable some feedback on the design.

I've been spending some time working out all the gory details
of starting and stopping lots of send threads, connection
threads and receive threads over the last few months.

The receive threads will be monitoring a number of socket
connections and as soon as data is available ensure that
the data is received and forwarded to the proper user
thread for execution.

However listening on many sockets is a problem which needs
a scalable solution. Almost every operating system on the
planet has some solution to this problem. The problem is
that they all have different solutions. So I decided to
make a simple interface to those socket monitoring solutions.

First my requirements. I will handle a great number of socket
connections from each thread, I will have the ability to
move socket connections from receive thread to another thread
to dynamically adapt to the usage scenarios. However mostly
the receive thread will wait for events on a set of socket
connections, handle them as they arrive and go back waiting
for more events. On the operating system side I aim at
supporting Linux, OpenSolaris, Mac OS X, FreeBSD and Windows.

One receive thread might be required to listen to socket
connections from many clusters. So there is no real limit
to the number of sockets a receive thread can handle. I
decided however to put a compile time limit in there since
e.g. epoll requires this at create time. This is currently
set to 1024. So if more sockets are needed another receive
thread is needed even if not needed from a performance
point of view.

The implementation aims to cover 5 different implementations.

epoll
-----
epoll interface is a Linux-only interface which uses
epoll_create to create an epoll file descriptor, then
epoll_ctl is used to add/drop file descriptors to the epoll
set. Finally epoll_wait is used to wait on the events to
arrive. Socket connections remain in the epoll set as
long as they are not explicitly removed or closed.

poll
----
Poll is the standard which is there simply to make sure it
works also on older platforms that have none of the other
mechanisms supported. Here there is only one system call,
the poll-call and all the state of the poll set needs to
be taken care of by this implementation.

kqueue
------
kqueue achieves more or less the same thing as epoll, it does
so however with a more complex interface that can support a
lot more things such as polling for completed processes and
i-nodes and so forth. It has a kqueue-method to create the
kqueue file descriptor and then a kevent call which is used
both to add, drop and listen to events on the kqueue socket.
kqueue exists in BSD OS:s such as FreeBSD and Mac OS X.

eventports
----------
eventports is again a very similar implementation to epoll which
has the calls port_create to create the eventport file descriptor.
It has a port_associate call to add a socket to the eventport set.
It has a port_dissociate call to drop a socket from the set.
It has a port_getn call to wait on the events arriving. There is
however a major difference in that after an event arriving in a
port_getn call the socket is removed from the set and has to be
added back. From an implementation point of view this mainly
complicated my design of error handling.

Windows IO Completion
---------------------
I have only skimmed this yet and it differs mainly in being a tad
complex and also in that events from the set can be distributed to
more than one thread. However this feature will not be used in this
design, also I have currently not implemented this yet, I need to
get all the bits together on building on Windows done first.

Implementation
--------------
The implementation is done in C but I still wanted to have
a clear object-oriented interface. To achieve this I
created two header files ic_poll_set.h which declares all
the public parts and ic_poll_set_int.h which defines the
private and the public data structures used. This means that
the internals of the IC_POLL_SET-object is hidden from the
user of this interface.

Here is the public part of the interface (the code is GPL:ed
but it isn't released yet):

Copyright (C) 2008 iClaustron AB, All rights reserved
struct ic_poll_connection
{
int fd;
guint32 index;
void *user_obj;
int ret_code;
};
typedef struct ic_poll_connection IC_POLL_CONNECTION;

struct ic_poll_set;
typedef struct ic_poll_set IC_POLL_SET;
struct ic_poll_operations
{
/*
The poll set implementation isn't multi-thread safe. It's intended to be
used within one thread, the intention is that one can have several
poll sets, but only one per thread. Thus no mutexes are needed to
protect the poll set.

ic_poll_set_add_connection is used to add a socket connection to the
poll set, it requires only the file descriptor and a user object of
any kind. The poll set implementation will ensure that this file
descriptor is checked together with the other file descriptors in
the poll set independent of the implementation in the underlying OS.

ic_poll_set_remove_connection is used to remove the file descriptor
from the poll set.

ic_check_poll_set is the method that goes to check which socket
connections are ready to receive.

ic_get_next_connection is used in a loop where it is called until it
returns NULL after a ic_check_poll_set call, the output from
ic_get_next_connection is prepared already at the time of the
ic_check_poll_set call. ic_get_next_connection will return a
IC_POLL_CONNECTION object. It is possible that ic_check_poll_set
can return without error whereas the IC_POLL_CONNECTION can still
have an error in the ret_code in the object. So it is important to
both check this return code as well as the return code from the
call to ic_check_poll_set (this is due to the implementation using
eventports on Solaris).

ic_free_poll_set is used to free the poll set, it will also if the
implementation so requires close any file descriptor of the poll
set.

ic_is_poll_set_full can be used to check if there is room for more
socket connections in the poll set. The poll set has a limited size
(currently set to 1024) set by a compile time parameter.
*/
int (*ic_poll_set_add_connection) (IC_POLL_SET *poll_set,
int fd,
void *user_obj);
int (*ic_poll_set_remove_connection) (IC_POLL_SET *poll_set,
int fd);
int (*ic_check_poll_set) (IC_POLL_SET *poll_set,
int ms_time);
const IC_POLL_CONNECTION*
(*ic_get_next_connection) (IC_POLL_SET *poll_set);
void (*ic_free_poll_set) (IC_POLL_SET *poll_set);
gboolean (*ic_is_poll_set_full) (IC_POLL_SET *poll_set);
};
typedef struct ic_poll_operations IC_POLL_OPERATIONS;

/* Creates a new poll set */
IC_POLL_SET* ic_create_poll_set();

struct ic_poll_set
{
IC_POLL_OPERATIONS poll_ops;
};

Friday, November 21, 2008

DTrace, opensolaris and MySQL Performance

Currently I'm working hard to find and remove scalability
bottlenecks in the MySQL Server. MySQL was acquired by Sun
10 months ago by now. Many people have in blogs wondered what
the impact has been from this acquisition. My personal
experience is that I now have a chance to work with Sun
experts in DBMS performance. As usual it takes time when
working on new challenges before the flow of inspiration
starts flowing. However I've seen this flow of inspiration
starting to come now, so the fruit of our joint work is
starting to bear fruit. I now have a much better understanding
of MySQL Server performance than I used to have. I know fairly
well where the bottlenecks are and I've started looking
into how they can be resolved.

Another interesting thing with Sun is the innovations they have
done in a number of areas. One such area is DTrace. This is a
really interesting tool which I already used to analyse some
behaviour of MySQL Cluster internals with some success. However
to analyse other storage engines inside MySQL requires a bit more
work on inserting DTrace probes at appropriate places.

To work with DTrace obviously means that you need to work with
an OS that supports DTrace. Solaris is such a one, I actually
developed NDB Cluster (the storage engine for MySQL Cluster) on
Solaris the first 5-6 years. So one would expect Solaris to be
familiar to me, but working with Linux mainly for 6-7 years means
that most of the Solaris memory is gone.

So how go about developing on Solaris. I decided to install a virtual
machine on my desktop. As a well-behaved Sun citizen I decided to
opt for VirtualBox in my choice of VM. This was an interesting
challenge, very similar to my previous experiences on installing
a virtual machine. It's easy to get the VM up and running, but how
do you communicate with it. I found some instructions on how to
set-up IP links to a virtual machine but to make life harder I
have a fixed IP address on my desktop so this complicated life
quite a bit. Finally I learned a lot about how to set-up virtual
IP links which I already have managed to forget about :)

The next step is to get going on having a development environment
for opensolaris. I soon discovered that there was a package
manager in opensolaris which could be used to get all the needed
packages. However after downloading a number of packages I
stumbled into some serious issues. I learned from this experience
that usage of Developer Previews for OS's is even worse than newly
released OS's which I already know by experience isn't for the
fainthearted.

So I decided to install a released opensolaris version instead
(the OpenSolaris2008.05 version). After some googling I discovered
a very helpful presentation at opensolaris developer how-to
which explained a lot about how to install a development
environment for opensolaris.

After installing opensolaris 2008.05, after following the
instructions on how to install a development environment
I am now equipped to develop DTrace probes and scripts and
try them out on my desktop.

I definitely like the idea that opensolaris is looking more
like yet another Linux distribution since it makes it a
lot simpler to work with it. I would prefer GNU developer
tools to be there from scratch but I have the same issue
with Ubuntu.

That the system calls are different don't bother me as a
programmer since different API's to similar things is
something every programmer encounters if he's developing
for a multi-platform environment. I even look forward to
trying out a lot of Solaris system calls since there are
lots of cool features on locking to CPU's, controlling
CPU's for interrupts, resource groups, scheduling
algorithms and so forth. I recently noted that most of
these things are available on Linux as well. However
I am still missing the programming API's to these
features.

Thursday, October 30, 2008

CACHE INDEX per partition for MyISAM

The newest development in the partitioning code
is WL#4571.

This new feature makes it possible to tie a
partition using MyISAM to a specific cache index.
The syntax for how to do is available in the
above worklog entry.

We found this feature to be useful for enabling
higher performance of parallel ALTER TABLE
(WL#2550). When adding
a primary key to a MyISAM table the key cache in
MyISAM limited scalability of Parallel ALTER TABLE
severely, so adding several key caches, essentially
one per partition we can ensure that the ALTER TABLE
can be fully parallelised (all other ALTER TABLE
on MyISAM already scales perfectly).

We also have some ideas on how to solve the base
problem in how to make the key cache more scalable
by dividing the mutex on the key cache into one
mutex per a range of key cache pages.

New launchpad tree for PARTITION BY RANGE COLUMN_LIST(a,b)

A colleague of mine at Sun/MySQL showed me how to get
statistics from my blog. This was an interesting read
of all statistics. I noted that there was a great
interest in partitioning related information and that
the new partitioning feature mentioned in my blog
2 years ago still attracts a lot of attention.

So I thought it was a good idea to blog a bit more
about what's going on in the partitioning
development. I decided to check out how easy it is
to externalize my development trees on launchpad.
It turned out to be really easy so I simply
put up the development tree for the new partitioning
feature which I described in my last blog.

Launchpad tree

I also externalized the Worklog entry for this
development, unfortunately not a very long
description but I'll try to work on that.
There is a new test case in the mysql-test/t
directory called partition_column.test which
shows how to use these new features (it might
take some time before this link works).

Worklog description

Tuesday, October 07, 2008

Further development on new partitioning feature

As mentioned in a blog 2 years ago I worked on a new
partitioning feature:

I've been busy with many other things but now I've taken this
work a step forward and will most likely set-up a preview tree
of this feature in a short time.

The new feature adds the possibility to perform partitioning
on any type of column also for range and list partitioning
(has been possible for KEY partitioning all the time). It also
adds a new function to the MySQL Server and this function is
also a monotonic function which means it gets a nice treatment
of the partition pruning. This new function is TO_SECONDS which
works very similarly to TO_DAYS.

So here are couple of new cases of what one will be able to do:

create table t1 (d date)
partition by range column_list(d)
( partition p0 values less than (column_list('1999-01-01')),
partition p1 values less than (column_list('2000-01-01')));

create table t1 (a date)
partition by range(to_seconds(a))
(partition p0 values less than (to_seconds('2004-01-01')),
partition p1 values less than (to_seconds('2005-01-01')));

select * from t1 where a <= '2003-12-31';

This select will be discovered to only find values in p0 by
the partition pruning optimisation step.

create table t1 (a int, b int)
partition by range column_list(a,b)
(partition p2 values less than (column_list(99,99)),
partition p1 values less than (column_list(99,999)));

insert into t1 values (99,998);
select * from t1 where a = 99 and b = 998;

This select statement will discover that it can only
be any records in the p1 partition and avoid
scanning the p0 partition. Thus partitioning works
in very much the same manner as a first step index.

Thursday, October 02, 2008

dbt2-0.37.37 uploaded and various other stuff

There was a small bug in the dbt2-0.37.36 version I uploaded which
I have now fixed in the new dbt2-0.37.37 version.

There has also been some interesting benchmark tests done where
we have run DBT2 on a T5220 box (Niagara II chips). We can show
the scalable performance benefits here as well. We've been able
to run with 20 data nodes on 1 box (these boxes can run up to
64 threads at a time) with scalable performance increase from
4 nodes.

We had a developer meeting a few weeks ago and there were lots of
activities. Personally I had most fun seeing the demo of
Parallel ALTER TABLE. We loaded a table with 10 million 70-80 byte
rows. We had access to a machine with 64 GB of memory and
16 cores. It was very interesting to run one SQL command and
see the load in top of mysqld go to 1600%. Altering a 10 million
row table in 2.5 seconds I thought was pretty good.

Another cool demo was to see the online add node in MySQL Cluster.
This is an interesting feature which I started thinking about
in 1999, had a first design then, changed to a second variant
in 2001 and changed again around 2005 and the final version that
was implemented was the fourth version of the design. The nice
thing is that the fourth version actually contains some nice
innovations that neither of the earlier designs had. So cooking
an idea for a long time can be really beneficial some times.
For a very brief description of this work see Jonas Oreland's
blog.

Jonas and Pekka is also working on another cool optimisation
of MySQL Cluster where the data node will become multithreaded.
There will be up to 6 threads in the first released version of
this. Jonas measured in a test today that one could do 370.000
inserts per second on 1 8-core box with this feature (and this
is still a fairly unstable version where there are still some
performance issues remaining). We're getting close to measuring
computer speed in MDO (MegaDatabaseOperations per second)
instead of in MHz.

Jonas and myself is also working on removing from MySQL Cluster
"the single transporter mutex" which will improve the scalability
of MySQL Servers using MySQL Cluster. We're working on this in
parallel using the same basic design but with small variations
on the details. Will be interesting to see which variant that
works best.

Finally Frazer has optimised the handling of large records in
the data node to the extent that inserts of 5k records gets
twice the speed. The interesting thing is that the benchmark
for this hits the limit of Gigabit Ethernet already with 1
CPU working at 80% which is quite interesting.

So as you can see there is a lot of interesting things cooking
at MySQL and then I haven't even mentioned the work we're
doing together with other Sun folks on optimising MySQL. More
on that later.

Tuesday, September 09, 2008

Linear Scalability of MySQL Cluster using DBT2

To achieve linear scalability of MySQL Cluster using the DBT2
benchmark has been a goal of mine for a long time now. Last
week I finally found the last issue that limited the scalability.
As usual when you discovered the issue it was trivial (in this
case it was fixed by inserting 3 0's in the NDB handler code).

We can now achieve ~41k TPM on a 2-node cluster, ~81k on a
4-node cluster and ~159k TPM on a 8-node cluster giving roughly
97% improved performance by doubling number of nodes. So there
is nothing limiting us now from achieving all the way up to
1M TPM except lack of hardware :)

I've learned a lot about what affects scalability and what
affects performance of MySQL Cluster by performing those
experiments and I'll continue writing up those experiences on
my blog here. I have also uploaded a new DBT2 version where I
added a lot of new features to the DBT2, improved performance
of the benchmark itself and also ensured that running with many
parallel DBT2 drivers do still provide correct results when
adding the results together. It can be downloaded from
www.iclaustron.com

Thursday, August 21, 2008

Some food for thoughts: How to make use of new SSD devices

The hardware guys are presenting new storage devices called
SSD's based on flash memory. At the moment I think they are
about 3-4 times cheaper than DRAM memory and the gap seems
to be increasing. They're still far from the price of hard
drives but also here the gap seems to be closing.

So as I'm now an employee of Sun that actually puts together
systems with this type of HW in it I get questioned what I
as a DBMS developer can do with those devices.

First some comments on performance. These new devices will be
able to perform reads and writes of a few kilobytes large pages
in about 25-100 microseconds compared to hard drives which
takes about 3-10 milliseconds for the same thing.

An obvious use is obviously to use them to speed up database
logging, particularly in commit situations. However this
doesn't really require any significant changes to the SW
already out there. So I won't spend any more time on this use.

Another use is for MySQL Cluster. MySQL Cluster stores most data
in memory and can store non-indexed data on disk. So how can
SSD devices be used to improve this.

First some facts about performance of MySQL Cluster. In the data
node where the data actually resides it takes about 10
microseconds of processing time to perform a key lookup and a
scan has about 20 microseconds of start-up costs whereafter each
record takes 1-2 microseconds to fetch.

So now for the idea. Let's assume we'll use an SSD device as swap
memory. We would then purposely set the swap to be e.g. 10x
larger than the memory. For this to work we need to be able to
allocate memory from different swap pools, memory used for
transaction state and things like this we don't want swapped out
(working for Sun has an advantage since we can work with the OS
guys directly, but naturally I hope Linux developers also take the
same opportunity).

So during a key lookup we need to get one page from the hash index
and one page with the record in it. Guestimating a 90% hit rate in
the hash index and 80% hit rate on the data page we find that we
will about 0.3 swap misses per key lookup. If we assume 50
microseconds for this it means that mean key lookup will increase
from 10 microseconds to 25 microseconds. This should be
acceptable, given that we can increase data size by a factor of
about 10.

A similar analysis can be made for scans as well, but I'm lazy so
will leave it to you to perform :)

So given todays sizes of memories and SSD's it should be possible
to use systems with 64 GBytes of memory and 640 GB of SSD memory
and clustering 8 of those with replication gives us a main memory
based system for a reasonable price providing 2.5 TByte of user
data in a highly available system with high degrees of parallelism
in the system.

New partitioning features

As burtonator pointed out parallelism is an important
feature that partitioning makes possible. So I thought
it might be a good idea to mention a little bit what
we're doing in the area of partitioning.

It's quite correct that parallelism is one of the main
advantages of partitioning (not the only one though since
also partition pruning and dividing large indexes and
being able to add and drop partitions efficiently are
important as well). In 5.1 we focused on the maintenance
features of partitioning but the intention to move on
to parallelisation was more or less the main goal from
the very start.

This is why it's such extra fun to actually get going on
this when one has worked on the foundation for this work
for almost 4 years (partitioning development started out
2004 H2 and most of the partitioning code in 5.1 was ready
about two years later).

There are also ideas to introduce parallelism for scans of
large partitioned tables and also a few more maintenance
features that are still missing.

Another feature in the works for partitioning is the
ability to use partition pruning on several fields. This
will be possible for PARTITION BY RANGE and LIST. The
syntax will look something like this:

CREATE TABLE t1 (a varchar(20), b int)
PARTITION BY RANGE (COLUMN_LIST(a,b))
(PARTITION p0 VALUES LESS THAN (COLUMN_LIST("a", 1)),
PARTITION p1 VALUES LESS THAN
(COLUMN_LIST(MAXVALUE, 4)));

In this case it is possible to partition on any field type
and it is also possible to do partition pruning on multiple
fields in much the same way as it is for indexes.

E.g.
select * from t1 where a = "a";
select * from t1 where a = "a" and b = 2;

will both be able to use for partition pruning with the
second obviously able to do more pruning then the first one.

Wednesday, August 20, 2008

Multi-threaded ALTER TABLE

Today I achieved something which is a first in the MySQL
server as far as I'm aware of. I managed to run a query
with multiple threads. The query was:
ALTER TABLE t1 ADD COLUMN b int;
and the table had 4 partitions in it. So it used 4 threads
that each thread handled the copying of data from old
table to new table of one partition.

Currently it's designed for use by partitioned tables but
it should be very straightforward to do minor parallelisation
also of non-partitioned tables by e.g. breaking up in a scan
thread and a write thread.

It's nice to get started on this track and see how one can
make use of modern computers with a great deal of CPU power
if one can parallelise the applications. As an example a
dual socket box T5220 (2 Niagara II CPU's) can handle 128
threads in parallel.

Friday, August 01, 2008

3: Thoughts on a new NDB API: Adaptive send algorithm

I thought a bit more on the adaptive send algorithm and kind of like
the following approach:

Keep track of how many sends we are at maximum allowed to wait
until we send in any ways. This is the state of the adaptive send
algorithm which is adapted through the following use of statistics
(we call this state variable max_waits):

For each send we calculate how long time has passed since the
send that was sent max_waits sends ago. We also do the same for
max_waits + 1. At certain intervals (e.g. every 10 milliseconds) we
calculate the mean wait that a send would have to do, if this lies
within half the desired maximum wait then we accept the current
state, if also the mean value using max_waits + 1 is acceptable
then we increase the state by one. If the state isn't acceptable
we decrease it by one.

In the actual decision making we will always send as soon as we
notify that more than the maximum wait time has occurred so this
means that the above algorithm is conservative. However the user
should have the ability to control how long he accepts a wait
through a configuration variable, thus increasing or decreasing
send buffering at the expense of extra delays.

This algorithm is applied on each socket and the actual decision
making is done within the critical section and also the statistics
calculation and from coding this it seems like the overhead should
be manageable.

Thursday, July 31, 2008

1: Making MySQL Cluster scale perfectly in the DBT2 benchmark: Initial discussion

Since 2006 H1 I've been working on benchmarking MySQL
Cluster using the DBT2 test suite. Initially this meant
a fair amount of work on the test suite itself and also
a set of scripts to start and stop NDB data nodes, MySQL
Servers and all the other processes of the DBT2 test.
(These scripts and the DBT2 tests I'm using is available
for download at www.iclaustron.com)

Initially I worked with an early version of MySQL Cluster
based on version 5.1 and this meant that I hit a number
of the performance bugs that had appeared there in the
development process. Nowadays the stability is really good
so in the most case I've spent my time focusing on what
is required to use in the operating system and the
benchmark application for optimum scalability.

Early on I discovered some basic features that were required
to get optimum performance of MySQL Cluster in those cases.
One of them is to simply use partitioning properly. In the
case of DBT2 most tables (everyone except the ITEM table) can
be partitioned on the Warehouse id. So the new feature I
developed as part of 5.1 came in handy here. It's possible to
use both PARTITION BY KEY (warehouse_id) or PARTITION BY
HASH (warehouse_id). Personally I prefer PARTITION BY HASH
since it spreads the warehouses perfectly amongst the data
nodes. However in 5.1 this isn't a fully supported so one has
to start the MySQL Server using the flag --new to use this
feature with MySQL Cluster.

The second one was the ability to use the transaction
coordinator on the same node as the warehouse the
transaction is handling. This was handled by a new
feature introducted in MySQL Cluster Carrier Grade
Edition 6.3 whereby the transaction coordinator is
started on the node where the first query is targeted.
This works perfectly for DBT2 and for many other
applications and it's fairly easy to change your
application if it doesn't fit immediately.

The next feature was to ensure that sending uses as
big buffers as possible and also to avoid wake-up
costs. Both those features meant changes to the
scheduler in the data nodes of the MySQL Cluster.
These changes works very well in most cases where
there is sufficient CPU resources for the data nodes.
This feature was also introduced in MySQL Cluster CGE
version 6.3.

Another feature which is very important to achieve
optimum scalability is to ensure that the MySQL Server
starts scans only on the data nodes where it will
actually find the data. This is done through the use
of partition pruning as introduced in MySQL version
5.1. Unfortunately there was a late bug introduced
which I recently discovered which gave decreased
scalability for DBT2 (this is bug#37934 which contains
a patch which fixes the bug, it hasn't been pushed yet
to any 6.3 version).

With these features there were still a number of scalability
issues remaining in DBT2. One was the obvious one that the
ITEM table is spread on all data nodes and thus reads of the
ITEM table will use network sockets that isn't so "hot".
There are two solutions to this, one is that MySQL Cluster
implements some tables as fully replicated on all data nodes.
This might arrive some time in the future, the other variant
uses standard MySQL techniques. One places the table in
another storage engine, e.g. InnoDB, and uses replication to
spread the updates to all the MySQL Servers in the cluster.
This technique should be a technique that can be applied to
many web applications where there are tables that need to be
in MySQL Cluster to handle availability issues and that the
data is required to be updated through proper transactions, but
there are also other tables which can be updated in a lazy
manner.

Finally there is one more remaining issue and this is when the
MySQL Server doesn't work on partitioned data. That is in the
case of DBT2 if all MySQL Servers can access data in a certain
node group then the data nodes will have more network sockets to
work with which will increase cost of networking. This limits
scalability as well.

In the case of DBT2 this can be avoided by using a spread
parameter that ensures that a certain MySQL Server only uses a
certain node group in the MySQL Cluster. In a generic application
this would be handled by an intelligent load balancer that
ensures that MySQL Servers works on different partitions of
the data in the application.

What I will present in future blogs is some data on how much the
effects mentioned above have on the scalability of the DBT2
benchmark for MySQL Cluster.

What is more surprising is that there is also a number of other
issues related to the use of the operating system which aren't
obvious at all. I will present those as well and what those mean
in terms of scalability for MySQL Cluster using DBT2.

Finally in a real application there will seldom be a perfect
scalability occuring, so in any real application it's also
important to minimize the impact of scalability issues. The
main technology to use here is cluster interconnects and I
will show how the use of cluster interconnects affects
scalability issues in MySQL Cluster.

Note numbers from these DBT2 are merely used to be used here to
compare different configurations of MySQL Cluster.

Wednesday, July 30, 2008

2: Thoughts on a new NDB API: Send part

In the current API when sending one takes the Transporter mutex and
then sends all the signals generated towards one or many nodes.
There is also some handling of adaptive sends, however this adaptive
algorithm takes care of all nodes, thus waiting for sending is global
on all nodes.

The new design uses one mutex for the sending, however this mutex only
controls the sending part of one socket. Also the time for holding the
mutex is just enough to check the state, no send operations are done
while holding the mutex.

The new adaptive algorithm will keep track of the last sent messages on
this socket and in principle the idea is that if it's at least a 90-99%
probability that it is a good idea to wait, then it will wait (unless
the application has provided the force send flag). It will do so by
keeping track of the last few messages sent.

So in principle the data structure protected by the mutex is:
struct ic_send_node_mutex
{
IC_SEND_THREAD_MUTEX *send_thread_mutex;
Mutex mutex;
boolean send_active;
IC_COMM_BUFFER *first_cb;
IC_COMM_BUFFER *last_cb;
uint32 queued_bytes;
Timer first_buffered_timer;
Timer last_sent_timers[8];
uint32 last_sent_timer_index;
}

For each socket there is a specific send thread, this thread is mostly
sleeping, waiting for someone to wake it up from its sleep. One reason
to wake it up is if one thread has started sending and other threads
have provided so much work that it needs to offload this sending to
a specific thread (the idea is that the sending is normally done by
an application thread which is involved in user activity and we cannot
keep this thread for longer than a few sends, thus we need to make it
possible to offload send activity to a specific send thread when a high
load appears. The send thread could also be awakened to send buffered
messages that has timed out.

The flag send_active is true whenever a thread is actively sending,
and thus a thread that needs to send when this flag is set can
simply return immediately, if it's not true then it can set the flag
and start sending.

It would probably be possible to handle this without a mutex, but the
contention on this mutex should be small enough and also there is some
wakeup logic that makes sense for a mutex.

The application thread can prepare the NDB Protocol messages completely
before acquiring the mutex, the only activity which sometimes happens
inside the mutex is reading the time for handling of the adaptive
algorithm.

Sends normally goes to a NDB Data node but could also go to another
Client node and could even go to another thread in the same process.
This is important to handle parallelisation, thus to parallelise it
is sufficient to send a number of messages to other nodes and/or
threads. Each message can kick of at least one new thread.

Tuesday, July 29, 2008

1. Thoughts on a new NDB API, Baseline thoughts

I spent some time during my vacation thinking about some
new ideas. I designed the first version of the NDB API
about 10 years ago and obviously in those days the maximum
number of CPU's in most systems was 2 so it wasn't a big
problem having a single mutex protecting send and receive
in the NDB API (The NDB API is the low level API used by the
storage engine NDB which is the storage engine in MySQL
Cluster).

Another design criteria I made when designing the NDB API
was that most developers want to use a synchronous API.
Thus the asynchronous API was made afterwards and didn't
cover all operations. Most developers still develop using
synchronous API's, however most of the use for the NDB
API is for specialised applications such as Telco servers,
storage engine code, LDAP servers. Also I'm thinking in
even using it inside an operating system kernel to design
a clustered file system.

Thus today it seems like a better idea to use an asynchronous
API as the base and then put the synchronous API on top of
this.

When designing the original NDB API it was sufficient to think
of simple key lookups, later it was advanced with also handling
scans of tables and indexes. However current design problems
are related to parallelising SQL queries and also there are
implementations of things such as BLOB's that actually require
multiple sequential and parallel operations. Thus in the new
design it's necessary to consider the possibility of starting
complex operations involving multiple threads (sometimes even
multiple processes), multiple operations in sequence and in
parallel.

These ideas will be fed into the existing NDB API. It will also
be used in the iClaustron project where I aim to build something
that can be used as a clustered file system. iClaustron is both
designed with the aim to at some point in time be a useful thing,
but at the same time I use it as my personal playground where I
can test new ideas and see how my ideas turns out when turned
into source code.

The original NDB API was designed in C++ as all the rest of the
MySQL Cluster code. Within the data nodes I think we've found
a good compromise of what to use in the C++ language and what
not to use. However in general I found that debates around what
should be used in C++ tends to take an improportionate amount of
time compared to the value of those discussions. So for that
reason I decided to use C as the language of choice for iClaustron.
Actually there was more reasons for this, it makes it a lot easier
to use the code inside an operating system kernel such as Linux
or FreeBSD and second it makes it easier to write layers to other
languages such as Python, Perl,...

Most of the thoughts on this new NDB API has been in my mind for
more than 2 years (actually some of the thoughts have already been
implemented in NDB already), however during my vacation I had
some fun in designing out all the details I hadn't considered
previously.

It's my view of a nice vacation to relax on a beach or walking in the
mountains while inventing some new ideas based on an interesting
problems. I cheated by solving a Sudoku as well this vacation but in
general I like mind games that are related to what I do for a living.
Inventing the idea is the fun part of innovation, then comes the
hard part of actually doling out all the details, writing code and
testing code and selling the ideas. This is the work part.

I will follow this posting with a high level view on the ideas as far
they've been developed so far. In parallel I'll also "dump" the ideas
into code format. I like to think of my coding as a "brain dump", I
have a fairly unusual way of writing code. I think about the problem
for a long time and when I'm satisfied I write the code for it. I then
write all the code with only a very minimal set of compilation and
test cases. The main idea of coding in this phase is still design, so
in principal I write a design in the form of code. This also means
that I try to write as much comments as possible since I know that I
will otherwise forget my ideas. Working for MySQL has made me much
more aware of software engineering issues as well, so today I do also
a bit of thinking on software engineering as well in the design.

An architecture for the design is obviously very important, and the
architecture has borrowed heavily from the way the Linux kernel is
designed with lots of interfaces similar to the VFS interface in
Linux using a struct of a set of function pointers.

Friday, June 06, 2008

DTrace probes in MySQL and MySQL Cluster

I've worked on DTrace probes for a while now. It's
a really interesting tool. I've worked on MySQL Cluster
code since 1996 but this is the most advanced tool
I've used to see exactly what's going on inside the
MySQL Server and the data nodes.

I'm still at an early stage of using these DTrace probes
and there is still some work before they are publishable
but one can see very well what's going on inside the
processes in real-time.

My first finding was that I quickly discovered that CPU
percentage that is reported at 1% in prstat in Solaris
actually means that it uses 64% of a CPU thread 1% is the
percentage of the total CPU resources, this is different
to what I'm used to from top.

The benchmark I'm analysing is the same DBT2 I've used in
a fairly long line of analysis on MySQL Cluster performance
over the last 2 years. This benchmark can be downloaded
from www.iclaustron.com/downloads.html. It's a DBT2 based
on version 0.37 with lots of additions to make it work with
running multiple MySQL Server instances as is the case
with MySQL Cluster. Currently I'm running this on Solaris so
there will soon be a new release with some fixes needed to
run this benchmark on Solaris.

Wednesday, March 26, 2008

Visited Hadoop Conference

NOTE: Any comments in this blog entry is based on my personal thoughts after visiting the Hadoop conference and doesn't represent any current plans within MySQL.

I visited the Hadoop conference today which was a very interesting event. The room was filled to its limit, people were even standing up in lack of chairs. Probably around 300 people or so.

It was interesting to see the wide scope of web-scale problems that could be attacked using Hadoop. The major disruptive feature in Hadoop is the MapReduce solution to solving parallel data analysis problems.

One piece that I started thinking of was how one could introduce the MapReduce into SQL. One presentation of HIVE showed an interesting approach of how to solve this problem. I thought a bit on how one could integrate a MapReduce solution in MySQL and there are certainly a lot of problems to solve but I got a few interesting ideas.

The concept of being able to query both business data stored in a database and web-based logs and other type of massive amounts of data is certainly an interesting problem to consider.

In principle what one can add by introducing MapReduce into MySQL is the ability to handle streaming queries (queries that use dataflows as input table(s) and dataflows as output table).

However the actual implementation of Hadoop and HBase still were very much in their infancies so availability and reliability were far away from always on and also performance wasn't yet a focus.

Tuesday, March 25, 2008

MySQL Architecture Workshop

We had a workshop in Stockholm in early March to discuss what can be done to innovate MySQL in a number of areas. Most of the work here will not be useful code in yet a year or two and a lot longer before it'll be used in Enterprise Ready binaries. Obviously there is no guarantee that this early work will reach production binaries. This work is part of an aim at advancing the MySQL Architecture in the next few years.

One interesting topic we discussed was Pushdown of Query Fragments to Storage Engine.

A Query Fragment is a piece of an SQL query, for example in a 3-way join any join of 2 tables in this query is a Query Fragment, also the full query is a Query Fragment. As part of this interface the storage engine can decide to perform its own optimising using a new interface or it could rely on the MySQL Server to handle this optimisation. If the Storage Engine decides it can handle the Query Fragment and the optimiser decides to use this Query Fragment then the execution of this Query Fragment will be executed using the traditional Storage Engine API as if the Query Fragment was a normal table.

There are many engines that could make use of this new interface. Another interesting use of this interface is to implement parallel query support for the MySQL Server using this new interface. We hope to build a prototype of this sometime this year.

Please provide comments on this development on this blog, the development is in such an early phase that input is very welcome.

Visiting Internal Sun Technology Conference

My first real chance to meet up with my new colleagues at Sun was an internal technology conference at Sun. It was interesting to listen to what's cooking within Sun.

We got a presentation of Data Centers and their impact on the environment and it immediately triggered me to start thinking of how we can interact with power save functions from the MySQL Code. It was also interesting to see slides on how computer architecture is developing, this can be put into thinking about how the MySQL architecture should progress over the next few years.

Visiting Family History Technology Workshop at BYU

On the 13th of March I attended an interesting workshop on techhnology for Genealogy. My interest in this is based on interest in genealogy itself (my family tree contains currently about 3000 persons from various parts of Sweden down to some farmers in northern Sweden born around 1400) and my interest in technology and in particular how MySQL and MySQL Cluster can be used for genealogy applications. Being an LDS myself also adds to my interest in the subject.

The LDS church has developed a Web API FamilySearchAPI where genealogists through their genealogy software can work on a common database where they can add, edit information about our ancestors. The system handling this system currently contains 2.2 PB of data and is going to grow significantly as images and more genealogy information is added.

There were quite a few interesting discussions on how to link information between the source information (scanned images of historical documents), transscribed information from sources and derived family trees. The most complex problem in this application is the fuzziness of the base data and that different genealogists can have many different opinion about how to interpret the fuzzy base data. Thus in order to solve the problem one has to handle quality of genealogists somehow in the model.

From a database point of view this application requires a huge system with large clusters of information, it contains one part which is the base data (the scanned images) and this is typically stored in a large clustered file system containing many petabytes of data. Then the derived data is smaller but given that all versions need to be stored will still be a really huge data set and this is a fairly traditional relational database with large amounts of relations between data.

So what I take home from the workshop is ideas on what MySQL and MySQL Cluster should support in 3-5 years from now to be able to work in applications like this one.

Speeding up MySQL by 36% on the T2000

This post will focus on the performance tuning work that we've been working on since December 2007 on the Sun T2000 server. We got a nice speedup of 36% with fairly small efforts and we've got good hope we can improve performance a great deal more. This effort is part of a new effort at MySQL to improve performance both on Solaris and Linux platforms and to some extent Windows as well. This report focuses on T2000 using Solaris.

T1000 and T2000 are the first CoolThreads servers from Sun with the UltraSPARC T1 processors. The T1 is very energy efficient, which is extremely important to modern datacenters. On the other hand, leveraging the massive amount of thread-level parallelism (32 concurrent threads) provided by the CoolThreads servers is the key to getting good performance. As the CoolThreads servers are used by many Sun customers to run web facing workloads, making sure that MySQL runs well on this platform is important to Sun and MySQL customers, and also to the success of the CoolThreads servers and MySQL.

Note: This work was started long before it was known that MySQL was to be acquired by Sun Microsystems. The actual work done for this tuning was done by Rayson in the performance team at MySQL.

The workload that we used was sysbench, which is a very simple benchmark. In particular, we only ran read-only OLTP sysbench to perform this tuning work. The reason behind this is that if MySQL does not scale well with a simple read-only OLTP workload, then it would not scale well with more complex workloads, yet using a more complex workload would need more time to setup and run.

This is a list of things that we tried.

1) Hardware setup and software versions used
============================================
The compiler version:
> cc -V
cc: Sun C 5.9 SunOS_sparc Build47_dlight 2007/05/22
usage: cc [ options] files. Use 'cc -flags' for details

Solaris version:
> cat /etc/release
Solaris 10 11/06 s10s_u3wos_10 SPARC
Copyright 2006 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 14 November 2006

For each run, 5 results were collected, and we discarded the best and the worst results, and then averaged the remaining 3, and sysbench was invoked as follow:
> ./sysbench --test=oltp --num-threads=32 --max-time=60 --max-requests=0 --oltp-read-only=on run

Using default configuration of MySQL 5.0.45 and read-only OLTP sysbench 0.4.8 on a Sun T2000 running at 1GHz, the throughput measured was 1209 transactions per second.

2) Compiling with -fast
=======================
Since the workload is CPU intensive with very few I/O operations, we knew that compiler optimizations would be very beneficial to performance. As Sun used the -fast flag for compiling other CPU intensive benchmarks (e.g. SPEC CPU), using -fast was the first thing we tried; this was done by setting CFLAGS and CXXFLAGS to -fast before we ran the configure script.

The throughput measured was 1241 transactions per second, or an improvement of 2.6%.

3) Fixing headers for Sun Studio
================================
As using a higher optimization level gave us a small but nice improvement, we then looked for other opportunities from compiler optimizations. The first thing we noticed was that there were compiler directives that were not recognized by Sun Studio. And inlining was disabled as well.

As the Sun Studio compiler supports inlining, we enabled it in InnoDB by modifying the header file: univ.i

The throughput went up 3.1% to 1279 transactions per second.

We also enabled prefetching by using "sparc_prefetch_read_many()" and "sparc_prefetch_write_many()". In fact there was a small performance degradation, the throughput decreased by -0.47% to 1273 transactions per second. Since we do enable prefetching on Linux when gcc is used as the build compiler, we believe that the Niagara has enough MLP (Memory Level Parallelism), which does not need a lot of help from prefetching. However, we will see if this could benefit other SPARC servers (UltraSPARC IV+ and SPARC64 come in mind), or x64 servers running Solaris (when Sun Studio is used as the build compiler).

4) Locks in MySQL
=================
We then use plockstat to locate contented mutex locks in the system. Surprising, memory management in libc was accounted for a lot of the lock contentions. Since the default malloc/free is not
optimized for threaded applications, we switched to mtmalloc. mtmalloc could be used without recompiling or relinking. We simply set the LD_PRELOAD environment variable in the shell that was used to start the MySQL server to interpose malloc/free calls.

> setenv LD_PRELOAD /usr/lib/libmtmalloc.so

The gain was 8.1% to 1376 transactions per second.

5) Caching Memory Inside MySQL
==============================
After we switched to mtmalloc, we still found that there were memory allocation and free patterns that were not efficient. We modified the code so that memory is cached inside MySQL instead of repeatedly allocated and freed. The idea is that we could trade memory usage for performance, but since most memory implementations cache memory when freed by the application instead of returning back to the operating system, with MySQL caching the memory would not only speed up the code, but also would not have impact on memory usage.

Using DTrace, we found that there were over 20 places where malloc and free were called repeatedly. We picked one of the hot spots and modified the code.

The change above gave us 1.5% to 1396 transactions per second.

6) Using Largepages
===================
Using largepages on the UltraSPARC T1 platform can be beneficial to performance, as the TLBs in the T1 processor are shared by the 32 hardware threads.

We use the environment variable MPSSHEAP to tell the operating system that we wanted to use largepages for the memory heap:

> setenv LD_PRELOAD mpss.so.1
> setenv MPSSHEAP 4M

This change gave us a gain of 4.2% in throughput to 1455 transactions per second.

7) Removing strdup() calls
==========================
Later on, we also found that there was an unnecessary strdup/free pattern in the code in mf_cache.c. Since the character string was not modified in the code, we removed the strdup call and simply passed the pointer to the string instead.

This change gave us a gain of 0.34% to 1460 transactions per second.

8) Feedback Profile and Link Time Optimizations
===============================================
We then compiled the MySQL server with feedback profile compiler optimization and link time optimization. We also trained MySQL in a training run, and then we recompile so that the compiler
could use the information (execution behavior) collected during the training run. The compiler flags used: -xipo -xprofile, -xlinkopt -fast

The combination of the compiler flags gave us a gain of 10.5% to 1614 transactions per second.

9) Configuration File Tuning
============================
While tuning values in the configuration file is the most common way to get higher performance for MySQL, we did not spend a lot of time on it, however. The reason is that we were more interested in finding the bottlenecks in the code. Nevertheless, we did use a few flags:

> cat my.cnf
[server]
query_cache_size=0
innodb_thread_concurrency=0
innodb_buffer_pool_size=100M
innodb_additional_mem_pool_size=20M

And the final throughput was 1649 transactions per second.

10) Things That Did Not Work as Expected
========================================
We also tried to use atomic instructions and ISM (Intimate Shared Memory), but both of them did not give us performance improvements.


Conclusion (for now)
====================
This was the initial work done to optimize MySQL on the Sun CoolThreads platform, and we got 36% better throughput than the default installation. As MySQL is now part of Sun, I expect that working with Sun engineers would allow MySQL to get even better performance and throughput.

Currently, caching memory inside MySQL looks promising. We got 1.5% improvement by only modifying one place inside MySQL. Since there are quite a few places that we could apply this optimization, there is still room for further performance improvement!

Finally, I should mention that some of the optimizations above also improved MySQL on x64 Linux Solaris. I will update everyone here in the near future. :-)

Performance Guide for MySQL Cluster@MySQL Users Conference

A new MySQL Users Conference is coming up again. MySQL was acquired recently by Sun Microsystems and thus innovation within will happen at an even faster rate than previously. The Users Conference will contain a lot of interesting presentations on how to develop your MySQL Applications. So come to Santa Clara 15-17 April to take part of the development and discuss with many MySQLers how MySQL will be developed in the next few years. I've prepared a set of blogs that I will publish over the next few days to give you an idea of what's cooking within MySQL and I hope some of these blogs can persuade you to come there and give your opinion on where the future development should be heading.

Personally I'll add my contribution to the talks at the MySQL Users Conference what to think about when building a high performance application based on MySQL Cluster. MySQL Cluster technology has matured over the last few years and is being used in more and more application categories. I even visited a conference on Family History Technology at BYU where I bumped into Matt Garner from FindMyPast (), he told me about how they had used MySQL Cluster for their Data Mining application and sustained a continous flow of 75.000 queries per second.

In my talk I'm planning to cover how partitioning your application data can improve performance, how the use of cluster interconnects can improve response time by as much as a factor of 8, when to use the native NDB API's and when to use SQL, and how to use some new features recently developed.

The MySQL Cluster development has been very focused on developing a feature set for the Telecom space for a few years, the last year development has started focusing more on general features to ensure we get improved performance also on complex SQL queries. Also development of improved features for usage of computers with high number of cores and execution threads (e.g. Niagara processor from Sun) and a number of other performance improvements are developed.

The talk will be very much focused on how you as an application developer can make use of the enormous performance capabilities a MySQL Cluster provides you with. I also hope to be able to present some impressing benchmark numbers using a large cluster Intel has made available to our use.