Testing MySQL NDB Cluster with dbdeployer
A great way to install MySQL when you need to do quick tests is to use a sandbox tool. This allows you to perform all the installation steps with a single command making the whole process very simple, and it allows for automation of the test. Giuseppe Maxia (also known as the Data Charmer, @datacharmer on Twitter) has for many years maintained sandbox tools for MySQL, first with MySQL Sandbox and now with dbdeployer.
One of the most recent features of dbdeployer is the support for MySQL NDB Cluster. In this blog, I will take this feature and test it. First, I will briefly discuss what MySQL NDB Cluster is, then install dbdeployer, and finally set up a test cluster.
What is MySQL NDB Cluster?
MySQL NDB Cluster is primarily an in-memory database (but also with support for on-disk data) that has been designed from day one to be highly available and providing consistent response times. A cluster consists of several nodes which can be one of three types:
- Data Nodes: This is where the actual data is stored. Currently there is support for up to 48 data nodes in a cluster with up to 1TiB of data memory for each node.
- API Nodes: These are the nodes where queries are executed on. An API node can be a normal
mysqld
process (also known as an SQL node), or it can be a NoSQL node using the C++ (this is the native NDB API), Java (ClusterJ), memcached, or Node.js API. - Management Nodes: These nodes hold the configuration of the cluster, and one of the management nodes is the most common choice as an arbitrator in case it is necessary decide between two halves of data nodes to avoid a split brain scenario.
You will typically have at least two data nodes in a cluster with two copies (replicas) of the data. This allows one data node to be offline while the cluster stays online. More data nodes can be added to increase the capacity or to add more data partitions. The data partitioning (sharding) and the replicas is all handled automatically, including when querying the data.
All of this means that you will end up with quite a few nodes. In a production cluster, you need at least two of each node type to have high availability. Even though you may not need high availability for your testing, you will still need at least two data nodes, one management node, and one SQL node. Being able to automate the installation of the cluster is a great help when you need to do a quick test – which brings us to dbdeployer. The first step is to install it.
Installing dbdeployer
It is simple to install dbdeployer. From the dbdeployer's GitHub page, there are releases that can be downloaded and easily installed. For this blog, I am using release 1.24.0 on Linux. I will recommend you to use the latest release. In addition to Linux, dbdeployer is also available for macOS. Unfortunately there is no Microsoft Windows support.
An example of downloading and installing dbdeployer is:
shell$ mkdir Downloads
shell$ cd Downloads/
shell$ wget https://github.com/datacharmer/dbdeployer/releases/download/v1.24.0/dbdeployer-1.24.0.linux.tar.gz
...
HTTP request sent, awaiting response... 200 OK
Length: 4888282 (4.7M) [application/octet-stream]
Saving to: ‘dbdeployer-1.24.0.linux.tar.gz’
100%[================================>] 4,888,282 1.70MB/s in 2.8s
2019-03-25 17:48:54 (1.70 MB/s) - ‘dbdeployer-1.24.0.linux.tar.gz’ saved [4888282/4888282]
shell$ tar -zxf dbdeployer-1.24.0.linux.tar.gz
shell$ mkdir ~/bin
shell$ mv dbdeployer-1.24.0.linux ~/bin/dbdeployer
shell$ export PATH=${PATH}:~/bin
This downloads and unpacks the 1.24.0 release into the Downloads
directory. Then the dbdeployer binary is moved to the ~/bin
directory and renamed to dbdeployer
. Finally, the ~/bin
directory is added to the path searched when executing a command, so it is not necessary to specify the path each dbdeployer
is executed. There are other ways to perform these steps and other options where to install it; see also the official documentation.
That it is. Now it is possible to install a test cluster.
Installing a Test Cluster
Since dbdeployer works on a single host, all of the nodes will be installed on the same host. While this is bad for a production cluster, it is perfectly fine for most test clusters.
Warning
While a single host cluster is great for most tests, for testing your application before a deployment to production, it is recommended to use a multi-host cluster that is as similar to your production cluster as possible.
The first step is to download MySQL NDB Cluster as a tar-ball. You can get the latest patch release of each version from https://dev.mysql.com/downloads/cluster/. If you need to test with an older release, you can get that from https://downloads.mysql.com/archives/cluster/. In this example, MySQL NDB Cluster 7.6.9 is downloaded from the latest releases and places in the ~/Downloads
directory:
shell$ cd ~/Downloads/
shell$ wget https://dev.mysql.com/get/Downloads/MySQL-Cluster-7.6/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz
...
HTTP request sent, awaiting response... 200 OK
Length: 914236523 (872M) [application/x-tar-gz]
Saving to: ‘mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz’
100%[================================>] 914,236,523 699KB/s in 23m 52s
2019-03-25 18:49:29 (624 KB/s) - ‘mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz’ saved [914236523/914236523]
Once the download has completed, use the unpack
command of dbdeployer to unpack the downloaded file:
shell$ dbdeployer unpack --prefix=ndb ~/Downloads/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz
Unpacking tarball /home/dbdeployer/Downloads/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64.tar.gz to $HOME/opt/mysql/ndb7.6.9
.........100.........200.........300.........400.........500........
...
.........20300.........20400.........20500.........20600.........2070020704
Renaming directory /home/dbdeployer/opt/mysql/mysql-cluster-gpl-7.6.9-linux-glibc2.12-x86_64 to /home/dbdeployer/opt/mysql/ndb7.6.9
You are now ready for the actual creation of the test cluster. This is done using the deploy
command:
shell$ dbdeployer deploy replication ndb7.6.9 --topology=ndb --concurrent
$HOME/sandboxes/ndb_msb_ndb7_6_9/initialize_nodes
MySQL Cluster Management Server mysql-5.7.25 ndb-7.6.9
2019-03-27 17:22:16 [ndbd] INFO -- Angel connected to 'localhost:20900'
2019-03-27 17:22:16 [ndbd] INFO -- Angel allocated nodeid: 2
2019-03-27 17:22:17 [ndbd] INFO -- Angel connected to 'localhost:20900'
2019-03-27 17:22:17 [ndbd] INFO -- Angel allocated nodeid: 3
executing 'start' on node 1
............ sandbox server started
executing 'start' on node 2
.... sandbox server started
executing 'start' on node 3
.... sandbox server started
NDB cluster directory installed in $HOME/sandboxes/ndb_msb_ndb7_6_9
run 'dbdeployer usage multiple' for basic instructions'
This creates a cluster with two data nodes, one management nodes, and three SQL nodes. The nodes have been installed in the ${HOME}/sandboxes/ndb_msb_ndb7_6_9/
directory:
shell$ ls sandboxes/ndb_msb_ndb7_6_9/
check_nodes ndb_conf node3 test_replication
clear_all ndb_mgm restart_all test_sb_all
cluster_initialized ndbnode1 sbdescription.json use_all
initialize_nodes ndbnode2 send_kill_all use_all_masters
n1 ndbnode3 start_all use_all_slaves
n2 node1 status_all
n3 node2 stop_all
Notice how there for example is an ndb_mgm
script. This is a wrapper script around the ndb_mgm
binary in the MySQL installation – the MySQL NDB Cluster management client. This makes it easy to connect to the management node, for example to check the status of the cluster:
shell$ ./sandboxes/ndb_msb_ndb7_6_9/ndb_mgm -e "SHOW"
Connected to Management Server at: localhost:20900
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 @127.0.0.1 (mysql-5.7.25 ndb-7.6.9, Nodegroup: 0, *)
id=3 @127.0.0.1 (mysql-5.7.25 ndb-7.6.9, Nodegroup: 0)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (mysql-5.7.25 ndb-7.6.9)
[mysqld(API)] 4 node(s)
id=4 @127.0.0.1 (mysql-5.7.25 ndb-7.6.9)
id=5 @127.0.0.1 (mysql-5.7.25 ndb-7.6.9)
id=6 @127.0.0.1 (mysql-5.7.25 ndb-7.6.9)
id=7 (not connected, accepting connect from localhost)
Before wrapping up, let's see how you can connect to the different SQL nodes and see how they indeed query the same data.
Testing the Cluster
As a simple test, connect to the first SQL node and create a table. Then, connect to the second SQL node and insert a row. Finally, connect to the third SQL node and query the data.
The SQL nodes are in the node*
directories in ${HOME}/sandboxes/ndb_msb_ndb7_6_9/
. Each of those work in the same way as for a standalone MySQL Server sandbox, so you can use the use
wrapper script to connect using the MySQL command-line client:
shell$ ./sandboxes/ndb_msb_ndb7_6_9/node1/use
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 7
Server version: 5.7.25-ndb-7.6.9-cluster-gpl-log MySQL Cluster Community Server (GPL)
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
node1 [localhost:27510] {msandbox} ((none)) >
Now, the table can be created (output has been reformatted):
node1 [localhost:27510] {msandbox} ((none)) > \R
Returning to default PROMPT of mysql>
mysql> CREATE SCHEMA db1;
Query OK, 1 row affected (0.42 sec)
mysql> CREATE TABLE db1.t1 (
id int unsigned NOT NULL auto_increment PRIMARY KEY,
val varchar(36)
) ENGINE=NDBCluster;
Query OK, 0 rows affected (2.64 sec)
I changed the prompt back to the default mysql>
prompt. This is not because I don't like the prompt created by dbdeployer, but simply to make the formatting of the queries nicer. In general, I do prefer my prompt to tell me where I am connected, so the normal dbdeployer prompt will otherwise work well for me.
The table creation is just like normal except the engine is set to NDBCluster
. This is the engine name that tells MySQL to create the table in the data nodes.
The second step is to connect to the second instance and insert a row:
node2 [localhost:27511] {msandbox} ((none)) > INSERT INTO db1.t1 (val) VALUES (UUID());
Query OK, 1 row affected (0.11 sec)
node2 [localhost:27511] {msandbox} ((none)) > SELECT * FROM db1.t1;
+----+--------------------------------------+
| id | val |
+----+--------------------------------------+
| 1 | 84f59369-5051-11e9-9078-08002709eea3 |
+----+--------------------------------------+
1 row in set (0.05 sec)
Notice how this worked without creating the table. Since the table was created in the data nodes, all SQL nodes that connect to these data nodes will automatically know about the table.
Finally, confirm the data is also available in the third node:
node3 [localhost:27512] {msandbox} ((none)) > SELECT * FROM db1.t1;
+----+--------------------------------------+
| id | val |
+----+--------------------------------------+
| 1 | 84f59369-5051-11e9-9078-08002709eea3 |
+----+--------------------------------------+
1 row in set (0.12 sec)
Verdict
It is fantastic that dbdeployer now support MySQL NDB Cluster as well. It will be a great help performing tests. I do have some comments based on my testing. It is very likely some of those are just do to the fact, that this is my initial use of dbdeployer and thus, I will not claim that I understand all details of how it works yet, so do not take the following comments as the final word – nor are the comments meant as negative criticism:
- I find it a little confusing that a cluster is considered a replication topology. Yes, there is synchronous replication between the data nodes, but it is not related to the replication you have between two MySQL Server instances (which is also supported between two clusters). Personally, I would have called a single cluster for a single sandbox, and then allow for a (future) feature setting up two clusters with replication between them.
- The
restart_all
sandbox command literally shuts down the whole cluster, then starts it again (but see also two items later). For MySQL NDB Cluster there are essentially two different types of restarts (which each can either be a normal or an initial restart):- System Restart: All data nodes at least are shut down together, then started together. This is what
restart_all
implements. - Rolling Restart: The cluster as a whole remains online throughout the restart phase. This is done by always leaving one data node in each node group online while restarting the data nodes. SQL nodes are restarted such that at least one SQL node is online at all times. This is the normal way to do most configuration changes as it avoids downtime. I miss this restart type.
- System Restart: All data nodes at least are shut down together, then started together. This is what
- There does not seem to be any way to choose between normal and initial restarts.
- The
start_all
does not start the management and data nodes (only the SQL nodes are started). This may be on purpose, but seems inconsistent withstop_all
that does shut down the management and data nodes. Actually, I have not been able to find a way to start the cluster cleanly. There isinitialize_nodes
that will start the management and data nodes, but the script will also try to start the SQL nodes and load grants into the SQL nodes. - The
stop_all
script, first shuts down the management and data nodes. Then the SQL nodes. It is better to do it in the opposite order as it avoids errors on the SQL nodes if queries are executed during the shutdown. In older versions of MySQL NDB Cluster, it could also take a long time to shut down an SQL node that had lost the connection to the data nodes. - The management node is given
NodeId = 1
and the data nodes the subsequent ids. Data nodes can only have ids 1-48, so I always recommend reserving these ids for data nodes, and make the first management node haveNodeId = 49
and SQL nodes later ids. - There does not seem to be any way to change the number of management nodes. The
--ndb-nodes
option appears to be taken as one management node, and the rest as data nodes. Maybe a better way would be to have two options like:--ndb-nodegroups
: The number of node groups in the cluster. The number of data nodes can then be calculated as <# Node Groups> *NoOfReplicas
.--ndb-mgmnodes
: The number of management nodes.
- There is no check whether the number of NDB nodes is valid. For example with
--ndb-nodes=4
, dbdeployer tries to create a cluster with three data nodes which is not valid withNoOfReplicas = 2
. - I did not find any way to specify my preferred configuration of the cluster as part of the sandbox deployment.
- Consider adding the
--reload
option when startingndb_mgmd
(the management node). This will make the management node check whether there are any changes to the cluster configuration (stored in<path to sandbox>/ndb_conf/config.ini
) and if so apply those changes.
This may seem like a long list of comments, but I am also very well aware that support for MySQL NDB Cluster has only just been added, and that it takes time to implement all the details. Rome was not built in one day.
So, I would very much like to conclude with a big thank you to the Data Charmer. This is a great initial implementation.
Jesper,
Many thanks for the thorough review!
Please consider the following three points to understand possible improvements:
1. I am not an expert on NDB cluster. I am not even a regular user. I barely know how to set it up. What I did is based on the requests and limited feedback I received from willing NDB users.
2. The naming and the nodes used are constraints that come from dbdeployer infrastructure, and the need to use the same commands to deal with very different entities, such as master/slave replication, group of independent MySQL servers, multi-source replication (fan-in, all-masters), group replication, Percona Xtradb cluster. Thus, NDB support should as much as possible fit into the existing paradigms. For example, the current script organization allows you to run “test_replication” in every topology, and get the expected result. For NDB, every SQL node creates a table with engine ndbcluster, and each node retrieves the tables from all nodes.
3. dbdeployer is built for configurability. There is a publicly viewable template for every script that you find in the sandbox. You can export the templates, edit them, re-import them, and eventually create your own improvements. (https://github.com/datacharmer/dbdeployer#Sandbox-customization)
The number 3 above is the key. By understanding how the system works, interested users can propose enhancements. Using this framework, users of TiDB (of which I know even less than NDB clusters) were able to propose features and submit code enhancements. I expect a similar interaction with NDB users to improve the tool.
Looking forward to seeing the improvement wanted converted into dbdeployer issues and eventually pull requests!
Hi Giuseppe,
Thanks for the comments.
Yes, I am fully aware that it is not easy to extend a framework to cover completely new topology, so I understand it is necessary to make compromises.
I will take a look at the link to the customization. Thanks for providing that. As mentioned, I have to admit that I am very new to dbdeployer.
Thanks,
Jesper
Hi,
I just had a doubt. I am a complete beginner btw.
But I am not getting all the nodes to connect simultaneously.
Can you help me out and tell me how to make them work?
Hi Antonio,
Can I get you to clarify, what you mean when you say the nodes do not connect simultaneously? And which nodes (SQL, data, or management)?
You can start taking a look at the output of the ndb_mgm -e “SHOW” command. Note that in that output, it is common to have unconnected nodes, so you have some spare nodes defined, if you need them later.
You can also look at the error logs and see if anything stands out.
Best regards,
Jesper