MySQL NDB Cluster 7.6: Fast Import with ndb_import

A common problem when you need to set up a new database instance is to import the initial data. This can particularly be an issue, if you are converting an existing database so there is a large amount of data to import. In MySQL NDB Cluster 7.6 which just was announced as general available (GA) this week, there is a new tool to help with such imports: ndb_import.

The ndb_import utility takes a CSV file and imports the data into a table. (While C stands for comma, the tool is not limited to comma-separated files, in fact tabs is the default delimiter.) But hey, does that not sound familiar? It sure does. MySQL Server has “for ever” included the mysqlimport utility and LOAD DATA statement. So why do we need another utility and what does ndb_import do that we cannot already do with the existing tools?

The big advantage of ndb_import is that it is a native NDB tool. It does not connect to any of the SQL nodes; instead it connects directly to the data nodes as an NDB API program. This alone allows you to bypass the overhead of connecting to the SQL node and parse SQL statements. However, it is even better than that. The ndb_import tool can also take advantage of the parallelism offered by MySQL NDB Cluster. This means that in the end, the overall performance is expected to be quite a bit better than loading the data using mysqlimport or LOAD DATA.

Note

The speed-up will depend on your setup. A cluster with all nodes in a VM on a laptop may not experience any significant speed-up compared to other methods of importing data.

So, how does this work? Let us look at a simple example. We will import the following data from the data.txt file:

# ID    First   Surname Birthday
1	John	Doe	1945-10-15
2	Jane	Doe	1947-08-11
3	Mary	Smith	1974-12-04
4	Mike	Jones	1980-11-12

The data includes the first name, surname, and birthday for four persons separated by \t, and it will be imported into the db1.t1 table:

CREATE SCHEMA IF NOT EXISTS db1;

CREATE TABLE db1.t1 (
  id int unsigned NOT NULL auto_increment,
  FirstName varchar(20) NOT NULL,
  Surname varchar(20) NOT NULL,
  Birthday date NOT NULL,
  PRIMARY KEY (id)
) ENGINE=NDBCluster CHARACTER SET=utf8mb4;

Information

You still need to create the table through an SQL node.

With the data and table, the data can be imported using the following ndb_import command:

shell$ ndb_import --ndb-connectstring=localhost:1186 db1 t1.txt \
                  --connections=2 --db-workers=2 \
                  --fields-terminated-by="\t" --ignore-lines=1 \
                  --lines-terminated-by="\n"

The arguments in the first line of the command tells how to connect to the management node, the schema name, and the name of the file to import data from. Like for mysqlimport, the table name is derived from the file name by using the basename, t1 in this case, as the table name.

The --connections and --db-workers options defines the parallelism of the job. The --connections option is equivalent to the ndb-cluster-connection-pool option that can be set for SQL nodes and defines how many connections the ndb_import process creates to the data nodes. There must be at least as many free API node slots available as the number of connections requested. Otherwise the command will hang while waiting for enough connections to become available and eventually fail. The --db-workers option specifies the number of threads ndb_import will use per data node.

The remaining options are equivalent to the options available for mysqlimport, LOAD DATA, and SELECT ... INTO OUTFILE. They specify the format of the data inside the source file and how many rows to skip – one in this case due to the header.

If you need to load data into several tables in the same schema, you can specify a file name for each table, and they will all be processed.

There are many more options available. I will recommend you to check out the reference manual page for ndb_import for more information.

The output of the command is:

job-1 import db1.t1 from t1.txt
job-1 [running] import db1.t1 from t1.txt
job-1 [success] import db1.t1 from t1.txt
job-1 imported 4 rows in 0h0m0s at 9 rows/s
jobs summary: defined: 1 run: 1 with success: 1 with failure: 0

This includes some metadata information about the job such as the table and file names as well as the total number of rows imported, how long time it took, and the number of rows imported.

What happens, if something goes wrong? If we thought the data was comma-separated and had used --fields-terminated-by=",", an error would have occurred:

shell$ ndb_import --ndb-connectstring=localhost:1186 db1 t1.txt \
                  --connections=2 --db-workers=2 \
                  --fields-terminated-by="," \
                  --ignore-lines=1 --lines-terminated-by="\n"
job-1 import db1.t1 from t1.txt
job-1 [running] import db1.t1 from t1.txt
job-1 [error] import db1.t1 from t1.txt
job-1 imported 0 rows in 0h0m0s at 0 rows/s
job-1 rejected 1 rows (limit 0), see t1.rej
job-1 error[gen-0]: error in teams: 1-csv-input (source:740)
job-1 1-csv-input error[data-0]: reject limit 0 exceeded (source:534)
jobs summary: defined: 1 run: 1 with success: 0 with failure: 1

Notice here how the fifth line of the output says “job-1 rejected 1 rows (limit 0), see t1.rej”. Each input file is considered a job, so “job-1” refers to the first (and in this case only) input file. It rejected 1 row, and the limit refers to the –rejects option which specifies how many permanent errors must be encountered before failing the whole job. The default is to fail on the first error. Finally, “see t1.rej” tells us to look at the t1.rej file (created by ndb_import) for more information about the failed job:

shell$ cat t1.rej
runno   rowid   linenr  startpos        endpos  bytes   errortype       errorcode       sourceline      errortext       reject
0       0       2       32      54      22      data    0       1130    line 2: too few fields (1 < 4)  1\tJohn\tDoe\t1945-10-15\n

The t1.rej file is a tab-delimited file with information for each failed row. In this case, the error is that there are too few values: one value was found (because there are no commas in the data and we specified the wrong delimiter) but four values were expected. If --rejects had been set to 3, all four rows would have been included in t1.rej.

The ndb_import binary is included with the NDB Cluster installation and can be found in the bin directory. The next step is for you to try it out. Good luck importing data into MySQL NDB Cluster 7.6.

MySQL NDB Cluster 7.6: Fast Import with ndb_import

Related