MySQL NDB Cluster 7.6: Fast Import with ndb_import
A common problem when you need to set up a new database instance is to import the initial data. This can particularly be an issue, if you are converting an existing database so there is a large amount of data to import. In MySQL NDB Cluster 7.6 which just was announced as general available (GA) this week, there is a new tool to help with such imports:
The ndb_import utility takes a CSV file and imports the data into a table. (While C stands for comma, the tool is not limited to comma-separated files, in fact tabs is the default delimiter.) But hey, does that not sound familiar? It sure does. MySQL Server has “for ever” included the
mysqlimport utility and
LOAD DATA statement. So why do we need another utility and what does
ndb_import do that we cannot already do with the existing tools?
The big advantage of
ndb_import is that it is a native NDB tool. It does not connect to any of the SQL nodes; instead it connects directly to the data nodes as an NDB API program. This alone allows you to bypass the overhead of connecting to the SQL node and parse SQL statements. However, it is even better than that. The
ndb_import tool can also take advantage of the parallelism offered by MySQL NDB Cluster. This means that in the end, the overall performance is expected to be quite a bit better than loading the data using
So, how does this work? Let us look at a simple example. We will import the following data from the
# ID First Surname Birthday 1 John Doe 1945-10-15 2 Jane Doe 1947-08-11 3 Mary Smith 1974-12-04 4 Mike Jones 1980-11-12
The data includes the first name, surname, and birthday for four persons separated by \t, and it will be imported into the
CREATE SCHEMA IF NOT EXISTS db1; CREATE TABLE db1.t1 ( id int unsigned NOT NULL auto_increment, FirstName varchar(20) NOT NULL, Surname varchar(20) NOT NULL, Birthday date NOT NULL, PRIMARY KEY (id) ) ENGINE=NDBCluster CHARACTER SET=utf8mb4;
With the data and table, the data can be imported using the following
shell$ ndb_import --ndb-connectstring=localhost:1186 db1 t1.txt \ --connections=2 --db-workers=2 \ --fields-terminated-by="\t" --ignore-lines=1 \ --lines-terminated-by="\n"
The arguments in the first line of the command tells how to connect to the management node, the schema name, and the name of the file to import data from. Like for
mysqlimport, the table name is derived from the file name by using the basename, t1 in this case, as the table name.
--db-workers options defines the parallelism of the job. The
--connections option is equivalent to the
ndb-cluster-connection-pool option that can be set for SQL nodes and defines how many connections the
ndb_import process creates to the data nodes. There must be at least as many free API node slots available as the number of connections requested. Otherwise the command will hang while waiting for enough connections to become available and eventually fail. The
--db-workers option specifies the number of threads
ndb_import will use per data node.
The remaining options are equivalent to the options available for
LOAD DATA, and
SELECT ... INTO OUTFILE. They specify the format of the data inside the source file and how many rows to skip – one in this case due to the header.
If you need to load data into several tables in the same schema, you can specify a file name for each table, and they will all be processed.
There are many more options available. I will recommend you to check out the reference manual page for ndb_import for more information.
The output of the command is:
job-1 import db1.t1 from t1.txt job-1 [running] import db1.t1 from t1.txt job-1 [success] import db1.t1 from t1.txt job-1 imported 4 rows in 0h0m0s at 9 rows/s jobs summary: defined: 1 run: 1 with success: 1 with failure: 0
This includes some metadata information about the job such as the table and file names as well as the total number of rows imported, how long time it took, and the number of rows imported.
What happens, if something goes wrong? If we thought the data was comma-separated and had used
--fields-terminated-by=",", an error would have occurred:
shell$ ndb_import --ndb-connectstring=localhost:1186 db1 t1.txt \ --connections=2 --db-workers=2 \ --fields-terminated-by="," \ --ignore-lines=1 --lines-terminated-by="\n" job-1 import db1.t1 from t1.txt job-1 [running] import db1.t1 from t1.txt job-1 [error] import db1.t1 from t1.txt job-1 imported 0 rows in 0h0m0s at 0 rows/s job-1 rejected 1 rows (limit 0), see t1.rej job-1 error[gen-0]: error in teams: 1-csv-input (source:740) job-1 1-csv-input error[data-0]: reject limit 0 exceeded (source:534) jobs summary: defined: 1 run: 1 with success: 0 with failure: 1
Notice here how the fifth line of the output says “job-1 rejected 1 rows (limit 0), see t1.rej”. Each input file is considered a job, so “job-1” refers to the first (and in this case only) input file. It rejected 1 row, and the limit refers to the –rejects option which specifies how many permanent errors must be encountered before failing the whole job. The default is to fail on the first error. Finally, “see t1.rej” tells us to look at the t1.rej file (created by
ndb_import) for more information about the failed job:
shell$ cat t1.rej runno rowid linenr startpos endpos bytes errortype errorcode sourceline errortext reject 0 0 2 32 54 22 data 0 1130 line 2: too few fields (1 < 4) 1\tJohn\tDoe\t1945-10-15\n
t1.rej file is a tab-delimited file with information for each failed row. In this case, the error is that there are too few values: one value was found (because there are no commas in the data and we specified the wrong delimiter) but four values were expected. If
--rejects had been set to 3, all four rows would have been included in
ndb_import binary is included with the NDB Cluster installation and can be found in the bin directory. The next step is for you to try it out. Good luck importing data into MySQL NDB Cluster 7.6.