Mark Haney wrote:
On 12/01/2017 02:32 PM, hw wrote:
Hm. Foo is a program that imports data into a database from two CVS files, using a connection for each file and forking to import both files at once.
So this would mean that the database (running on a different server) takes almost two times as much as foo --- which I would consider kinda excruciatingly long because it´s merely inserting rows into two different tables after they were prepared by foo and then processes some queries to convert the data.
The queries after importing may take like 3 or 5 minutes. About 4.5 million rows are being imported.
Would you consider about 20 minutes for importing as long?
There are far too many variables you've not mentioned to determine if that's good or bad (or very bad). Is the connection a local connection (ie the import is done on the DB server) or a network connection?
Foo is running on a different machine than the database server.
What size are the CSV (CVS is a typo, correct?) files? 4.5M rows tells us nothing about how much data each row has. It could be 4.5M rows of one INT field or 4.5M rows of a hundred fields.
One CSV is 70745427, the other one is 536302424 bytes (68M and 512M). That´s 18 and 23 fields or so to insert for each row.
I'm a bit confused by the last two sentences. Based on how I read this:
- Foo is prepping (creating?) the tables
- Processes queries to convert the data (to CSV?)
- Runs more queries on those tables.
Or it could be:
- Foo preps the tables
... deletes the part of the rows that was imported the last time. The rows from last time are being imported again, plus new rows.
Only importing new rows would require checking every row that is being imported to figure out if it´s already there, which may not be so much faster as to be worthwhile, and since I usually don´t need to wait on the import to finish, it doesn´t really matter.
- Foo imports the CSV files
- Foo does post-processing of the tables.
right
It's not really clear the actual process, but I'll go on the assumption that Foo is creating the tables with the correct fields, data types, keys and hopefully indices. Then dumps the CSV files into the tables. Then does post-processing. (I've written similar scripts, so this is the most logical process to me.)
If we assume network bandwidth is fine, that still leaves far too many server variables to know if 20m is about right or not. Amount of data to import, TYPE of data, database AND server configuration, CPU, RAM, etc and DB config for tunable paramters like buffer pool, read/write I/O threads, etc.
The servers are connected by 4x1GB, using LACP.
IIRC, you posted some questions about tuning a DB server a while back, would this be data going into that server, perhaps?
right
I'd like to offer a helpful suggestion when asking for list help. It's better to provide TOO MUCH information, than too little. There's a big difference between 'my printer won't print' and 'my printer won't print because it's not feeding paper properly'.
Of course --- what gives me to think is that it takes relatively long for the database to insert the rows while foo converting them is relatively fast.
Foo is written in perl. I like to think that letting the database do as much of the work as possible is generally a better idea than doing things that the database could do in perl because the database is likely to be faster --- without overdoing either because for practical reasons, things need to be kept sufficiently simple, and unnecessary optimization is, well, unnecessary.
Now I wonder if my general assumption is false, though foo isn´t a good example to verify the assumption because it can´t really do anything else but import the rows, which takes as long as it takes, and the post processing is surprisingly fast (and brings the time that queries take which are working with the data once it has been imported down from many hours to a few minutes or to seconds and less because I optimized things).
So I don´t know ... I guess 45 minutes to import 600MB of data is reasonably fast, considering that 2.25 million rows times 40 fields yield 90 million fields, so that´s about 3333 fields/sec.