MySQL: Best way to warehouse older data

cpscdave · Sep 1, 2017

Hi everyone,
I have a MySQL DB with a couple of large tables (a couple hundred million rows). Each row has a timestamp associated with it. Since this is just historical data, and not needed we want to copy older data into another database, and then delete the copied records.

I've written a python script that accomplishes this by running a series of queries:

Code:

CREATE DATABASE IF NOT EXISTS `olddata`;
CREATE TABLE IF NOT EXISTS `oldTable`;
INSERT INTO `olddata`.oldeTable` SELECT * FROM `currData`.`currTable` WHERE timestamp < someValue;
DELETE FROM `currData`.`currTable` WHERE timestamp < someValue;

What is the most efficent way to run these queries?
I've tried running it so that we timestamp is the ultimate goal value, but the query takes a long time to run, and can timeout on some of the larger tables,
So I've changed the script to try and do the same thing but looping through the oldest value + 1 month repeatedly till we hit the goal.
Or is there a third way that would be better??

Thanks,
-Dave

Greg Bernhardt · Sep 1, 2017

cpscdave said:

So I've changed the script to try and do the same thing but looping through the oldest value + 1 month repeatedly till we hit the goal.

I'm sure there is a better way, but this is what I would do too

jack action · Sep 1, 2017

What you want to do seems to be a good candidate for partitioning - namely RANGE partitioning - with an ALTER TABLE statement. You don't need to create another table that you won't use, you basically separate your actual table into sub-tables. If you ever need to delete the older data, it will require a simple TRUNCATE PARTITION statement.

But I don't know if it will be better performance-wise compared to your method as the manual says «In effect, different portions of a table are stored as separate tables in different locations», so it looks to be the same, but mySQL might be better optimized for it and it will be a lot easier for future addition of data.

And if you still want to go your way, you should have an index on your timestamp column, if you don't have one already.

cpscdave · Sep 1, 2017

jack action said:

And if still want to go your way, you should have an index on your timestamp column, if you don't have one already.

Thanks for the feedbacks! I had thought about pre-adding the index, but been having some problems with the connection timing out. I just got it to work by running the query locally on the machine, vs from a remote computer.
Adding the index has drastically improved the performance of the script, to the point that it may not even be needed to do the warehousing.

The joys of supporting a DB schema that I didn't create. Get to find all these little foibles down the road :)

newjerseyrunner · Sep 5, 2017

Does this really need to be all that fast? It seems like this is going to be the type of thing you do only once, then have a cron perform daily maintenance to keep it right. I would absolutely not do it in python because then you are reliant on the connection. Use a .sql file and give that directly to mysql. From python, you can then call the mysql client with this file as the parameter (which will allow you to change the "someValue" parameter programatically.

Wait! I see some very scary sql there, here is the corrected version:

Code:

BEGIN;
CREATE DATABASE IF NOT EXISTS `olddata`;
CREATE TABLE IF NOT EXISTS `oldTable`;
INSERT INTO `olddata`.oldeTable` SELECT * FROM `currData`.`currTable` WHERE timestamp < someValue;
DELETE FROM `currData`.`currTable` WHERE timestamp < someValue;
COMMIT;

You have multiple queries in your file, and if mysql craps out halfway through, your database will be in a corrupted state so always wrap it in a transaction.

WWGD · Sep 5, 2017

Hi, hope not to derail or distract the post, with a quasi-related followup; the answer may provide info towards speeding up the OPs query .

Are there, in SQL Server ( or other) significant differences (resource- and otherwise? ) between Delete, Drop, Truncate ? I believe (SQL Server -wise) is more of a DDL (Data Definition ) statement, which will delete associated data objects ( triggers, integrity constraints, indexes, etc ), so, I guess, the table would not exist in the original script, while Truncate and Delete are more DML (Table will still exist, albeit without data in it)?

jack action · Sep 5, 2017

WWGD said:

Are there, in SQL Server ( or other) significant differences (resource- and otherwise? ) between Delete, Drop, Truncate ?

DELETE deletes rows (DML);
DROP TABLE deletes a table (DDL);
TRUNCATE TABLE is equivalent to DROP TABLE followed by CREATE TABLE to recreate the table without any data (DDL).

So TRUNCATE TABLE is much faster than DELETE all rows with large tables, but it's really not the same thing.

WWGD · Sep 5, 2017

jack action said:

DELETE deletes rows (DML);

DROP TABLE deletes a table (DDL);

TRUNCATE TABLE is equivalent to DROP TABLE followed by CREATE TABLE to recreate the table without any data (DDL).

So TRUNCATE TABLE is much faster than DELETE all rows with large tables, but it's really not the same thing.

Thanks, I have done some reading, but sources seem to contradict each other. What would then be the difference results-wise between DELETE and TRUNCATE? Don't we end up in both cases with an empty table? And I guess DROP TABLE does not allow for 'Where' clauses? Also, do you know ifMySQL and SQL Server coincide in these definitions? EDIT: And I guess we would require to drop "Child tables" before dropping Parent Tables?EDIT2: There also seems to be a difference of opinion on whether "Drops" can be rolled back.

jack action · Sep 5, 2017

WWGD said:

What would then be the difference results-wise between DELETE and TRUNCATE?

From the MySQL documentation:

Although TRUNCATE TABLE is similar to DELETE, it is classified as a DDL statement rather than a DML statement. It differs from DELETE in the following ways:

Truncate operations drop and re-create the table, which is much faster than deleting rows one by one, particularly for large tables.

Truncate operations cause an implicit commit, and so cannot be rolled back. See Section 13.3.3, “Statements That Cause an Implicit Commit”.

Truncation operations cannot be performed if the session holds an active table lock.

TRUNCATE TABLE fails for an InnoDB table or NDB table if there are any FOREIGN KEY constraints from other tables that reference the table. Foreign key constraints between columns of the same table are permitted.

Truncation operations do not return a meaningful value for the number of deleted rows. The usual result is “0 rows affected,” which should be interpreted as “no information.”

As long as the table format file tbl_name.frm is valid, the table can be re-created as an empty table with TRUNCATE TABLE, even if the data or index files have become corrupted.

Any AUTO_INCREMENT value is reset to its start value. This is true even for MyISAM and InnoDB, which normally do not reuse sequence values.

When used with partitioned tables, TRUNCATE TABLE preserves the partitioning; that is, the data and index files are dropped and re-created, while the partition definitions (.par) file is unaffected.

Note
As of MySQL 5.7.6, partition definition (.par) files are no longer created. Instead, partition definitions are stored in the internal data dictionary.

The TRUNCATE TABLE statement does not invoke ON DELETE triggers.

WWGD said:

Don't we end up in both cases with an empty table?

If you specify deleting all rows, yes. (Like with «DELETE FROM table_name WHERE 1=1»).

WWGD said:

And I guess DROP TABLE does not allow for 'Where' clauses?

All you can do is delete a table by specifying its name («DROP [TEMPORARY] TABLE [IF EXISTS] tbl_name [, tbl_name] ...»).

WWGD said:

Also, do you know ifMySQL and SQL Server coincide in these definitions?

I'm not familiar with SQL Server.

WWGD said:

And I guess we would require to drop "Child tables" before dropping Parent Tables?

The concept of "Child" and "Parent" tables doesn't exist in MySQL, so you really need to specify their names. But I think if you have an appropriate foreign key constraint, it won't drop the table, but I'm not sure.

WWGD said:

There also seems to be a difference of opinion on whether "Drops" can be rolled back.

According to MySQL documentation:

Some statements cannot be rolled back. In general, these include data definition language (DDL) statements, such as those that create or drop databases, those that create, drop, or alter tables or stored routines.

WWGD · Sep 5, 2017

jack action said:

From the MySQL documentation:

If you specify deleting all rows, yes. (Like with «DELETE FROM table_name WHERE 1=1»).

All you can do is delete a table by specifying its name («DROP [TEMPORARY] TABLE [IF EXISTS] tbl_name [, tbl_name] ...»).

I'm not familiar with SQL Server.

The concept of "Child" and "Parent" tables doesn't exist in MySQL, so you really need to specify their names. But I think if you have an appropriate foreign key constraint, it won't drop the table, but I'm not sure.

According to MySQL documentation:

Thanks, Jack.

MySQL: Best way to warehouse older data

Related to MySQL: Best way to warehouse older data

1. What is MySQL and how is it used?

2. Why is it important to warehouse older data in MySQL?

3. What is the best way to warehouse older data in MySQL?

4. How can warehousing older data in MySQL benefit a business?

5. Are there any potential challenges or drawbacks to warehousing older data in MySQL?

Similar threads

Hot Threads

Recent Insights