MySQL: Best way to warehouse older data

In summary: Truncating a table also resets any AUTO_INCREMENT counter to zero.In summary, the conversation discusses the best way to efficiently run queries on a MySQL database with large tables. One solution proposed is to use partitioning, while others suggest adding an index on the timestamp column or using a SQL file instead of a Python script. The differences between DELETE, DROP, and TRUNCATE are also discussed, with TRUNCATE being the fastest option but not allowing for rollbacks. Ultimately, the best approach may vary depending on the specific database and its needs.
  • #1
cpscdave
403
120
Hi everyone,
I have a MySQL DB with a couple of large tables (a couple hundred million rows). Each row has a timestamp associated with it. Since this is just historical data, and not needed we want to copy older data into another database, and then delete the copied records.

I've written a python script that accomplishes this by running a series of queries:

Code:
CREATE DATABASE IF NOT EXISTS `olddata`;
CREATE TABLE IF NOT EXISTS `oldTable`;
INSERT INTO `olddata`.oldeTable` SELECT * FROM `currData`.`currTable` WHERE timestamp < someValue;
DELETE FROM `currData`.`currTable` WHERE timestamp < someValue;

What is the most efficent way to run these queries?
I've tried running it so that we timestamp is the ultimate goal value, but the query takes a long time to run, and can timeout on some of the larger tables,
So I've changed the script to try and do the same thing but looping through the oldest value + 1 month repeatedly till we hit the goal.
Or is there a third way that would be better??

Thanks,
-Dave
 
Technology news on Phys.org
  • #2
cpscdave said:
So I've changed the script to try and do the same thing but looping through the oldest value + 1 month repeatedly till we hit the goal.
I'm sure there is a better way, but this is what I would do too :smile:
 
  • #3
What you want to do seems to be a good candidate for partitioning - namely RANGE partitioning - with an ALTER TABLE statement. You don't need to create another table that you won't use, you basically separate your actual table into sub-tables. If you ever need to delete the older data, it will require a simple TRUNCATE PARTITION statement.

But I don't know if it will be better performance-wise compared to your method as the manual says «In effect, different portions of a table are stored as separate tables in different locations», so it looks to be the same, but mySQL might be better optimized for it and it will be a lot easier for future addition of data.

And if you still want to go your way, you should have an index on your timestamp column, if you don't have one already.
 
Last edited:
  • Like
Likes stoomart
  • #4
jack action said:
And if still want to go your way, you should have an index on your timestamp column, if you don't have one already.

Thanks for the feedbacks! I had thought about pre-adding the index, but been having some problems with the connection timing out. I just got it to work by running the query locally on the machine, vs from a remote computer.
Adding the index has drastically improved the performance of the script, to the point that it may not even be needed to do the warehousing.

The joys of supporting a DB schema that I didn't create. Get to find all these little foibles down the road :)
 
  • #5
Does this really need to be all that fast? It seems like this is going to be the type of thing you do only once, then have a cron perform daily maintenance to keep it right. I would absolutely not do it in python because then you are reliant on the connection. Use a .sql file and give that directly to mysql. From python, you can then call the mysql client with this file as the parameter (which will allow you to change the "someValue" parameter programatically.

Wait! I see some very scary sql there, here is the corrected version:
Code:
BEGIN;
CREATE DATABASE IF NOT EXISTS `olddata`;
CREATE TABLE IF NOT EXISTS `oldTable`;
INSERT INTO `olddata`.oldeTable` SELECT * FROM `currData`.`currTable` WHERE timestamp < someValue;
DELETE FROM `currData`.`currTable` WHERE timestamp < someValue;
COMMIT;

You have multiple queries in your file, and if mysql craps out halfway through, your database will be in a corrupted state so always wrap it in a transaction.
 
Last edited:
  • Like
Likes stoomart, WWGD and jack action
  • #6
Hi, hope not to derail or distract the post, with a quasi-related followup; the answer may provide info towards speeding up the OPs query .

Are there, in SQL Server ( or other) significant differences (resource- and otherwise? ) between Delete, Drop, Truncate ? I believe (SQL Server -wise) is more of a DDL (Data Definition ) statement, which will delete associated data objects ( triggers, integrity constraints, indexes, etc ), so, I guess, the table would not exist in the original script, while Truncate and Delete are more DML (Table will still exist, albeit without data in it)?
 
  • #7
WWGD said:
Are there, in SQL Server ( or other) significant differences (resource- and otherwise? ) between Delete, Drop, Truncate ?
  • DELETE deletes rows (DML);
  • DROP TABLE deletes a table (DDL);
  • TRUNCATE TABLE is equivalent to DROP TABLE followed by CREATE TABLE to recreate the table without any data (DDL).
So TRUNCATE TABLE is much faster than DELETE all rows with large tables, but it's really not the same thing.
 
  • Like
Likes WWGD
  • #8
jack action said:
  • DELETE deletes rows (DML);
  • DROP TABLE deletes a table (DDL);
  • TRUNCATE TABLE is equivalent to DROP TABLE followed by CREATE TABLE to recreate the table without any data (DDL).
So TRUNCATE TABLE is much faster than DELETE all rows with large tables, but it's really not the same thing.
Thanks, I have done some reading, but sources seem to contradict each other. What would then be the difference results-wise between DELETE and TRUNCATE? Don't we end up in both cases with an empty table? And I guess DROP TABLE does not allow for 'Where' clauses? Also, do you know ifMySQL and SQL Server coincide in these definitions? EDIT: And I guess we would require to drop "Child tables" before dropping Parent Tables?EDIT2: There also seems to be a difference of opinion on whether "Drops" can be rolled back.
 
Last edited:
  • #9
WWGD said:
What would then be the difference results-wise between DELETE and TRUNCATE?
From the MySQL documentation:
Although TRUNCATE TABLE is similar to DELETE, it is classified as a DDL statement rather than a DML statement. It differs from DELETE in the following ways:
  • Truncate operations drop and re-create the table, which is much faster than deleting rows one by one, particularly for large tables.
  • Truncate operations cause an implicit commit, and so cannot be rolled back. See Section 13.3.3, “Statements That Cause an Implicit Commit”.
  • Truncation operations cannot be performed if the session holds an active table lock.
  • TRUNCATE TABLE fails for an InnoDB table or NDB table if there are any FOREIGN KEY constraints from other tables that reference the table. Foreign key constraints between columns of the same table are permitted.
  • Truncation operations do not return a meaningful value for the number of deleted rows. The usual result is “0 rows affected,” which should be interpreted as “no information.”
  • As long as the table format file tbl_name.frm is valid, the table can be re-created as an empty table with TRUNCATE TABLE, even if the data or index files have become corrupted.
  • Any AUTO_INCREMENT value is reset to its start value. This is true even for MyISAM and InnoDB, which normally do not reuse sequence values.
  • When used with partitioned tables, TRUNCATE TABLE preserves the partitioning; that is, the data and index files are dropped and re-created, while the partition definitions (.par) file is unaffected.

    Note
    As of MySQL 5.7.6, partition definition (.par) files are no longer created. Instead, partition definitions are stored in the internal data dictionary.
  • The TRUNCATE TABLE statement does not invoke ON DELETE triggers.
WWGD said:
Don't we end up in both cases with an empty table?
If you specify deleting all rows, yes. (Like with «DELETE FROM table_name WHERE 1=1»).
WWGD said:
And I guess DROP TABLE does not allow for 'Where' clauses?
All you can do is delete a table by specifying its name («DROP [TEMPORARY] TABLE [IF EXISTS] tbl_name [, tbl_name] ...»).
WWGD said:
Also, do you know ifMySQL and SQL Server coincide in these definitions?
I'm not familiar with SQL Server.
WWGD said:
And I guess we would require to drop "Child tables" before dropping Parent Tables?
The concept of "Child" and "Parent" tables doesn't exist in MySQL, so you really need to specify their names. But I think if you have an appropriate foreign key constraint, it won't drop the table, but I'm not sure.
WWGD said:
There also seems to be a difference of opinion on whether "Drops" can be rolled back.
According to MySQL documentation:
Some statements cannot be rolled back. In general, these include data definition language (DDL) statements, such as those that create or drop databases, those that create, drop, or alter tables or stored routines.
 
  • Like
Likes WWGD
  • #10
jack action said:
From the MySQL documentation:

If you specify deleting all rows, yes. (Like with «DELETE FROM table_name WHERE 1=1»).

All you can do is delete a table by specifying its name («DROP [TEMPORARY] TABLE [IF EXISTS] tbl_name [, tbl_name] ...»).

I'm not familiar with SQL Server.

The concept of "Child" and "Parent" tables doesn't exist in MySQL, so you really need to specify their names. But I think if you have an appropriate foreign key constraint, it won't drop the table, but I'm not sure.

According to MySQL documentation:
Thanks, Jack.
 

Related to MySQL: Best way to warehouse older data

1. What is MySQL and how is it used?

MySQL is an open-source relational database management system (RDBMS) that is commonly used for storing and managing data. It is used for a variety of purposes, including web applications, data warehousing, and business intelligence.

2. Why is it important to warehouse older data in MySQL?

Warehousing older data in MySQL allows for efficient storage and retrieval of large amounts of data. It also helps to improve the performance of the database by reducing the size of active data, making it easier to query and analyze.

3. What is the best way to warehouse older data in MySQL?

The best way to warehouse older data in MySQL depends on the specific needs and requirements of the organization. Some common methods include partitioning the data, using archiving techniques, or creating separate databases for older data.

4. How can warehousing older data in MySQL benefit a business?

Warehousing older data in MySQL can provide numerous benefits for a business. It can help to reduce storage costs, improve database performance, and make it easier to analyze historical data for business insights and decision-making.

5. Are there any potential challenges or drawbacks to warehousing older data in MySQL?

Yes, there can be some challenges and drawbacks to warehousing older data in MySQL. These may include increased maintenance and management efforts, potential data redundancy, and the need for effective data aging strategies to ensure only relevant data is stored.

Similar threads

  • Programming and Computer Science
Replies
7
Views
535
Back
Top