Large Table cleanup
Posted: Thu Mar 05, 2020 12:00 pm
I want to cleanup a large postgresql table (9M records). At one time it was created as a 1-1 relation to another table (parent), but now we are using this child table a bit differently which results in me wanting to delete about 2/3 of its data. And also in the past years the data has become corrupted as well, there are records in the child table that no longer have corresponding records in its parent table so I want to cleanup this as well.
In the new setup I want only those record in the child table remaining where a specific flag in the parent table is set
Simplified Schema:
Child table: child_id: INTEGER, parent_id: INTEGER
Parent table: id: INTEGER, flag: INTEGER
I want to do this in a way that has as minimum impact as possible to the users.
I cannot find a single query that does what I want in a reasonable amount of time.
What I have come up with so far is this:
- created temp flag field in child table
- copy value from parent table
- delete all the records in child where the flag = null
- remove temp flag field
What I have found trying to run this locally as a test is that added the temp flag column takes a lot of time (still running after 1 hr) and I assume that when I run this query it places a lock on the table, preventing users from accessing the other data in this table (the records I want to delete are not used or accessed anymore by users)
Any suggestions on how to optimize this?
Does deleting the data in one go (after the tmp flag is set) result in the table being locked for as long as the query is running?
In the new setup I want only those record in the child table remaining where a specific flag in the parent table is set
Simplified Schema:
Child table: child_id: INTEGER, parent_id: INTEGER
Parent table: id: INTEGER, flag: INTEGER
I want to do this in a way that has as minimum impact as possible to the users.
I cannot find a single query that does what I want in a reasonable amount of time.
What I have come up with so far is this:
- created temp flag field in child table
- Code: Select all
alter table child add column flag integer default null;
- copy value from parent table
- Code: Select all
update child as c
set flag = p.flag
from parent p
where c.parent_id = p.id and p.flag = 1;
- delete all the records in child where the flag = null
- Code: Select all
delete from child where flag is null;
- remove temp flag field
- Code: Select all
alter table child drop column flag;
What I have found trying to run this locally as a test is that added the temp flag column takes a lot of time (still running after 1 hr) and I assume that when I run this query it places a lock on the table, preventing users from accessing the other data in this table (the records I want to delete are not used or accessed anymore by users)
Any suggestions on how to optimize this?
Does deleting the data in one go (after the tmp flag is set) result in the table being locked for as long as the query is running?