I want to cleanup a large postgresql table (9M records). At one time it was created as a 1-1 relation to another table (parent), but now we are using this child table a bit differently which results in me wanting to delete about 2/3 of its data. And also in the past years the data has become corrupted as well, there are records in the child table that no longer have corresponding records in its parent table so I want to cleanup this as well.
In the new setup I want only those record in the child table remaining where a specific flag in the parent table is set
Simplified Schema:
Child table: child_id: INTEGER, parent_id: INTEGER
Parent table: id: INTEGER, flag: INTEGER
I want to do this in a way that has as minimum impact as possible to the users.
I cannot find a single query that does what I want in a reasonable amount of time.
What I have come up with so far is this:
- created temp flag field in child table
alter table child add column flag integer default null;
- copy value from parent table
update child as c
set flag = p.flag
from parent p
where c.parent_id = p.id and p.flag = 1;
- delete all the records in child where the flag = null
delete from child where flag is null;
- remove temp flag field
alter table child drop column flag;
What I have found trying to run this locally as a test is that added the temp flag column takes a lot of time (still running after 1 hr) and I assume that when I run this query it places a lock on the table, preventing users from accessing the other data in this table (the records I want to delete are not used or accessed anymore by users)
Any suggestions on how to optimize this?
Does deleting the data in one go (after the tmp flag is set) result in the table being locked for as long as the query is running?