3

I have a local PostgreSQL database that was created using a Rails application. It has 600k records, of which ~200k are duplicates. I want to keep only 1 of the record and delete the duplicates. I write SQL everyday for work but Rails is my hobby, and still I struggle with ActiveRecord.

Here's how I found the duplicates (in Rails console):

Summary.select(:map_id).group(:map_id).having("count(*) > 1")

I don't think I can simply add destroy_all to the end of that statement as it will destroy all instances of that entry, including the duplicate values.

Could you please tell me how to update this so that it removes the duplicates?

Michael Gaskill
  • 7,913
  • 10
  • 38
  • 43
nonegiven72
  • 427
  • 8
  • 19
  • 1
    If you know your way around SQL why don't you just do it in SQL? – mu is too short May 30 '16 at 00:39
  • For some reason I thought using pure SQL in rails was hard. I've done this a few times in SQL. One would be order by map_id and select first. Other would be order by, creating a count column, and selecting where that count column = some_number (used in case you didn't want the first, but rather the 2nd or 3rd observation to be preserved). – nonegiven72 May 30 '16 at 00:44
  • Using raw SQL is easy in Rails, I do it all the time because ActiveRecord only understands baby talk SQL. – mu is too short May 30 '16 at 01:21

3 Answers3

4

This will destroy the duplicates in waves, selecting only a single duplicate per map_id, on each pass. The loop will automatically finish when no more duplicates exist.

loop do
  duplicates = Summary.select("MAX(id) as id, map_id").group(:map_id).having("count(*) > 1")
  break if duplicates.length == 0
  duplicates.destroy_all
end

If the database looks like this:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |
|  4 |    235 |
|  5 |    235 |
|  6 |    299 |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the first wave, these records would be returned and destroyed:

| id | map_id |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the second wave, this record would be returned and destroyed:

| id | map_id |
|  5 |    235 |
|  6 |    299 |

The third wave would return and destroy this record:

| id | map_id |
|  4 |    235 |

The fourth wave would complete the process. Unless there are numerous duplicates for a given map_id, it's likely that this process will finish in single-digit loop iterations.

Given the approach, only duplicates will ever be returned, and only the newer duplicates will be removed. To remove older duplicates, instead, the query can be changed to this:

  duplicates = Summary.select("MIN(id) as id, map_id").group(:map_id).having("count(*) > 1")

In that case, wave 1 would return and destroy:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |

Wave 2 would return and destroy:

| id | map_id |
|  4 |    235 |
|  6 |    299 |

Wave 3 would return and destroy:

| id | map_id |
|  5 |    235 |

Wave 4 would complete the process.

Michael Gaskill
  • 7,913
  • 10
  • 38
  • 43
  • Takes a little while to finish for 200k duplicates, but it works. I'm working on my scraping logic to reduce the amount of duplicates generated. – nonegiven72 May 30 '16 at 00:47
  • That's great to hear! When you use it on subsequent runs, it should be *much* faster. 200K is a lot of records to destroy. – Michael Gaskill May 30 '16 at 00:58
3

I would go to the db console(rails dbconsole) and do:

SELECT DISTINCT ON (map_id) * FROM summaries AS some_temp_name;

Then rename the tables.

EDIT - This seems like what you're looking for:

Summary.where.not(id: Summary.group(:map_id).pluck('min(summaries.id)')).delete_all

NOT TESTED. It was part of this answer here: Rails: Delete duplicate records based on multiple columns

Community
  • 1
  • 1
seph
  • 6,066
  • 3
  • 21
  • 19
  • Would definitely work, just wanted something less hackish since it'll be part of the web scraping code and thus be executed more than once. – nonegiven72 May 29 '16 at 22:06
  • @nonegiven72: Why would you be doing this more than once? Presumably you'd clean up the mess of duplicates, add a UNIQUE constraint to prevent them from happening again, and then check for duplicates (and catch the unique violation exception from the constraint) before adding/updating. – mu is too short May 30 '16 at 00:41
  • My other applications have had this, but the unique check was only when someone created an account, and honestly that wasn't too often. I worried that when scraping 500k records in an hour, the checking of uniqueness for each one might slow down the process, and that it'd be easier just to remove them at the end. – nonegiven72 May 30 '16 at 00:50
  • @nonegiven72: If you're worried about performance then you want to bypass the ORM as much as possible anyway and pump data straight into the database, then throw a little SQL at the database to clean things up en mass. Trying to do everything through Rails is politically correct and will make the fan boys smile but it is often silly and wasteful. IMO of course. – mu is too short May 30 '16 at 01:23
  • @nongiven72: Please see edit. I think this what you were looking for. You can do `to_sql` instead of `delete_all` to see what gets generated. – seph May 30 '16 at 03:04
1

What I would suggest to do, is to fetch all the records and order by the field that has duplicates.

Then loop all the records and just keep one record per value.

value = nil
Summary.order("map_id ASC").each do |record|
  if record.map_id == value
    # duplicate
    record.destroy
  else
    # first entry
    value = record.map_id
  end
end
Simone Carletti
  • 173,507
  • 49
  • 363
  • 364