1

Test File

Sometimes, my lists of emails include duplicate addresses for the same person. For example, Jane's addresses are both "[email protected]" and "doe.jane@email". Her variants include replacing the "." with "-" or "_". At the moment, my duplicates script—upgraded ever so kindly by @Jordan Running and Ed Nelson—takes care of 'strict' duplicates, yet cannot detect that "[email protected]" is a 'complicated' duplicate of "[email protected]". Is there a way to delete even these duplicates such that I do not email more than one of Jane's addresses? All of them point to the same inbox, so I need only include one of her addresses.

Here is my current code:

function removeDuplicates() {
  const startTime = new Date();
  const newData = [];
  const sheet = SpreadsheetApp.getActiveSheet();
  const data = sheet.getDataRange().getValues();
  const numRows = data.length;
  const seen = {};

  for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
    key = JSON.stringify(row);
    if (key in seen) {
      continue;
    }
    seen[key] = true;
    newData.push(row);
  };

  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);

  // Show summary
  const secs = (new Date() - startTime) / 1000;
  SpreadsheetApp.getActiveSpreadsheet().toast(
    Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
                           numRows, secs, numRows / secs, numRows - newData.length),
    'Remove duplicates', -1);
}
Max Makhrov
  • 17,309
  • 5
  • 55
  • 81
Ed Dev
  • 105
  • 7

2 Answers2

1

Sample File

Fuzzy match test

enter image description here

Notes:

  • used without @email.com part, it distorts the result
  • use a the custom function: =removeDuplicatesFuzzy(B2:B12,0.66)
  • 0.66 is a percentage of fuzzy match.
  • the right column of a result (Column D) shows found matches with > 0.66 accuracies. Dash - is when matches are not found ("unique" values)

Background

You may try this library: https://github.com/Glench/fuzzyset.js

To install it, copy the code from here.

The usage is simple:

function similar_test(string1, string2)
{
  string1 = string1 || '[email protected]';
  string2 = string2 || '[email protected]'
  a = FuzzySet();
  a.add(string1);
  var result = a.get(string2);
  
  Logger.log(result);  // [[0.6666666666666667, [email protected]]]
  
  return result[0][0]; // 0.6666666666666667
}

There's also more info here: https://glench.github.io/fuzzyset.js/

Notes:

  • please google more info, look for javascript fuzzy string match. Here's related Q: Javascript fuzzy search that makes sense. Note: the solution should work in Google Sheets (no ECMA-6)
  • this algorithm is not smart like a human, it tests a string by char. If you have two similar strings like [email protected] it will be 84% similar to [email protected] but human detects it is completely another person.
Community
  • 1
  • 1
Max Makhrov
  • 17,309
  • 5
  • 55
  • 81
  • Thanks, @Max. I'm not sure I understand. I copied the code from the link and then pasted it into my test sheet but I am receiving errors I've never seen before. My test sheet is here: https://docs.google.com/spreadsheets/d/1RQxZ2yMeARdjGAcoJfx3Mtgdb7IM35WUICk7I-Q5AuU/edit?usp=sharing – Ed Dev Feb 01 '18 at 19:04
  • 1
    Hi, @ed-dev, thank you for the test file. I've made some tests and still the best result I have is far from your goal. Please try my sample, if it helps you. I can't find a better solution so far. But I like your question and want to propose bounty for it, to see the better approaches. – Max Makhrov Feb 02 '18 at 13:09
  • Thank you, @Max Makhrov. I'm not able to view the script code with your sample file given the permissions. (FYI, the link to 'Test Fuzzy Match' in your answer differs from the 'Test sheet 2' link in my post, which might cause confusion if you set a bounty.) – Ed Dev Feb 02 '18 at 13:47
  • for a start, I think it would even be an improvement to remove less complicated duplicates such as [email protected] whenever there is a [email protected]. – Ed Dev Feb 02 '18 at 18:21
  • Hi @Max. I wonder whether you've had any more thoughts on this question. – Ed Dev Feb 04 '18 at 10:41
0

Search for my Google Sheets add-on called Flookup. It should do what you want.

For your case, you can use this function:

ULIST(colArray, [threshold])

The parameter details are:

  1. colArray: the column from which unique values are to be returned.
  2. threshold: the minimum percentage similarity between the colArray values that are not unique.

Or you can simply use the Highlight duplicates or Remove duplicates from the add-on menu. The key feature is that you can adjust the level of strictness by changing the percentage similarity.

Bonus: It will easily catch swaps like [email protected] / [email protected]

You can find out more at the official website.