4

I'm getting data using web scraping to create a dataset. I have a 'company' column that contains the names of the companies. I would like to encode this column but i don't know how to find the sentences that represent the same companies .

For example: "International Business Machines Corporation", "IBM", "IBM India Pvt.Ltd" reprsent the same company.

Any suggestion? Thank you

Devashish Prasad
  • 834
  • 7
  • 17
Lydia
  • 43
  • 2

1 Answers1

3

This kind of problem is called record linkage (or sometimes entity matching or other variants). The task consists in finding among a list of strings representing entities (persons or organizations) those which represent the same actual entity.

There are two main approaches (which can be combined):

  • String similarity matching methods. See for example this question or this one. Note that in case the list of companies is large, there can also be an efficiency issue: see this question.
  • Databases or third-party resources. See for example this related question.
Erwan
  • 25,321
  • 3
  • 14
  • 35