1

I analyse log files, very often there are strings that only differ on few places and are same every where else, I am trying to find the generic string that they most likely belong to e.g. :

UHGUYGUYGUYGUY id = U1234 UYAG*&^T*#@G*(&G@ id2 = 8767 ib97y79yh0978
UHGUYGUYGUYGUY id = Z1D#34 UYAG*&^T*#@G*(&G@ id2 = 98h ib97y79yh0978
Sss3ug87g87g78ghs837g8 obj { 876t7g }937hs937hs973h97sh397 jh7897y98h
Sss3ug87g87g78ghs837g8 obj { 98u2 }937hs937hs973h97sh397 ZN7897y98h

to me the only difference is between the two the value of ids, so a generic forms/groupings will be

UHGUYGUYGUYGUY id = * UYAG*&^T*#@G*(&G@ id2 = * ib97y79yh0978
Sss3ug87g87g78ghs837g8 obj { * }937hs937hs973h97sh397 *7897y98h

I am not sure what in machine learning I should be looking up for this problem, or even if this problem has a name.

of course this is a very simplified example, the number, location of id's can vary for different generic cases, that is why I can not write old fashioned code for this, too many things can change.

Is there anything in machine learning to help find groups for such strings? if yes, what is it called?

jimjim
  • 141
  • 7

2 Answers2

2

That sounds like k nearest neighbor clustering because a large portion of the string is similar. This answer to a similar question from 2014 has code you could try in Python.

https://stats.stackexchange.com/a/158090

CalZ
  • 1,663
  • 7
  • 14
1

If you will have a lot of instances, K-NN will quickly become very computationally expensive.

I would suggest using percent match. The computer can learn the distribution of the different types of strings that you have. When a new string comes, it will compare the string with all other strings and can calculate the likelihood that the string came from a specific class. Then you can pick the greatest likelihood which is the most likely class that the new string will belong to.

JahKnows
  • 8,866
  • 30
  • 45