1

I am new to Machine Learning. I want to develop Curriculum Vitae recommender system. I want to determine how similar 2 CVs are, and given a random CV, it suggest which cluster of CVs it belongs to?

This is what I've already done, following a blog post:

  1. I have a folder containing lot of CVs or resume text documents in plain text format (.txt).

  2. I have done pre-processing on this data, like tokenization, stop words removal, stemming.

  3. I extracted the Candidate's name, email-id, contact number, education and experience.

I am confused with how to train the data and how do I create a model for that. More specifically, I have the following questions:

  1. Now how to create a model on text data?

  2. Which algorithm I should apply on this data?

Please anyone answer. Your help will be appreciated.

Thanks.

mapto
  • 744
  • 5
  • 16
Heena
  • 15
  • 4
  • 1
    What pre-processing have you done? What does your final dataset look like? What are you actually trying to achieve? – Dan Carter Feb 27 '19 at 09:09
  • I have applied the preprocessing steps like tokenization, stop words removal, stemming. And I extracted the Candidate's name, email-id, contact number, education and experience. Now how to create a model on text data that I do not have an idea. @Dan Carter – Heena Feb 27 '19 at 09:28
  • I'm not clear as to what your model is trying to achieve?. What is the end goal of your project? – Itachi Feb 27 '19 at 09:42
  • I want to develop CV recommender system. @Itachi – Heena Feb 27 '19 at 09:54
  • Do you mean, you want to determine how similar 2 CVs are, and if you give a random CV, it suggest which cluster of CVs it belongs to? – Itachi Feb 27 '19 at 09:58
  • Yes, exactly. @Itachi – Heena Feb 27 '19 at 09:59
  • From the information that you say you've already extracted, names, mails and phones are identifiers. There's no reason to expect that they would be indicative for any clustering/similarity. Could you please clarify in more details how you're representing the extracted education and experience? – mapto Feb 27 '19 at 10:30
  • I have done it by the help of this blog. https://www.omkarpathak.in/2018/12/18/writing-your-own-resume-parser/ – Heena Feb 27 '19 at 10:32
  • If your skills and education are indicated by extracted words, your dataset is equivalent to tagged users. If you are still interested in a wider range of possible approaches, consider this and this questions and the corresponding answers. – mapto Feb 27 '19 at 10:46

1 Answers1

0

I have worked on a similar project with JDs, we basically created a word2vec model for words in JD, the result were good as we had lots of JDs. Basically, what word2vec does is convert a word to vectorial representation which signifies context. You can check the documentation here: https://radimrehurek.com/gensim/

You may extract skills or other stuff from CVs, and do a semantic similarity based on w2v model. You may use a custom formulae for comparing similarity. Other things could be education, experience, similar projects etc

Itachi
  • 251
  • 2
  • 8