Train a model for unstructured data

Question

I am new to Machine Learning. I want to develop Curriculum Vitae recommender system. I want to determine how similar 2 CVs are, and given a random CV, it suggest which cluster of CVs it belongs to?

This is what I've already done, following a blog post:

I have a folder containing lot of CVs or resume text documents in plain text format (.txt).
I have done pre-processing on this data, like tokenization, stop words removal, stemming.
I extracted the Candidate's name, email-id, contact number, education and experience.

I am confused with how to train the data and how do I create a model for that. More specifically, I have the following questions:

Now how to create a model on text data?
Which algorithm I should apply on this data?

Please anyone answer. Your help will be appreciated.

Thanks.

What pre-processing have you done? What does your final dataset look like? What are you actually trying to achieve? — Dan Carter, Feb 27 '19 at 09:09
I have applied the preprocessing steps like tokenization, stop words removal, stemming. And I extracted the Candidate's name, email-id, contact number, education and experience. Now how to create a model on text data that I do not have an idea. @Dan Carter — Heena, Feb 27 '19 at 09:28
I'm not clear as to what your model is trying to achieve?. What is the end goal of your project? — Itachi, Feb 27 '19 at 09:42
Do you mean, you want to determine how similar 2 CVs are, and if you give a random CV, it suggest which cluster of CVs it belongs to? — Itachi, Feb 27 '19 at 09:58
From the information that you say you've already extracted, names, mails and phones are identifiers. There's no reason to expect that they would be indicative for any clustering/similarity. Could you please clarify in more details how you're representing the extracted education and experience? — mapto, Feb 27 '19 at 10:30
I have done it by the help of this blog. https://www.omkarpathak.in/2018/12/18/writing-your-own-resume-parser/ — Heena, Feb 27 '19 at 10:32
If your skills and education are indicated by extracted words, your dataset is equivalent to tagged users. If you are still interested in a wider range of possible approaches, consider this and this questions and the corresponding answers. — mapto, Feb 27 '19 at 10:46

score 0 · Accepted Answer · answered Feb 27 '19 at 12:37

0

I have worked on a similar project with JDs, we basically created a word2vec model for words in JD, the result were good as we had lots of JDs. Basically, what word2vec does is convert a word to vectorial representation which signifies context. You can check the documentation here: https://radimrehurek.com/gensim/

You may extract skills or other stuff from CVs, and do a semantic similarity based on w2v model. You may use a custom formulae for comparing similarity. Other things could be education, experience, similar projects etc

answered Feb 27 '19 at 12:37

Itachi

251
2
8

1

I will try this and If I'll have any question then I'll ask you. @Itachi – Heena Feb 27 '19 at 12:55
I have done till w2v. After that what to do? How do I recommend resume? @Itachi – Heena Mar 04 '19 at 08:21

Train a model for unstructured data

1 Answers1