Highest Voted Questions - Data Science Stack Exchange

34

votes

6 answers

Gini coefficient vs Gini impurity - decision trees

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the…

data-mining

asked Sep 09 '14 at 12:44

Damien

341
1
3
3

33

votes

3 answers

Hypertuning XGBoost parameters

XGBoost have been doing a great job, when it comes to dealing with both categorical and continuous dependant variables. But, how do I select the optimized parameters for an XGBoost problem? This is how I applied the parameters for a recent Kaggle…

asked Dec 13 '15 at 14:19

Dawny33

8,296
12
48
104

33

votes

4 answers

Neural Network parse string data?

So, I'm just starting to learn how a neural network can operate to recognize patterns and categorize inputs, and I've seen how an artificial neural network can parse image data and categorize the images (demo with convnetjs), and the key there is to…

neural-network

asked Jul 30 '14 at 16:27

MidnightLightning

433
1
4
4

33

votes

6 answers

Are there any tools for feature engineering?

Specifically what I am looking for are tools with some functionality, which is specific to feature engineering. I would like to be able to easily smooth, visualize, fill gaps, etc. Something similar to MS Excel, but that has R as the underlying…

asked Oct 03 '15 at 04:09

John

431
1
5
4

33

votes

6 answers

Validation loss is not decreasing

I am trying to train a LSTM model. Is this model suffering from overfitting? Here is train and validation loss graph:

asked Dec 27 '18 at 08:23

DukeLover

581
1
6
14

33

votes

8 answers

Best practical algorithm for sentence similarity

I have two sentences, S1 and S2, both which have a word count (usually) below 15. What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture…

asked Nov 23 '17 at 14:40

DaveTheAl

503
1
5
11

33

votes

1 answer

Ways to deal with longitude/latitude feature

I am working on a fictional dataset with 25 features. Two of the features are latitude and longitude of a place and others are pH values, elevation, windSpeed etc with varying ranges. I can perform normalization on the other features but how do I…

asked Aug 20 '16 at 06:51

AllThingsScience

443
1
4
5

33

votes

5 answers

How can I get a measure of the semantic similarity of words?

What is the best way to figure out the semantic similarity of words? Word2Vec is okay, but not ideal: # Using the 840B word Common Crawl GloVe vectors with gensim: # 'hot' is closer to 'cold' than 'warm' In [7]: model.similarity('hot',…

asked Jul 19 '16 at 21:54

Thomas Johnson

665
1
7
11

32

votes

3 answers

How can I check the correlation between features and target variable?

I am trying to build a Regression model and I am looking for a way to check whether there's any correlation between features and target variables? This is my sample dataset Loan_ID Gender Married Dependents Education Self_Employed…

asked Oct 03 '18 at 18:43

Jeeth

931
2
10
19

32

votes

4 answers

Role derivative of sigmoid function in neural networks

I try to understand role of derivative of sigmoid function in neural networks. First I plot sigmoid function, and derivative of all points from definition using python. What is the role of this derivative exactly? import numpy as np import…

asked Apr 23 '18 at 09:38

lukassz

467
1
5
10

32

votes

2 answers

How to calculate the fold number (k-fold) in cross validation?

I am confused about how I choose the number of folds (in k-fold CV) when I apply cross validation to check the model. Is it dependent on data size or other parameters?

asked Feb 22 '18 at 05:23

Taimur Islam

941
4
11
17

31

votes

4 answers

What algorithms should I use to perform job classification based on resume data?

Note that I am doing everything in R. The problem goes as follow: Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify…

asked Jul 03 '14 at 16:11

user1769197

431
1
5
5

31

votes

7 answers

Can machine learning learn a function like finding maximum from a list?

I have an input which is a list and the output is the maximum of the elements of the input-list. Can machine learning learn such a function which always selects the maximum of the input-elements present in the input? This might seem as a pretty…

asked Jul 31 '19 at 11:06

user78739

319
1
3
3

31

votes

3 answers

General approach to extract key text from sentence (nlp)

Given a sentence like: Complimentary gym access for two for the length of stay ($12 value per person per day) What general approach can I take to identify the word gym or gym access?

asked Mar 13 '15 at 16:41

William Falcon

421
1
6
7

31

votes

5 answers

Why underfitting is called high bias and overfitting is called high variance?

I have been using terms like underfitting/overfitting and bias-variance tradeoff for quite some while in data science discussions and I understand that underfitting is associated with high bias and over fitting is associated with high variance. But…

asked Feb 14 '19 at 14:33

Vaibhav Thakur

2,363
3
12
9

Most Popular