Highest Voted Questions - Data Science Stack Exchange

8

votes

4 answers

Why is there a difference between predicting on Validation set and Test set?

I have a XGBoost model trying to predict if a currency will go up or down next period (5 min). I have a dataset from 2004 to 2018. I split the data randomized into 95% train and 5% validation and the accuracy on the Validation set is up to 55%. When…

asked Aug 24 '19 at 20:10

DBSE

221
2
3

8

votes

1 answer

Complex Chunking with NLTK

I am trying to figure out how to use NLTK's cascading chunker as per Chapter 7 of the NLTK book. Unfortunately, I'm running into a few issues when performing non-trivial chunking measures. Let's start with this phrase: "adventure movies between 2000…

asked May 16 '15 at 00:15

grill

234
3
7

8

votes

1 answer

Gensim LDA model: return keywords based on relevance (λ - lambda) value

I am using the gensim library for topic modeling, more specifically LDA. I created my corpus, my dictionary, and my LDA model. With the help of the pyLDAvis library I visualized the results. When I print the words with the highest probability on…

asked Aug 21 '19 at 17:40

Tasos Lytos

81
4

8

votes

1 answer

Which classification algorithms to try for classifying text data into 300 categories

I have 40000 rows of text data of health care domain. Data has one column for text (2-5 sentences) and one column for its category. I want to classify that into 300 categories. Some categories are independent while some are somewhat related.…

asked May 07 '15 at 08:52

Alok Nayak

191
1
5

8

votes

2 answers

How to use Graph Neural Network to predict relationships between nodes with pytorch_geometric?

Let's say I have a partly connected graph that represents members of many unrelated communities. I would like to predict the possible friendships between members of the same community: on an sliding scale between 0 to 10 how likey would they like…

pytorch-geometric

asked Jul 31 '19 at 16:38

Soerendip

724
1
9
16

8

votes

5 answers

What is the best question generation state of art with nlp?

I was trying out various projects available for question generation on GitHub namely NQG,question-generation and a lot of others but I don't see good results form them either they have very bad question formation or the questions generated are…

asked Jul 27 '19 at 07:39

Jack109

108
1
10

8

votes

2 answers

Why is taking the gradient of the average error in SGD not correct, but rather the average of the gradients of single errors?

I am a little confused about taking averages in cost functions and SGD. So far I always thought in SGD you would compute the average error for a batch and then backpropagate it. But then I was told in a comment on this question that that was wrong.…

asked Jul 25 '19 at 21:13

lo tolmencre

235
1
9

8

votes

2 answers

Which classification algorithms are negatively affected by class imbalances?

I've seen a few posts and papers floating around the web (mostly those related to over/undersampling, SMOTE, and cost-sensitive training) that, when discussing class imbalance, specify that certain algorithms are negatively impacted by class…

asked Jul 03 '19 at 19:45

Danny David Leybzon

180
2

8

votes

4 answers

What is the term for when a model acts on the thing being modeled and thus changes the concept?

I'm trying to see if there is a conventional term for this concept to help me in my literature research and writing. When a machine learning model causes an action to be taken in the real world that affects future instances, what is that called? …

asked Apr 02 '15 at 23:52

jsmith54

83
2

8

votes

1 answer

What are the input and output channels of a convolution in PyTorch?

From the documentation of Pytorch for Convolution, I saw the function torch.nn.Conv1d requires users to pass the parameters "in_channels" and "out_channels". I know they refer to input channels and output channels but I am not sure about what they…

asked Jun 18 '19 at 09:46

LastK7

101
1
1
3

8

votes

4 answers

XGBoost Huge Dataset ~1TB

Can a gradient boosting solution like XGBoost or Lightbgm be used for a huge amount of data ? I have a csv file of 820GB containing 1 Billion observations and each observation has 650 datapoints. Is this too much data for XGBoost ? I have searched…

asked Jun 15 '19 at 08:05

Medz Benz

81
1
2

8

votes

3 answers

How to find out if two datasets are close to each other?

I have the following three datasets. data_a=[0.21,0.24,0.36,0.56,0.67,0.72,0.74,0.83,0.84,0.87,0.91,0.94,0.97] data_b=[0.13,0.21,0.27,0.34,0.36,0.45,0.49,0.65,0.66,0.90] data_c=[0.14,0.18,0.19,0.33,0.45,0.47,0.55,0.75,0.78,0.82] data_a is real data…

asked Jun 09 '19 at 05:10

Kartikeya Sharma

167
1
9

8

votes

1 answer

What makes binary cross entropy a better choice for binary classification than other loss functions?

I'm reading this post where I came across this quote "Cross-entropy is the default loss function to use for binary classification problems." But what about it makes it the default and presumably best loss function for binary classification?

asked Jun 07 '19 at 15:41

John Slaine

81
1
2

8

votes

3 answers

Why does logistic function use e rather than 2?

sigmoid function could be used as activation function in machine learning. $${\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.}$$ If substitute e with 2, def sigmoid2(z): return 1/(1+2**(-z)) x = np.arange(-9,9,dtype=float) y…

asked Jun 06 '19 at 07:55

JJJohn

623
10
23

8

votes

2 answers

Why class weight is outperforming oversampling?

I am applying both class_weight and oversampling (SMOTE) techniques on a multiclass classification problem and getting better results when using the class_weight technique. Could someone please explain what could be the cause of this difference?

asked May 26 '19 at 01:09

Sarah

611
2
5
17

Most Popular