Questions tagged [data-mining]

An activity that seeks patterns in large, complex data sets. It usually emphasizes algorithmic techniques, but may also involve any set of related skills, applications, or methodologies with that goal.

Conceptually speaking, data-mining can be thought of as one item (or set of skills and applications) in the toolkit of the data scientist.

More specifically, data-mining is an activity that seeks patterns in large, complex data sets. It usually emphasizes algorithmic techniques, but may also involve any set of related skills, applications, or methodologies with that goal.

In US-English colloquial speech, data-mining and data-collection are often used interchangeably.

However, a main difference between these two related activities is intentionality.

Definition inspired mostly by the contributions of @statsRus to Data Science.SE

1181 questions

votes

6 answers

Gini coefficient vs Gini impurity - decision trees

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the…

data-mining

asked Sep 09 '14 at 12:44

Damien

votes

3 answers

What is the use of user data collection besides serving ads?

Well this looks like the most suited place for this question. Every website collect data of the user, some just for usability and personalization, but the majority like social networks track every move on the web, some free apps on your phone…

data-mining

asked Jul 31 '14 at 18:52

GleissonGraca

votes

2 answers

Method for finding top-k cosine similarity based closest item on large dataset

I have a dataset with 40 million item, where each item is 400-dimension double vector. What I want to do is to find top-k (small k, about 3~10) most similar items to an arbitrary given input vector. Similarity measure is cosine similarity, since…

data-mining

asked Mar 25 '16 at 17:34

YKS

votes

5 answers

What kind of research can be done with an email data set?

I found a data set called Enron Email Dataset. It is possibly the only substantial collection of "real" email that is public. I found some prior analysis of this work: A paper describing the Enron data was presented at the 2004 CEAS…

data-mining

asked May 10 '15 at 09:58

Miller

votes

1 answer

Amount of data needed and hypothesis for SVD

I was looking into the definition of SVD and trying to understand which are the conditions needed to be met in order to be able to use it. Is there any hypothesis concerning the distribution of the data that I want to apply SVD on ? Is the…

data-mining

asked Jan 31 '16 at 18:10

vphenix

votes

2 answers

Calculating entropies of attributes

Can you please show the step by step calculation of Entropy(Ssun)? I do not understand how 0.918 is arrived at. I tried but I get the values as 0.521089678, 0.528771238, 0.521089678 for Sunny, Windy, Rainy. I was able to calculate the target…

data-mining

asked Feb 11 '15 at 16:46

user1744649

votes

1 answer

What tool to find expected and hidden patterns in data?

I have no background AT ALL in data science/stats/mathematics. However, I've always been interested in what data shows. I have a huge dataset right now - daily attendance figures for a factory of ~300 for the past 10 years. I'm interested in finding…

data-mining

asked Jan 03 '16 at 11:29

hasan

votes

2 answers

Dealing with events that have not yet happened when building a model

I was building a model that predicts user churn for a website, where I have data on all users, both past and present. I can build a model that only uses those users that have left, but then I'm leaving 2/3 of the total user population unused. Is…

data-mining

asked Jul 04 '14 at 02:31

soandos

votes

1 answer

How much app analytics data to collect?

Excuse the potentially dumb question, I've only just started learning about data science. How do we find out how much data we should collect before using it to start making decisions? Is there a way of knowing if we should wait to collect more…

data-mining

asked Feb 22 '17 at 16:31

tobinharris

votes

1 answer

Non-parametric approach to healthcare dataset?

I have a Healthcare dataset. I have been told to look at non-parametric approach to solve certain questions related to the dataset. I am little bit confused about non-parametric approach. Do they mean density plot based approach (such as looking at…

data-mining

asked Jul 24 '15 at 13:21

user62198

1,091
4
16
32

votes

1 answer

What fields offer most data science job opporunities?

I'm now transitioning to data scientist as bioinformatics PhD. What fields need lots of data scientists? Or offer more opportunities? I guess business/finance and internet?

data-mining

asked Apr 29 '17 at 15:46

LookIntoEast

votes

1 answer

How is the modulo number selected to build the hash table in DHP algorithm?

I'm trying to understand the DHP(Direct Hashing and Pruning) algorithm and I got stuck at explaining the selection of modulo number. The paper shows an example of using the hash function at page 7: h{{x y}) = ((order of x)*10 + (order of y)) mod…

data-mining

asked Mar 17 '17 at 19:36

flamenco

votes

1 answer

Why is sum a succinct constraint?

I'm new to data mining and have been going through constraint-based query mining lately. I came across the concept of succinctness which basically details a constraint as succinct, if we can generate all the candidate item-sets precisely, based on…

data-mining

asked Feb 26 '17 at 09:52

Shubham Mittal

votes

4 answers

Clustering categorical data

I have a dataset with categorical features. I want to segment the data using clustering techniques. What could be the possible choices for this scenariogiven the fact that data has categorical features. Is there any variation of k-means which can be…

data-mining

asked Aug 07 '16 at 22:37

user3198880

vote

0 answers

which model for ordinal scaled dependent variable

I want to use several models to find a relation btw. life satisfaction and several independent variables like children, unemployment, marriage and so on. My dependent var has an ordinal scale [1:10] and the independent vars have nominal and ordinal…

data-mining

asked Dec 07 '15 at 11:08

René

2 3 4 Next