Questions tagged [data-mining]

An activity that seeks patterns in large, complex data sets. It usually emphasizes algorithmic techniques, but may also involve any set of related skills, applications, or methodologies with that goal.

Conceptually speaking, data-mining can be thought of as one item (or set of skills and applications) in the toolkit of the data scientist.

More specifically, data-mining is an activity that seeks patterns in large, complex data sets. It usually emphasizes algorithmic techniques, but may also involve any set of related skills, applications, or methodologies with that goal.

In US-English colloquial speech, data-mining and data-collection are often used interchangeably.

However, a main difference between these two related activities is intentionality.

Definition inspired mostly by the contributions of @statsRus to Data Science.SE

1181 questions
34
votes
6 answers

Gini coefficient vs Gini impurity - decision trees

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the…
Damien
  • 341
  • 1
  • 3
  • 3
8
votes
3 answers

What is the use of user data collection besides serving ads?

Well this looks like the most suited place for this question. Every website collect data of the user, some just for usability and personalization, but the majority like social networks track every move on the web, some free apps on your phone…
8
votes
2 answers

Method for finding top-k cosine similarity based closest item on large dataset

I have a dataset with 40 million item, where each item is 400-dimension double vector. What I want to do is to find top-k (small k, about 3~10) most similar items to an arbitrary given input vector. Similarity measure is cosine similarity, since…
YKS
  • 83
  • 1
  • 3
5
votes
5 answers

What kind of research can be done with an email data set?

I found a data set called Enron Email Dataset. It is possibly the only substantial collection of "real" email that is public. I found some prior analysis of this work: A paper describing the Enron data was presented at the 2004 CEAS…
Miller
  • 287
  • 2
  • 9
5
votes
1 answer

Amount of data needed and hypothesis for SVD

I was looking into the definition of SVD and trying to understand which are the conditions needed to be met in order to be able to use it. Is there any hypothesis concerning the distribution of the data that I want to apply SVD on ? Is the…
vphenix
  • 181
  • 2
4
votes
2 answers

Calculating entropies of attributes

Can you please show the step by step calculation of Entropy(Ssun)? I do not understand how 0.918 is arrived at. I tried but I get the values as 0.521089678, 0.528771238, 0.521089678 for Sunny, Windy, Rainy. I was able to calculate the target…
user1744649
  • 41
  • 1
  • 2
3
votes
1 answer

What tool to find expected and hidden patterns in data?

I have no background AT ALL in data science/stats/mathematics. However, I've always been interested in what data shows. I have a huge dataset right now - daily attendance figures for a factory of ~300 for the past 10 years. I'm interested in finding…
hasan
  • 139
  • 2
3
votes
2 answers

Dealing with events that have not yet happened when building a model

I was building a model that predicts user churn for a website, where I have data on all users, both past and present. I can build a model that only uses those users that have left, but then I'm leaving 2/3 of the total user population unused. Is…
soandos
  • 133
  • 5
3
votes
1 answer

How much app analytics data to collect?

Excuse the potentially dumb question, I've only just started learning about data science. How do we find out how much data we should collect before using it to start making decisions? Is there a way of knowing if we should wait to collect more…
tobinharris
  • 131
  • 1
2
votes
1 answer

Non-parametric approach to healthcare dataset?

I have a Healthcare dataset. I have been told to look at non-parametric approach to solve certain questions related to the dataset. I am little bit confused about non-parametric approach. Do they mean density plot based approach (such as looking at…
user62198
  • 1,091
  • 4
  • 16
  • 32
2
votes
1 answer

What fields offer most data science job opporunities?

I'm now transitioning to data scientist as bioinformatics PhD. What fields need lots of data scientists? Or offer more opportunities? I guess business/finance and internet?
LookIntoEast
  • 121
  • 2
2
votes
1 answer

How is the modulo number selected to build the hash table in DHP algorithm?

I'm trying to understand the DHP(Direct Hashing and Pruning) algorithm and I got stuck at explaining the selection of modulo number. The paper shows an example of using the hash function at page 7: h{{x y}) = ((order of x)*10 + (order of y)) mod…
flamenco
  • 121
  • 3
2
votes
1 answer

Why is sum a succinct constraint?

I'm new to data mining and have been going through constraint-based query mining lately. I came across the concept of succinctness which basically details a constraint as succinct, if we can generate all the candidate item-sets precisely, based on…
2
votes
4 answers

Clustering categorical data

I have a dataset with categorical features. I want to segment the data using clustering techniques. What could be the possible choices for this scenariogiven the fact that data has categorical features. Is there any variation of k-means which can be…
user3198880
  • 29
  • 1
  • 1
  • 2
1
vote
0 answers

which model for ordinal scaled dependent variable

I want to use several models to find a relation btw. life satisfaction and several independent variables like children, unemployment, marriage and so on. My dependent var has an ordinal scale [1:10] and the independent vars have nominal and ordinal…
René
  • 11
  • 1
1
2 3 4