Statistics is a scientific approach to inductive inference and prediction based on probabilistic models of the data. By extension, it covers the design of experiments and surveys to gather data for this purpose.
Questions tagged [statistics]
1129 questions
11
votes
2 answers
Why use bootstrapping?
The wiki page for bootstrapping says that you use it in the case where the underlying distribution is unknown. Why is bootstrapping, or sampling with replacement, better than just calculating the variance and other properties from the data directly?

sebastianspiegel
- 891
- 4
- 11
- 16
7
votes
2 answers
Data Science as a Social Scientist?
as I am very interested in programming and statistics, Data Science seems like a great career path to me - I like both fields and would like to combine them. Unfortunately, I have studied political science with a non-statistical sounding Master. I…

Christian Sauer
- 517
- 3
- 6
6
votes
3 answers
Are Undergraduate Statistics Concepts Used in Practice?
I'm curious for more experienced Data Scientist, have you ever used t - test, ANOVA, Wilcoxon, etc?
Basically my question is, do you perform inference task, or purely prediction tasks? (Machine Learning)
5
votes
2 answers
What is a logworth statistic, and how useful is it?
My teacher mentioned it today, and there is nearly zero good search results for it, other than one mention each in the SAS and JMP documentation.
It says it is -log10(p-value), but there is almost no explanations of this online. Also it seems like…

Gabriel Fair
- 257
- 3
- 8
5
votes
1 answer
Standardize numbers for ranking ratios
I'm trying to rank some percentages. I have numerators and denominators for each ratio. To give a concrete example, consider ratio as total graduates / total students in a school.
But the issue is that total students vary over a long range…

Rohit Mittal
- 53
- 2
5
votes
2 answers
Methods for standardizing / normalizing different rank scales
I know there is the normal subtract the mean and divide by the standard deviation for standardizing your data, but I'm interested to know if there are more appropriate methods for this kind of discrete data. Consider the following case.
I have 5…

Climbs_lika_Spyder
- 400
- 1
- 3
- 8
4
votes
1 answer
how to find probability of one or more events to happen from an incomplete data set
I have a dataset that gives information of a population. For instance, I know the fraction of people that are males (M) and that are within a certain age range (A), P(M & A), and then I know the fraction of males that live in a certain area (L), P(M…

Brian
- 143
- 2
4
votes
4 answers
Statistics - Train and test data split
How much data should we use during training, and how much in testing? Can anyone explain why does it always seem to be 70:30 or 80:20 ratios?

Shyama
- 91
- 1
- 2
- 8
2
votes
1 answer
Compare between similar and dissimilar couples of instances
I label couples of similar and dissimilar instances based on user behavior.
each instance has a lot of features.
I have few ways of labeling the couples.
I know want to evaluate which of the label methods produce the most homogeneous distribution in…

anat
- 155
- 4
2
votes
1 answer
How can I show the relations between travel destinations?
I'm trying to do a project about email marketing. I'm working on a tourism company and I want to make a best destination suggestion for the clients. But I need to see the relations between destinations.
Example: How many people visited Dublin and…

Uygar Yologlu
- 23
- 2
2
votes
2 answers
How should I create a single score with two values as input?
I have two series of values, a and b as inputs and I want to create a score, c, which reflects both of them equally. The distribution of a and b are below
In both cases, the x-axis is just an index.
How should I go about creating an equation c =…

Eric Baldwin
- 123
- 3
2
votes
2 answers
Correlation between time to event data and continuous data
I want to measure the correlation between the survival time which is a time to event data and the patient's activity count which is measured on continuous scale. What type of correlation coefficient is available to measure the strength of these two…

ASJRM
- 21
- 2
2
votes
1 answer
How to test the influence of a feature on conversion?
I have a user journey where I have data of the format:
userID, did_interact_with_feature(0/1), did_convert(0/1)
I want to verify the hypothesis that if a user is engaging with the feature, he's more likely to get converted.
Now I can get the % of…

Ronak Agrawal
- 206
- 3
- 11
2
votes
0 answers
How to track user given some guaranteed unique but deletable data and some possibly conflicting but non-deletable data?
I am trying to track users reliably on my website so that if they are abusive, they can be banned and not come back easily (obviously this can be bypassed with TOR and such, but most trolls don't care that much). I have some data that can be set…

Robert Moore
- 121
- 2
2
votes
1 answer
Using Diebold-Mariano test
I've got predicted results from two different types of neural networks. Now I would like to run significance testing on both of the results to prove that they do not have equal predictive accuracy. I've learnt that the only tool in the game for this…

JannaBotna
- 23
- 3