11

I am trying to understand if averages of different variables can be combined together to produce "better" estimates. For example - suppose there are 100 high schools, and we randomly select students (i.e. sample) from each of these 100 high schools. Suppose we find that in these randomly selected students (this is all the information we have - we only have the aggregate summaries and not data on individual students):

  • The graduation rate for males is 55% and for females 65%
  • The graduation rate for students who study more than 10 hours a week is 80% and the for students who study less than 10 hours a week is 60%

Suppose for one of these high schools that we randomly studied, we would like to "interpolate" and find out how many students might graduate. We know the population of this school:

  • There are 500 males and 500 females
  • There are 400 students that study more than 10 hours a week and 600 students that study less than 10 hours a week

We are interested in estimating how many students will graduate.

  • Using the gender as a variable, we could say that $500 \times 0.55 + 500 \times 0.65 = 600$ students are expected to graduate on average
  • Using hours of studied as a variable, we could say that $400 \times 0.8 + 600 \times 0.6 = 680$ students are expected to graduate on average

But could we take the average for both of these numbers and say that $(600 + 680)/2 = 640$ students are expected to graduate on average? Would this be a more "reliable" estimate that averages out possible errors in the initial graduation rates we used to base our estimates on?

I am trying to figure out if this logic is correct (e.g. is this mathematically correct?) - can someone please comment on this? I also wonder if this method might somehow allow you to assign some measure of "risk" to this estimate, e.g. 640 plus/minus "c"? (Confidence interval without std?)

Thank you!

stats_noob
  • 3,112
  • 4
  • 10
  • 36
  • 3
    If that is correct, then you had to assume that "gender" and "hours studied" have equivalent contribution to graduation rate. My intuition insists "hours studied" is a more significant factor. I might be wrong, though. – AlvinL Sep 04 '22 at 07:16
  • @ AlvinL : thank you for your reply! I just thought of this general example - I agree, "hours studied" is more significant factors. I also agree - the way I set up this problem assumes that "gender" and "hours studied" have equivalent contributions. – stats_noob Sep 04 '22 at 07:19
  • In the very general sense - suppose you have some generic variables X, Y, Z. Let's say that you have the breakdown of Z in terms of X, and the breakdown of Z in terms of Y. In such a case, is the method I have presented valid? thank you so much! – stats_noob Sep 04 '22 at 07:19
  • 3
    (A) The "collected" statistics seem wrong & you will get incorrect conclusions using that. (B) If the statistics were correctly collected , you can use the average of averages but the conclusions will still all be "equivalent" within some bounds. (C) You will get better conclusions & better bounds if you had collected individually "males + study less than 10 hours" & "females + study less than 10 hours" & "males + study more than 10 hours" & "females + study more than 10 hours" & not have to take average of averages. (D) When taking average of averages , you have to consider correct weights. – Prem Sep 04 '22 at 07:28
  • @ Prem: Thank you for your reply! Can you please explain the following points? – stats_noob Sep 04 '22 at 07:30
  • (A) : Why does the "collected" statistic seem wrong and why will this produce incorrect conclusions? – stats_noob Sep 04 '22 at 07:30
  • (B): How do you know that the average of the averages will still be "equivalent" within some bound? – stats_noob Sep 04 '22 at 07:31
  • (C) : I agree that this would be ideal, but for the sake of this question, let's assume that this information is missing. – stats_noob Sep 04 '22 at 07:31
  • (D) : Suppose you have no knowledge of these weights and you decide to use 0.5 and 0.5 . If you have no other knowledge, is this reasonable? – stats_noob Sep 04 '22 at 07:32
  • Thank you so much! – stats_noob Sep 04 '22 at 07:32
  • 6
    In real life, you wouldn't do this. This is something known as "Simpson's paradox" (https://www.britannica.com/topic/Simpsons-paradox). In real life, you would use a model that incorporates both variables. The discrete nature of the response suggests logistic regression. It also makes little sense to take average of two different types of variables (in the same way physicisits would discourage the sum of variables with different units; note that your variables aren't students but "expected number of males to graduate" and "expected number of students who study 10 or more hours to graduate"). – William M. Sep 04 '22 at 16:37
  • @ William M : Thank you for your reply! I agree with the points that you brought up. But in my example, all I have are these summary statistics - I do NOT have access to information on individual students. I just have access to the overall graduation rate with respect to these variables. In such a case (i.e. with this limited information), is the approach I have suggested ok? – stats_noob Sep 04 '22 at 16:55
  • 2
  • 1
    Given any $m \times n$ matrix, the mean of the means of the $m$ rows is equal to the mean of the means of the $n$ column and also the mean of all $mn$ numbers in the matrix. – Geoffrey Trang Sep 04 '22 at 17:20
  • @ Ross Millikan: thank you for the Dilbert comic! :) – stats_noob Sep 04 '22 at 17:22
  • 1
    If we had the model that incorporates both parameters, we could give the result in terms of an unknown (third? fourth?) parameter and see when the OP approximation is plausible and when it is not. And if we are smart, what is the substitute for the "naive approximation" in this case – Andrea Marino Sep 04 '22 at 20:15
  • Your comments towards me used "@ Prem" with a space , & I was not notified , @antonoyaro8 , I saw those comments now & I will respond in about 6 hours. I think the "Comment Box" will be too small , hence I will use the "Answer Box" to include Pictures to "Elaborate" my thoughts. – Prem Sep 05 '22 at 05:27
  • @ Prem: sorry for the typo - thank you so much for everything! I look forward to what you will write! :) – stats_noob Sep 06 '22 at 03:59
  • @ Geoffrey Trang: Thank you so much for your reply! Can you please tell me the relevance of this information regarding the "m x n" matrix? thank you so much! – stats_noob Sep 06 '22 at 03:59
  • There is this question (not sure it counts as a duplicate, though): https://math.stackexchange.com/q/95909/114279 Also lots of questions over at Cross Validated, e.g. https://stats.stackexchange.com/search?q=%5Bmean%5D+mean+of+means – Darren Cook Sep 07 '22 at 07:15

2 Answers2

3

In a nutshell, you can do better than just taking the average of the two forecasts by using a weighted average that places a larger weight on the more informative model.

The solution in the general case where you have two unbiased forecasts $f_1, f_2$ for an unknown $Y$ is as follows. Let $e_1 = Y - f_1, e_2 = Y - f_2$. Suppose we have estimates for the following quantities: $\sigma_1^2 = Var(e_1)$, $\sigma_2^2 = Var(e_2)$, and $\rho$, the correlation of $f_1$ and $f_2$. Even if we can only make rough guesses of these quantities, the solution may be able to improve on just taking the average of the forecasts.

Let $w$ be the weight for $f_1$, so our forecast is $f = wf_1 + (1 - w)f_2$. Our error is then $$e = Y - f = w(Y - f_1) + (1 - w)(Y - f_2) = we_1 + (1 - w)e_2.$$ We want to choose $w$ to minimise the variance of the error $$Var(e) = w^2 \sigma_1^2 + (1 - w)^2 \sigma_2^2 + 2w(1 - w)\rho\sigma_1\sigma_{2}.$$

Differentiating with respect to $w$ and then solving to find the minimum of $Var(e)$ gives $$w^* = \frac{\sigma_2^2 - \rho \sigma_1 \sigma_2}{\sigma_1^2 + \sigma_2^2 - 2\rho \sigma_1 \sigma_2} = \frac{1 - \rho \sigma_1 / \sigma_2}{\sigma_1^2/\sigma_2^2 + 1 - 2\rho \sigma_1/ \sigma_2}.$$

You only have aggregate statistics so you won't be able to explicitly calculate the correlation $\rho$, but if you have aggregate statistics on the male-female split and the "hours studied" split then you can estimate $\sigma_1^2$ and $\sigma_2^2$. In your example, let $f_1$ be the forecast from gender, and $f_2$ from hours studied. Also let $Y$ be the number of students who graduated. If this school is like other schools, with each male having a $0.55$ chance of graduating and each female having a $0.65$ chance of graduating, then for $e_1$ we estimate $Y$ as a sum of two independent binomial distributions, $Bin(500, 0.55) + Bin(500, 0.65)$, for the males and females. The variance of a $Bin(n, p)$ distribution is $np(1-p)$, and the prediction $f_1$ is not random in this case if we consider the distibrution of pupils into male-female to be fixed. This gives a variance estimate of $\sigma_1^2 = 500 \times 0.55 \times 0.45 + 500 \times 0.65 \times 0.35 = 237.5$. For $e_2$ we estimate $Y$ as $Bin(400, 0.8) + Bin(600, 0.6)$, so $\sigma_2^2 = 400 \times 0.8 \times 0.2 + 600 \times 0.6 \times 0.4 = 208$. This gives $\sigma_1/\sigma_2 = 1.069$.

As you say, we would expect the "hours studied" factor to be more important than the gender factor. More important factors reduce the variance, so if we don't have any other information than what you included in the first part of the question and can only make a guesstimate, I would have chosen something like $\sigma_1 / \sigma_2 \approx 1.5$. The estimate we had from before was $\sigma_1 / \sigma_2 \approx 1.069$. I would expect the estimates from the gender and studying predictions to be correlated, we could guess $\rho \approx 0.1$. Obviously feel free to replace these numbers with your own if you have more on-the-ground experience. Using $\sigma_1/\sigma_2 = 1.069$ and $\rho = 0.1$ gives $w^* = 0.46$, so something like $0.46 \times 600 + 0.54 \times 680 = 643.2$ might be a better estimate than $640$.

You also asked about making a confidence interval for the estimate. The formula for $Var(e)$ would allow you to compute this, but you would need estimates for $\rho$, $\sigma_1^2$ and $\sigma_2^2$. The formula for the 95% confidence interval would be $f \pm 1.96 \sqrt{Var(e)}$.

Alex
  • 2,351
1

I wasn't convinced at the beginning, but I would say yes.

That's how I see it. There are four groups of people:

  • Diligent male students (group $A$)
  • Diligent female students (group $B$)
  • Non-diligent male students (group $C$)
  • Non-diligent female students (group $D$)

Let us denote by $a,b,c,d$ the respective cardinalities of the groups and $g_a, g_b, g_c, g_d$ the graduation rates of the four groups. The latters are parameters of the model that we don't have access to and that we are estimating (or combinations of them). I don't consider $a,b,c,d$ as parameters of the model: there are equally unknown, but the random-ness of the problem we are looking at is whether they graduate or not; it's not about how many diligent students schools have (male or female).

If you want to estimate the total expected graduation rate $G$, the formula is $$ G = \frac{ a g_a + b g_b + c g_c + d g_d }{a+b+c+d } $$ You have estimated four variables $G_{ab}, G_{cd}, G_{ac}, G_{bd}$, which are the graduation rates in the various bi-groups (for example, $G_{ab}$ is the expected graduation rate in the group A+B, that is male students). The following formula (and analogous others) holds: $$ G_{ab} = \frac{ a g_a + b g_b }{a+b} $$ You are proposing two estimates of $G$:

  1. Use $G_{ab}, G_{cd}$. We get $$ G_1 = \frac{ (a+b) G_{ab} + (c+d) G_{cd}}{a+b+c+d} $$ In terms of the intrinsic parameters of the model we have $$ G_1 = \frac{ (a+b) G_{ab} + (c+d) G_{cd}}{a+b+c+d} = \frac{ (ag_a + b g_b)+ (c g_c + d g_d ) }{a+b+c+d} = G $$
  2. Use $G_{ac}, G_{bd} $. We get $$ G_2 = \frac{ (a+c) G_{ac} + (b+d) G_{bd}}{a+b+c+d} $$ In terms of the intrinsic parameters of the model we have $$ G_2 = \frac{ (a+c) G_{ac} + (b+d) G_{bd}}{a+b+c+d} = \frac{ (ag_a + c g_c)+ (b g_b + d g_d ) }{a+b+c+d} = G $$

On balance, both your calculations are computing the same combination of the intrinsic parameters of the model. It definitely makes sense, when you have multiple estimates of the same variable, to average them.

I have not written the whole story: to be precise, one should work with four binomial variables $X_a, X_b, X_c, X_d$ of sizes $a,b,c,d$ and probability $g_a, g_b, g_c, g_d$. In the end, you are always estimating the expected value of the average $$\frac{ X_a + X_b + X_c + X_d }{a+b+c+d}$$ I hope I am not loosing some important detail on the way! It's not my expertise field.