I am trying to understand if averages of different variables can be combined together to produce "better" estimates. For example - suppose there are 100 high schools, and we randomly select students (i.e. sample) from each of these 100 high schools. Suppose we find that in these randomly selected students (this is all the information we have - we only have the aggregate summaries and not data on individual students):
- The graduation rate for males is 55% and for females 65%
- The graduation rate for students who study more than 10 hours a week is 80% and the for students who study less than 10 hours a week is 60%
Suppose for one of these high schools that we randomly studied, we would like to "interpolate" and find out how many students might graduate. We know the population of this school:
- There are 500 males and 500 females
- There are 400 students that study more than 10 hours a week and 600 students that study less than 10 hours a week
We are interested in estimating how many students will graduate.
- Using the gender as a variable, we could say that $500 \times 0.55 + 500 \times 0.65 = 600$ students are expected to graduate on average
- Using hours of studied as a variable, we could say that $400 \times 0.8 + 600 \times 0.6 = 680$ students are expected to graduate on average
But could we take the average for both of these numbers and say that $(600 + 680)/2 = 640$ students are expected to graduate on average? Would this be a more "reliable" estimate that averages out possible errors in the initial graduation rates we used to base our estimates on?
I am trying to figure out if this logic is correct (e.g. is this mathematically correct?) - can someone please comment on this? I also wonder if this method might somehow allow you to assign some measure of "risk" to this estimate, e.g. 640 plus/minus "c"? (Confidence interval without std?)
Thank you!