2

How far are the Mode and the Median of the Log-Normal distribution from behaving as Linear functions?

Intro_______________

Recently I made a question where later I figure out I was requiring that the Mode $\nu[x]$ of a distribution were behaving as it was a linear function $\nu\left[\sum_i^N a_iX_i\right]=\sum_i^N a_i\nu[X_i]$, which I know is not true in general.

But also, if the random variables $X_i$ belong to the same symmetrical distribution, then I will have that $\nu = \mu = m$ with "$\mu[x]$" the mean value and "m[x]" the Median of the distribution ($m$ is the value such it split the probabilities as $P(X\geq m) = P(X\leq m) = \frac12$).

Since the mean value is a linear operator $\mu\left[\sum_i^N a_iX_i\right]=\sum_i^N a_i\mu[X_i]$, under this "symmetric" scenario I think it should be true also that $\nu\left[\sum_i^N a_iX_i\right]=\sum_i^N a_i\nu[X_i]$ and $m\left[\sum_i^N a_iX_i\right]=\sum_i^N a_im[X_i]$ (this because $\mu$, $\nu$, and $m$ are homogeneous of degree $1$), so at least there are some conditions where they could be split as a weighted sum.

Question__________________

If the variables $X_i$ belongs each to some non-necessarily identical Log-Normal distributions with some parameters $X_i \sim \text{Lognormal}(\mu_i,\ \sigma_i)$ (so each of them have their individual parameters $\nu[X_i]=\nu_i$ and $m[X_i]=m_i$), and for some real-valued weights $0\leq a_i\leq 1$ such as $\sum\limits_{i=1}^N a_i = 1$, I want to know How far are from behaving as a linear function each of:

  1. $$\nu\left[\sum_{i=1}^N a_i X_i\right] \overset{?}{\approx}\sum_{i=1}^N a_i\ \nu\left[X_i\right]=\sum_{i=1}^N a_i\ \nu_i$$
  2. $$m\left[\sum_{i=1}^N a_i X_i\right] \overset{?}{\approx}\sum_{i=1}^N a_i\ m\left[X_i\right]=\sum_{i=1}^N a_i\ m_i$$
  • Could we say something about the Left-Hand Sides (LHS) being always bigger/lower than the Right-Hand Sides (RHS)?
  • Are there any inequalities setting bounds of how much spread could have the LHS from the RHS?
  • Are there any ways of split somehow the LHS expressions? Like having a known formula?
  • Since the Log-Normal distribution is already positively skewed (so non-symmetric): Are there any conditions under the RHS could be considered as an approximation of the LHS?

PS: If you are currently a undergraduate student, I will really appreciate if you could share this question with your probability/statistic teachers.


Added later

After the answer by @Amir I realized that without without assuming anything regarding independence, correlations, or unimodality, one could do the following: $$\begin{array}{r c l} |\sum a_i m[x_i]-m\left[\sum a_i x_i\right]| & = & |\sum a_i m[x_i]+E\left[\sum a_i x_i\right]-E\left[\sum a_i x_i\right]-m\left[\sum a_i x_i\right]| \\ & \overset{\text{triangle ineq.}}{\leq} & |E\left[\sum a_i x_i\right]-m\left[\sum a_i x_i\right]|+|\sum a_i m[x_i]-E\left[\sum a_i x_i\right]| \\ & \overset{|\mu - m|\leq \sigma}{\leq} & \sqrt{\text{Var}\left[\sum a_i x_i\right]}+\left|\sum a_i\left(m[x_i]-E[x_i]\right)\right| \end{array}$$ due the linearity of the expected value. All the values from the RHS could be taken from the individual variable distributions. From this last result could be inferred conditions for having both terms near one from each other, and also tells that the mistaken formula from this another question could be useful at last as a selection figure since it will be penalizing strongly those variables which makes both $\sum a_i m[x_i]$ and $m\left[\sum a_i x_i\right]$ drift apart (note it was done from the mode, but identical construction could be made for the median).

Do you think this formula could be improved for LogNormal distributed variables without assuming independence? I already know it could be improved if assuming unimodality, but would be also better to not taking it as assumption.

Joako
  • 1,380

1 Answers1

1

I could obtain lower and upper bounds for the median of the sum of $X_1,\dots, X_N$ that have log-normal distributions.

By definition, for any $i$, $X_i=e^{Y_i}$ with $Y_i \sim N \left (\mu_i, \sigma^2_i \right )$. We also know that

$$\mathbb{E}[X_i]=e^{\mu_i+\frac{\sigma^2_i}{2}}, \text{var}[X_i]=(e^{\sigma^2_i}-1)e^{2\mu_i+\sigma^2_i}, m_i=m[X_i]=e^{\mu_i}.$$

Lower bound:

$$\sum_{i=1}^{N}a_im_i\le m \left [\sum_{i=1}^{N}a_iX_i \right] $$

For simplicity, I provide the proof for $n=2$.

Generally, we have

$$ a_1e^{Y_1}+a_2e^{Y_2}\ge a_1e^{\mu_1}(Y_1-\mu_1-1)+a_2e^{\mu_2}(Y_2-\mu_2-1)=Z$$

We also know that

$$m[Z]=\mathbb{E} \left[ a_1e^{\mu_1}(Y_1-\mu_1+1)+a_2e^{\mu_2}(Y_2-\mu_2+1) \right] =a_1e^{\mu_1}+a_2e^{\mu_2}=a_1m_1+a_2m_2 $$

Hence,

$$\mathbb{P} \left ( a_1e^{Y_1}+a_2e^{Y_2}\ge m[Z] \right )\ge \mathbb{P} \left ( Z\ge m[Z] \right )=\frac{1}{2},$$

which yields $m[a_1X_1+a_2X_2]\ge m[Z]=a_1m_1+a_2m_2$.

Upper bound:

To derive a useful result, here I assume that $X_1,\dots, X_N$ are independent.

$$m \left [\sum_{i=1}^{N}a_iX_i \right] \ge \sum_{i=1}^{N}a_i(1+B_i)m_i $$

with

$$B_i= e^{\frac{\sigma^2_i}{2}}\left (\sqrt{e^{\sigma^2_i}-1}+1 \right)-1$$

We know that

$$m[a_1X_1+a_2X_2] \le \mathbb{E} \left[ a_1X_1+a_2X_2 \right]+\sqrt {\text{var} \left[ a_1X_1+a_2X_2 \right]}$$

From

$\mathbb{E} \left[ a_1X_1+a_2X_2 \right]= a_1e^{\sigma^2_1}m_1+a_1e^{\sigma^2_2}m_2$

it follows that

$$m[a_1X_1+a_2X_2] \le a_1m_1+a_2m_2 + \sqrt {\text{var} \left[ a_1X_1+a_2X_2 \right]}+a_1 \left (e^{\frac{\sigma^2_i}{2}}-1 \right )+a_2 \left (e^{\frac{\sigma^2_i}{2}}-1 \right)$$

The final result is obtained by noting that

$$\text{var} \left[ a_1X_1+a_2X_2 \right]= a_1^2 m_1^2 (e^{\sigma^2_1} -1)e^{\sigma^2_1} +a_2^2 m_2^2 (e^{\sigma^2_2} -1)e^{\sigma^2_1}$$

and that $\sqrt{x+y}\le \sqrt{x}+\sqrt{y}$.

Hence, for the independent case we have (recall that the lower bound remains valid for the dependent case)

$$\sum_{i=1}^{N}a_im_i \le m \left [\sum_{i=1}^{N}a_iX_i \right] \le \sum_{i=1}^{N}a_i(1+B_i)m_i . $$

This shows that the error of approximating $m \left [\sum_{i=1}^{N}a_iX_i \right]$ by $\sum_{i=1}^{N}a_im_i$ is between $0$ and $\sum_{i=1}^{N}a_i B_i m_i$, which varies with the values of $m_1,\dots,m_N$. When all $\sigma^2_i$ are very small, we have $B_i \approx 0$, and so $\sum_{i=1}^{N}a_im_i$ is a good approximation for the median.

Let us use the Fenton-Wilkinson approximation, in which the distribution of the sum of two log-normally distributed RVs is approximated by another log-normal where the new parameters are set based on the moment method. Using this approximation, the approximate median is

$$\tilde{m}= e^\tilde{\mu}= \frac {\left (\sum_{i=1}^{N}a_im_iC_i\right )^2 }{ \sqrt { \left (\sum_{i=1}^{N}a_im_iC_i \right )^2 +\sum_{i=1}^{N}a^2_im^2_iC^2_i (C^2_i-1) }} \le \sum_{i=1}^{N}a_im_iC_i $$

with $C_i=e^{\frac{\sigma^2_i}{2}}.$ Again, when all $\sigma^2_i$ are very small, we have $C_i \approx 1$, and so $\sum_{i=1}^{N}a_im_i$ is a good approximation for the median.

Another interesting observation can be obtained by assuming that all $\sigma^2_i$ are the same and very large. In this case, using $\|x\|_2 \leq \|x\|_1 \leq \sqrt{n} \|x\|_2 $, which is tight,

$$ \sum_{i=1}^{N}a_im_i \le \tilde{m} \approx \frac {\left (\sum_{i=1}^{N}a_im_i\right )^2 }{ \sqrt { \sum_{i=1}^{N}a^2_im^2_i}} \le \sqrt {N} \left ( \sum_{i=1}^{N}a_im_i \right ) $$

This extreme analysis shows that when $\sigma^2_i$ are not small, $\sum_{i=1}^{N}a_im_i$ is not a good approximation for the median.

Moreover, it leads us to guess that the following inequality generally holds for independent RVs with log-normal distributions:

$$ \sum_{i=1}^{N}a_im_i \le m \left [\sum_{i=1}^{N}a_iX_i \right] \le \sqrt {N} \left ( \sum_{i=1}^{N}a_im_i \right ). $$

Amir
  • 4,305
  • Thank you for taking the time for an answer. I will be attentive for the explanation ( I am more interested actually in understand how I could handle these terms with summations as argument). – Joako Dec 14 '23 at 22:36
  • 1
    @Joako I just added more details. Hope they are helpful! – Amir Dec 15 '23 at 15:35
  • Indeed it has been useful for improving my understanding. I am interested in the case the RVs are not necessarily independent, and after your answer I realized I could do the following: $$\begin{array}{r c l} | m[\sum a_i x_i]-\sum a_i m[x_i] | & = & | m[\sum a_i x_i]+E[\sum a_i m[x_i]]-E[\sum a_i m[x_i]]-\sum a_i m[x_i] | \ &\leq & | m[\sum a_i x_i]+E[\sum a_i m[x_i]] | + | E[\sum a_i m[x_i]]-\sum a_i m[x_i]| \ &\leq & b\sqrt{\text{Var}[\sum a_i m[x_i]]} +|\sum a_i(E[x_i]-m[x_i])| \end{array}$$ with $b=1$ in general, or $b = \sqrt{0.6}$ if I assume the weighted sum distrib. is unimodal – Joako Dec 15 '23 at 23:56
  • I could evaluate all terms in the RHS without knowing the distribution of the weighted sum. But in the other hand, I think this bounds are too wide thinking in what it is said in Wikipedia, that the distribution of the sum of LogNormal RVs could be approximated with another LogNormal with some specific mean and variance. – Joako Dec 15 '23 at 23:58
  • I just realized I messed up the ineqs of the first comment: it should be $E[\sum a_i x_i]$ instead of $E[\sum a_i m[x_i]]$ so at the end the upper bound is $$b\sqrt{\text{Var}[\sum a_i x_i]}+|\sum a_i (E[x_i]-m[x_i])|$$ – Joako Dec 16 '23 at 19:45
  • 1
    @Joako i) You can see that in the proof of the lower bound, the RVs are not necessarily independent. ii) Moreover, the sum of two independent RVs with unimodal distributions is not necessarily unimodal you may see https://math.stackexchange.com/questions/70651/is-the-sum-of-independent-unimodal-random-variables-still-unimodal. Hence, you cannot use $b=\sqrt{.6}$ instead of $b=1$ to improve the upper bound unless the unimodality of the sum can be proven for log-normal distributions. Unfortunately, we know a little about the distribution of the sum. – Amir Dec 17 '23 at 12:01
  • 1
    @Joako iii) The Fenton-Wilkinson approximation, in which the sum is approximated by another log-normal where the new parameters are set based on the moment method, or other methods can be used, see https://stats.stackexchange.com/questions/238529/the-sum-of-independent-lognormal-random-variables-appears-lognormal and https://www.soa.org/globalassets/assets/files/static-pages/research/arch/2009/arch-2009-iss1-dufresne.pdf for a summary; however, none of them is accurate in the sense that no error bound can be obtained. – Amir Dec 17 '23 at 12:03
  • 1
    @Joako iv) Based on the Fenton-Wilkinson approximation I just added more comments on when the sum of the medians can be a good approximation for the median of the sum. – Amir Dec 17 '23 at 12:04
  • Thanks for the comments. I think it is interesting that the difference is bounded by $\sqrt{\text{Var}\left[\sum a_i x_i\right]}+|\sum a_i\left(E[x_i)-m[x_i]\right)|$ since it tells not only I need small deviations and variances/correlations, but also as I found in this other question (done for the mode but works in the same way if I choose the median), by assuming it could be approximated as linear operator, the founded Variance, if I use it as a selection figure, will be punishing those points where the difference is higher – Joako Dec 17 '23 at 21:29
  • 1
    @Joako hope the results are useful to determine when the median behaves as a linear operator for log-normal distributions. See also https://math.stackexchange.com/a/9621/1231520. – Amir Dec 18 '23 at 19:09
  • Which I find surprising is that intuitively neither the median nor the mode should be linear at all, but in the case the distributions are symmetric somehow it happen to be true - I was expecting to find some reasons behind this but it looks it too hard for handle it with enough accuracy: here is were the Fenton approx. lies in between. I tried to look for the mode by trying differentiating the convolution of $2$ lognormal RVs depicted by Dufresne , thinking that the result could be expanded as another combination against the new founded mode and then extend it, but I got stuck in the math. – Joako Dec 18 '23 at 22:14
  • @joako Yes exact study of the case of mode seems more challenging, but it can be approximately analyzed following a similar approach I used in the last part of my answer based on Fenton approx. – Amir Dec 18 '23 at 23:57