4

It seems intuitive that the line of best fit for $\{(n,n+\sin n) : n\in \mathbb{Z}\}$ should be $y=x$.

More concretely, it seems like a reasonable conjecture would be:

If $y = m_k x + b_k$ is the line of best fit for the set of points $$\{ (n,n+\sin n) : n\in \mathbb{Z}, |n| \leq k \},$$ then $\lim_{k\to \infty} m_k = 1$ and $\lim_{k\to\infty} b_k = 0$.

Is this conjecture true? And if so, how would one go about proving it? And moreover still, if $\{a_n\}$ is a sequence in $\mathbb{R}$ which is uniformly distributed in some compact interval $[A,B]$, then how does the line of best fit change when considering the set $\{(n,n+a_n)\}$?

EDIT: Just to clarify, by "line of best fit" I mean using the method of least-squares.

Patch
  • 4,245
  • 2
    Just to confirm: do you mean the "least-squares line of best fit", i.e. the line which minimizes the sum of the squares of the errors in the y-coordinates? – diracdeltafunk Sep 07 '22 at 16:49
  • Yes, this is what I had in mind. Thank you for clarifying. – Patch Sep 07 '22 at 16:57
  • 1
    It turns out that there are closed-form sums for $\sin^2(n)$ and $\sin(n)$. This might help you out. (https://math.stackexchange.com/questions/3015509/find-the-sum-of-sin2n), (https://math.stackexchange.com/questions/1119043/where-does-the-sum-of-sinn-formula-come-from). – Doug Sep 07 '22 at 17:13
  • 3
    The squared residuals will be divergent, regardless of which line we try to fit to the data set. So, in the traditional sense of the method of least squares, there won't be a line of best fit. That said, I think it would be possible to redefine the problem in a way that is not especially unnatural or trivial so that you get the desired result. You could try looking at the lines of best fit of the sets$$D_N = {(n, n + \sin(n)) : n \in \Bbb{Z}, |n| \le N},$$and consider their limits as $N \to \infty$, for example. – Theo Bendit Sep 07 '22 at 17:31
  • @TheoBendit Aren't the $D_N$ sets you're considering exactly the sets I was asking about? I'm a little confused, sorry. – Patch Sep 07 '22 at 20:24
  • @Patch Your set is infinite, but mine are finite. The condition $|n| \le N$ is new; it means we only consider the line of best fit over a finite set, so that we can have finite square residuals to minimise. – Theo Bendit Sep 07 '22 at 20:27
  • But that's what I said in the second line of my question, where I made or formal conjecture. – Patch Sep 07 '22 at 20:29
  • 1
    @Patch Oh right, I see that now. I missed the extra condition in the restatement. – Theo Bendit Sep 07 '22 at 21:20
  • 1
    Concerning the general question, which indeed is interesting, I would state it with $[A, B]=[-1, 1]$, so the expected result is the same as in the $\sin(n)$ case. – Giuseppe Negro Sep 08 '22 at 11:50
  • @GiuseppeNegro If you make any headway on the generalized problem, I'd love to see it! I just have no idea where to start with something like that. It would seem some probabilistic arguments would be helpful, and that is certainly not where my experience lies. – Patch Sep 08 '22 at 15:54

2 Answers2

3

To start with, we're going to need a few identities:

$$\begin{eqnarray} \sum_{j = 1}^n j & = & \frac{n(n+1)}{2} \\ \sum_{j = 1}^n j^2 & = & \frac{n(n+1)(2n+1)}{6} \\ \sum_{j = 1}^n \sin j & = & \frac{\sin n - \sin (n+1) + \sin 1}{2(1 - \cos 1)} \\ \sum_{j = 1}^n j \sin j & = & \frac{(n + 1) \sin n - n \sin (n + 1)}{2(1 - \cos 1)} \end{eqnarray}$$

The first two are the triangular and square pyramidal numbers respectively, the third is the sum of sines formula as expressed in this answer, and the last is courtesy of Wolfram Alpha, although you can derive it using a similar approach to the sum of sines formula, but applying a formula for the sum of $k x^k$ (that in turn comes from differentiating the geometric series formula with respect to $x$).

Then, we're going to use this formula for the coefficients of a simple linear regression:

$$\begin{eqnarray} y & = & \alpha + \beta x \\ \beta & = & \frac{n \sum_j x_j y_j - \sum_j x_j \sum_j y_j}{n \sum_j x_j^2 - (\sum_j x_j)^2} \\ \alpha & = & \bar{y} - \beta \bar{x} \\ & = & \frac{1}{n}\left(\sum_j y_j - \beta \sum_j x_j \right) \end{eqnarray}$$

Now, we get to start substituting $x_j = j$ and $y_j = j + \sin j$ everywhere. Starting with the denominator of $\beta$:

$$\begin{eqnarray} n \sum_j j^2 - (\sum_j j)^2 & = & n \frac{n(n+1)(2n+1)}{6} - \left(\frac{n(n+1)}{2}\right)^2 \\ & = & \frac{n^2 (n+1)}{2}\left(\frac{2n+1}{3} - \frac{n+1}{2} \right) \\ & = & \frac{n^2 (n+1)(n-1)}{12}\end{eqnarray}$$

Next, the numerator:

$$\begin{eqnarray} n \sum_j j(j + \sin j) - \sum_j j \sum_j (j + \sin j) & = & n(\sum_j j^2 + \sum_j j \sin j) - \left( (\sum_j j)^2 + \sum_j j \sum_j \sin_j \right) \\ & = & \frac{n^2 (n+1)(n-1)}{12} + n \frac{(n+1) \sin n - n \sin(n+1)}{2(1 - \cos 1)} \\ && - \frac{n(n+1)}{2} \frac{\sin n - \sin(n+1) + \sin 1}{2(1 - \cos 1)} \\ & = & \frac{n^2 (n+1)(n-1)}{12} \\ && + \frac{n \left((n+1) \sin n - (n - 1)\sin(n+1) - (n+1) \sin 1 \right)}{4(1 - \cos 1)} \end{eqnarray}$$

Putting those together, we get:

$$\begin{eqnarray} \beta & = & 1 - \frac{n \left((n+1) \sin n - (n - 1)\sin(n+1) - (n+1) \sin 1 \right)}{4(1 - \cos 1)}\frac{12}{n^2 (n+1)(n-1)} \\ & = & 1 - \frac{3 \left((n+1) \sin n - (n-1) \sin(n+1) - (n+1) \sin 1 \right)}{n(n-1)(n+1)(1 - \cos 1)} \end{eqnarray}$$

And:

$$\begin{eqnarray} \alpha & = & \frac{1}{n}(\sum_j (j + \sin j) - \beta \sum_j j) \\ & = & \frac{1}{n}\left(\frac{\sin n - \sin(n+1) + \sin 1}{2(1 - \cos 1)} \right. \\ && \left. + \frac{n(n+1)}{2} \frac{3 \left((n+1) \sin n - (n-1) \sin(n+1) - (n+1) \sin 1 \right)}{n(n-1)(n+1)(1 - \cos 1)} \right) \\ & = & \frac{(2n+1)\sin n - (2n-2)\sin(n+1) - (n+2)\sin 1}{n(n-1)(1 - \cos 1)} \end{eqnarray}$$

Or, at least, that's probably close to being right, but the probability of an algebraic error creeping in there is pretty high. However, assuming that that's all roughly accurate, we can see that $\beta = 1 + O(n^{-2})$ and $\alpha = O(n^{-1})$, so in the limit as $n \rightarrow \infty$ we do indeed get that $y = x$ is the least squares regression line.

ConMan
  • 24,300
1

We also could consider the continuous case of an infinite number of data points $(x_i,x_i+\sin(x_i))$ for $0 \leq x \leq a$ and minimize $$\Phi(m,b)=\int_0^a \Big[(mx+b)-(x+\sin(x))\Big]^2\,dx$$ $$\Phi(m,b)=\frac{1}{6} \left(a \left(2 a^2 (m-1)^2+6 a b (m-1)+6 b^2+3\right)-3 \sin (a) (\cos (a)+4 m-4)\right)+$$ $$2 \cos (a) (a (m-1)+b)-2 b$$

$$\frac{\partial \Phi(m,b)}{\partial m}=\frac{1}{6} \left(a \left(4 a^2 (m-1)+6 a b\right)-12 \sin (a)\right)+2 a \cos (a)\tag 1$$ $$\frac{\partial \Phi(m,b)}{\partial b}=\frac{1}{6} a (6 a (m-1)+12 b)+2 \cos (a)-2\tag 2$$

Solving the two linear equations in $(m,b)$ gives $$m=1-\frac{6 (a-2 \sin (a)+a \cos (a))}{a^3}\quad \to ~ 1^-$$ $$b=\frac{2 (2 a-3 \sin (a)+a \cos (a))}{a^2}\quad \to ~ 0^+$$