Shortcut to finding the square-root of a perfect-square?

Question

I've been trying to speed up an algorithm where the most expensive operation is the square-root. However, I can often guarantee that the input value is a perfect-square. I'm curious to know if there are any algorithms that will find the square-root faster (constant time?) if it is known that the input is a perfect-square?

Thanks, Ryan

I'm not aware of any faster algorithm in that situation (certainly not to constant time), but this probably isn't the best Stack Exchange to ask that question. — Brian Tung, Aug 04 '18 at 22:21
Thanks for your input. I can stick a similar question on a programming Stack Exchange, but this is primarily a math question :) — Ryan Pierce Williams, Aug 04 '18 at 22:23
How large are your squares? 48 binary digits? 64? 1000? And since this is very applied numerical math: what machine are we running on? Or is this configurable hardware and we can implement our own logic? — Marcus Müller, Aug 04 '18 at 23:56
And: do you need to make a decision after every square root calculation, or can you calculate multiple square roots in parallel (e.g. data parallelism)? — Marcus Müller, Aug 05 '18 at 00:03
Currently I'm interested in calculating a square root and then making some decisions based upon it :) — Ryan Pierce Williams, Aug 05 '18 at 00:13
In terms of size, I'm doing preliminary work up to 64-bit integers. However, I intend to use the GNU Multiple Precision Library for larger integers after my initial testing is complete. — Ryan Pierce Williams, Aug 05 '18 at 00:14
For hardware, I'm just using a standard Windows 10 desktop computer. I would like to migrate to a cluster of Raspberry Pis later though — Ryan Pierce Williams, Aug 05 '18 at 00:15
If you can run on a cluster, you have data parallelism (prob'ly it's better to discard the pi cluster idea. It's very little bang for the initial Buck and little bang for the power bill Buck; embedded computers are not optimized for number crunching, much unlike modern desktop CPUs) — Marcus Müller, Aug 05 '18 at 00:34
The algorithm overall can be parallelized to execute over different input ranges. However, when evaluating a particular candidate, I don't need a bunch of different square-roots - I only need one or two per candidate. — Ryan Pierce Williams, Aug 05 '18 at 00:38
Also, given the size of the numbers I'd like to work with, keeping a minimal/constant memory foot print is important. Can't have my memory increasing drastically as I try to work with larger numbers. I ran into this problem previously when trying to use prime numbers. Trying to store all prime numbers up to the sqrt of the target becomes prohibitive very quickly. — Ryan Pierce Williams, Aug 05 '18 at 00:39
Practical idea, Depending on use case, you may be able to compute your square root earlier and memorize it for the bottle neck. Assuming you're doing some earlier computation to Ensure this number is a square. Memorization is also an option, but perhaps that's too much memory. To elaborate: get n^2, split thread. One end does your usual thing, the other computes the square root. Savings will depend on if you are doing more than just finding root n. — Artimis Fowl, Aug 05 '18 at 00:44
@ArtimisFowl - That won't work in this case. As with pre-computing prime numbers, trying to pre-compute and store square-roots will prove to be a bottle neck when trying to process large integers. — Ryan Pierce Williams, Aug 05 '18 at 00:46
By the way, gmp comes with an integer sqrt function. And I don't you'll he much faster than that on large integers. — Marcus Müller, Aug 05 '18 at 02:00
Please explain how the number is stored. As a native 32-bit int (in that case there can be optimization specific to 32-bit number), or as a general big-integer? — user202729, Aug 05 '18 at 03:41
This question is more suited for [codereview.se] or [so] (as always, read help before asking) [math.se] is more about pure math than actual computer runtime (so if you ask about asymptotic complexity, it would suit here (probably [cs.se] is better), but real computer implementation would not) — user202729, Aug 05 '18 at 03:46
Assume that the number is a big-integer, it's impossible to do anything on it in constant-time, as just reading the number takes at least logarithmic time. — user202729, Aug 05 '18 at 03:47
Very closely related: Fast algorithm for computing integer square roots on machines that doesn't support floating-point arithmetic? Given that the answers here did not exploit anything about $n$ being a perfect square, and answers in the linked question is much better (faster), I'd say that it should be closed as duplicate. Also https://math.stackexchange.com/questions/296102/fastest-square-root-algorithm and https://math.stackexchange.com/questions/1599481/integer-square-root-algorithm . — user202729, Aug 05 '18 at 03:53
@user202729 All good points. The major reason I put it here on the Math Stack Exchange rather than a programming forum is because I am interested in solutions that take advantage of the special property I noted: the input is (likely) a perfect-square. Tiwa Aina is the only one so far who has attempted an answer taking advantage of this property so far, however. It maybe that an algorithm that assumes a perfect-square as its input doesn't work for very efficiently or at all with a non perfect-square. — Ryan Pierce Williams, Aug 05 '18 at 08:02
@RyanPierceWilliams I don't have a proof, but for all we've been able to figure out, discrete logarithms are hard; i.e. by exploiting the fact that your number is a perfect square, you basically win nothing; all integer square root algorithms can already assume that inherently (if the number sqrti'ed isn't a perfect square, you don't get an integer result, anyway). If I wanted to put this harshly, "you just claim there should be a benefit of knowing it's a perfect square, but fail to demonstrate any evidence or even indication for that" :) But putting it like that would be unfair, because: — Marcus Müller, Aug 05 '18 at 12:51
Your actual question is "are there algorithms that exploit that fact", and I must admit I have failed at providing an existence proof! — Marcus Müller, Aug 05 '18 at 12:55
There are a lot of different suggestions in the answers below, and people seem to vote based on how 'cute' / simple the idea is. This is not a good way of judging what is the best method in practice. What this question needs is for someone to do a proper comparison by implementing and timing the different proposals. Would be very nice to see this. — Winther, Aug 05 '18 at 14:26
@Winther no, I disagree: [in a 300! voice]: This is math! also, too many variables: how large are the integers we're talking about, what hardware do we run on, how many of them do we need the square root of, what's the actual memory restriction, how is memory clocked compared to CPU and especially ALU/FPU… — Marcus Müller, Aug 05 '18 at 14:29
@MarcusMüller You strongly disagree that timing the different proposals is a good idea? How would this hurt and how else would you determine what works best? Sure this is a site about math not computing, but this is a question comes from a practical implementation and only testing would reveal what works best. The fact that the worst (for all but the smallest integers) algorithm is the highest upvoted answer suggest that people here would benefit from seeing that. — Winther, Aug 05 '18 at 14:41
Is your "standard Windows 10 desktop computer" a 64 bit machine? I assume it is, but you didn't say. Also, are you using a fully compiled language? Or are you using a scripting language? The best algorithm running on bare metal may not be the best running in a scripting environment. When you go up to the big numbers after your preliminary 64-bit testing phase how big will these numbers be? If you have a set size limit, there may be some possible optimizations. — PM 2Ring, Aug 05 '18 at 15:39
@Winther I don't disagree that if you have different approaches, a benchmark should be constructed. But just timing something will have little to no information that is actually useful in comparing algorithms for the use case OP has. Construction of a proper benchmark is so much more than "let this function run 10000 times and count the milliseconds". It's primarily setting the conditions for a fair comparison. Implementation and running is the lesser problem. But: let's be honest, OP explicitly wanted to ask this here and not on a CS or programmer's Stack Exchange site, and this is where — Marcus Müller, Aug 05 '18 at 15:58
Here's a method to find that an integer $N$ is a square without using the square root function. I don't know how fast the method is compared to other methods but it's certainly worth trying.
https://math.stackexchange.com/questions/4226869/how-well-does-this-method-of-checking-if-an-integer-n-is-a-square-perform — user25406, Aug 19 '21 at 13:12

actinidia · Answer 1 · 2018-08-04T22:31:27.163

14

The sum of the first $k$ odd numbers is $k^2$. Knowing this, you can you calculate the square root by summing successive odd numbers (starting from one)—once you reach the input value, return the number of summations you made.

For example, $16 = 1 + 3 + 5 + 7$; that's $4$ addends, so $\sqrt{16}=4$. This process will always work, since our input is guaranteed to be of the form $k^2$ with $k \in \mathbb N$.

I think this method would run in $O(\sqrt n)$.

edited Aug 04 '18 at 22:31

answered Aug 04 '18 at 22:22

actinidia

3,365

1

+1. I would edit your answer to make your use of $n$ consistent between the first and last sentences. (Perhaps prefer changing $n$ to $k$ in your first sentence to match the second paragraph.) – Brian Tung Aug 04 '18 at 22:30
1

Good answer :) I actually tried this myself. Unfortunately, as n gets larger this approach takes a lot more time than more general approaches like Newton's Method. – Ryan Pierce Williams Aug 04 '18 at 22:50
5

However, I wonder if this is actually faster than the existing sqrt implementation in most compiled languages. – Brian Tung Aug 04 '18 at 22:50
1

Newton's algorithm converges much faster (on average; worst case: your integer is the square of two large primes, then complexity approaches $\mathcal O(\sqrt n)$, I think. – Marcus Müller Aug 04 '18 at 23:47
7

Terrible. Remember that $n$ here is the number itself, which is of order $2^s$ where $s$ is number of bits used to represent the number; so $\sqrt n = \sqrt{2^s} = (\sqrt 2)^s$, which is exponential in the input size. Newton's algorithm is polynomial in $s$. – user202729 Aug 05 '18 at 03:39
@user202729 I had a feeling it wasn't the best one available; I definitely didn't know that modern square root algos run in $\log n$! – actinidia Aug 05 '18 at 03:56
This is a cool idea though, for listing the first $k$-many perfect squares. – goblin GONE Aug 05 '18 at 10:43
1

Newton's method for square roots isn't exactly modern, it was known to the ancient Babylonians. :) Newton's method (or more properly the Newton–Raphson method) is a generalization for finding the root of any function of which you can compute the derivative. – PM 2Ring Aug 05 '18 at 15:32
-1 Cool idea, but this will almost certainly be slower than whatever the OP is already using (probably floating point square root and rounding). – Tuomas Laakkonen Dec 26 '19 at 12:58

Marcus Müller · Answer 2 · 2018-08-05T10:14:56.133

With integers within sensible bounds compared to what your CPU can natively compute, it can be quite easy to restrict the range of numbers you have to binary search to find the square root of x.

(0. remove two-blocks of trailing 0 from your binary number. Each block you remove is one factor of 2 to be multiplied to the result of the following step. This can be done in constant time, if I'm not mistaken: Observe the structure of "Subtract 1 and XOR with the input" for numbers with $t$ trailing 0s. Then use the POPCNT (Hamming weight) instruction of most serious CPUs. After removing these 0s, i.e. dividing by $4^n$, you'll end up with an odd number; if you end up with an even number after removing an even number of 0s, your number is not a perfect square.)

Find $k=\lfloor\log_2 x\rfloor $, see https://graphics.stanford.edu/~seander/bithacks.html
$a=\frac k2$
Thus, $2^a$ becomes a lower limit for $\sqrt x$ and $2^{a+1}$ an upper. Both values can be found via bit-shifting 1.
From here, do a binary search¹.

I doubt you'd be much faster than converting to floating point and letting the FPU do it in hardware, giving you an approximate value, comvertable back to integer, from which you only need to search small ranges (namely, the lost precision) for the actual integer square root.

Note that in such problems as yours, algorithmic elegance often plays a minor role - it needs to be fast on actual hardware, so execution avoiding a lot of memory interaction is a good thing, and: with SIMD instructions, doing four to 16 operations of the same type take about as long as doing one; so if you just need to test a few integers for their square, modifying your algorithm to be able to try four in parallel is way more efficient than saving half of the operations necessary.

You have a technological problem, not so much a numerical.

¹ binary search assumes that you can do one squaring and one comparison at once; as hinted at before, you might very well be able to divide your interval into five search chunks by calculating four products at once and comparing four numbers at once using SIMD. This further hints that even if there should be no constant time algorithm (and I'm pretty sure there's none), you can be better than $\mathcal O(n^2·\log_2 x)$; compare Fürer's algorithm.

Step 1 assumes the input is a machine-sized integer, which may not be a case. — user202729, Aug 05 '18 at 03:44
Well if it is some kind of big integer: whatever big integer library you're using somehow has to keep track of the length in binary digits, anyway. — Marcus Müller, Aug 05 '18 at 08:37
Did you try to time it? The binary search requires a lot of multiplications for large numbers so my bet is that this is much slower than the built in square root function (this is what I find with a simple test in C++ using long long integers, but its always the caveat that there is a better implementation than what I did) — Winther, Aug 05 '18 at 14:23
"built-in square root function": of what, operating on what data types? — Marcus Müller, Aug 05 '18 at 14:26
(and that confirms my suspicion that you'll have a hard time beating "the usual implementations") — Marcus Müller, Aug 05 '18 at 14:28
Binary search is ok for small $n$, but Newton's method (aka Heron's method) is faster for large $n$. For random 64 bit unsigned numbers, Newton's converges in 5 iterations or less (mean around 4.6), and each iteration requires a division, a 1 bit shift, a subtraction and an addition, and a pair of comparisons. — PM 2Ring, Aug 05 '18 at 16:21

Aeryk · Answer 3 · 2018-08-05T19:34:39.380

I think the only advantage gained by having a perfect square in analytic methods is that you know an iterative algorithm will actually terminate. So instead here is a number theoretic solution that'll work for numbers less than $2^{66}$.

Fact 1: If $p$ is a prime with $p \equiv 3 \mod 4$ and $x$ is a perfect square $\mod p$, then $$x \equiv \left(x^{(p+1)/4}\right)^2 \mod p,$$ i.e. you can compute the modular square root by exponentiating by $(p+1)/4$. (See https://crypto.stackexchange.com/a/20994/18112)

Fact 2: The numbers $m_{17}=2^{17}-1$, $m_{19}=2^{19}-1$, and $m_{31}=2^{31}-1$ are (Mersenne) primes whose product is greater than $2^{66}$.

Method: Let $S$ be the square whose root $t$ you'd like to find. Compute the following $$t_{17} \equiv S^{2^{15}} \mod m_{17}$$ $$t_{19} \equiv S^{2^{17}} \mod m_{19}$$ $$t_{31} \equiv S^{2^{29}} \mod m_{31}$$ Then the Chinese Remainder Theorem gives $$t \equiv \pm 31207 t_{17} m_{19} m_{31} \pm 298611 m_{17} t_{19} m_{31} \pm 413071270 m_{17} m_{19} t_{31} \mod m_{17}m_{19}m_{31}$$ Then check these 8 possibilities.

Remarks: I don't know how computationally efficient this is; it's more of a mathematical solution taking advantage of knowing that $S$ is a square. I would venture to guess it's about as "constant time" as you could get as the number of steps is essentially fixed, but that constant may be larger than the $\sqrt{n}$ of other methods for this range of $n$.

(1) I've never heard of a practical algorithm computing square root that may not terminate. (2) (as I pointed out in a comment under the question) it's impossible to do this in constant time, as storing a number $n$ takes at least $O(\log n)$ bits. — user202729, Aug 05 '18 at 12:38
(3) Note that for $n$ being in a constant range (in this case $[1 \dots 2^{66}]$, any (deterministic, correct) algorithm must run in constant asymptotic time complexity. (4) It's only necessary for $p$ to be larger than $2\sqrt n$ instead of $n$, the proof is left to the reader. ... and ... — user202729, Aug 05 '18 at 12:40
@user202729 You're totally right about the 8 solution ambiguity. I made the correction in the exposition above. Thanks! — Aeryk, Aug 05 '18 at 19:35
Activity comment: see this for theory and link to python shifting code. — CopyPasteIt, Jul 27 '21 at 13:10

score 1 · Answer 4 · answered Aug 04 '18 at 22:57

1

I think a binary search type algorithm would be quite efficient for large input values if we know the input is a perfect square.

Let $n$ be the input value.

Begin with two integers $a$ and $b$ such that $a^2 < n$ and $b^2 > n$. We could use $a=0$ and $b=n$.
Find the midpoint of $a$ and $b$ and round down to the nearest integer if necessary. Call this $m$.
Find $m^2$. If $m^2=n$ then $m$ is the square root. If $m^2>n$ then $m$ is too high, so we return to step 2 with $a$ and $m$ as our two integers. If $m^2<n$ then $m$ is too low, so we return to step 2 with $m$ and $b$ as our two integers. Repeat until the square root is found.

The squaring of $m$ may be what slows the algorithm down, however I believe that multiplication algorithms are implemented in processor hardware and therefore very efficient. In terms of the number of operations, I believe the binary search would run in logarithmic time and therefore be preferable to $O(\sqrt n)$ for large input values. However, I am no expert on algorithm efficiency...

answered Aug 04 '18 at 22:57

alcana

120

Thanks for your answer alcana :) This is a more general algorithm for the square root (would work for non-perfect squares as well with some minor modifications to the logic for determining when we are done). It doesn't take advantage of the special property of the input value: that it is a perfect-square. It is faster than the previous answer as the input size increases however :) – Ryan Pierce Williams Aug 04 '18 at 23:35
@RyanPierceWilliams - also note that on many current Intel processors, multiplies are optimized and only take 1 to 3 cycles, so for larger n, the binary search using multiply and compare should be faster. – rcgldr Aug 05 '18 at 02:20
3

Isn't this very similar to Newton's method, just less efficient because it doesn't use deltas to correct the prediction, just a blind binary split testing? – ktb Aug 05 '18 at 02:35
2

Yes, this is less efficient than Newton's method. – user202729 Aug 05 '18 at 03:55

Mark Dickinson · Answer 5 · 2022-03-25T09:02:46.177

You can get good results using a 2-adic version of Newton–Raphson. The algorithmic complexity will be no better than with the usual adaptation of Newton–Raphson to the domain of integers, but convergence is easier to establish and computation modulo a power of $2$ is especially well-suited to modern computers.

As an example of the sort of performance possible, in a couple of dozen lines of division-free, almost branch-free C code I can compute the 32-bit square root of any 64-bit perfect square in around 14.9 CPU clock cycles (around 3.3 ns at 4.5 GHz) on my Intel Core i7-8559U laptop. I've included the code below so that you can do timings for yourself.

I think this fits the "known perfect square" criterion described by the questioner, since for non-perfect squares it doesn't give a useful result (outside the presumably highly specialised application of actually wanting a 2-adic approximation to the square root of a non-square integer).

Details: It's enough to be able to compute square roots of odd perfect squares: handling zero is trivial, and for positive even perfect squares we can shift out trailing zero bits (of which there must be an even number) to get an odd perfect square, take the square root, and shift back appropriately.

So suppose that $n$ is an odd perfect square integer and define a function $f$ (on the real numbers for now) by $$f(x) = n - \frac 1 {(2x+1)^2},$$ valid for all real $x$ except $x = -1/2$. The roots of $f$ are $((\pm 1/\sqrt n) - 1) / 2$. Working through the algebra, the Newton–Raphson method applied to $f$ says that if we have a sufficiently good approximation $x$ to one of those roots then $$x - \left((x^2 + x)n + \frac{n-1}4\right) (2x + 1) \tag{1}\label{eq1}$$ should be a better approximation, and moreover that convergence of the repeated iteration should be quadratic once we get close enough.

Now here's the key point: the expression \eqref{eq1} can be evaluated using only integer arithmetic. (Note that since $n$ is an odd square, it must be congruent to $1$ modulo $4$, so $(n - 1)/4$ is an integer.) There are exactly two roots of $f$ in the 2-adic integers, and we can use \eqref{eq1} to compute successive $2$-adic approximations to the roots. Moreover, we continue to get the expected quadratic convergence to the roots; while the normal proof of this goes via calculus, for this particular $f$ we can see the quadratic convergence purely algebraically. Suppose that $x_0$ is a (rational) integer that satisfies $$(x_0^2 + x_0)n + \frac{n-1}4 \equiv 0 \pmod{2^j}$$ for some positive integer $j$. This is an integer-only reformulation of the statement that $x_0$ is congruent to $((\pm 1/\sqrt n) - 1) / 2$ modulo $2^j$ in the 2-adics, where now $\sqrt n$ represents one of the square roots of $n$ in the ring of 2-adic integers.

Let $x_1$ be the next approximation computed using the formula \eqref{eq1}: $$x_1 = x_0 - \left((x_0^2 + x_0)n + \frac{n-1}4\right) (2x_0 + 1).$$ Then simple but tedious algebra (expanding both sides) shows that $$(x_1^2 + x_1)n + \frac{n-1}4 = \left((x_0^2 + x_0)n + \frac{n-1}4\right)^2 ((2x_0+1)^2 n - 4)$$ from which we immediately have $$(x_1^2 + x_1)n + \frac{n-1}4 \equiv 0 \pmod{2^{2j}}$$ So we double the number of good bits in our approximation on every iteration. If $n < 4^j$ for some $j$ then it's enough to iterate until we have an integer $x$ satisfying $$(x^2 + x)n + \frac{n-1}4 \equiv 0 \pmod{2^j}$$ Then $a = (2x+1)n$ is an integer solution to the congruence $$a^2 \equiv n \pmod{2^{j+2}}.$$ Note that we have to be a bit careful: there are four solutions in total to this congruence, but only one of them lies in the interval $(0, 2^j)$ (after reduction modulo $2^{j+2}$), and that's the square root that we're after. Given any one solution, the others are easy to find via negation and via addition of $2^{j+1}$.

As a final speed trick, while we could start with a solution to the congruence modulo $2^1$ and work our way up from there, on most machines it will make more sense to use a small lookup table to enable us to start with a solution modulo $2^8$, say. That lookup table will typically be small enough to fit into a couple of level-1 cache lines on a modern processor.

Here's some C code that applies the above ideas to the special case of computing the 32-bit square root of a 64-bit perfect square integer. It's mostly portable, but it does make use of the GCC / Clang intrinsic function __builtin_ctzl for counting trailing zero bits in a nonzero integer.

#include <stdint.h>
static const uint8_t lut[128] = {
    0, 85, 83, 102, 71, 2, 36, 126, 15, 37, 28, 22, 87, 50, 107, 46,
    31, 10, 115, 57, 103, 98, 4, 33, 47, 58, 3, 118, 119, 109, 116, 113,
    63, 106, 108, 38, 120, 61, 27, 62, 79, 101, 35, 41, 104, 13, 84, 17,
    95, 53, 76, 121, 88, 34, 59, 97, 111, 5, 67, 54, 72, 82, 52, 78,
    127, 42, 44, 25, 56, 125, 91, 1, 112, 90, 99, 105, 40, 77, 20, 81,
    96, 117, 12, 70, 24, 29, 123, 94, 80, 69, 124, 9, 8, 18, 11, 14,
    64, 21, 19, 89, 7, 66, 100, 65, 48, 26, 92, 86, 23, 114, 43, 110,
    32, 74, 51, 6, 39, 93, 68, 30, 16, 122, 60, 73, 55, 45, 75, 49,
};
uint32_t isqrt64_exact(uint64_t n)
{
    uint32_t m, k, x, b;
if (n == 0)
    return 0;

int j = __builtin_ctzl(n);
n &gt;&gt;= j;
m = (uint32_t)n;
k = (uint32_t)(n &gt;&gt; 2);
x = lut[k &gt;&gt; 1 &amp; 127];
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b &gt;&gt; 31);
return (b - ~b) &lt;&lt; (j &gt;&gt; 1);

}

I have a fuller version of this code available, including exhaustive tests and explanations, as a GitHub gist at https://gist.github.com/mdickinson/e087001d213725a93eeb8d8f447a2f40

Shortcut to finding the square-root of a perfect-square?

5 Answers5