Shortcut for multiple highly-redundant dot-products?

Question

In statistical computing, we often have linear models that involve computing multiple vector dot products, but there's a high amount of redundancy in the operations when you break them down. As an example, let X be the following 12x3 matrix:

[[-1,-1,-1]
,[-1,-1,1]
,[-1,1,-1]
,[-1,1,1]
,[0,-1,-1]
,[0,-1,1]
,[0,1,-1]
,[0,1,1]
,[1,-1,-1]
,[1,-1,1]
,[1,1,-1]
,[1,1,1]]

As you can see, there's high redundancy across rows, and in particular the only values present are the integers from -1:1.

In linear models (particularly hierarchical linear models) we need to compute the dot-product of each row of X with a vector or real values Y:

[0.2796520, -0.2959688,  2.2617949]

As in (pseudo-code):

for(row in 1:12)
    out[row] = dot_product(X[row,],Y)

But with the assumption that X contains only integers -1,0,1, the multiplications are unncessary, reflecting mere sign-flips (in the case of -1) or exclusion from the sum (in the case of 0). So an equivalent output is achieved by:

for(row in 1:12)
    out[row] = 0
    for(col in 1:3)
        if(X[row,col]==0):
            pass
        elif(X[row,col]<0):
            out[row] -= Y[col]
        else:
            out[row] += Y[col]

Which should be faster (correct?).

And yet this approach still ignores the redundancy across rows of X, and I have this vague but itchy suspicion that further speedup could be achieved by capitalizing on that redundancy, and that there's likely an analogy to a standard problem with an elegant solution in CS, but being a stats guy I haven't been exposed to. Any ideas?

(note: I'm arguably over-optimizing here, but the full compute context is Bayesian computing where these operations occur millions of times and dominate overall compute time; and these models are used fairly ubiquitously in science, so a speedup would improve many scientific workflows).

After playing a bit I worked out an approach that treats the sum creation as a branching process to minimize the number of unique sum operations. Here's R code showing the approach; it's not "elegant" in implementation, but it does tend to yield an order of magnitude-ish fewer operations (counting only adds/multiplies and treating multiplies as 3x more costly than adds):

library(tidyverse)
make data
(
    expand_grid(
        a = c(-1,0,1)
        , b = c(-1,0,1)
        , c = c(-1,0,1)
        , d = c(-1,0,1)
        , e = c(-1,0,1)
    )
    %>% as.matrix()
) ->
    X
#example parameter vector
beta = rnorm(ncol(X))
compute the standard row-wise dot-product
expected_dot_out = t(X%*%beta)
dot_ops_cost = nrow(X)ncol(X)4
standard dot product should take nrow(X)*ncol(X) multiplies & the same
number of adds, and multiplies are about 3x slower than adds
#precompute some indices
X_tmp_xrow_list = list()
for(i in 1:ncol(X)){
    (
        X[,1:i]
        %>% as_tibble()
        %>% mutate(xrow=1:n())
    ) ->
        X_with_xrow
(
    X_with_xrow
    %&gt;% group_by_at(.vars=vars(-xrow))
    %&gt;% summarise(
        first_xrow = xrow[1]
        , .groups = 'drop'
    )
    %&gt;% mutate(
        xrow_in_X_tmp = 1:n()
    )
) -&gt;
    tmp
(
    X_with_xrow
    %&gt;% select(-xrow)
    %&gt;% left_join(tmp)
    %&gt;% nest(xrow_in_X_tmp=xrow_in_X_tmp)
) -&gt;
    X_tmp_xrow_list[[i]]

}
helper function for the no-multiply dot product
dot_like = function(X,Z,beta){
    out = Z
    for(row in 1:length(X)){
        if(X[row]!=0){
            if(X[row]>0){
                out[row] = out[row] + beta
            }else{
                out[row] = out[row] - beta
            }
            add_count <<- add_count + 1
        }
    }
    return(out)
}
#initialize a count of additions performed
add_count = 0
Z = rep(0,nrow(X))
#loop over columns to do branched sum
for(i in 1:ncol(X)){
    X_tmp = X[X_tmp_xrow_list[[i]]$first_xrow,1]
if(ncol(X)>1){
    X = as.matrix(X[,2:ncol(X)])
}
Z_tmp = Z[X_tmp_xrow_list[[i]]$first_xrow]
    Z_new = dot_like(X_tmp,Z_tmp,beta[i])
    Z = Z_new[unlist(X_tmp_xrow_list[[i]]$xrow_in_X_tmp)]
}
#show that we get the same result
all(near(Z,expected_dot_out))
#compare operation cost (ignores overhead of my ugly method)
print(c(dot_ops_cost,add_count,dot_ops_cost/add_count))
```

To determine which approach is faster, program both and compare the execution times. — Yuval Filmus, Jan 04 '23 at 18:07
is $X$ always in the form of an ordered list of vectors from $A_1\times A_2 \times \dots \times A_k$, Where $A_i$ is a small set of whole numbers? — nir shahar, Jan 04 '23 at 19:19
Also, what is your output? Do I understand correctly that it is a vector of all dot products? — nir shahar, Jan 04 '23 at 19:20
@YuvalFilmus Yes, obviously, but I'm looking first to see if this is a known problem with a known solution. — Mike Lawrence, Jan 04 '23 at 19:43
@nirshahar I just updated the post with code demonstrating the data setup (and also a solution I came to in the interim); hopefully it clarifies the problem? — Mike Lawrence, Jan 04 '23 at 19:44

Shortcut for multiple highly-redundant dot-products?

make data

compute the standard row-wise dot-product

standard dot product should take nrow(X)*ncol(X) multiplies & the same

number of adds, and multiplies are about 3x slower than adds

helper function for the no-multiply dot product

0 Answers0