0

In statistical computing, we often have linear models that involve computing multiple vector dot products, but there's a high amount of redundancy in the operations when you break them down. As an example, let X be the following 12x3 matrix:

[[-1,-1,-1]
,[-1,-1,1]
,[-1,1,-1]
,[-1,1,1]
,[0,-1,-1]
,[0,-1,1]
,[0,1,-1]
,[0,1,1]
,[1,-1,-1]
,[1,-1,1]
,[1,1,-1]
,[1,1,1]]

As you can see, there's high redundancy across rows, and in particular the only values present are the integers from -1:1.

In linear models (particularly hierarchical linear models) we need to compute the dot-product of each row of X with a vector or real values Y:

[0.2796520, -0.2959688,  2.2617949]

As in (pseudo-code):

for(row in 1:12)
    out[row] = dot_product(X[row,],Y)

But with the assumption that X contains only integers -1,0,1, the multiplications are unncessary, reflecting mere sign-flips (in the case of -1) or exclusion from the sum (in the case of 0). So an equivalent output is achieved by:

for(row in 1:12)
    out[row] = 0
    for(col in 1:3)
        if(X[row,col]==0):
            pass
        elif(X[row,col]<0):
            out[row] -= Y[col]
        else:
            out[row] += Y[col]

Which should be faster (correct?).

And yet this approach still ignores the redundancy across rows of X, and I have this vague but itchy suspicion that further speedup could be achieved by capitalizing on that redundancy, and that there's likely an analogy to a standard problem with an elegant solution in CS, but being a stats guy I haven't been exposed to. Any ideas?

(note: I'm arguably over-optimizing here, but the full compute context is Bayesian computing where these operations occur millions of times and dominate overall compute time; and these models are used fairly ubiquitously in science, so a speedup would improve many scientific workflows).

After playing a bit I worked out an approach that treats the sum creation as a branching process to minimize the number of unique sum operations. Here's R code showing the approach; it's not "elegant" in implementation, but it does tend to yield an order of magnitude-ish fewer operations (counting only adds/multiplies and treating multiplies as 3x more costly than adds):

library(tidyverse)

make data

( expand_grid( a = c(-1,0,1) , b = c(-1,0,1) , c = c(-1,0,1) , d = c(-1,0,1) , e = c(-1,0,1) ) %>% as.matrix() ) -> X

#example parameter vector beta = rnorm(ncol(X))

compute the standard row-wise dot-product

expected_dot_out = t(X%*%beta)

dot_ops_cost = nrow(X)ncol(X)4

standard dot product should take nrow(X)*ncol(X) multiplies & the same

number of adds, and multiplies are about 3x slower than adds

#precompute some indices X_tmp_xrow_list = list() for(i in 1:ncol(X)){ ( X[,1:i] %>% as_tibble() %>% mutate(xrow=1:n()) ) -> X_with_xrow

(
    X_with_xrow
    %&gt;% group_by_at(.vars=vars(-xrow))
    %&gt;% summarise(
        first_xrow = xrow[1]
        , .groups = 'drop'
    )
    %&gt;% mutate(
        xrow_in_X_tmp = 1:n()
    )
) -&gt;
    tmp
(
    X_with_xrow
    %&gt;% select(-xrow)
    %&gt;% left_join(tmp)
    %&gt;% nest(xrow_in_X_tmp=xrow_in_X_tmp)
) -&gt;
    X_tmp_xrow_list[[i]]

}

helper function for the no-multiply dot product

dot_like = function(X,Z,beta){ out = Z for(row in 1:length(X)){ if(X[row]!=0){ if(X[row]>0){ out[row] = out[row] + beta }else{ out[row] = out[row] - beta } add_count <<- add_count + 1 } } return(out) }

#initialize a count of additions performed add_count = 0 Z = rep(0,nrow(X)) #loop over columns to do branched sum for(i in 1:ncol(X)){ X_tmp = X[X_tmp_xrow_list[[i]]$first_xrow,1] if(ncol(X)>1){ X = as.matrix(X[,2:ncol(X)]) } Z_tmp = Z[X_tmp_xrow_list[[i]]$first_xrow] Z_new = dot_like(X_tmp,Z_tmp,beta[i]) Z = Z_new[unlist(X_tmp_xrow_list[[i]]$xrow_in_X_tmp)] }

#show that we get the same result all(near(Z,expected_dot_out))

#compare operation cost (ignores overhead of my ugly method) print(c(dot_ops_cost,add_count,dot_ops_cost/add_count))

```

0 Answers0