In statistical computing, we often have linear models that involve computing multiple vector dot products, but there's a high amount of redundancy in the operations when you break them down. As an example, let X
be the following 12x3 matrix:
[[-1,-1,-1]
,[-1,-1,1]
,[-1,1,-1]
,[-1,1,1]
,[0,-1,-1]
,[0,-1,1]
,[0,1,-1]
,[0,1,1]
,[1,-1,-1]
,[1,-1,1]
,[1,1,-1]
,[1,1,1]]
As you can see, there's high redundancy across rows, and in particular the only values present are the integers from -1:1.
In linear models (particularly hierarchical linear models) we need to compute the dot-product of each row of X with a vector or real values Y:
[0.2796520, -0.2959688, 2.2617949]
As in (pseudo-code):
for(row in 1:12)
out[row] = dot_product(X[row,],Y)
But with the assumption that X contains only integers -1,0,1, the multiplications are unncessary, reflecting mere sign-flips (in the case of -1) or exclusion from the sum (in the case of 0). So an equivalent output is achieved by:
for(row in 1:12)
out[row] = 0
for(col in 1:3)
if(X[row,col]==0):
pass
elif(X[row,col]<0):
out[row] -= Y[col]
else:
out[row] += Y[col]
Which should be faster (correct?).
And yet this approach still ignores the redundancy across rows of X, and I have this vague but itchy suspicion that further speedup could be achieved by capitalizing on that redundancy, and that there's likely an analogy to a standard problem with an elegant solution in CS, but being a stats guy I haven't been exposed to. Any ideas?
(note: I'm arguably over-optimizing here, but the full compute context is Bayesian computing where these operations occur millions of times and dominate overall compute time; and these models are used fairly ubiquitously in science, so a speedup would improve many scientific workflows).
After playing a bit I worked out an approach that treats the sum creation as a branching process to minimize the number of unique sum operations. Here's R code showing the approach; it's not "elegant" in implementation, but it does tend to yield an order of magnitude-ish fewer operations (counting only adds/multiplies and treating multiplies as 3x more costly than adds):
library(tidyverse)
make data
(
expand_grid(
a = c(-1,0,1)
, b = c(-1,0,1)
, c = c(-1,0,1)
, d = c(-1,0,1)
, e = c(-1,0,1)
)
%>% as.matrix()
) ->
X
#example parameter vector
beta = rnorm(ncol(X))
compute the standard row-wise dot-product
expected_dot_out = t(X%*%beta)
dot_ops_cost = nrow(X)ncol(X)4
standard dot product should take nrow(X)*ncol(X) multiplies & the same
number of adds, and multiplies are about 3x slower than adds
#precompute some indices
X_tmp_xrow_list = list()
for(i in 1:ncol(X)){
(
X[,1:i]
%>% as_tibble()
%>% mutate(xrow=1:n())
) ->
X_with_xrow
(
X_with_xrow
%>% group_by_at(.vars=vars(-xrow))
%>% summarise(
first_xrow = xrow[1]
, .groups = 'drop'
)
%>% mutate(
xrow_in_X_tmp = 1:n()
)
) ->
tmp
(
X_with_xrow
%>% select(-xrow)
%>% left_join(tmp)
%>% nest(xrow_in_X_tmp=xrow_in_X_tmp)
) ->
X_tmp_xrow_list[[i]]
}
helper function for the no-multiply dot product
dot_like = function(X,Z,beta){
out = Z
for(row in 1:length(X)){
if(X[row]!=0){
if(X[row]>0){
out[row] = out[row] + beta
}else{
out[row] = out[row] - beta
}
add_count <<- add_count + 1
}
}
return(out)
}
#initialize a count of additions performed
add_count = 0
Z = rep(0,nrow(X))
#loop over columns to do branched sum
for(i in 1:ncol(X)){
X_tmp = X[X_tmp_xrow_list[[i]]$first_xrow,1]
if(ncol(X)>1){
X = as.matrix(X[,2:ncol(X)])
}
Z_tmp = Z[X_tmp_xrow_list[[i]]$first_xrow]
Z_new = dot_like(X_tmp,Z_tmp,beta[i])
Z = Z_new[unlist(X_tmp_xrow_list[[i]]$xrow_in_X_tmp)]
}
#show that we get the same result
all(near(Z,expected_dot_out))
#compare operation cost (ignores overhead of my ugly method)
print(c(dot_ops_cost,add_count,dot_ops_cost/add_count))
```