3

As the title suggests, I have been looking into an application of D-optimal design. I read this thread

What does it mean to have a determinant equal to zero?

and found some of the answers interesting, so I thought I would post some related questions.

The data which this study is based on are an n-space of 43 columns of floating point (2019 rows). All of the inputs are normalized by z-score and scaled from 0 to 1. In this example, d-optimal design theory is used to select a subset of data that defines the maximal coverage of the n-space that can be achieved with a specified number of rows. The metric used to describe the n-space coverage of a given set of rows is the X'X determinant. This is essentially the n-space volume delineated by the numerical values contained in the set of rows. The larger the determinant, the greater the volume and the greater the coverage in n-space. The X'X determinant was taken by the usual method. The transpose of X was created as X', the X'X product matrix was created, and the determinant of X'X was evaluated.

The first method investigated was a leave-one-out algorithm. The X'X determinant of the entire 2019x43 matrix was taken as a baseline (Det-X'X_n2019). The first row of data was then zeroed out, Det-X'X was re-evaluated, and the difference between Det-X'X_n2019 and Det-X'X with row 1 zeroed out was taken. This difference gives the "coverage loss" meaning the shrink in the volume of n-space when the information from the first row is removed. The process was repeated for each row resulting in a coverage loss value for each row. Rows were then ranked on coverage loss so that the first row is the row that demonstrated the greatest loss of coverage when its information was removed. For all rows, there was a loss of coverage when the row information was zeroed out. The magnitude of the loss varied considerably.

A second step forward procedure was implemented to see if results would vary. The first two rows from the leave one out rank ordered list were added to a new matrix with the same input columns (2x43). Det-X'X of this matrix returned as zero, so additional rows were sequentially added from the leave one out rank ordered list until Det-X'X was non-zero. It took 35 rows to reach this level. Once a non-zero Det-X'X was established, each row not already in the matrix was added to the 35 rows one at a time and Det-X'X re-evaluated. The procedure indicates the row that makes the largest increase in the coverage volume when added to the initial 35 rows. This row was then added to the initial 35 rows and the procedure was repeated looking for the next row to add. As with leave-one-out, the step forward method showed in increase in Det-X'X for every row, meaning that there were no rows found where addition of the row did not increase the coverage volume. As expected, the magnitude of the increase diminished with each addition.

I find it surprising that both methods found that all rows contribute to the coverage volume. My understanding is that this volume should be defined by the minimum and maximum numerical values in the input columns. At some point, there should be rows whose values fall within the bound of values defined by other rows. If the following data is contained in a matrix.

0.0  1.0
1.0  0.0
0.5  0.5

The first two rows would define an area that would encompass the third row. The third row would not contribute to the coverage volume. According to the two methods described above, zeroing of the third row should not decrease Det-X'X since point 0.0,0.0 is already in the defined area. Also, the third row should not add to the area as long as rows 1 and 2 are already in the matrix.

I would be interested to know if there are opinions on these findings. Specifically,

  1. In the step forward method, why does Det-X'X for two rows return as zero? I don't think that there is any possibility that these two rows have a linear dependence.

  2. The magnitude of determinants. The magnitude of Det-X'X for all 2019 rows is 3.17752e+28, which is a rather large number. Det-X'X for the initial 35 rows used in step forward is 3.79668e-150, which is an unbelievably small number. If I use principle components of the data, the numbers are even larger with the Det-X'X for all 2019 rows being on the order of e+122 which is sort of on the scale of the mass of the universe. These numbers cause me to question if I have set this up properly.

  3. Why are there no rows that do not contribute to the coverage volume.

  4. Is there anything in my current understanding the is not correct?

I don't have a great deal of experience in linear algebra, so I don't have much of a mental compass to let me judge if my results are reasonable. I can post any of the data described here if that would be helpful.

Thanks for reading at least this far,

LMHmedchem

0 Answers0