I've been trying to implement a program for matrix multiplication of multiple of 4 (ex: 4x4, 8x8, 16x16, etc.) and while I my program works for 4x4 matrix, it only fills the first 4 row of every other matrix (which are filled with the proper value).
My guess it's because I don't have the proper algorithm implemented, so I made some search and I came upon that link:
http://www.mathcs.emory.edu/~cheung/Courses/561/Syllabus/90-parallel/SIMD.html
While I understand the first step of the algorithm, I would appreciate if someone could explain to me why the final row of 12 18 18 in the example becomes 13 19 19? I know these are the good values for the result matrix, but I seem to be missing a step and I guess if I could understand it I could come up with a proper algorithm to solve my problem.
I can provide the code I have so far by personal inbox if anyone is interested, but basically it's an adaptation of this algorithm: https://codereview.stackexchange.com/questions/101144/simd-matrix-multiplication