
There were multiple assignments where we were required to utilize SIMD intrinsics to achieve greater performance for various calculations used frequently in game programming.
While trying to piece together each function, I essentially lived out of the Intel Intrinsics Guide and Google Sheets, putting together spreadsheets that look like this in order to find the solution costing the fewest cycles:

Above shows one of the solutions I found for Matrix * Vector, which finishes in ~20 cycles by taking advantage of the large throughput of _mm_hadd_ps to compute the other half of the eventual return vector via regular _mm_add_ps.
Below are the test results for each function, demonstrating the improvement between standard arithmetic and SIMD arithmetic, as well as a brief comment explaining each function’s purpose.
// Basic 4×4 matrix multiplication
——— Matrix * Matrix ———
Matrix Orig: 1.55125 s
SIMD: 0.24054 s
Ratio: 6.449126
// Basic 4×4 matrix * 1×4 vector, with vector on right-hand side
——— Matrix * Vect ———
Matrix * Vect Orig: 1.304800 s
SIMD: 0.718984 s
Ratio: 1.814782
// Basic 1×4 vector * 4×4 matrix, with matrix on right-hand side
——— Vect * Matrix ———
Vect * Matrix Orig: 1.308423 s
SIMD: 0.560002 s
Ratio: 2.336460
// Standard LERP, given two points and a T value
——— LERP ———
LERP Orig: 1.753906 s
SIMD: 0.481867 s
Ratio: 3.639814
// Computing the intersection point between a line and a plane
——— LinePlane ———
LinePlane Orig: 1.282856 s
SIMD: 0.159019 s
Ratio: 8.067318
// Computing the shortest distance between a point and a line
——— LinePoint ———
LinePoint Orig: 2.656251 s
SIMD: 0.289358 s
Ratio: 9.179821
// Determining if a BSphere and OBB collide
——— Collide ———
collide Orig: 2.606204 s
SIMD: 0.898253 s
Ratio: 2.901413
/* Simplifying an expression containing multiple matrices and a single vector to perform operations right-to-left instead of left-to-right in order to take advantage of the reduced operations needed for M*V. In other words, transforming this:
Vector vA = mB * mC * mD * mE * mF * vG;
Into this:
Vector vA = (mB * (mC * (mD * (mE * (mF * vG)))));
This required the use of proxy object chaining and inlining to shortcut directly to a single conversion operation, which calculated each subsequent M*V as a series of SIMD intrinsic operations. */
——— colMajor ———
colMajor Orig: 4.136135 s
SIMD: 0.762374
Ratio: 5.425333