The goal of this project was to take a middleware particle effect and optimize it as much as possible. It also doubled as a class-wide competition to see who could squeeze the most performance out of it.
In the video below, you can see a (very) short demonstration of the side-by-side visual output and timings for my optimized version and the original baseline version, with the program’s run-time parameters set at 20k total particles and a 17 second loop.
I really, really enjoyed this project.
I pored over my past assignments and notes. I scrutinized every formula. I walked through the code line-by-line. I used every ounce of knowledge that I had to take this puzzle apart and piece it back together. In the end, I was able to achieve a significant increase in performance compared to the baseline program.
Below are a few areas of improvement I was able to identify and address, in mostly chronological order (based on when a particular alteration was made). While each listed adjustment was beneficial, there were three that were especially impactful, and are highlighted below.
- Properly defining the Big Four across every class, either by default, delete, or user specified
- Moving variable initializations to the init list whenever possible
- Making use of ‘const’ in both function signatures and parameters
- Altering many function parameters to be pass-by-reference instead of pass-by-value
- Converting all Doubles to Floats
- Converting temporaries into members, global statics, or removing them entirely
- Converting the original STL List of Particles into a standard array, which:
- Eliminated constant resizing and reallocation of the List
- Removed all particle constructors beyond the initial run-time set up, and all particle destructors outside of closing the program
- Allowed me convert loops using pointers and iterators to simple indexed for-loops
- Simplifying and reducing operations inside functions and loops as much as possible, eliminating many of them altogether
- Facilitating compiler-led optimization by ensuring 16-byte alignment for objects frequently used in arithmetic operations (specifically for SIMD)
- Reducing the size of objects for improved caching—specifically, reducing the size of a single Particle object from 512 bytes to 112 bytes
- This allowed me to further split Particles into Hot (48-byte) and Cold (64-byte) structs, as many loops involved Particles, but only a single loop required the Matrix transform which became the Cold struct
- Adjusting functions to make use of RVO, especially for Vector and Matrix arithmetic
- Identifying a number of Matrix elements that were uniformly set to 0.0f for the every Particle, allowing me to eliminate a substantial amount of operations
- I was fortunate enough to be taking Applied 3D Geometry concurrently with this course, which was massively helpful when it came to reducing—and ultimately eliminating—the GetAdjugate(), Inverse(), and Determinant() functions
- Substituting in SIMD intrinsics for certain complex arithmetic operations
Here are the full testing results when running the program at the intended volume of 300k particles for a full 17 second loop.
TESTING
Sample Size
Baseline: 267 output loops
Optimized: 279 output loops
Test Settings
Number of Particles — 300k
Loop Duration — 17 seconds
Update() Time
Baseline: 85.43086111 ms
Optimized: 3.814734409 ms
Delta: -81.6161267 ms
Improvement: 22.3949696 times faster
Draw() Time
Baseline: 99.40260173 ms
Optimized: 6.82981972 ms
Delta: -92.57278201 ms
Improvement: 14.55420579 times faster
Total Time
Baseline: 184.8334631 ms
Optimized: 10.6445542
Delta: -174.1889089 ms
Improvement: 17.36413377 times faster
Sample Results
Optimized

Baseline

Class Contest Results
Victory.

Note: The actual testing computer was faster than my own, with better results