Particle Project

The goal of this project was to take a middleware particle effect and optimize it as much as possible. It also doubled as a class-wide competition to see who could squeeze the most performance out of it.

In the video below, you can see a (very) short demonstration of the side-by-side visual output and timings for my optimized version and the original baseline version, with the program’s run-time parameters set at 20k total particles and a 17 second loop.

I really, really enjoyed this project.

I pored over my past assignments and notes. I scrutinized every formula. I walked through the code line-by-line. I used every ounce of knowledge that I had to take this puzzle apart and piece it back together. In the end, I was able to achieve a significant increase in performance compared to the baseline program.

Below are a few areas of improvement I was able to identify and address, in mostly chronological order (based on when a particular alteration was made). While each listed adjustment was beneficial, there were three that were especially impactful, and are highlighted below.

  • Properly defining the Big Four across every class, either by default, delete, or user specified
  • Moving variable initializations to the init list whenever possible
  • Making use of ‘const’ in both function signatures and parameters
  • Altering many function parameters to be pass-by-reference instead of pass-by-value
  • Converting all Doubles to Floats
  • Converting temporaries into members, global statics, or removing them entirely
  • Converting the original STL List of Particles into a standard array, which:
    • Eliminated constant resizing and reallocation of the List
    • Removed all particle constructors beyond the initial run-time set up, and all particle destructors outside of closing the program
    • Allowed me convert loops using pointers and iterators to simple indexed for-loops
  • Simplifying and reducing operations inside functions and loops as much as possible, eliminating many of them altogether
  • Facilitating compiler-led optimization by ensuring 16-byte alignment for objects frequently used in arithmetic operations (specifically for SIMD)
  • Reducing the size of objects for improved caching—specifically, reducing the size of a single Particle object from 512 bytes to 112 bytes
    • This allowed me to further split Particles into Hot (48-byte) and Cold (64-byte) structs, as many loops involved Particles, but only a single loop required the Matrix transform which became the Cold struct
  • Adjusting functions to make use of RVO, especially for Vector and Matrix arithmetic
  • Identifying a number of Matrix elements that were uniformly set to 0.0f for the every Particle, allowing me to eliminate a substantial amount of operations
    • I was fortunate enough to be taking Applied 3D Geometry concurrently with this course, which was massively helpful when it came to reducing—and ultimately eliminating—the GetAdjugate(), Inverse(), and Determinant() functions
  • Substituting in SIMD intrinsics for certain complex arithmetic operations 

Here are the full testing results when running the program at the intended volume of 300k particles for a full 17 second loop.

TESTING
Sample Size
Baseline: 267 output loops
Optimized: 279 output loops

Test Settings
Number of Particles — 300k
Loop Duration — 17 seconds

Update() Time
Baseline: 85.43086111 ms
Optimized: 3.814734409 ms
Delta: -81.6161267 ms
Improvement: 22.3949696 times faster

Draw() Time
Baseline: 99.40260173 ms
Optimized: 6.82981972 ms
Delta: -92.57278201 ms
Improvement: 14.55420579 times faster

Total Time
Baseline: 184.8334631 ms
Optimized: 10.6445542
Delta: -174.1889089 ms
Improvement: 17.36413377 times faster

Sample Results

Optimized

Baseline

Class Contest Results
Victory.

Note: The actual testing computer was faster than my own, with better results