GPU based particle system

by Eele Roet

The goal of this research and development project is to create a particle system that is updated by the GPU. Performance is gained by offloading particle calculations from the CPU to the GPU. Hopefully saving enough performance to increase the amount of particles for effects like this:

In this blog post, I’ll be taking you on my journey as I iteratively work towards creating the final product. I’ll be sharing the techniques I’ve tried and tested for spawning and rendering particles. Sources are linked for extra information.

Problems we are trying to fix each iteration

  1. Storing the data of millions of particles efficiently.
  2. Updating the particles using the GPU.
  3. Rendering large amounts of objects efficiently.

Quick definitions

  1. “Compute shader” – a piece of code that runs on the GPU. Code on the GPU runs in parallel (multiple at the same time), while code on the CPU runs in series (one piece after the other).
  2. “Instancing” – a rendering technique that draws 1 object lots of times, instead of drawing lots of objects 1 time. This is efficient when there are lots of instances of one object.
  3. “Transform matrix” – a 4×4 mathmatical matrix that describes the: position, rotation and scale of an object.
  4. “Buffer” – a list of things in GPU memory is called a buffer.

Iteration 1

In iteration 1 I decided to tackle the three problems from above as follows:

  1. Storing particle position, rotation, scale, and velocity in one array in C# and storing transform matrices in another C# array.
  2. Initializing the positions with a compute shader.
  3. Rendering the particles using GPU instancing.

I chose this setup because both methods were mentioned in class (compute shaders and instancing) and I saw no reason why I could not combine these into a particle system.

After looking at existing particle systems with high particle counts I found Unity VFX. [1]

In their documentation they mention that this system uses compute shaders, persumably for updating the particles. [2]

On top of this, VFX supports the option for GPU instancing of the particles. [3]

These methods are always being advertised as fast, so I knew it was a good starting point for what I was trying to accomplish.

When we are done with this iteration we should have a system that works like this:

Above we can see that the GPU gives the particles a random positions in the compute shader, then the CPU makes transform matrices out of these positions and sends them off to the GPU to be rendered. (We will see later why this is a horrible setup that should never be used)

Note the GREAT WALL OF DATA SEPERATION. This just represents the fact that the CPU and GPU cannot access each others memory. Any data that the GPU needs from the CPU, needs to be moved accross the wall before it can be used.

Setup of compute shader

Compute shaders [4] were new to me and shader language [5] can be a little abstract.

So below I have the most basic setup of a compute shader with explanations about how to increase all ints in an array by one.

// data in GPU memory is called a buffer.
// this buffer contains integer numbers.

RWStructuredBuffer<int> ExampleBuffer;
// below we set the thread group size.
// the dimenions that we use: 256 by 1 by 1, represent how many threads will run the example code.
// so, 256 threads will execute this code in parallel. the number of threads is equal to multiplying the dimenions.
// threads in the same group can share data with eachother, which is helpful when, for example, a pixel needs to know information about pixels around it.

// the id parameter is provided automatically and tells us what thread is being processed.

void ExamplePragma (uint3 id : SV_DispatchThreadID)
    // here we use the id to get an int from our buffer.
    // every thread will get a different int because the id is being used as the index in the buffer.

    int myThreadsInt = ExampleBuffer[id.x];
    // we can now perform whatever calculations we want.

    myInt += 1;
    // and save the result in the buffer.

    ExampleBuffer[id.x] = myInt;

Of course for our particle system we would have a buffer of particles and every frame we add the velocity of that particle to the position of the particle.

Problems with implementation

  1. C# and shader language (hlsl [5]) have different data structures, so not all C# data can be used in buffers without conversion.
  2. Passing single structs to the compute shader is not really possible, this needs to be a buffer of length one.
  3. Structs inside structs seems impossible when sending data from CPU to GPU, structs need to be of “depth” one.
  4. I was unsure about the thread size parameter of compute shaders. how many threads are optimal? how to threadIDs work? A new framework for programming is always confusing.


  1. Have a copy of structs on both the CPU and GPU side. This will ensure succesful transfer of data.
  2. Use “constant buffers” with a length of one for settings. [6]
  3. Use descriptive names for variables in structs (“emissionBoxSize” not “emissionBox.size”) to keep the depth to one.
  4. AMD GPUs runs in “waves” of 64 threads and NVIDIA GPUs in “warps” of 32, so the threadgroup size should be a multiple of 64 for optimal use of GPU performance. [7]

Setup of GPU instancing

The idea of instancing is simple. [8]

When we are drawing the same object lots of times, some things will not change about that object from instance to instance. For instance: the mesh will have the same amount of vertices no matter how often we render the mesh.

So why don’t we just have one copy of this static data. Getting rid of this duplicate data will save a lot space and time.

Problems of implementation

  1. Instancing methods usually want 4×4 matrices as input, [9] and my particles only have three Vector3s for: position, rotation and scale.
  2. Maximum Instance batch size is only 1023. I want one million instances. [9]


  1. I used the Matrix4x4.TRS [10] method on the CPU to make the matrices for instancing.
  2. Copy the array of mat4x4s into a 2D array, where the second dimension has a length of up to 1000, this creates batches of 1000 for instancing.

And after all that work what do we have?

Well, not a lot. We are just drawing static squares.

100 thousand quads being rendered.

What matters is that we implemented the compute shader and instancing using the GPU, so now we can start adding features onto this base.

Performance test – iteration 1

As I have said before, this system is not even close to optimal. I decided to get the old Unity profiler out and have a look at the performance.

We are only testing the instanced rendering here, since the positions of our particles are not being updated yet.

We will render an increasing amount of particles while monitoring the frames per second.


Test conclusion

  • Instancing is doable up to around 100k particles with around 80 fps.
  • Far from viable for 1M particles with around 15 fps. This is using the standard Unity unlit particle shader.
  • Instancing is CPU bottlenecked, passing transforms accross “the great wall of data seperation” is costing more time than instancing itself.

Iteration 2

I’m going to be honest. Iteration 2 is not that interesting.

Yes, I put a lot of thought into it. “How am I going to store the data most efficiently?”, “How can I send the least amount of data and still get a good result?”, “Why am I still doing expensive matrix math on the CPU?” etc.

But there is an old saying about polishing a turd, that applies well here.

The truth is that sending any data from the GPU to the CPU is not only killing performance, but is also, as it turns out, unnecessary.

Which is why I want to save your precious reader attention for iteration 3 and beyond. This is much more visually pleasing and actually SAVES on performance.

Looking into alternatives

In the profiling session from iteration 2 I saw that ALL my performance problems were coming from too much communication between the CPU and GPU.

Currently two types of data need to be passed from the GPU to the CPU:

  • The lifetime of all particles to know which particles are active.
  • The transformation matrices of all active particles, which are used to render using instancing.

Just getting this data from the GPU is taking eight of the total ten milliseconds of that frame.

The reason I haven’t thrown out this data transfer yet, is because I use the built in instancing system from Unity (Graphics.DrawMeshInstanced). [9] This is called from the CPU with all the transforms as parameter.

I have reached a point where further optimisations of the current setup have plateaued.

Optimisations can obviously be made to decrease the CPU frame times, and thus achieve higher FPS. However these are rather small compared to increases we can achieve by changing the architecture of the system.

We need a way to be virtually independent of the CPU. A setup where all the data is stored in GPU land and the bulk of calculations take place on the GPU. From updating particles to rendering.

Possible Solutions:

Unity ECS:

The Entity Component System (ECS) [11], is an alternative method of programming that is data oriënted, instead of the usual object oriënted way.

What this essentially boils down to is that all the data is nicely grouped together, while the behavior of that data is scattered. In contrast to object oriënted, where data is scattered, while behavior is grouped.

The trick is that calculations with grouped data is a lot faster than calculations with scattered data.

I will not use this method because: this is an optimisation on the CPU side. The calculations will still be done by the CPU which will always be slower than the GPU. I want to find a method that uses the power of the GPU.

Draw Mesh Instanced INDIRECT:

Indirect instancing [12] works similarly to our current setup. The difference is that we can use an existing GPU buffer to render the particles, instead of having to pass this data to the GPU from the CPU every frame.

This ensures that we don’t lose time moving data around.

Looks promising, exactly what I needed.

Iteration 3 – indirect instancing

This iteration was dedicated to moving as much unnecessary load off the CPU as possible.
I discovered during research in iteration 2 that there was a built-in “indirect instancing” method in Unity [12] that magically solves all of my problems by: just keeping the GPU buffers on the GPU. You know, like a sane person.

So in iteration 3 I decided to tackle the three core problems as follows:

  1. Storing particle data and transforms in buffers in GPU land.
  2. update the particles with a compute shader on the GPU.
  3. Rendering the particles using indirect instancing on the GPU.

When using indirect instancing our system should look like this:

This required me to do the following:

  • Rewriting the C# script to remove batching and getting data from the GPU.
  • Writing a C# Instancer class that calls for indirect instancing of the transforms buffer.
  • Making a custom shader to support indirect instancing.

Now, on the CPU side we no longer:

  • Get any data from the GPU, only set data when spawning new particles.
  • Keep track of active particles. Particles have a life time, when that lifetime is zero they are inactive and should be ignored
  • Batch transforms for rendering, since we can instance very large numbers of  particles at once straight from our buffer.

This leaves the CPU only with the tasks of:

  • Initializing buffers when at the start (once).
  • Passing system settings for the compute shader to use.
  • Spawning particles. (Give particle a random position, reset rotation, reset lifetime)

And then we have a working particle system?

Yes, indeed we do, now take a look at this moving image of the system in action

The small feature list consists of: a starting velocity, a rotation over time and a starting scale.

But don’t worry, the cool effects I promised come in the next chapters.

Performance test of this iteration

Let’s see how this GPU based system fairs againt the standard Unity particle system, which is CPU based.

Due to word count I will only list a summary of results, along with the conclusion of the test.

setup of the test:

I tried to use as many similar settings for both particle systems as possible. They use similar: textures, particle sizes, emission shapes, and particle velocities. I also disabled as many unused features from the standard system as I could.

We will test the systems based on two parameters: spawnrate per second and particle lifetime.

We will try different values for our parameters while monitoring the framerate.

Unity system (left) and custom system (right)


The unity sytem has 2 FPS at the highest test settings

Test conclusion:

The GPU based system can handle large amounts of particles, that can all behave differently. The performance stays good because the GPU is so fast.

Anywhere where we have a lot of particles doing different things like:

  • Emergent behavior algorithms.
  • Fluid simulations with particles.

These are things are where this custom system starts to outperform the standard Unity system.

Vector fields for forces on particles

Now for the fun stuff.

The particles only have a static velocity from the emitter at this moment. This seems boring. I want to use the particles to make complex patterns.

The method I chose to move the large amount of particles in interesting ways was vector fields.

What are vector fields

Vector fields [13] are 2D or 3D grids. Each cell in the grid contains a Vector. The vector represents a direction and magnitude in space. This can be used to record an approximation of the flow of things in that cell.

You can imagine a 2D vector field of a river. Where the river flows you see arrows pointing to where the water flows. The length of the arrow represents the flow speed. On land the arrows have no length, since there is no water flow there. This would look like this:

A 3D vector field is just a 2D field with an extra axis. Since my particle system is 3D this is what I used to add forces to my particles.

How to move particles with a vector field

There are only a few things to calculate if you want to use a vector field like an acceleration field. (How I am using it)

  1. Calculate the position of the particle in the vector field.
  2. Get the value from that position out of the vector field.
  3. Add the value to the velocity of the particles.

At step 1 we have the possibility of altering the calculations a little to create different effects.

different ways of getting the position

The vector field I am using is a 3D array of values.

So, we need to know what position in the array is to be used for a specific particle.

We can do this by:

  • Setting up system bounds, within which our field exists.
  • Dividing the position of the particle by the size of the bounds, this gives us a Vector3 of values from zero to one. These values are the relative positions in our vector field.
  • We multiply this vector by the dimensions of the vector field to get the absolute position in the vector field.
  • We can then use this position to get the value from the vectorfield, at the position of our particle.

In 2D:

The variation we can add is how we restrain the position when it goes outside of the system bounds.

When our system bounds, for example, is ten units on the X-axis but our particle is at position x = 11, we can’t just take the eleventh value of our vector field, because it only has ten.

We can restrain the position in a couple of ways:

  • Keep the particle inside the system bounds. This will avoid the issue, but will also restrict the system to just inside the bounds.
  • Clamp the position when doing the vector field calculations, this way the value on the edge of your vector field will go on forever.
  • Take the modulo of the position, this will cause the vector field to repeat.

These two options will have different effects, which can be the answer to more complex patterns that repeat like this one:

Multiple spawners

Let’s close this project off with two last features that will really give us the ability to create some good effects.

The first relatively straight forward feature I really wanted was: multiple emitters in the same system.

This went mostly without trouble because of the way my data was set up.

The emitter writes new particles in one big array of particles, so support for multiple emitters just meant that the second emitter writes the particles after the first emitter.

Since the update compute shader just updates all the active particles in the big array, it did not matter that they were spawned by different emitters.

The only problem I ran into was that I had a setting in the emitter settings that was being used in the compute shader, the rotation over time. I want to have seperate rotations per emitter. On top of that I wanted to add coloring later too, so I already knew there was data from each seperate emitter that I needed to know on the gpu side.

I solved this problem by giving each particle an emitterID, this id then corresponds to the settings of that emitter. This way I can have as many emitters as I want with only 1 extra int in the particle data.

And here we see the results:

Adding color modes

Finally we can add some color to these boring white particles.

And while we can just give the shader an adjustable tint and call it a day, I thought it would be more interesting if every particle could have its own color that changes based on some values we can pick.

And so with my shader experience I dug into my brain to find fun ways of easily creating some nice color patterns.

I came up with six color modes that I wanted to build into the particle system

  1. Emit a solid color.
  2. Emit a gradient of colors over time.
  3. Color particles based on how fast they are going.
  4. Color particles based on what direction they were going.
  5. Color particles based on how fast they are accelerating.
  6. Color particles based on the direction of acceleration.

The implementation of these can be categorized into: coloring by emission, and coloring based on particle data.

With modes one and two we can just set the color when we spawn the particle and not worry about it after that.

That pseudo-code would look like this:

SetEmissionColor(color, emitter)
     // color with gradient
     if (emitter's colorMode == emit a gradient over time)
         point in gradient = (current time / ( 1 / gradient frequency)) % 1;

         for(every colorkey in the gradient)
             time1 = current colorkey's point in gradient;
             time2 = next colorkey's point in gradient;

             if (point in gradient > time1 AND point in gradient < time2)
                 color = blend of colors( current colorkey, next colorkey );
     // single color
     else if (emitter's colorMode == emit one color)
        color = gradient's first color;

Because the “point in gradient” value changes over time, every frame will have a different shade, thus creating the gradient in the particles.

Now we can look at the results:

We can see the code for the other four coloring options below, this is written in hlsl:

float4 GetColorWithMagnitude(float3 v)
    float maxMagnitude = 8.0;

    float magnitude = clamp(sqrt(pow(v.x, 2) + pow(v.y, 2) + pow(v.z, 2)), 0, maxVelocity);
    float t = magnitude / maxMagnitude;

    return lerp(color1, color2, t);

float4 GetColorAxisBased(float3 v)
    float maxValue = 8.0;

    float tX = clamp(abs(v.x), 0, maxValue) / maxValue;
    float tY = clamp(abs(v.y), 0, maxValue) / maxValue;
    float tZ = clamp(abs(v.z), 0, maxValue) / maxValue;

    return color(tX, tY, tZ, 1.0);

The magnitude and axis based options have the same calculation for the velocity and acceleration.

The axis based mode link each axis to a color channel: Red, Green and Blue. A higher speed on that axis means a higher value in that channel.

The axis based approach gives us this nice result:

In conclusion

We now have a robust particle system with enough features to make very good looking particle effects.

A developer/designer can use the following settings when making an effect:

  • one or more emitters to spawn particles.
  • six coloring modes for each emitter.
  • A spawnrate per emitter.
  • Start velocity of particles.
  • Rotation over time for particles.
  • Scale of particles.
  • Using a vector field to add forces.
  • Clamping or repeating the vector field.
  • Choice between opaque and transparent particles.
  • Using custom meshes for particles.
  • Using lighting on particles (no shadows).

I use a GPU based approach of updating particles with a compute shader, then rendering them with instancing. This approach allows even lower end PCs to render a million particles in real time.


[1] U. Technologies, “Unity Visual Effect Graph,” Unity.

[2] U. Technologies, “Introduction to the VFX Graph in Unity,” Unity.

[3] U. Technologies, “How to use VFX Graph Instancing,” (accessed Jul. 04, 2022).

[4] U. Technologies, “Unity – Manual:  Compute shaders.”

[5] Stevewhims, “Reference for HLSL – Win32 apps,” Microsoft Learn, Aug. 23, 2019.

[6] U. Technologies, “Unity – Scripting API: ComputeShader.SetConstantBuffer.”

[7] X. Gong, X. Gong, L. Yu, and D. Kaeli, “HAWS,” ACM Transactions on Architecture and Code Optimization, Apr. 18, 2019. (accessed Apr. 06, 2023).

[8] U. Technologies, “Unity – Manual:  GPU instancing.”

[9] U. Technologies, “Unity – Scripting API: Graphics.DrawMeshInstanced.”

[10] U. Technologies, “Unity – Scripting API: Matrix4x4.TRS.”

[11] U. Technologies, “ECS for Unity,” Unity.

[12] U. Technologies, “Unity – Scripting API: Graphics.DrawMeshInstancedIndirect.”

[13] Tu, “Eos, Transactions, American Geophysical Union Volume 92, Number 32, 9 August 2011,” Eos, Transactions American Geophysical Union, vol. 92, no. 32, p. 149, Aug. 2011, doi: 10.1029/eost2011eo32.