Skip to content

An educational example for learning the Navier-Stoke equations. Also included is a C++ and CUDA shared object library, buildable with CMake, for use in your personal projects.

Notifications You must be signed in to change notification settings

danieljvickers/fluid_simulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fluid Simulation

Author: Daniel J. Vickers

About

This repository has been created for me to explore fluid simulation, as an attempt to move towards magnetohydro dynamics simulations. I have opted to start by solving the incompressible Navier-Stokes equations for fluid flow before adding the complexity of electromagnetism. This is also an opportunity for me to demonstrate an ability to write software in Python, C++, and CUDA.

The goal of this project overall is easy-to-read python code that will allow someone to properly learn Navier-Stokes, a C++/CUDA shared object library with python bindings for use in personal projects, and an exploration of benchmarks for performance of the various implementations fo the library.

All of the python code is written to leverage performance from numpy. This is intended to be the unit test for the higher-performance implementations of the simulation and a performance baseline.

The C++ code is meant to be a serial implementation of Navier-Stokes.

The CUDA code is a large-scale parallel implementation of Navier-Stokes. I am leaving several versions of the simulation, and the data presented here is generated using an NVIDIA RTX 2080 Super, and will not likely not yield similar performance on your machine.

Citations

The Python implementation is based upon an implementation that I found here: https://github.com/Ceyron/machine-learning-and-simulation/blob/main/english/simulation_scripts/lid_driven_cavity_python_simple.py The results of that solver are captured in a youtube video here: https://www.youtube.com/watch?v=BQLvNLgMTQE I want it to be clear that this Python code is largely influenced by the contents of the repository I just provided. The C++ and CUDA implementations are my own. It is easy to find plentiful descriptions of Navier-Stokes online. I will link here the wiki page for the equations: https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations "GPU Gems", published by NVIDIA, is one of the best resources available for physics-GPU compute optimization. I found (only after making an initial immplemtation of Navier-Stokes) that GPU Gems volume 1 has already published an article about a Navier-Stokes solver: https://developer.nvidia.com/gpugems/gpugems/part-vi-beyond-triangles/chapter-38-fast-fluid-dynamics-simulation-gpu

Python

The most-basic implementation is a python version that leverages linear algebra in numpy. Because we will be implementing the Cpp and CUDA versions raw (without external libraries), the python implementation is quite optimial.

Code Explained

We will be tracking the pressure and velocity (stored as two patrices: u and v) of the fluid in the grid. To start, we instatiate our mesh grid:

# define the x and y values
element_length = domain_size / (num_elements - 1)
x = np.linspace(0., domain_size, num_elements)
y = np.linspace(0., domain_size, num_elements)

# create the vector fields
X, Y = np.meshgrid(x, y)
u_previous = np.zeros_like(X)
v_previous = np.zeros_like(X)
p_previous = np.zeros_like(X)

We need the ability to compute the central difference and laplacian on this grid. For that, we define those in functions:

def central_difference_x(field, element_length):
    diff = np.zeros_like(field)
    diff[1:-1, 1:-1] = (field[1:-1, 2:] - field[1:-1, :-2]) / (2. * element_length)
    return diff


def central_difference_y(field, element_length):
    diff = np.zeros_like(field)
    diff[1:-1, 1:-1] = (field[2:, 1:-1] - field[:-2, 1:-1]) / (2. * element_length)
    return diff


def laplace(field, element_length):
    diff = np.zeros_like(field)
    diff[1:-1, 1:-1] = field[1:-1, :-2] + field[:-2, 1:-1] - 4 * field[1:-1, 1:-1] + field[1:-1, 2:] + field[2:, 1:-1]
    diff = diff / (element_length ** 2)
    return diff

We can then compute these for our velocities via the Couchy momentum equation:

# get tentative velocity
d_u_previous_d_x = central_difference_x(u_previous, element_length)
d_u_previous_d_y = central_difference_y(u_previous, element_length)
d_v_previous_d_x = central_difference_x(v_previous, element_length)
d_v_previous_d_y = central_difference_y(v_previous, element_length)
u_previous_laplacian = laplace(u_previous, element_length)
v_previous_laplacian = laplace(v_previous, element_length)

This gives us some differential information that we can use to plug into the Navier-Stokes equation for velocity and then get a tenative estimate of the new state that we will use in the poison steps:

du_dt = kinematic_viscosity * u_previous_laplacian - u_previous * d_u_previous_d_x - v_previous * d_u_previous_d_y
dv_dt = kinematic_viscosity * v_previous_laplacian - u_previous * d_v_previous_d_x - v_previous * d_v_previous_d_y
u_tentative = u_previous + time_step * du_dt
v_tentative = v_previous + time_step * dv_dt

du_tentative_dx = central_difference_x(u_tentative, element_length)
dv_tentative_dy = central_difference_y(v_tentative, element_length)

We then use these velocities to solve the pressure poisson equation. We will perform an iterative number of poison steps to allow the fluid to settle over multiple steps.

# solve pressure poisson equation
right_hand_side = (density / time_step) * (du_tentative_dx + dv_tentative_dy)
p_next = np.zeros_like(p_previous)
for j in range(num_poisson_iterations):
    p_next[1:-1, 1:-1] = ((right_hand_side[1:-1, 1:-1] * element_length**2) - (
        p_previous[1:-1, :-2] + p_previous[1:-1, 2:] + p_previous[:-2, 1:-1] + p_previous[2:, 1:-1]
    )) * -0.25

    p_next[:, -1] = p_next[:, -2]
    p_next[:, 0] = p_next[:, 1]
    p_next[0, :] = p_next[1, :]
    p_next[-1:, :] = 0.
    p_previous = p_next

We compute the pressure so that we can use the next equation to update the velocity

dp_next_dx = central_difference_x(p_previous, element_length)
dp_next_dy = central_difference_y(p_previous, element_length)

# correct velocities
u_next = u_tentative - (time_step / density) * dp_next_dx
v_next = v_tentative - (time_step / density) * dp_next_dy

Finally we apply our boundary coniditions. Since the python version is only made for an open box, we set the left, right, and bottom velocities to zero. We also set the top velocity to a constant.

Once we have our final velocities, we step forward in time.

# apply the boundary conditions
u_next[0, :] = 0.  # bottom
u_next[-1, :] = top_velocity  # top
u_next[:, 0] = 0.  # left
u_next[:, -1] = 0.  # right
v_next[0, :] = 0.
v_next[-1, :] = 0.
v_next[:, 0] = 0.
v_next[:, -1] = 0.

# step forward in time
u_previous = u_next
v_previous = v_next
p_previous = p_next

Results

The result of the sim is the quiver plot below that captures the fluid velocity and pressure.

screenshot

I then tested the performance of our Navier-Stokes solver. The parameters here are those that match all solutions in this repo:

  • 41x41 bins in our grid
  • 1000 time steps forward
  • 50 poison steps to let the fluid settle
  • 5000 time trials

The results of 5000 time trials for the Python version is shown below.

screenshot

...But can it go faster?

C++

Serial Implementation

Code Explained

For the C++ implementation, we first create a templated class that stores useful data of the simulation, and has useful getters and setters. All of the variables for each cell to make the simulation go fast are given in fluid_cpp/cpp/NavierStokesCell.h as:

template <typename T>
struct NavierStokesCell {
    // the x velcoity (u), y velocity (v), and pressure (p)
    T u = 0.;
    T v = 0.;
    T p = 0.;

    // some values saved in the cell during computation
    T du_dx = 0.;
    T du_dy = 0.;
    T dv_dx = 0.;
    T dv_dy = 0.;
    T u_laplacian = 0.;
    T v_laplacian = 0.;
    T du_dt = 0.;
    T dv_dt = 0.;
    T right_hand_size = 0.;

    // placeholder for the updated values of the sim
    T u_next = 0.;
    T v_next = 0.;
    T p_next = 0.;
    T du_next_dx = 0.;
    T dv_next_dy = 0.;
    T dp_dx = 0.;
    T dp_dy = 0.;

    // boundary conditions (BCs). The set bool tells us if it has been set
    T u_boundary = 0.;
    T v_boundary = 0.;
    T p_boundary = 0.;
    bool u_boundary_set = false;
    bool v_boundary_set = false;
    bool p_boundary_set = false;
};

Notice that I am keeping track of some da_next_db values because it allows us to perform some computation without iterating through an entire loop again. I then instantiate our deminsion space in the NavierStokesSolver() object constructor, which has a private variable of cells to hold the parameters. The code for instantiation can be found in fluid_cpp/cpp/NavierStokesSolver.cpp file as:

this->cells = new NavierStokesCell<T>*[box_dim_x];
for (int x = 0; x < box_dim_x; x++) {
    this->cells[x] = new NavierStokesCell<T>[box_dim_y];
}

There is a good bit of code here, but the easiest way to show the steps of the computation is to show the safeSolve() method, which is a method template with each calculation being done in series. Inside each method is a pair of nested for loops, which iterate over each cell in the space. It turns out we can reduce this to only 4 pairs of loops, which is in the solve() method, not shown here. Below is the code for safeSolve():

template <class T>
void SerialNavierStokes<T>::safeSolve() {
    // loop over each time step
    for (int i = 0; i < this->num_iterations; i++) {
        // compute useful derivatives
        this->computeCentralDifference();
        this->computeLaplacian();
        this->computeTimeDerivitive();

        // take a tenative step forward in time
        this->takeTimeStep();
        this->computeNextCentralDifference(); // recompute the central difference
        this->computeRightHandSide();

        // take a series of poisson steps to approximate the pressure in each cell
        for (int j = 0; j < this->num_poisson_iterations; j++) {
            // compute the Poisson step, enforce BCs, and enforce the pressure
            this->computePoissonStepApproximation();
            this->enforcePressureBoundaryConditions();
            this->updatePressure();
        }

        // get the pressure central difference, correct the u and v values, and enforce BCs
        this->computePressureCentralDifference();
        this->correctVelocityEstimates();
        this->enforceVelocityBoundaryConditions();
    }
}

When we are ready, we call the code by creating a SerialNavierStokes object, setting all of the parameters, establishing boundary conditions, and running the solve() method:

// set up the solver
int num_x_bins = 41;
int num_y_bins = 41;
float width = 1.0;
float height = 1.0;
SerialNavierStokes<float> solver(num_x_bins, num_y_bins, width, height);

// entire specific constants of the simulation
solver.density = 1.0;
solver.kinematic_viscosity = 0.1;
solver.num_iterations = 1000;
solver.num_poisson_iterations = 50;
solver.time_step = 0.001;
solver.stability_safety_factor = 0.5;

// establish boundary conditions
for (int x = 0; x < num_x_bins; x++) {
    solver.setUBoundaryCondition(x, 0, 0.);  // no flow inside of the floor
    solver.setVBoundaryCondition(x, 0, 0.);  // no flow into the floor
    solver.setUBoundaryCondition(x, num_y_bins - 1, 1.);  // water flowing to the right on top
    solver.setVBoundaryCondition(x, num_y_bins - 1, 0.);  // no flow into the top
    solver.setPBoundaryCondition(x, num_y_bins - 1, 0.);  // no pressure at the top
}
for (int y = 1; y < num_y_bins - 1; y++) {
    solver.setUBoundaryCondition(0, y, 0.);  // no flow into of the left wall
    solver.setVBoundaryCondition(0, y, 0.);  // no flow inside the left wall
    solver.setUBoundaryCondition(num_x_bins - 1, y, 0.);  // no floow into the right wall
    solver.setVBoundaryCondition(num_x_bins - 1, y, 0.);  // no flow inside the right wall
}

// Solve and retrieve the solution
solver.solve();

Results

For perforamnce benchmarks, I am going to We can view the single-point precision performance on my machine below:

screenshot

I we see that the C++ version has imporved performance by about a factor of 2.5x. That is a nice performance gain in our small case. Since we are mostly going to be testing large-scale parallelization, I also opted to run tests from here on on a 1000x1000 grid. Because this is a hobby for me, and it is running on my poor little home computer that isn't dedicated to compute, I only ran 20 time trials. This serial C++ implementation is to serve as the benchmark for our compute moving forward. The results are shown here:

screenshot

...But can it go even faster?

Threaded Implementation

Code Explained

The most-obvious next step is to multi-thread the code. There are a few problems with this that we will need to tackle. The first is that we want to keep the number of thread creations/destructions to a minimal, as that can hurt performance. For reference, I did try to launch a new string of threads for each function call, but the result was that our execution time shot up to over 10 seconds for this simple test case. This leads me to our first point, which is we are only threading the solve function. This was done by converting the solve function into something that can be threaded, and then writing a thin wrapper around that which launches and joins our threads:

template <class T>
void ThreadedNavierStokes<T>::solve() {
    for (int t = 0; t < THREADED_GRID_SIZE; t++) {
        // begin a thread to work on the unified time step approximation functions
        this->worker_threads[t] = std::thread(&ThreadedNavierStokes::solveThread, this, t);
    }
    for (int t = 0; t < THREADED_GRID_SIZE; t++) {
        // join all of the worker threads when they are complete
        this->worker_threads[t].join();
    }
}

This will launch a number of threads equal to THREADED_GRID_SIZE, which appears to run well on my machine when set to 8. This calls our original solve function with an additional integer which is the thread index. Each thread can then be optimized by indexing over a select region of the array. For example, let us consider a step in our Poisson approximation using the index:

template <class T>
void ThreadedNavierStokes<T>::computePoissonStepApproximation(int thread_index) {
    int start_x = thread_index + 1;

    for (int x = start_x; x < this->box_dimension_x - 1; x += THREADED_GRID_SIZE) {
        for (int y = 1; y < this->box_dimension_y - 1; y++) {
            // compute the Poisson step
            this->cells[x][y].p_next = this->cells[x][y].right_hand_size * this->element_length_x * this->element_length_y;
            this->cells[x][y].p_next -= this->cells[x-1][y].p + this->cells[x+1][y].p + this->cells[x][y+1].p + this->cells[x][y-1].p;
            this->cells[x][y].p_next *= -0.25;
        }
    }
}

Note that here, we are generating a starting index which depends upon the index number assigned to the executing thread. Since this number is unique, and we are only jumping up by a size of THREADED_GRID_SIZE, we garuntee that each thread is iterating over a unique set of x values. After that, the rest of the compute is identical. I feel the need to emphasize that one must be careful in threading like this to avoid false sharing (https://en.wikipedia.org/wiki/False_sharing) when reading and writing from closely-situated regions of memory. One way to garuntee safety here would be to remove our concept of a NavierStokesCell data structure in favor of a series of arrays. I will perhaps return to this code at some point to test if this optimizes our performance in any way.

Note that if we run the original solve function, we will certainly run into race conditions, where we are misordering reads and writes hapening when we need to compute values from neighboring cells. The way to reduce this is to synchronize the threads every so often. In languages like CUDA, there are built-in functions that cause all threads to synchronize at once. Here, however, we will have to write our own synchronization. After much debate, I decided to solve this problem with a mutex and counter that will wait for our threads to all synchronize. I capture this in a synchThreads function:

template <class T>
void ThreadedNavierStokes<T>::syncThreads() {
    this->sync_mutex.lock(); // aquire the lock
    if (this->lock_count >= THREADED_GRID_SIZE) this->lock_count = 0;  // reset the counter if this thread is the first to arrive
    this->lock_count += 1; // incriment the counter
    this->sync_mutex.unlock();  // release the mutex to other threads

    while(this->lock_count < THREADED_GRID_SIZE) {}  // block the thread until all threads have checked in
}

This synchronization is quite simple. First, the thread grabs the mutex. The thread will know if it is the first thread to reach the synchronization because the lock_count will be equal to THREAD_GRID_SIZE from the last synchronization. If that is the case, it will initialize the lock_count to zero. It will then incriment the counter before unlocking the mutex. From there, we block the thread by performing a continual while loop that constantly checks for when the lock_count is high enough. Because there are THREAD_GRID_SIZE numbers of threads, the condition will only ever evaluate to false once all threads have checked in. Once this happens, all of the threads are free to pass the synchronization.

We can then rewrite the solve function to have additinoal synchronization after each function call and by passing around the thread index of the executing thread:

template <class T>
void ThreadedNavierStokes<T>::solveThread(int index) {
    // loop over each time step
    for (int i = 0; i < this->num_iterations; i++) {
        this->unifiedApproximateTimeStep(index);
        this->syncThreads();

        this->unifiedComputeRightHand(index);
        this->syncThreads();

        // take a series of poisson steps to approximate the pressure in each cell
        for (int j = 0; j < this->num_poisson_iterations; j++) {
            this->computePoissonStepApproximation(index);
            this->syncThreads();

            this->enforcePressureBoundaryConditions(index);
            this->updatePressure(index);
            this->syncThreads();
        }

        this->unifiedVelocityCorrection(index);
        this->syncThreads();

        this->enforceVelocityBoundaryConditions(index);
        this->syncThreads();
    }
}

This works quite well, and passes all of my unit tests, but leaves a very small chance for a race condition if one of the threads can't clear the reset before the first thread arrives at the next synchronization. It turns out that there is a C++20 primitive that I was unaware of that exactly performs synchThreads without busy waiting and without the potential race condition. This primitive is std::barrier. I decided to upgrade from C++17 to C++20 and added a barrier, replacing all of the syncThreads calls with sync_barrier.arrive_and_wait();.

Let us see how the performance imporved!

Results

Let us take a look at our 41x41 test case first:

screenshot

You will immediately notice that the small-scale implementations perform quite poorly compared to the serial implementation. This result seems logical since there is a large amount of computational overhead associated with the thread synchronization, creation, and joining. When you add in the fact that the orignial code was already so fast, and the compiler may have a harder time optimizing certain regions of a parallel algorithm, our result seems to agree with reality. But, do we see improvement in our larger 1000x1000 case?

screenshot

The answer is a luke-warm yes. I believe that we are hitting an issue with false sharing. I think that because I chose to space the memory that each thread is writing to close to the memeory that is is reading from in the struct that we are getting parallel threads performing redundant memory cache refreshes, which greatly slows down performance. This effect should be particularly strong in instances where cells must check the value of their neighbors.

The net result is still about a 9% performance improvement, but that feels underwhelming when you consider the fact that wea are using 8x more compute resources. I may come back an optimize this out by separating the memory read and write locations, but for now I would like to continue to the CUDA version, as I anticipate that being significantly faster than my most-optimized threaded implementation.

But can it go EVEN faster?

CUDA

Initial Implementation

When we launch a kernel in CUDA, we need to select a grid size that we are going to use. The optimal grid size for you will depend upon your hardware. When I was done, I found that my optimal choice of grid dimensions for the RTX 2080 Super was:

#define KERNEL_2D_WIDTH 16
#define KERNEL_2D_HEIGHT 16
#define GRID_2D_WIDTH 4
#define GRID_2D_HEIGHT 4

We can convert this to CUDA grid and block dim3 objects as

dim3 block_size(KERNEL_2D_WIDTH, KERNEL_2D_HEIGHT);
dim3 grid_size(GRID_2D_WIDTH, GRID_2D_HEIGHT);

All of our kernels will be scheduled on the CUDA stream via a kernel call, where we pass in information, but also our grid dimension objects. For example, here is an example of launching a kernel to apply our pressure boundary conditions. We are also passing in our array of cells, d_cells and the width and height of our simulation, box_dimension_x and box_dimension_y:

update_pressure_kernel<<<grid_size, block_size>>>(d_cells, box_dimension_x, box_dimension_y);

Note that because our cells are arranged in 2D space, I am launching the kernels as a 2D grid. This is optional and does not come with any performance inmprovement. It only makes the code easier to work with and read. We can get the location of that thread in the grid via blockIdx.x * blockDim.x + threadIdx.x. Note that this is the location along the x axis, as we are using the .x value. The same can be done in the y-dimension with .y.

Now we have our threads loop over the array. Since we are launching threads in a grid, we only need to iterate in steps sizes of the grid size, which is found as gridDim.x * blockDim.x. Finally, inside this loop, we can get the index in the 1D array that we want via int index = c * height + r;. I demonstrate all of this in a kernel meant to update the pressure from a temporary placeholder value, here:

__global__ void update_pressure_kernel(NavierStokesCell<T>* cells, int width, int height) {
    // get our location in the grid
    const int column = blockIdx.x * blockDim.x + threadIdx.x;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;

    for (int c = column; c < width; c += gridDim.x * blockDim.x) {
        for (int r = row; r < height; r += gridDim.y * blockDim.y) {
            int index = c * height + r;
            cells[index].p = cells[index].p_next;
        }
    }
}

Notice the __global__ keyword, denoting it as a CUDA kernel.

I converted all functions from the serial implementation into a CUDA kernel call, which launches on the GPU in this manner. For a detailed look at all of these changes, you can find them in fluid_cpp/cuda/ParallelNavierStokes.cu.

Results

So, how did all of that work go? First are my recorded benchmarks in the 41x41 case:

screenshot

You will notice that this version has a performance loss relative to the serial C++ version. At this grid size, the framework used to allow parallelization simply has too much overhead associated with it to yeild a performance improvement. I often tell people that using a GPU to speed up your code is similar to using a shotgun to imrpove your accuracy. Your approach is often simply to throw more stuff at the problem and hope that it makes up for you being bad at shooting. However, if you are shooting at such a close and easy target that you don't need the shotgun, then it is simply a lot of extra bulk carried around for now reason. The same is true for our GPU. GPUs are the happiest when they have a lot of compute to keep them occupied so that they can run at full power. There is simply not enough compute in the 41x41 grid to keep my GPU happy and running. The result is that we have a lot of bulk that we are bringing along to move memory around, set up kernels, and synchronize our threads.

However, for the right target, the shotgun is not only a reasonable choice, but can be the only correct choice. This becomes true in our larger grid of 1000x1000 cells. In this case, we have the compute to keep the GPU working at a higher occupancy where we can get more work our of our threads. The results are shown below.:

screenshot

Now THAT is pod racing!

This is a 10x speedup in the 1000x1000 grid case vs. our serial C++ implementation. The perfroamcne here is HIGHLY hardware dependent. I was allowed to run some trial benchmarks of this code on a proper server with higher-performance GPUs. That resulted in run times closer to 16 seconds total for this test case. The results are clear that once you pass a certain threshold for compute in this case that is simply makes sense to swap to our shotgun (CUDA) approach. The good news here is that is only slows us down by about 20%-30% from the efficient C++ version in the 41x41 case. This means that I will almost always want to pick this CUDA version for all but the slowest of grids in a head-to-head comparison.

BUT CAN IT GO EVEN FASTER???

Tiled Implementation

What is tiling

There are a few optimizations that we can do from here that will get progressively more GPU dependent in their effectiveness. The main issue that we still have is that we are not well-utilizing our cache memory, but instead making many reads and writes to global device memory. In particular, in steps like the Poisson step approximation and the compute of the laplacians, we are having multiple threads access an overlapping region of memory. The result is that we are not able to properly utilize the L1 nor L2 caches, but we would like to still avoid the expensive global device memory accesses. It turns out that CUDA has a level of cache memory in between that could help us called "shared memory", which is shared between all threads in a grid. If you ensure that all of the threads in a similar grid are operating relatively closely to each other in terms of the overlapping memory, then you can have all of them performing more-optimal shared memory cache reads. Unlike L1 and L2 cahce, shared memory must be explicitly allocated and managed through CUDA.

The use of shared memory to allow threads in a block to read from an overlapping region of memory is known as "tiling". In this seciton, I would like to implement one final verison of this solver utilizing tiling to speed up some kernels. As you may have been able to tell, up to this point in the code we have been taking the cave-man approach to software optimization---we just keep trying to hit the problem with a bigger stick (or in this case, more threads). This optimization will be the first step that requires some consideration to implement well. It is also the case that sometimes this type of optimization reduces code performance, which we will have to keep an eye on.

Code Explained

The best function to optimize with this function is our Poisson step, as it is called multiple times every time step of our loop. It is the only pressure step that involves computations on neighboring cells. So we will explore the implementation with this. The optimization is performed in three steps:

  1. Copy global memory information to shared memroy
  2. Synchronize the threads so that no threads try to use neighbors before all calls are done
  3. Perform the compute with the data stores in shared memory, but write the output to global memory.

Step one is easy, we can instantiate memory with the __shared__ keyworord. This can either be done statically or dynamically. Since we want exactly one entry per grid element and thus know the amount of shared memory required beforehand, we are going to perform this computation statically. This can be done with:

__shared__ NavierStokesCell<T> s_cells[KERNEL_2D_WIDTH * KERNEL_2D_HEIGHT];
int s_index = threadIdx.x * KERNEL_2D_HEIGHT + threadIdx.y;

Notice the prefix s_, which is standard syntax to indicate shared memory in CUDA. We also create an index inside of the shared memory array that each thread will access. We perform step 2 in a single line by calling __syncthreads(), which accomplishes the same as std::barrier, which we used in the multi-threaded version of this code. Finally, we index in the array:

NavierStokesCell<T> up;
NavierStokesCell<T> down;
NavierStokesCell<T> right;
NavierStokesCell<T> left;
if (threadIdx.y == 0) down = cells[index - 1];
else down = s_cells[s_index - 1];
if (threadIdx.y == KERNEL_2D_HEIGHT - 1) up = cells[index + 1];
else up = s_cells[s_index + 1];
if (threadIdx.x == 0) left = cells[index - height];
else left = s_cells[s_index - 1];
if (threadIdx.x == KERNEL_2D_WIDTH - 1) right = cells[index + height];
else right = s_cells[s_index + KERNEL_2D_HEIGHT];
__syncthreads();

Notice that we have to catch edge cases where the indexing would normally go outside of the array. In that case, we are only accessing that section of global memory a single time in this block, and thus there is no speedup to be gained from first copying it to shared memory. Because of this, we read the edge cases from global memory (I did test a version that first copies to shared memory, and the additional syncronization time made our code MUCH slower). Notice also the one final thread syncronization, as we must finish reading before we allow more threads to loop around and overwrite memory.

That is it! From here, we simply perform our compute as normal. You can see a full version of our kernel here:

template <typename T>
__global__ void tiled_poisson_step_kernel(NavierStokesCell<T>* cells, int width, int height, T element_length_x, T element_length_y) {
    // get our location in the grid
    const int column = blockIdx.x * blockDim.x + threadIdx.x + 1;
    const int row = blockIdx.y * blockDim.y + threadIdx.y + 1;

    __shared__ NavierStokesCell<T> s_cells[KERNEL_2D_WIDTH * KERNEL_2D_HEIGHT];
    int s_index = threadIdx.x * KERNEL_2D_HEIGHT + threadIdx.y;

    for (int c = column; c < width - 1; c += gridDim.x * blockDim.x) {
        for (int r = row; r < height - 1; r += gridDim.y * blockDim.y) {
            int index = c * height + r;
            s_cells[(threadIdx.x) * KERNEL_2D_HEIGHT + threadIdx.y] = cells[index];
            __syncthreads();

            NavierStokesCell<T> up;
            NavierStokesCell<T> down;
            NavierStokesCell<T> right;
            NavierStokesCell<T> left;
            if (threadIdx.y == 0) down = cells[index - 1];
            else down = s_cells[s_index - 1];
            if (threadIdx.y == KERNEL_2D_HEIGHT - 1) up = cells[index + 1];
            else up = s_cells[s_index + 1];
            if (threadIdx.x == 0) left = cells[index - height];
            else left = s_cells[s_index - 1];
            if (threadIdx.x == KERNEL_2D_WIDTH - 1) right = cells[index + height];
            else right = s_cells[s_index + KERNEL_2D_HEIGHT];
            __syncthreads();

            T p_next = s_cells[s_index].right_hand_size * element_length_x * element_length_y;
            p_next -= left.p + right.p + up.p + down.p;
            cells[index].p_next = p_next * -0.25;

        }
    }
}

I decided to only implement this on two kernels, the Poisson step kernel, and our combined derivative kernel (since a lot of compute is done with each cell before exiting). If these go well, we can always return to add it to the 3 other candidate kernels.

Results

Without delay, here are the results in our 41x41 case:

screenshot

As you can tell, this is another slow down from our more-simple pure CUDA case. This follows the trend that we have seen of software slowing down in simple cases as we add optimizations for our larger cases. So how does this perform in the 1000x1000 case?

screenshot

Now that is a disapointing result. As you can see, we received a pretty massive slowdown. I have seen this result before. While shared-memory cache hits are nice, the global memory VRAM accesses aren't so slow that the use of shared memory is THAT significant. Conversely, the cost of the __syncthreads() fucntion is particularly slow, forcing all of our threads to wait for the slowest thread to finish. What is more, because our grid size is only 16x16, we will always perform 64 64 global memory reads on the edges. A lower surface-area-to-volume ratio would allow the threads to perform more compute before syncronization and reduce the number of extra global memory reads per cycle. All in all, I believe the issue comes down to the number of threads that we can throw at the problem on my GPU, since we are capped at 16x16 due to my GPU's shared memory constraints.

Notice again that we are using my outdated NavierStokesCell data structure, which is still causing performance problems due to the size of that data. If I were only copying in two or three arrays of floats, I would be able to perhaps raise the shared memory size to 32x32 and see a performance speed up. This result is quite common though. Speaking from personal experience, I have encountered situations on better GPUs where I was not limited by the size of shared memory that still saw slight performance degradation when adding tiling. While my older GPU and the excessive size of the cell data structure doesn't help, this is simply not ideal in all cases.

Conclusion

In this project, we explored five seprate implementations of a 2D Navier-Stokes fluid solver. Our python version was very simple, resulting in tens of lines of code with modest performance at low grid sizes. To optimize our code, we implemented a serial C++ version, which saw a factor of 2.5x improvement over the python implementation. Since we are interested in grid sizes of about 1000x1000, this code was optimized for that use case.

The first optimization considered was multi-threading the serial version in C++. That resulted in a very mild speed up, with performance limited by false sharing due to our cell data structure in fluid_cpp/cpp/NavierStokesCell.h. Our second optimization was moving compute to the GPU in CUDA, which resulted in a 10x speedup over the serial C++ implemenation. The CUDA verison is quite efficient, and could perhaps only be optimized by reducing the size of global memory data copied via removing the cells data structure in the form of float/double arrays. Finally, we attempted an implmentation of tiling in CUDA to optimize shared memory cache hits. Unfortunately, this use case was not ideal and caused a noticable slow down. This optimization would likely not be ideal in general, but was hurt by limited shared memory and the structure size of our cells.

In summary, we succeded in implementing a very efficient Navier-Stokes solver via porting the code to the GPU. We saw an order of magnitude improvement in our 2D case. The common problem that occured in all optimizations was the poor structure of the NavierStokesCell data type, which held all of the cell information, including derivatives, boundary conditions, etc. While this data structure was convenient, it carried a significant amount of bulk and aligned our memory in an unideal way for multi-threaded implementations. It is my belief that a noticible performance gain can be seen in all three of our atempted optimizations by discarding the cell structure in favor of arrays for the data. Small preliminary tests have shown me that this is true in the CUDA cases. Because of the amount I learned in this project and because I would like to move on to 3D MHD simulation, I am not currently going to optimize these memory accesses here. Optimization must stop somewhere in favor of readability and finishing the task, or else we would all be writing in Asembly. I am content enough that I have identified the method of optimization and know how to implement it if I needed these simulations to move on any faster.

I hope this was an enjoyable read and that this demonstrated my familiarity with mathematical/scitific compute in Python, C++, and CUDA. I also hope that this document demonstrates familiarity with data and metric analysis in a scientific setting. I encourage you to use this library in your own code projects if you have need of an optimized fluid simulation. Please reach out to me if you have questions about this code or my results. You can find my contact information on my website: www.danieljvickers.com

About

An educational example for learning the Navier-Stoke equations. Also included is a C++ and CUDA shared object library, buildable with CMake, for use in your personal projects.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published