[Installation] [Using ParAMD] [Replicating Our benchmarks] [Citation and Acknowledgments] [Copyright]
ParAMD is a shared memory parallel implementation of the approximate minimum degree (AMD) algorithm with multiple elimination via distance-2 independent sets.
For more information, please read our preprint paper Parallelizing the Approximate Minimum Degree Ordering Algorithm: Strategies and Evaluation on arXiv.
To build the shared library libparamd.so
and the binary paramd
, run the following commands from the root of this repository:
mkdir build && cd build
cmake ..
cmake --build .
This will place the output files in the build/
directory.
To install the header file paramd.h
, the shared library libparamd.so
, and the binary paramd
to a custom location, run:
cmake --install . --prefix /path/to/install/
Replace /path/to/install/
with your desired installation directory.
If you have SuiteSparse AMD installed, you can enable support for it by specifying its path during the configuration step. Run the following instead of cmake ..
:
cmake .. -DSUITESPARSE_ROOT=/path/to/amd/
Make sure the following files exist:
Header: /path/to/amd/include/suitesparse/amd.h
Library: /path/to/amd/lib64/libamd.so
, /path/to/amd/lib64/libamd.a
, /path/to/amd/lib/libamd.so
, or /path/to/amd/lib/libamd.a
The easiest way to use ParAMD is via the command-line binary paramd
. To view all available options, run:
./paramd -h # in the build/ directory, or paramd -h if installed
You will see the following usage message:
Usage: paramd [OPTIONS]
Options:
-h, --help Show this help message
--matrix <path> Path to the input matrix in mtx format (required)
--algo <paramd, amd (if available)> Algorithm to benchmark (default: paramd)
--mult <double> Multiplicative relaxation factor (default: 1.1)
--lim <double> Limitation factor (default: 8192), will be divided by #threads
--mem <double> Extra memory factor (default: 1.5)
--seed <64-bit integer> Seed used to randomly permute the matrix (default: 1)
--breakdown Enable time breakdown (default: false)
--stat Enable statistics (default: false)
--sym Specify sym if the matrix is known to be symmetric (default: false)
The argument descriptions are as follows:
-h, --help
: Display the help message and exit.--matrix <path>
: Required. Specify the path to the input matrix file in Matrix Market (.mtx) format.--algo <paramd, amd>
: Select the ordering algorithm. (Default:paramd
)paramd
: Use ParAMD.amd
: Use SuiteSparse AMD if available.
--mult <double>
: The multiplicative relaxation factor to the approximate minimum degree used when selecting pivots. Only apply to ParAMD. (Default:1.1
)--lim <double>
: The limitation factor used when selecting pivots, will be divided by the number of threads. Only apply to ParAMD. (Default:8192
)--mem <double>
: The extra memory factor for the graph structure. Only apply to ParAMD. (Default:1.5
)--seed <64-bit integer>
: The seed used to randomly permute the matrix in advance to decouple tie-breaking issues. Setting this to0
means no permutation is used beforehand. (Default:1
)--breakdown
: Enable time breakdown. Only apply to ParAMD. (Default:false
)--stat
: Enable statistics for the distribution of the sizes of distance-2 independent sets. Only apply to ParAMD. (Default:false
)--sym
: Specify this if the matrix is known to be symmetric. Only apply to ParAMD. (Default:false
)
The fill-reduction ordering will be stored as a permutation at <path>.perm
where <path>
is the path to the input matrix.
./paramd --matrix ../examples/test.mtx --algo paramd \
--mult 1.1 --lim 8192 --mem 1.5 --seed 1 \
--breakdown --stat --sym # running from the build/ directory
We recommend configuring your environment for OpenMP based on your hardware by setting the environment variables, including but not limited to OMP_NUM_THREADS
and OMP_PLACES
.
ParAMD provides a header file and a shared library to allow easy integration into custom workflows. Once installed and your environment is properly configured, you can include ParAMD in your project with:
#include "paramd/paramd.h"
And link against the shared library libparamd.so
.
-
paramd::config config()
: A configuration structure constructingconfig
with default values corresponding to the command-line options. You can modify its fields as needed. Use the member functionprint()
to display the current configuration. -
paramd::matrix A(std::string fp)
: The matrix structure with a constructor constructing a sparse matrixA
by reading a Matrix Market (.mtx) file specified by the file pathfp
. -
uint64_t paramd::paramd(const vtype n, const vtype *rowptr, const etype *colidx, vtype *perm, const config &config)
: Run the ParAMD algorithm with:n
: the number of rows/columns in the matrix.rowptr
,colidx
: the CSR/CSC representation (will be symmetrized) of matrixA
.A.rowptr.data()
andA.colidx.data()
can be used if the matrix structure is constructed. The length ofrowptr
isn+1
and the length ofcolidx
isrowptr[n]
.perm
: an output array of lengthn
that will contain the computed fill-reducing permutation.config
: the configuration options.
The function returns the number of fill-ins in the Cholesky factorization (excluding the diagonal). Datatypes
vtype
andetype
are bothint32_t
.
For practical usage examples, helper functions for randomly permuting the matrix and reverting for the resulting permutation, and the full integration workflow, refer to bin/main.cpp.
A minimal example is shown below:
#include <iostream>
#include "paramd/paramd.h"
#include <vector>
int main(int argc, char* argv[]) {
paramd::config config;
paramd::matrix A("examples/test.mtx");
std::vector<paramd::vtype> perm(A.n);
uint64_t fill = paramd::paramd(A.n, A.rowptr.data(), A.colidx.data(), perm.data(), config);
std::cout << "#Fill = " << fill << std::endl;
paramd::write_perm("examples/test.mtx.perm", perm);
return 0;
}
Our benchmarks require a CPU with at least 64 physical cores, 256 GB of RAM, and 128 GB of available disk space.
CMake >= v3.22
Python >= v3.13
, with the following packages:Matplotlib >= v3.10.1
NumPy >= v2.2.5
pandas >= v2.2.3
seaborn >= v0.13.2
We recommend setting up the environment using conda with the provided environment.yml configuration file. If conda is not already installed on your system, please refer to its official documentation for installation instructions.
Additionally, ensure that you have GCC installed. It must supports #pragma omp atomic compare
. In our experiments, we used version v13.2.1
.
To create and activate the environment, run:
conda env create -f environment.yml
conda activate paramd-benchmarks
We provide a single script, all.sh
, that runs all benchmark stages including download.sh
, build.sh
, scaling.sh
, breakdown.sh
, distribution.sh
, and tuning.sh
reproducing key tables and figures. The entire workflow is expected to take 4 hours.
To run all benchmarks, navigate to the root of this repository and execute the following command:
./scripts/all.sh /path/to/a/clean/scratch/dir/
The /path/to/a/clean/scratch/dir/
directory will be used to store the dataset, benchmark results, and intermediate files.
Once the benchmarks are complete, the following output files will be generated:
-
/path/to/a/clean/scratch/dir/plots/scaling.csv
: An end-to-end comparison between SuiteSparse AMD and our ParAMD, using 1 thread (1t) and 64 threads (64t). Matrices are randomly permuted five times, with all methods evaluated on the same set of permutations to decouple tie-breaking issues. We report the median ordering time and fill-in across the five runs. Speedups and fill-in ratios are calculated based on the medians. -
/path/to/a/clean/scratch/dir/plots/scaling.html
: The same data as in the CSV file, but formatted as an HTML table for easier viewing. -
/path/to/a/clean/scratch/dir/plots/breakdown.png
: A runtime breakdown of our ParAMD as the number of threads scales from 1 to 64. -
/path/to/a/clean/scratch/dir/plots/distribution.png
: Violin plots showing the distribution of the sizes of distance-2 independent sets across elimination steps for ParAMD. The plots are cropped to reflect the actual range of the data. The dotted line marks the threshold of 64, representing the minimum size needed to fully utilize all 64 threads. -
/path/to/a/clean/scratch/dir/plots/tuning.png
: The impact of the relaxation factor$mult$ and the limitation factor$lim$ on core AMD runtime, distance-2 independent set selection time, and ordering quality for our ParAMD. Results are shown for matricesnd24k
andnlpkkt240
, representing the worst and best scalability cases, respectively. This experiment is conducted using 64 threads. Lighter colors indicate better performance.
Exemplar outputs executed on an AMD EPYC 7763 CPU are available in the plots directory.
For completeness, a detailed description of each script is provided below.
To download the dataset, execute the following command from the root of this repository:
./scripts/download.sh /path/to/a/clean/scratch/dir/
This will download the benchmark matrices to /path/to/a/clean/scratch/dir/dataset/
. This takes about 30 minutes, which may vary depending on internet bandwidth. The matrices used in our benchmarks are from SuiteSparse Matrix Collection, with details provided below, sorted by the number of nonzeros:
Matrix Name | #rows | #nonzeros | Symmetric | Problem Type |
---|---|---|---|---|
pre2 | 659K | 5.83M | No | Circuit Simulation |
nd24k | 72.0K | 28.7M | Yes | 3D Mesh |
ldoor | 952K | 42.5M | Yes | Structural |
dielFilterV3real | 1.10M | 89.3M | Yes | Electromagnetics |
Flan_1565 | 1.56M | 114M | Yes | Structural |
HV15R | 2.02M | 283M | No | Computational Fluid Dynamics |
Queen_4147 | 4.15M | 317M | Yes | Structural |
stokes | 11.4M | 349M | No | Semiconductor Process |
nlpkkt240 | 28.0M | 761M | Yes | Optimization |
To build SuiteSparse AMD and ParAMD, run the following command from the root of this repository:
./scripts/build.sh /path/to/a/clean/scratch/dir/
This will first clone the SuiteSparse repository, build AMD, and install the shared library to /path/to/a/clean/scratch/dir/
. Afterward, ParAMD will be built with SuiteSparse AMD enabled. The estimated compilation time is 2 minutes.
To run the scaling experiment and end-to-end comparison, execute the following command from the root of this repository:
./scripts/scaling.sh /path/to/a/clean/scratch/dir/
This will generate two tables: /path/to/a/clean/scratch/dir/plots/scaling.csv
and /path/to/a/clean/scratch/dir/plots/scaling.html
. Both tables contain the same data, with the latter formatted for better human readability. The experiment performs an end-to-end comparison between SuiteSparse AMD and our ParAMD, using 1 thread (1t) and 64 threads (64t). Matrices are randomly permuted five times, and all methods are evaluated on the same set of permutations to eliminate tie-breaking issues. We report the median ordering time and fill-in across the five runs, with speedups and fill-in ratios calculated based on the median values. This takes about 2 hours.
To run the time breakdown experiment, execute the following command from the root of this repository:
./scripts/breakdown.sh /path/to/a/clean/scratch/dir/
This will generate a figure at /path/to/a/clean/scratch/dir/plots/breakdown.png
. The experiment provides a runtime breakdown of our ParAMD as the number of threads scales from 1 to 64. This takes about 40 minutes.
To run the distribution experiment, execute the following command from the root of this repository:
./scripts/distribution.sh /path/to/a/clean/scratch/dir/
This will generate a figure at /path/to/a/clean/scratch/dir/plots/distribution.png
. The figure displays violin plots showing the distribution of the sizes of distance-2 independent sets across elimination steps in ParAMD. The plots are cropped to show the actual range of the data, with a dotted line marking the threshold of 64, which represents the minimum size required to fully utilize all 64 threads. This takes about 2 minutes.
To run the tuning experiment, execute the following command from the root of this repository:
./scripts/tuning.sh /path/to/a/clean/scratch/dir/
This will generate a figure at /path/to/a/clean/scratch/dir/plots/tuning.png
. The figure illustrates the impact of the relaxation factor nd24k
and nlpkkt240
, representing the worst and best scalability cases, respectively. This experiment is conducted using 64 threads. Lighter colors indicate better performance. This takes about 40 minutes.
If you find this code useful, please cite our paper:
@misc{paramd,
title={Parallelizing the Approximate Minimum Degree Ordering Algorithm: Strategies and Evaluation},
author={Yen-Hsiang Chang and Aydın Buluç and James Demmel},
year={2025},
eprint={2504.17097},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2504.17097},
}
This software optionally uses AMD (Approximate Minimum Degree), primarily used when benchmarking, from the SuiteSparse project which is licensed under the BSD 3-Clause License. Our codebase extracts the workflow from SuiteSparse AMD, which is a sequential implementation, where we enable parallelism via multiple elimination using distance-2 independent sets to build our ParAMD.
Copyright (c) 2025, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at [email protected].
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.