Skip to content

Commit 2273286

Browse files
committed
first commit
0 parents  commit 2273286

29 files changed

+3769
-0
lines changed

MainPage.h

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
/** @mainpage Compressed Sparse Blocks (CSB) Library (Cilk Plus implementation)
2+
*
3+
* @author <a href="http://gauss.cs.ucsb.edu/~aydin"> Aydın Buluç </a>
4+
* (in collaboration with <a href="http://crd.lbl.gov/about/staff/amsc/scientific-computing-group-scg/hasan-metin-aktulga/">Hasan Metin Aktulga</a>, <a href="http://www.cs.berkeley.edu/~demmel/">James Demmel</a>, <a href="http://www.cs.georgetown.edu/~jfineman/">Jeremy Fineman</a>, <a href="http://www.fftw.org/~athena/">Matteo Frigo</a>, <a href="http://www.cs.ucsb.edu/~gilbert/">John Gilbert</a>, <a href="http://people.csail.mit.edu/cel/">Charles Leiserson</a>, <a href="http://crd.lbl.gov/about/staff/cds/ftg/leonid-oliker/">Lenny Oliker</a>, <a href="http://crd.lbl.gov/about/staff/cds/ftg/samuel-williams/">Sam Williams</a>).
5+
*
6+
* <i> This material is based upon work supported by the National Science Foundation under Grants No. 0540248, 0615215, 0712243, 0822896, and 0709385, by MIT Lincoln Laboratory under contract 7000012980, and by the Department of Energy, Office of Science, ASCR Contract No. DE-AC05-00OR22725. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF) and the Department of Energy (DOE). This software is released under <a href="http://en.wikipedia.org/wiki/MIT_License">the MIT license</a>.</i>
7+
*
8+
*
9+
* @section intro Introduction
10+
* The Compressed Sparse Blocks (CSB) is a storage format for sparse matrices that does not favor rows over columns (and vice-versa), hence offering performance symmetry in shared-memory parallel systems for Ax and A'x. The format is originally described in
11+
<a href="http://gauss.cs.ucsb.edu/~aydin/csb2009.pdf">this paper</a> [1]. It has been later improved through the incorporation of bitmasked register blocks in <a href="http://gauss.cs.ucsb.edu/~aydin/ipdps2011.pdf">this paper</a> [2] where an algorithm for symmetric matrices is also proposed. Finally <a href="http://gauss.cs.ucsb.edu/~aydin/ipdps14aktulga.pdf">this recent paper</a> [3] includes performance results for the multiple vector cases.
12+
13+
This library targets shared-memory parallel systems (ideally in a single NUMA domain for best performance) and implements:
14+
* - Sparse Matrix-Vector Multiplication (SpMV)
15+
* - Sparse Matrix-Transpose-Vector Multiplication (SpMV_T)
16+
* - Sparse Matrix-Multiple-Vector Multiplication (SpMM)
17+
* - Sparse Matrix-Transpose-Multiple-Vector Multiplication (SpMM_T)
18+
*
19+
* Download the <a href="csb2014.tgz">library and drivers as a tarball including the source code</a>.
20+
*
21+
* All operations can be done on an arbitrary semiring by overloading add() and multiply(), though some optimizations might not work for
22+
* specialized semirings. While the code is implemented using Intel Cilk Plus (which is available in Intel Compilers and GCC), it can
23+
* be ported to any concurrency platform that supports efficient task-stealing such as OpenMP and TBB.
24+
*
25+
* The driver will accept matrices in text-based triples format and a binary format for faster benchmarking (created using
26+
* <a href="http://gauss.cs.ucsb.edu/~aydin/csb/dumpbinsparse.m">this matlab script</a>). The library also includes functions to convert from the common CSC format
27+
* though the conversion is serial and not optimized for performance yet.
28+
* An example input in (compressed) <a href="http://gauss.cs.ucsb.edu/~aydin/csb/asic_320k.mtx.bz2"> ascii </a> and in (compressed) <a href="http://gauss.cs.ucsb.edu/~aydin/csb/asic_320k.bin.bz2">binary</a>. <br>
29+
*
30+
*
31+
* <b> How to run it? </b>
32+
33+
* Read the <a href="http://gauss.cs.ucsb.edu/~aydin/csb/Makefile-2013">example makefile</a>. Here is a <a href="http://gauss.cs.ucsb.edu/~aydin/csb/README">README</a> file. <br>
34+
* Running this code on a 8-core Intel processor is done by the following way (similar for other executables):
35+
* - make parspmv/parspmv_nobm/parspmvt (the tarball includes sample makefiles as well)
36+
* - CILK_NWORKERS=8 ./parspmvt ../BinaryMatrices/asic_320k.bin nosym binary <br>
37+
*
38+
* If you have multiple sockets (NUMA domains) in your machine, then you need to constrain the memory space to a single NUMA node (CSB is not designed for multiple NUMA domains - it will run, but slower).
39+
*
40+
* - export CILK_NWORKERS=8 (or 16 if hyperthreading turns out to be beneficial)
41+
* - numactl --cpunodebind=0 ./parspmvt ../BinaryMatrices/asic_320k.bin nosym binary <br>
42+
*
43+
* if you don't set CILK_NWORKERS, then it will run with as many hardware threads available on your machine (or numactl constrained domain).
44+
*
45+
* - ./parspmv ../BinaryMatrices/kkt_power.bin nosym binary (using the binary format for fast I/O)
46+
* - ./parspmv ../TextMatrices/kkt_power.mtx nosym text (using the matrix market format)
47+
* - ./spmm_d$$number runs on $$number right-hand-side vectors that are randomly generated using double precision
48+
* - ./spmm_s$$number uses single precision for the same case
49+
* - ./both_d runs both parspmv and parspmv_t one after other (simulating iterative methods such as BiCG and QMR)
50+
*
51+
*
52+
* <b> What does those numbers mean? </b>
53+
* - BiCSB: Original CSB code with minor performance fixes, nonsymmetric and without register blocking. Quite robust
54+
* - BmCSB: Bitmasked register blocks in action. Modify RBDIM in utility.h to try different blocking sizes (8x8, 4x4, etc). May perform better.
55+
* - CSC: Serial CSC implementation. For reference only.
56+
57+
* Release notes:
58+
* - 1.2: Current version. Multiple vector support.
59+
* - A performance bug affecting A'x scaling on certain matrices is fixed.
60+
* - 1.1: Bitmasked register blocks, symmetric algorithm with half the bandwidth, port to Intel Cilk Plus.
61+
* - A performance bug affecting Ax scaling on certain matrices is fixed.
62+
* - Minor: A bug with the parspmvt test driver is fixed, new parspmv_nobm compilation target is added for those who don't have SSE.
63+
*
64+
* - 1.0: Initial version. Support for Ax and A'x using cilk++.
65+
*
66+
* <b> Citation: </b>
67+
*
68+
* - [1] Aydın Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. <it>Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks.</it> In SPAA'09: Proceedings of the 21st Annual ACM Symposium on Parallel Algorithms and Architectures, 2009.
69+
* - [2] Aydın Buluç, Samuel Williams, Leonid Oliker, and James Demmel. <it> Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication.</it> In Proceedings of the IPDPS. IEEE Computer Society, 2011
70+
* - [3] H.Metin Aktulga, Aydın Buluç, Samuel Williams, and Chao Yang. <it> Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations.</it> In Proceedings of the IPDPS. IEEE Computer Society, 2014
71+
*/

Makefile

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
GCCOPT = -O2 -fno-rtti -fno-exceptions # -ftree-vectorize
2+
INTELOPT = -O2 -no-ipo -fno-rtti -fno-exceptions -parallel -restrict -std=c++11 -xAVX -no-prec-div #-fno-inline-functions
3+
DEB = -g -DNOBM -O0 -parallel -restrict -std=c++11
4+
5+
6+
seqspmv: csb_spmv_test.cpp bicsb.cpp bicsb.h friends.h utility.h
7+
g++ $(INCADD) $(GCCOPT) -o seqspmv csb_spmv_test.cpp
8+
9+
10+
clean:
11+
rm -f seqspmv
12+
rm -f *.o

README

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
========================================================================
2+
APPLICATION : CSB Overview
3+
========================================================================
4+
5+
Author: Aydin Buluc, LBNL, [email protected]
6+
Date: 2/28/2014
7+
8+
Classes
9+
-------
10+
11+
CSC:
12+
- Class that implements the standard "compressed sparse column" format.
13+
- Used for baseline comparisons.
14+
15+
BiCSB:
16+
- Production (final) class that implements "compressed sparse blocks"
17+
- Nonzeros within a block are stored in "bit-interleaved" order
18+
- Described in http://dx.doi.org/10.1145/1583991.1584053
19+
20+
BmCSB:
21+
- Class that implements bitmasked register blocks on top of CSB
22+
- Change the register block dimension inside utility.h (RBDIM), options are 2,4,8 (default is 8)
23+
- Decribed in http://doi.ieeecomputersociety.org/10.1109/IPDPS.2011.73
24+
25+
CSBSYM:
26+
- Class that implements the symmetric algorithm
27+
- Decribed in http://doi.ieeecomputersociety.org/10.1109/IPDPS.2011.73
28+
29+
SYM/CSBSYM [do not use]:
30+
- Experimental class that implements a variant of "compressed sparse blocks"
31+
- Nonzeros within a block are stored in row-major order
32+
- Various optimizations are tried in this class, such as SSE, prefetching, etc.
33+
34+
Files
35+
-----
36+
37+
csb_spmv(t)_test.cpp :
38+
- Driver programs for both sequential and parallel Ax and A'x runs
39+
- Usage "./executable matrixfile nosym/sym ascii/binary" or "./executable" in which case read the ascii file matrix.txt if exists (only nosym works for now - special support for symmetric matrices will be available soon)
40+
- Executables are parspmv, parspmvt, seqspmv, seqspmvt where names are self
41+
explanatory.
42+
- For parallel execution, you can specify the number of workers by setting
43+
the environmental variable CILK_NWORKERS.
44+
45+
spmm_test.cpp:
46+
- Driver program for the multiple vector cases of Ax and A'x (i.e. SpMM for AX and A'X)
47+
48+
bwtest-mimd.cpp :
49+
- Usage "./bwtest-mimd -n file_1 file_2 ... file_n"
50+
- Bandwidth test program that reads does SpMV's in n different matrices simultaneously
51+
- pthreads implementation
52+
53+
oskispmv(t).cpp :
54+
- Usage "./oskispmv(t) matrixfile"
55+
- Compares the performance of our serial implementations with plain OSKI to reveal any anomalies
56+
57+
utility.h :
58+
- Includes constants, preprocessors directives and utility functions
59+
60+
61+
/////////////////////////////////////////////////////////////////////////////

Semirings.h

+129
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
2+
#ifndef _SEMIRINGS_H_
3+
#define _SEMIRINGS_H_
4+
5+
#include <utility>
6+
#include <climits>
7+
#include <cmath>
8+
#include <tr1/array>
9+
#include "promote.h"
10+
11+
template <typename T>
12+
struct inf_plus{
13+
T operator()(const T& a, const T& b) const {
14+
T inf = std::numeric_limits<T>::max();
15+
if (a == inf || b == inf){
16+
return inf;
17+
}
18+
return a + b;
19+
}
20+
};
21+
22+
// (+,*) on scalars
23+
template <class T1, class T2>
24+
struct PTSR
25+
{
26+
typedef typename promote_trait<T1,T2>::T_promote T_promote;
27+
28+
static T_promote add(const T1 & arg1, const T2 & arg2)
29+
{
30+
return (static_cast<T_promote>(arg1) +
31+
static_cast<T_promote>(arg2) );
32+
}
33+
static T_promote multiply(const T1 & arg1, const T2 & arg2)
34+
{
35+
return (static_cast<T_promote>(arg1) *
36+
static_cast<T_promote>(arg2) );
37+
}
38+
// y += ax overload with a=1
39+
static void axpy(const T2 & x, T_promote & y)
40+
{
41+
y += x;
42+
}
43+
44+
static void axpy(T1 a, const T2 & x, T_promote & y)
45+
{
46+
y += a*x;
47+
}
48+
};
49+
50+
51+
template<int Begin, int End, int Step>
52+
struct UnrollerL {
53+
template<typename Lambda>
54+
static void step(Lambda& func) {
55+
func(Begin);
56+
UnrollerL<Begin+Step, End, Step>::step(func);
57+
}
58+
};
59+
60+
template<int End, int Step>
61+
struct UnrollerL<End, End, Step> {
62+
template<typename Lambda>
63+
static void step(Lambda& func) {
64+
// base case is when Begin=End; do nothing
65+
}
66+
};
67+
68+
69+
// (+,*) on std:array's
70+
template<class T1, class T2, unsigned D>
71+
struct PTSRArray
72+
{
73+
typedef typename promote_trait<T1,T2>::T_promote T_promote;
74+
75+
// y <- a*x + y overload with a=1
76+
static void axpy(const array<T2, D> & b, array<T_promote, D> & c)
77+
{
78+
const T2 * __restrict barr = b.data();
79+
T_promote * __restrict carr = c.data();
80+
__assume_aligned(barr, ALIGN);
81+
__assume_aligned(carr, ALIGN);
82+
83+
#pragma simd
84+
for(int i=0; i<D; ++i)
85+
{
86+
carr[i] += barr[i];
87+
}
88+
// auto multadd = [&] (int i) { c[i] += b[i]; };
89+
// UnrollerL<0, D, 1>::step ( multadd );
90+
}
91+
92+
// Todo: Do partial unrolling; this code will bloat for D > 32
93+
static void axpy(T1 a, const array<T2,D> & b, array<T_promote,D> & c)
94+
{
95+
const T2 * __restrict barr = b.data();
96+
T_promote * __restrict carr = c.data();
97+
__assume_aligned(barr, ALIGN);
98+
__assume_aligned(carr, ALIGN);
99+
100+
#pragma simd
101+
for(int i=0; i<D; ++i)
102+
{
103+
carr[i] += a* barr[i];
104+
}
105+
//auto multadd = [&] (int i) { carr[i] += a* barr[i]; };
106+
//UnrollerL<0, D, 1>::step ( multadd );
107+
}
108+
};
109+
110+
// (min,+) on scalars
111+
template <class T1, class T2>
112+
struct MPSR
113+
{
114+
typedef typename promote_trait<T1,T2>::T_promote T_promote;
115+
116+
static T_promote add(const T1 & arg1, const T2 & arg2)
117+
{
118+
return std::min<T_promote>
119+
(static_cast<T_promote>(arg1), static_cast<T_promote>(arg2));
120+
}
121+
static T_promote multiply(const T1 & arg1, const T2 & arg2)
122+
{
123+
return inf_plus< T_promote >
124+
(static_cast<T_promote>(arg1), static_cast<T_promote>(arg2));
125+
}
126+
};
127+
128+
129+
#endif

0 commit comments

Comments
 (0)