Skip to content

Commit 156a12c

Browse files
author
Henry Jin
committed
synced with the 4.5.0 implementation of the examples-internal repo
1 parent c65fe47 commit 156a12c

File tree

447 files changed

+3166
-788
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

447 files changed

+3166
-788
lines changed

Changes.log

+6
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
[20-May-2016] Version 4.5.0
2+
Changes from 4.0.2ltx
3+
4+
1. Reorganization into topic chapters
5+
2. Change file suffixes (f/f90 => Fixed/Free format) C++ => cpp
6+
17
[2-Feb-2015] Version 4.0.2
28
Changes from 4.0.1ltx
39

Chap_SIMD.tex

+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
\pagebreak
2+
\chapter{SIMD}
3+
\label{chap:simd}
4+
5+
Single instruction, multiple data (SIMD) is a form of parallel execution
6+
in which the same operation is performed on multiple data elements
7+
independently in hardware vector processing units (VPU), also called SIMD units.
8+
The addition of two vectors to form a third vector is a SIMD operation.
9+
Many processors have SIMD (vector) units that can perform simultaneously
10+
2, 4, 8 or more executions of the same operation (by a single SIMD unit).
11+
12+
Loops without loop-carried backward dependency (or with dependency preserved using
13+
ordered simd) are candidates for vectorization by the compiler for
14+
execution with SIMD units. In addition, with state-of-the-art vectorization
15+
technology and \code{declare simd} construct extensions for function vectorization
16+
in the OpenMP 4.5 specification, loops with function calls can be vectorized as well.
17+
The basic idea is that a scalar function call in a loop can be replaced by a vector version
18+
of the function, and the loop can be vectorized simultaneously by combining a loop
19+
vectorization (\code{simd} directive on the loop) and a function
20+
vectorization (\code{declare simd} directive on the function).
21+
22+
A \code{simd} construct states that SIMD operations be performed on the
23+
data within the loop. A number of clauses are available to provide
24+
data-sharing attributes (\code{private}, \code{linear}, \code{reduction} and
25+
\code{lastprivate}). Other clauses provide vector length preference/restrictions
26+
(\code{simdlen} / \code{safelen}), loop fusion (\code{collapse}), and data
27+
alignment (\code{aligned}).
28+
29+
The \code{declare simd} directive designates
30+
that a vector version of the function should also be constructed for
31+
execution within loops that contain the function and have a \code{simd}
32+
directive. Clauses provide argument specifications (\code{linear},
33+
\code{uniform}, and \code{aligned}), a requested vector length
34+
(\code{simdlen}), and designate whether the function is always/never
35+
called conditionally in a loop (\code{branch}/\code{inbranch}).
36+
The latter is for optimizing peformance.
37+
38+
Also, the \code{simd} construct has been combined with the worksharing loop
39+
constructs (\code{for simd} and \code{do simd}) to enable simultaneous thread
40+
execution in different SIMD units.
41+
%Hence, the \code{simd} construct can be
42+
%used alone on a loop to direct vectorization (SIMD execution), or in
43+
%combination with a parallel loop construct to include thread parallelism
44+
%(a parallel loop sequentially followed by a \code{simd} construct,
45+
%or a combined construct such as \code{parallel do simd} or
46+
%\code{parallel for simd}).
47+
48+

Chap_affinity.tex

+118
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
\pagebreak
2+
\chapter{OpenMP Affinity}
3+
\label{chap:openmp_affinity}
4+
5+
OpenMP Affinity consists of a \code{proc\_bind} policy (thread affinity policy) and a specification of
6+
places (\texttt{"}location units\texttt{"} or \plc{processors} that may be cores, hardware
7+
threads, sockets, etc.).
8+
OpenMP Affinity enables users to bind computations on specific places.
9+
The placement will hold for the duration of the parallel region.
10+
However, the runtime is free to migrate the OpenMP threads
11+
to different cores (hardware threads, sockets, etc.) prescribed within a given place,
12+
if two or more cores (hardware threads, sockets, etc.) have been assigned to a given place.
13+
14+
Often the binding can be managed without resorting to explicitly setting places.
15+
Without the specification of places in the \code{OMP\_PLACES} variable,
16+
the OpenMP runtime will distribute and bind threads using the entire range of processors for
17+
the OpenMP program, according to the \code{OMP\_PROC\_BIND} environment variable
18+
or the \code{proc\_bind} clause. When places are specified, the OMP runtime
19+
binds threads to the places according to a default distribution policy, or
20+
those specified in the \code{OMP\_PROC\_BIND} environment variable or the
21+
\code{proc\_bind} clause.
22+
23+
In the OpenMP Specifications document a processor refers to an execution unit that
24+
is enabled for an OpenMP thread to use. A processor is a core when there is
25+
no SMT (Simultaneous Multi-Threading) support or SMT is disabled. When
26+
SMT is enabled, a processor is a hardware thread (HW-thread). (This is the
27+
usual case; but actually, the execution unit is implementation defined.) Processor
28+
numbers are numbered sequentially from 0 to the number of cores less one (without SMT), or
29+
0 to the number HW-threads less one (with SMT). OpenMP places use the processor number to designate
30+
binding locations (unless an \texttt{"}abstract name\texttt{"} is used.)
31+
32+
33+
The processors available to a process may be a subset of the system's
34+
processors. This restriction may be the result of a
35+
wrapper process controlling the execution (such as \code{numactl} on Linux systems),
36+
compiler options, library-specific environment variables, or default
37+
kernel settings. For instance, the execution of multiple MPI processes,
38+
launched on a single compute node, will each have a subset of processors as
39+
determined by the MPI launcher or set by MPI affinity environment
40+
variables for the MPI library. %Forked threads within an MPI process
41+
%(for a hybrid execution of MPI and OpenMP code) inherit the valid
42+
%processor set for execution from the parent process (the initial task region)
43+
%when a parallel region forks threads. The binding policy set in
44+
%\code{OMP\_PROC\_BIND} or the \code{proc\_bind} clause will be applied to
45+
%the subset of processors available to \plc{the particular} MPI process.
46+
47+
%Also, setting an explicit list of processor numbers in the \code{OMP\_PLACES}
48+
%variable before an MPI launch (which involves more than one MPI process) will
49+
%result in unspecified behavior (and doesn't make sense) because the set of
50+
%processors in the places list must not contain processors outside the subset
51+
%of processors for an MPI process. A separate \code{OMP\_PLACES} variable must
52+
%be set for each MPI process, and is usually accomplished by launching a script
53+
%which sets \code{OMP\_PLACES} specifically for the MPI process.
54+
55+
Threads of a team are positioned onto places in a compact manner, a
56+
scattered distribution, or onto the master's place, by setting the
57+
\code{OMP\_PROC\_BIND} environment variable or the \code{proc\_bind} clause to
58+
\plc{close}, \plc{spread}, or \plc{master}, respectively. When
59+
\code{OMP\_PROC\_BIND} is set to FALSE no binding is enforced; and
60+
when the value is TRUE, the binding is implementation defined to
61+
a set of places in the \code{OMP\_PLACES} variable or to places
62+
defined by the implementation if the \code{OMP\_PLACES} variable
63+
is not set.
64+
65+
The \code{OMP\_PLACES} variable can also be set to an abstract name
66+
(\plc{threads}, \plc{cores}, \plc{sockets}) to specify that a place is
67+
either a single hardware thread, a core, or a socket, respectively.
68+
This description of the \code{OMP\_PLACES} is most useful when the
69+
number of threads is equal to the number of hardware thread, cores
70+
or sockets. It can also be used with a \plc{close} or \plc{spread}
71+
distribution policy when the equality doesn't hold.
72+
73+
74+
% We need an example of using sockets, cores and threads:
75+
76+
% case 1 cores:
77+
78+
% Hyper-Threads on (2 hardware threads per core)
79+
% 1 socket x 4 cores x 2 HW-threads
80+
%
81+
% export OMP_NUM_THREADS=4
82+
% export OMP_PLACES=threads
83+
%
84+
% core # 0 1 2 3
85+
% processor # 0,1 2,3 4,5 6,7
86+
% thread # 0 * _ _ _ _ _ _ _ #mask for thread 0
87+
% thread # 1 _ _ * _ _ _ _ _ #mask for thread 1
88+
% thread # 2 _ _ _ _ * _ _ _ #mask for thread 2
89+
% thread # 3 _ _ _ _ _ _ * _ #mask for thread 3
90+
91+
% case 2 threads:
92+
%
93+
% Hyper-Threads on (2 hardware threads per core)
94+
% 1 socket x 4 cores x 2 HW-threads
95+
%
96+
% export OMP_NUM_THREADS=4
97+
% export OMP_PLACES=cores
98+
%
99+
% core # 0 1 2 3
100+
% processor # 0,1 2,3 4,5 6,7
101+
% thread # 0 * * _ _ _ _ _ _ #mask for thread 0
102+
% thread # 1 _ _ * * _ _ _ _ #mask for thread 1
103+
% thread # 2 _ _ _ _ * * _ _ #mask for thread 2
104+
% thread # 3 _ _ _ _ _ _ * * #mask for thread 3
105+
106+
% case 3 sockets:
107+
%
108+
% No Hyper-Threads
109+
% 3 socket x 4 cores
110+
%
111+
% export OMP_NUM_THREADS=3
112+
% export OMP_PLACES=sockets
113+
%
114+
% socket # 0 1 2
115+
% processor # 0,1,2,3 4,5,6,7 8,9,10,11
116+
% thread # 0 * * * * _ _ _ _ _ _ _ _ #mask for thread 0
117+
% thread # 0 _ _ _ _ * * * * _ _ _ _ #mask for thread 1
118+
% thread # 0 _ _ _ _ _ _ _ _ * * * * #mask for thread 2

Chap_data_environment.tex

+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
\pagebreak
2+
\chapter{Data Environment}
3+
\label{chap:data_environment}
4+
The OpenMP \plc{data environment} contains data attributes of variables and
5+
objects. Many constructs (such as \code{parallel}, \code{simd}, \code{task})
6+
accept clauses to control \plc{data-sharing} attributes
7+
of referenced variables in the construct, where \plc{data-sharing} applies to
8+
whether the attribute of the variable is \plc{shared},
9+
is \plc{private} storage, or has special operational characteristics
10+
(as found in the \code{firstprivate}, \code{lastprivate}, \code{linear}, or \code{reduction} clause).
11+
12+
The data environment for a device (distinguished as a \plc{device data environment})
13+
is controlled on the host by \plc{data-mapping} attributes, which determine the
14+
relationship of the data on the host, the \plc{original} data, and the data on the
15+
device, the \plc{corresponding} data.
16+
17+
\bigskip
18+
DATA-SHARING ATTRIBUTES
19+
20+
Data-sharing attributes of variables can be classified as being \plc{predetermined},
21+
\plc{explicitly determined} or \plc{implicitly determined}.
22+
23+
Certain variables and objects have predetermined attributes.
24+
A commonly found case is the loop iteration variable in associated loops
25+
of a \code{for} or \code{do} construct. It has a private data-sharing attribute.
26+
Variables with predetermined data-sharing attributes can not be listed in a data-sharing clause; but there are some
27+
exceptions (mainly concerning loop iteration variables).
28+
29+
Variables with explicitly determined data-sharing attributes are those that are
30+
referenced in a given construct and are listed in a data-sharing attribute
31+
clause on the construct. Some of the common data-sharing clauses are:
32+
\code{shared}, \code{private}, \code{firstprivate}, \code{lastprivate},
33+
\code{linear}, and \code{reduction}. % Are these all of them?
34+
35+
Variables with implicitly determined data-sharing attributes are those
36+
that are referenced in a given construct, do not have predetermined
37+
data-sharing attributes, and are not listed in a data-sharing
38+
attribute clause of an enclosing construct.
39+
For a complete list of variables and objects with predetermined and
40+
implicitly determined attributes, please refer to the
41+
\plc{Data-sharing Attribute Rules for Variables Referenced in a Construct}
42+
subsection of the OpenMP Specifications document.
43+
44+
\bigskip
45+
DATA-MAPPING ATTRIBUTES
46+
47+
The \code{map} clause on a device construct explictly specifies how the list items in
48+
the clause are mapped from the encountering task's data environment (on the host)
49+
to the corresponding item in the device data environment (on the device).
50+
The common \plc{list items} are arrays, array sections, scalars, pointers, and
51+
structure elements (members).
52+
53+
Procedures and global variables have predetermined data mapping if they appear
54+
within the list or block of a \code{declare target} directive. Also, a C/C++ pointer
55+
is mapped as a zero-length array section, as is a C++ variable that is a reference to a pointer.
56+
% Waiting for response from Eric on this.
57+
58+
Without explict mapping, non-scalar and non-pointer variables within the scope of the \code{target}
59+
construct are implicitly mapped with a \plc{map-type} of \code{tofrom}.
60+
Without explicit mapping, scalar variables within the scope of the \code{target}
61+
construct are not mapped, but have an implicit firstprivate data-sharing
62+
attribute. (That is, the value of the original variable is given to a private
63+
variable of the same name on the device.) This behavior can be changed with
64+
the \code{defaultmap} clause.
65+
66+
The \code{map} clause can appear on \code{target}, \code{target data} and
67+
\code{target enter/exit data} constructs. The operations of creation and
68+
removal of device storage as well as assignment of the original list item
69+
values to the corresponding list items may be complicated when the list
70+
item appears on multiple constructs or when the host and device storage
71+
is shared. In these cases the item's reference count, the number of times
72+
it has been referenced (+1 on entry and -1 on exited) in nested (structured)
73+
map regions and/or accumulative (unstructured) mappings, determines the operation.
74+
Details of the \code{map} clause and reference count operation are specified
75+
in the \plc{map Clause} subsection of the OpenMP Specifications document.

Chap_devices.tex

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
\pagebreak
2+
\chapter{Devices}
3+
\label{chap:devices}
4+
5+
The \code{target} construct consists of a \code{target} directive
6+
and an execution region. The \code{target} region is executed on
7+
the default device or the device specified in the \code{device}
8+
clause.
9+
10+
In OpenMP version 4.0, by default, all variables within the lexical
11+
scope of the construct are copied \plc{to} and \plc{from} the
12+
device, unless the device is the host, or the data exists on the
13+
device from a previously executed data-type construct that
14+
has created space on the device and possibly copied host
15+
data to the device storage.
16+
17+
The constructs that explicitly
18+
create storage, transfer data, and free storage on the device
19+
are catagorized as structured and unstructured. The
20+
\code{target} \code{data} construct is structured. It creates
21+
a data region around \code{target} constructs, and is
22+
convenient for providing persistent data throughout multiple
23+
\code{target} regions. The \code{target} \code{enter} \code{data} and
24+
\code{target} \code{exit} \code{data} constructs are unstructured, because
25+
they can occur anywhere and do not support a "structure"
26+
(a region) for enclosing \code{target} constructs, as does the
27+
\code{target} \code{data} construct.
28+
29+
The \code{map} clause is used on \code{target}
30+
constructs and the data-type constructs to map host data. It
31+
specifies the device storage and data movement \code{to} and \code{from}
32+
the device, and controls on the storage duration.
33+
34+
There is an important change in the OpenMP 4.5 specification
35+
that alters the data model for scalar variables and C/C++ pointer variables.
36+
The default behavior for scalar variables and C/C++ pointer variables
37+
in an 4.5 compliant code is \code{firstprivate}. Example
38+
codes that have been updated to reflect this new behavior are
39+
annotated with a description that describes changes required
40+
for correct execution. Often it is a simple matter of mapping
41+
the variable as \code{tofrom} to obtain the intended 4.0 behavior.
42+
43+
In OpenMP version 4.5 the mechanism for target
44+
execution is specified as occuring through a \plc{target task}.
45+
When the \code{target} construct is encountered a new
46+
\plc{target task} is generated. The \plc{target task}
47+
completes after the \code{target} region has executed and all data
48+
transfers have finished.
49+
50+
This new specification does not affect the execution of
51+
pre-4.5 code; it is a necessary element for asynchronous
52+
execution of the \code{target} region when using the new \code{nowait}
53+
clause introduced in OpenMP 4.5.

0 commit comments

Comments
 (0)