OpenMP
diff --git a/‎Changes.log
+6 b/‎Changes.log
+6
diff --git a/‎Chap_SIMD.tex
+48 b/‎Chap_SIMD.tex
+48
diff --git a/‎Chap_affinity.tex
+118 b/‎Chap_affinity.tex
+118
diff --git a/‎Chap_data_environment.tex
+75 b/‎Chap_data_environment.tex
+75
diff --git a/‎Chap_devices.tex
+53 b/‎Chap_devices.tex
+53
@@ -1,3 +1,9 @@
+[20-May-2016] Version 4.5.0
+Changes from 4.0.2ltx
+
+1. Reorganization into topic chapters
+2. Change file suffixes (f/f90 => Fixed/Free format) C++ => cpp 
+
 [2-Feb-2015] Version 4.0.2
 Changes from 4.0.1ltx
 
 
@@ -0,0 +1,48 @@
+\pagebreak
+\chapter{SIMD}
+\label{chap:simd}
+
+Single instruction, multiple data (SIMD) is a form of parallel execution 
+in which the same operation is performed on multiple data elements 
+independently in hardware vector processing units (VPU), also called SIMD units.
+The addition of two vectors to form a third vector is a SIMD operation.
+Many processors have SIMD (vector) units that can perform simultaneously 
+2, 4, 8 or more executions of the same operation (by a single SIMD unit). 
+
+Loops without loop-carried backward dependency (or with dependency preserved using 
+ordered simd) are candidates for vectorization by the compiler for 
+execution with SIMD units. In addition, with state-of-the-art vectorization 
+technology and \code{declare simd} construct extensions for function vectorization
+in the OpenMP 4.5 specification, loops with function calls can be vectorized as well. 
+The basic idea is that a scalar function call in a loop can be replaced by a vector version 
+of the function, and the loop can be vectorized simultaneously by combining a loop 
+vectorization (\code{simd} directive on the loop) and a function 
+vectorization (\code{declare simd} directive on the function).
+
+A \code{simd} construct states that SIMD operations be performed on the
+data within the loop.  A number of clauses are available to provide
+data-sharing attributes (\code{private}, \code{linear}, \code{reduction} and 
+\code{lastprivate}).  Other clauses provide vector length preference/restrictions 
+(\code{simdlen} / \code{safelen}), loop fusion (\code{collapse}), and data 
+alignment (\code{aligned}).
+
+The \code{declare simd} directive designates
+that a vector version of the function should also be constructed for 
+execution within loops that contain the function and have a \code{simd} 
+directive.  Clauses provide argument specifications (\code{linear},
+\code{uniform}, and \code{aligned}), a requested vector length 
+(\code{simdlen}), and designate whether the function is always/never 
+called conditionally in a loop (\code{branch}/\code{inbranch}). 
+The latter is for optimizing peformance.
+
+Also, the \code{simd} construct has been combined with the worksharing loop 
+constructs (\code{for simd} and \code{do simd}) to enable simultaneous thread 
+execution in different SIMD units.  
+%Hence, the \code{simd} construct can be 
+%used alone on a loop to direct vectorization (SIMD execution), or in 
+%combination with a parallel loop construct to include thread parallelism 
+%(a parallel loop sequentially followed by a \code{simd} construct,
+%or a combined construct such as \code{parallel do simd} or 
+%\code{parallel for simd}).
+
+
@@ -0,0 +1,118 @@
+\pagebreak
+\chapter{OpenMP Affinity}
+\label{chap:openmp_affinity}
+
+OpenMP Affinity consists of a \code{proc\_bind} policy (thread affinity policy) and a specification of
+places (\texttt{"}location units\texttt{"} or \plc{processors} that may be cores, hardware
+threads, sockets, etc.).  
+OpenMP Affinity enables users to bind computations on specific places.
+The placement will hold for the duration of the parallel region. 
+However, the runtime is free to migrate the OpenMP threads 
+to different cores (hardware threads, sockets, etc.) prescribed within a given place, 
+if two or more cores (hardware threads, sockets, etc.) have been assigned to a given place.
+
+Often the binding can be managed without resorting to explicitly setting places.
+Without the specification of places in the \code{OMP\_PLACES} variable, 
+the OpenMP runtime will distribute and bind threads using the entire range of processors for 
+the OpenMP program, according to the \code{OMP\_PROC\_BIND} environment variable
+or the \code{proc\_bind} clause.  When places are specified, the OMP runtime
+binds threads to the places according to a default distribution policy, or
+those specified in the \code{OMP\_PROC\_BIND} environment variable or the
+\code{proc\_bind} clause.
+
+In the OpenMP Specifications document a processor refers to an execution unit that
+is enabled for an OpenMP thread to use.  A processor is a core when there is
+no SMT (Simultaneous Multi-Threading) support or SMT is disabled.  When 
+SMT is enabled, a processor is a hardware thread (HW-thread). (This is the
+usual case; but actually, the execution unit is implementation defined.) Processor
+numbers are numbered sequentially from 0 to the number of cores less one (without SMT), or
+0 to the number HW-threads less one (with SMT). OpenMP places use the processor number to designate
+binding locations (unless an \texttt{"}abstract name\texttt{"} is used.) 
+
+
+The processors available to a process may be a subset of the system's
+processors.  This restriction may be the result of a 
+wrapper process controlling the execution (such as \code{numactl} on Linux systems), 
+compiler options, library-specific environment variables, or default
+kernel settings.  For instance, the execution of multiple MPI processes,
+launched on a single compute node, will each have a subset of processors as
+determined by the MPI launcher or set by MPI affinity environment 
+variables for the MPI library.  %Forked threads within an MPI process
+%(for a hybrid execution of MPI and OpenMP code) inherit the valid 
+%processor set for execution from the parent process (the initial task region) 
+%when a parallel region forks threads.  The binding policy set in 
+%\code{OMP\_PROC\_BIND} or the \code{proc\_bind} clause will be applied to 
+%the subset of processors available to \plc{the particular} MPI process.
+
+%Also, setting an explicit list of processor numbers in the \code{OMP\_PLACES} 
+%variable before an MPI launch (which involves more than one MPI process) will
+%result in unspecified behavior (and doesn't make sense) because the set of 
+%processors in the places list must not contain processors outside the subset 
+%of processors for an MPI process. A separate \code{OMP\_PLACES} variable must
+%be set for each MPI process, and is usually accomplished by launching a script 
+%which sets \code{OMP\_PLACES} specifically for the MPI process. 
+
+Threads of a team are positioned onto places in a compact manner, a 
+scattered distribution, or onto the master's place, by setting the 
+\code{OMP\_PROC\_BIND} environment variable or the \code{proc\_bind} clause  to 
+\plc{close}, \plc{spread}, or \plc{master}, respectively.  When 
+\code{OMP\_PROC\_BIND} is set to FALSE no binding is enforced; and 
+when the value is TRUE, the binding is implementation defined to 
+a set of places in the \code{OMP\_PLACES} variable or to places 
+defined by the implementation if the \code{OMP\_PLACES} variable 
+is not set.
+
+The \code{OMP\_PLACES} variable can also be set to an abstract name 
+(\plc{threads}, \plc{cores}, \plc{sockets}) to specify that a place is
+either a single hardware thread, a core, or a socket, respectively. 
+This description of the \code{OMP\_PLACES} is most useful when the 
+number of threads is equal to the number of hardware thread, cores
+or sockets.  It can also be used with a \plc{close} or \plc{spread} 
+distribution policy when the equality doesn't hold.
+
+
+% We need an example of using sockets, cores and threads:
+
+% case 1 cores:
+
+%     Hyper-Threads on (2 hardware threads per core)
+%     1 socket x 4 cores x 2 HW-threads
+%   
+%     export OMP_NUM_THREADS=4
+%     export OMP_PLACES=threads
+%     
+%          core #      0    1    2    3
+%     processor #     0,1  2,3  4,5  6,7  
+%     thread #     0  * _  _ _  _ _  _ _   #mask for thread 0
+%     thread #     1  _ _  * _  _ _  _ _   #mask for thread 1
+%     thread #     2  _ _  _ _  * _  _ _   #mask for thread 2
+%     thread #     3  _ _  _ _  _ _  * _   #mask for thread 3
+
+% case 2 threads:
+%   
+%     Hyper-Threads on (2 hardware threads per core)
+%     1 socket x 4 cores x 2 HW-threads
+%    
+%     export OMP_NUM_THREADS=4
+%     export OMP_PLACES=cores
+%     
+%          core #      0    1    2    3
+%     processor #     0,1  2,3  4,5  6,7  
+%     thread #     0  * *  _ _  _ _  _ _   #mask for thread 0
+%     thread #     1  _ _  * *  _ _  _ _   #mask for thread 1
+%     thread #     2  _ _  _ _  * *  _ _   #mask for thread 2
+%     thread #     3  _ _  _ _  _ _  * *   #mask for thread 3
+
+% case 3 sockets:
+%   
+%     No Hyper-Threads
+%     3 socket x 4 cores 
+%     
+%     export OMP_NUM_THREADS=3
+%     export OMP_PLACES=sockets
+%     
+%        socket #        0         1          2
+%     processor #     0,1,2,3   4,5,6,7   8,9,10,11
+%     thread #     0  * * * *   _ _ _ _   _ _  _  _   #mask for thread 0
+%     thread #     0  _ _ _ _   * * * *   _ _  _  _   #mask for thread 1
+%     thread #     0  _ _ _ _   _ _ _ _   * *  *  *   #mask for thread 2
@@ -0,0 +1,75 @@
+\pagebreak
+\chapter{Data Environment}
+\label{chap:data_environment}
+The OpenMP \plc{data environment} contains data attributes of variables and
+objects.  Many constructs (such as \code{parallel}, \code{simd}, \code{task}) 
+accept clauses to control \plc{data-sharing} attributes
+of referenced variables in the construct, where \plc{data-sharing} applies to
+whether the attribute of the variable is \plc{shared}, 
+is \plc{private} storage, or has special operational characteristics 
+(as found in the \code{firstprivate}, \code{lastprivate}, \code{linear}, or \code{reduction} clause).
+
+The data environment for a device (distinguished as a \plc{device data environment})
+is controlled on the host by \plc{data-mapping} attributes, which determine the
+relationship of the data on the host, the \plc{original} data, and the data on the
+device, the \plc{corresponding} data.
+
+\bigskip
+DATA-SHARING ATTRIBUTES
+
+Data-sharing attributes of variables can be classified as being \plc{predetermined},
+\plc{explicitly determined} or \plc{implicitly determined}.
+
+Certain variables and objects have predetermined attributes.  
+A commonly found case is the loop iteration variable in associated loops 
+of a \code{for} or \code{do} construct. It has a private data-sharing attribute.
+Variables with predetermined data-sharing attributes can not be listed in a data-sharing clause; but there are some
+exceptions (mainly concerning loop iteration variables).
+
+Variables with explicitly determined data-sharing attributes are those that are
+referenced in a given construct and are listed in a data-sharing attribute
+clause on the construct. Some of the common data-sharing clauses are:
+\code{shared}, \code{private}, \code{firstprivate}, \code{lastprivate}, 
+\code{linear}, and \code{reduction}. % Are these all of them?
+
+Variables with implicitly determined data-sharing attributes are those
+that are referenced in a given construct, do not have predetermined
+data-sharing attributes, and are not listed in a data-sharing
+attribute clause of an enclosing construct.
+For a complete list of variables and objects with predetermined and
+implicitly determined attributes, please refer to the
+\plc{Data-sharing Attribute Rules for Variables Referenced in a Construct}
+subsection of the OpenMP Specifications document.  
+
+\bigskip
+DATA-MAPPING ATTRIBUTES
+
+The \code{map} clause on a device construct explictly specifies how the list items in
+the clause are mapped from the encountering task's data environment (on the host)
+to the corresponding item in the device data environment (on the device).
+The common \plc{list items} are arrays, array sections, scalars, pointers, and
+structure elements (members). 
+
+Procedures and global variables have predetermined data mapping if they appear
+within the list or block of a \code{declare target} directive. Also, a C/C++ pointer
+is mapped as a zero-length array section, as is a C++ variable that is a reference to a pointer.
+% Waiting for response from Eric on this.
+
+Without explict mapping, non-scalar and non-pointer variables within the scope of the \code{target}
+construct are implicitly mapped with a \plc{map-type} of \code{tofrom}.
+Without explicit mapping, scalar variables within the scope of the \code{target}
+construct are not mapped, but have an implicit firstprivate data-sharing
+attribute. (That is, the value of the original variable is given to a private
+variable of the same name on the device.) This behavior can be changed with
+the \code{defaultmap} clause.
+
+The \code{map} clause can appear on \code{target}, \code{target data} and 
+\code{target enter/exit data} constructs.  The operations of creation and
+removal of device storage as well as assignment of the original list item 
+values to the corresponding list items may be complicated when the list 
+item appears on multiple constructs or when the host and device storage 
+is shared. In these cases the item's reference count, the number of times
+it has been referenced (+1 on entry and -1 on exited) in nested (structured)
+map regions and/or accumulative (unstructured) mappings, determines the operation.
+Details of the \code{map} clause and reference count operation are specified 
+in the \plc{map Clause} subsection of the OpenMP Specifications document.
@@ -0,0 +1,53 @@
+\pagebreak
+\chapter{Devices}
+\label{chap:devices}
+
+The \code{target} construct consists of a \code{target} directive 
+and an execution region. The \code{target} region is executed on
+the default device or the device specified in the \code{device} 
+clause. 
+
+In OpenMP version 4.0, by default, all variables within the lexical
+scope of the construct are copied \plc{to} and \plc{from} the
+device, unless the device is the host, or the data exists on the
+device from a previously executed data-type construct that
+has created space on the device and possibly copied host
+data to the device storage.
+
+The constructs that explicitly
+create storage, transfer data, and free storage on the device
+are catagorized as structured and unstructured. The
+\code{target} \code{data} construct is structured. It creates
+a data region around \code{target} constructs, and is
+convenient for providing persistent data throughout multiple
+\code{target} regions. The \code{target} \code{enter} \code{data} and 
+\code{target} \code{exit} \code{data} constructs are unstructured, because 
+they can occur anywhere and do not support a "structure" 
+(a region) for enclosing \code{target} constructs, as does the
+\code{target} \code{data} construct. 
+
+The \code{map} clause is used on \code{target} 
+constructs and the data-type constructs to map host data. It 
+specifies the device storage and data movement \code{to} and \code{from}
+the device, and controls on the storage duration.
+
+There is an important change in the OpenMP 4.5 specification
+that alters the data model for scalar variables and C/C++ pointer variables.
+The default behavior for scalar variables and C/C++ pointer variables
+in an 4.5 compliant code is \code{firstprivate}. Example
+codes that have been updated to reflect this new behavior are
+annotated with a description that describes changes required
+for correct execution. Often it is a simple matter of mapping
+the variable as \code{tofrom} to obtain the intended 4.0 behavior.
+
+In OpenMP version 4.5 the mechanism for target
+execution is specified as occuring through a \plc{target task}. 
+When the \code{target} construct is encountered a new 
+\plc{target task} is generated. The \plc{target task} 
+completes after the \code{target} region has executed and all data 
+transfers have finished.
+
+This new specification does not affect the execution of 
+pre-4.5 code; it is a necessary element for asynchronous 
+execution of the \code{target} region when using the new \code{nowait} 
+clause introduced in OpenMP 4.5.