Showing posts with label MPI. Show all posts
Showing posts with label MPI. Show all posts

Thursday, July 2, 2020

openMPI understanding

https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php

B1. General run-time tuning

https://tinyurl.com/yabwy5en
MCA: The Modular Component Architecture (MCA) is the backbone for much of Open MPI's functionality. It is a series of frameworks, components, and modules that are assembled at run-time to create an MPI implementation.

MCA parameters are the basic unit of run-time tuning for Open MPI. They are simple "key = value" pairs that are used extensively throughout the code base.

btl option:
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl=self,vader,openib
export OMPI_MCA_btl=self,vader,uct
export OMPI_MCA_btl=^tcp


B2. Set value of MCA parameters?
https://docs.oracle.com/cd/E19708-01/821-1319-10/mca-params.html
There are 3 ways to set MCA parameters:
1. Command line: The highest-precedence method is setting MCA parameters, the format used on the command line is "--mca <param_name> <value>"
$ mpirun --mca mpi_show_handle_leaks 1 -np 4 a.out
$ mpirun --mca param "value with multiple words" ...

2. Environment variable: OMPI_MCA_<param_name>
$ export OMPI_MCA_mpi_show_handle_leaks=1
$ mpirun -np 4 a.out
3. Aggregate MCA parameter files:
Q11: https://www.open-mpi.org/faq/?category=tuning

B3. Setting MCA Parameters
Q13: https://www.open-mpi.org/faq/?category=tuning
Each MCA framework has a top-level MCA parameter that can be used to include or exclude 
components from a given run.
# Tell Open MPI to exclude the tcp and openib BTL components and implicitly include all 
the rest $ mpirun --mca btl ^tcp,openib ... # Tell Open MPI to include *only* the 
components listed here and implicitly ignore all the 
# rest (i.e., the loopback, shared memory, and OpenFabrics (a.k.a., "OpenIB") MPI 
point-to-point components): $ mpirun --mca btl self,sm,openib ...
Note that ^ can only be the prefix of the entire value because the inclusive and exclusive 
behavior are mutually exclusive. Specifically, since the exclusive behavior means "use all 
components except these", it does not make sense to mix it with the inclusive behavior of 
not specifying it (i.e., "use all of these components"). 
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#sect20
MPI shared memory communications
The vader BTL is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.
Beginning with the v1.8 series, the vader BTL replaces the sm BTL
Shows all the MCA parameters for all BTL components that ompi_info finds.
# to show all MCA parameters: ompi_info --param btl all
$ ompi_info --param btl all --level 9
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3) MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)

NOTE: the UCX PML is now the preferred method of InfiniBand support in openMPI 4.0.x (btl is built-in methods)

Exported Environment Variables

All environment variables that are named in the form OMPI_* will automatically be exported to new processes on the local and remote nodes. Environmental parameters can also be set/forwarded to the new processes using the MCA parameter mca_base_env_list. The -x option to mpirun has been deprecated, but the syntax of the MCA param follows that prior example. While the syntax of the -x option and MCA param allows the definition of new variables, note that the parser for these options are currently not very sophisticated - it does not even understand quoted values. Users are advised to set variables in the environment and use the option to export them; not to define them.

Setting MCA Parameters

The -mca switch allows the passing of parameters to various MCA (Modular Component Architecture) modules. MCA modules have direct impact on MPI programs because they allow tunable parameters to be set at run time (such as which BTL communication device driver to use, what parameters to pass to that BTL, etc.).
The -mca switch takes two arguments: <key> and <value>. The <key> argument generally specifies which MCA module will receive the value. For example, the <key> "btl" is used to select which BTL to be used for transporting MPI messages. The <value> argument is the value that is passed. For example:
mpirun -mca btl tcp,self -np 1 foo
Tells Open MPI to use the "tcp" and "self" BTLs, and to run a single copy of "foo" an allocated node.
mpirun -mca btl self -np 1 foo
Tells Open MPI to use the "self" BTL, and to run a single copy of "foo" an allocated node.
The -mca switch can be used multiple times to specify different <key> and/or <value> arguments. If the same <key> is specified more than once, the <value>s are concatenated with a comma (",") separating them.
Note that the -mca switch is simply a shortcut for setting environment variables. The same effect may be accomplished by setting corresponding environment variables before running mpirun. The form of the environment variables that Open MPI sets is:
OMPI_MCA_<key>=<value>
Thus, the -mca switch overrides any previously set environment variables. The -mca settings similarly override MCA parameters set in the $OPAL_PREFIX/etc/openmpi-mca-params.conf or $HOME/.openmpi/mca-params.conf file.
Unknown <key> arguments are still set as environment variable -- they are not checked (by mpirun) for correctness. Illegal or incorrect <value> arguments may or may not be reported -- it depends on the specific MCA module.
To find the available component types under the MCA architecture, or to find the available parameters for a specific component, use the ompi_info command. See the ompi_info(1) man page for detailed information on the command.


B4.  What is processor affinity? 
Open MPI supports processor affinity on a variety of systems through process binding, in which each MPI process, along with its threads, is "bound" to a specific subset of processing resources (cores, sockets, etc.). 
Affinity can improve performance by inhibiting excessive process movement — for example, away from "hot" caches or NUMA memory. Judicious bindings can improve performance by reducing resource contention (by spreading processes apart from one another) or improving interprocess communications (by placing processes close to one another).
Note that processor affinity probably should not be used when a node is over-subscribed (i.e., more processes are launched than there are processors). 
memory affinity? Simply: some memory will be faster to access (for a given process) than others.
see if your system is supported processor affinity?/memory affinity?
$ ompi_info | grep hwloc
         MCA hwloc: hwloc191 (MCA v2.0, API v2.0, Component v1.8.4)

B5 tell Open MPI to use processor and/or memory affinity
Q19 https://www.open-mpi.org/faq/?category=tuning
  • --byslot: Alias for --bycore.
  • --bycore: When laying out processes, put sequential MPI processes on adjacent processor cores. *(Default)*
  • --bysocket: When laying out processes, put sequential MPI processes on adjacent processor sockets.
  • --bynode: When laying out processes, put sequential MPI processes on adjacent nodes.
The use of processor and memory affinity evolved rapidly, starting with Open MPI version:
B6 Mapping, Ranking, and Binding: Oh My! Open MPI employs a three-phase procedure for assigning process locations and ranks:
mapping
Assigns a default location to each process
ranking
Assigns an MPI_COMM_WORLD rank value to each process
binding
Constrains each process to run on specific processors
The mapping step is used to assign a default location to each process based on the mapper being employed. Mapping by slot, node, and sequentially results in the assignment of the processes to the node level. In contrast, mapping by object, allows the mapper to assign the process to an actual object on each node.
Note: the location assigned to the process is independent of where it will be bound - the assignment is used solely as input to the binding algorithm.
The mapping of process processes to nodes can be defined not just with general policies but also, if necessary, using arbitrary mappings that cannot be described by a simple policy. One can use the "sequential mapper," which reads the hostfile line by line, assigning processes to nodes in whatever order the hostfile specifies. Use the -mca rmaps seq option. For example, using the same hostfile as before:
mpirun -hostfile myhostfile -mca rmaps seq ./a.out
will launch three processes, one on each of nodes aa, bb, and cc, respectively. The slot counts don’t matter; one process is launched per line on whatever node is listed on the line.
Another way to specify arbitrary mappings is with a rankfile, which gives you detailed control over process binding as well. Rankfiles are discussed below.
The second phase focuses on the ranking of the process within the job’s MPI_COMM_WORLD. Open MPI separates this from the mapping procedure to allow more flexibility in the relative placement of MPI processes. This is best illustrated by considering the following two cases where we used the —map-by ppr:2:socket option:
node aa node bb
rank-by core 0 1 ! 2 3 4 5 ! 6 7
rank-by socket 0 2 ! 1 3 4 6 ! 5 7
rank-by socket:span 0 4 ! 1 5 2 6 ! 3 7
Ranking by core and by slot provide the identical result - a simple progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does a round-robin ranking within each node until all processes have been assigned an MCW rank, and then progresses to the next node. Adding the span modifier to the ranking directive causes the ranking algorithm to treat the entire allocation as a single entity - thus, the MCW ranks are assigned across all sockets before circling back around to the beginning.
The binding phase actually binds each process to a given set of processors. This can improve performance if the operating system is placing processes suboptimally. For example, it might oversubscribe some multi-core processor sockets, leaving other sockets idle; this can lead processes to contend unnecessarily for common resources. Or, it might spread processes out too widely; this can be suboptimal if application performance is sensitive to interprocess communication costs. Binding can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with.
The processors to be used for binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a —bind-to l3cache directive will cause the process to be bound to the processors that share a single L3 cache within that socket.
Alternatively, processes can be assigned to processors based on their local rank on a node using the --bind-to cpu-list:ordered option with an associated --cpu-list "0,2,5". In this example, the first process on a node will be bound to cpu 0, the second process on the node will be bound to cpu 2, and the third process on the node will be bound to cpu 5. --bind-to will also accept cpulist:ortered as a synonym to cpu-list:ordered. Note that an error will result if more processes are assigned to a node than cpus are provided.
To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding algorithm will assign each process located to a socket to a unique core in a round-robin manner.
Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located. In this manner, users can exert detailed control over relative MCW rank location and binding.
Finally, --report-bindings can be used to report bindings.
As an example, consider a node with two processor sockets, each comprising four cores. We run mpirun with -np 4 --report-bindings and the following additional options:
% mpirun ... --map-by core --bind-to core [...] ... binding child [...,0] to cpus 0001 [...] ... binding child [...,1] to cpus 0002 [...] ... binding child [...,2] to cpus 0004 [...] ... binding child [...,3] to cpus 0008
% mpirun ... --map-by socket --bind-to socket [...] ... binding child [...,0] to socket 0 cpus 000f [...] ... binding child [...,1] to socket 1 cpus 00f0 [...] ... binding child [...,2] to socket 0 cpus 000f [...] ... binding child [...,3] to socket 1 cpus 00f0
% mpirun ... --map-by core:PE=2 --bind-to core [...] ... binding child [...,0] to cpus 0003 [...] ... binding child [...,1] to cpus 000c [...] ... binding child [...,2] to cpus 0030 [...] ... binding child [...,3] to cpus 00c0
% mpirun ... --bind-to none
Here, --report-bindings shows the binding of each process as a mask. In the first case, the processes bind to successive cores as indicated by the masks 0001, 0002, 0004, and 0008. In the second case, processes bind to all cores on successive sockets as indicated by the masks 000f and 00f0. The processes cycle through the processor sockets in a round-robin fashion as many times as are needed. In the third case, the masks show us that 2 cores have been bound per process. In the fourth case, binding is turned off and no bindings are reported.
Open MPI’s support for process binding depends on the underlying operating system. Therefore, certain process binding options may not be available on every system.
Process binding can also be set with MCA parameters. Their usage is less convenient than that of mpirun options. On the other hand, MCA parameters can be set not only on the mpirun command line, but alternatively in a system or user mca-params.conf file or as environment variables, as described in the MCA section below. Some examples include:
mpirun option MCA parameter key value
--map-by core rmaps_base_mapping_policy core --map-by socket rmaps_base_mapping_policy socket --rank-by core rmaps_base_ranking_policy core --bind-to core hwloc_base_binding_policy core --bind-to socket hwloc_base_binding_policy socket --bind-to none hwloc_base_binding_policy none

Portable Hardware Locality(HWLOC) (included in OpenMPI)


The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs.

MPI InfiniBand, RoCE, and iWARP communications

https://www.open-mpi.org/faq/?category=openfabrics#ib-components

support for high-speed interconnect networks

https://www.open-mpi.org/faq/?category=openfabrics#run-ucx
https://www.open-mpi.org/faq/?category=building#build-p2p

Compare transport mechanism: https://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy


II. OMPI_MCA_btl

ompi_info --param btl all MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.0) MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.1.0) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.0) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.0)

What is the vader BTL?

The vader BTL is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.

Beginning with the v1.8 series, the vader BTL replaces the sm BTL


III. Using UCX with OpenMPI

http://openucx.github.io/ucx/faq.html
#1.a. See all available transports of OMPI:
module load openmpi
ompi_info |grep btl
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3) 
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3) 
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
#1.b. See all available transports/Devices of UCX:
module load openmpi
ucx_info -d
# Transport: tcp
#   Device: eth0
#      capabilities:
#            bandwidth: 113.16 MB/sec
#   Device: eth1
#      capabilities:
#            bandwidth: 113.16 MB/sec
# Device: ib0
# capabilities: # bandwidth: 4457.00 MB/sec
# Transport: self # Device: self # capabilities: # bandwidth: 6911.00 MB/sec # Transport: mm # Device: sysv # capabilities: # bandwidth: 12179.00 MB/sec
# Device: posix # capabilities: # bandwidth: 12179.00 MB/sec
#   Transport: ud   (or ud_verbs)
#   Device: mlx4_0:1
#      capabilities:
#            bandwidth: 3774.15 MB/sec

#   Transport: rc   (or rc_verbs)
#   Device: mlx4_0:1
#      capabilities:
#            bandwidth: 3774.15 MB/sec
#   Transport: cm
#   Device: mlx4_0:1
#      capabilities:
#            bandwidth: 2985.42 MB/sec

#   Transport: knem
#   Device: knem
#      capabilities:
#            bandwidth: 13862.00 MB/sec

2.a. Force to use UCX

export OMPI_MCA_btl=^vader,tcp,openib,uct
export OMPI_MCA_pml=ucx

2.b. Choose a specific transport/device

export UCX_TLS=self,mm,knem,sm,ud,rc,tcp
export UCX_NET_DEVICES=mlx4_0:1,ib0

Wednesday, May 13, 2020

Threading Building Blocks (TBB)

Why: need for Kokkos
Intel® Threading Building Blocks (Intel® TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. It is designed to promote scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner.
Many of the library interfaces employ generic programming, in which interfaces are defined by requirements on types and not specific types. The C++ Standard Template Library (STL) is an example of generic programming. Generic programming enables Intel TBB to be flexible yet efficient. The generic interfaces enable you to customize components to your specific needs.
The net result is that Intel TBB enables you to specify parallelism far more conveniently than using raw threads, and at the same time can improve performance.
https://tinyurl.com/yallm39a
##-----------
Intel(R) Threading Building Blocks is available commercially (see http://software.intel.com/en-us/intel-tbb) as a binary distribution, and in open source, in both source and binary forms (see https://github.com/intel/tbb).

1. Require: glibc 2.17
check: ldd --version
The GNU C Library, commonly known as glibc, is the GNU Project's implementation of the C standard library. 

2. Compile TBB
Download:
configure manuall in: oneTBB-2020.2/cmake/README.rst
#--
tar zxvf oneTBB-2020.2.tar.gz
cd oneTBB-2020.2/build
#-- comment out these lines from the "build/linux.gcc.inc" file:
# gcc 4.8 and later support RTM intrinsics, but require command line switch to enable them
ifneq (,$(shell gcc -dumpversion | egrep  "^4\.[8-9]"))
    RTM_KEY = -mrtm
#endif
#--
module load compiler/gcc-9.2.0 
module load tool_dev/cmake-3.17.2
module load tool_dev/glibc-2.19
export LD_LIBRARY_PATH=/home1/p001cao/local/app/tool_dev/glibc-2.19/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib

######## 
cd oneTBB-2020.2/src
make -j 8

it will produce new folder "build/linux_intel64_gcc_cc9.2.0_libc2.12_kernel2.6.32_release" contain library of TBB.

Tuesday, May 5, 2020

MVAPICH2

Which MPI implementation?
MVAPICH2 (MPI-3 over InfiniBand) is an MPI-3 implementation based on MPICH ADI3 layer. 
wget http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3.2.tar.gz 
tar -xzf mvapich2-2.3.2.tar.gz
module load conda/py37mvapichSupp 
cd mvapich2-2.3.2
./autogen.sh

I. install MVAPICH2 + GCC (USC)

1. Supporting: 
module load conda/conda3
conda create -n py37mvapichSupp python=3.7
source activate py37mvapichSupp 
conda install autoconf automake 
conda install -c sas-institute libnuma
conda install -c conda-forge lld=9.0.1 binutils    # llmv & gold linker
#--
prepend-path PKG_CONFIG_PATH $topdir/lib/pkgconfig

2. Configuration

#2.2. USC 2:
module load compiler/gcc-9.2.0   
module load conda/py37mvapichSupp         # to use gold linker or lld linker
./configure CC=gcc CXX=g++ FC=gfortran F77=gfortran LDFLAGS="-fuse-ld=gold" \
--with-device=ch3:mrail --with-rdma=gen2 --enable-hybrid \
--prefix=/home1/p001cao/local/app/mvapich2/2.3.2-gcc9.2.0

II. MVAPICH and SGE
http://gridscheduler.sourceforge.net/howto/mvapich/MVAPICH_Integration.html

The job example 'mvapich.sh' starts the xhpl' program. Please note that a MPI job that has to start 'mpirun_rsh' with the options  "-np $NSLOTS" to start the job with the correct number of slots ($NSLOTS is set by Grid Engine).
To pass information where to start the MPI tasks one has to pass  "-hostfile $TMPDIR/machines" as the second argument.

Additionally, for tight integration remember to use "-rsh " and optionally, you can use "-nowd" to prevent mvapich to 'cd $wd' in the remote hots.
This leaves SGE in charge of the working directory.


Friday, November 29, 2019

Installing mpi4py & Voro++

I. Install mpi4py

1. on Window

Note: 
- install Visual Studio Build Tools at this link: http://go.microsoft.com/fwlink/?LinkId=691126&fixForIE=.exe
- download Microsoft MPI v10.1, install both msmpisdk.msi and msmpisetup.exe
https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi
#check: mpiexec -help

1.1a. install mpi4py with conda: 
# for intel MPI 
pip install mpi4py impi

1.1b.  install mpi4py on Python without Conda: (being used)
install another python3.7 independent to Conda
https://www.python.org/downloads/windows/
pip install mpi4py numpy scipy 
pip search mpi 
pip install impi

# register mpiexec
mpiexec -register
windows user & password
test:       mpiexec -np 2 python -m mpi4py.bench helloworld                       

1.2 Run:

- with anaconda: open Anaconda_Prompt
 $ mpiexec -np 8 python script.py

- without anaconda: open cmd as administrator

 $ mpiexec -np 8 python script.py


2. on Linux

make sure mpi4py link to right MPI lib (OpenMPI or MPICH,  or Impi), then use right command to run: 
mpirun -np 5 python
# or
mpiexec -np 5 python

2.1. Install with installing conda-mpi:  
to chose right version mpi4py, have to attach option [-c CHENNAL] in conda install,
    conda-forge
    intel
    bioconda
    anaconda
Note: conda-forge contains both version of mpi4py for openmpi and mpich
conda search -c conda-forge mpi4py     # find package
conda search -c intel       mpi4py   
conda search -c anaconda    mpi4py   

conda install [-c chennal] <package_name>=<version>=<build_string>

conda create --name new_name --clone old_name
conda remove --name old_name --all
# for mpi4py with OpenMPI(being used)
module load conda/conda3
conda create  -n  py37ompi python=3.7 scipy numpy scikit-learn

source activate   py37ompi
pip install tess ovito                  # voro++ 

conda install -c conda-forge mpi4py=3.0.3=py37hd0bea5a_0


for mpich 
conda create --name py37mpich --clone py37ompi
source activate    py37mpich
conda uninstall mpi4py mpi

conda install -c conda-forge mpi4py=3.0.3=py37hcf07815_0  

# for intel MPI (just support python 3.6)
conda create  -n  py36impi python=3.6 scipy numpy scikit-learn
source activate    py36impi 
conda install -c intel mpi4py=3.0.0=py36_intel_0 
pip install tess ovito 

TEST:
mpirun -np 5 python -m mpi4py.bench helloworld
## ------
Hello, World! I am process 0 of 5 on leopard.
Hello, World! I am process 1 of 5 on leopard.
Hello, World! I am process 2 of 5 on leopard.
Hello, World! I am process 3 of 5 on leopard.

Hello, World! I am process 4 of 5 on leopard.

NOTE: there are 3 env for python with include conda-mpi: py36mpi, py37mpi, py27mpi
 - but load conda-mpi maybe cause unexpected conflict with other MPI. So consider to install mpi4py only, without include conda-mpi --> use pip install

2.2. Install without installing conda-mpi:  (being used)
Note: using conda to install mpi4py, it also installsconda-mpi which cannot control and may conflict with other MPI. To use openMPI we want, must use pip install: (but this normally fail to link MPI compiler with python36, work with python37)
module load conda/conda3

conda create  -n  py37 python=3.7 scipy numpy 


source activate   py37
pip install mpi4py tess

TEST: (this word on Centos7)
module load mpi/openmpi4.0.2-Intel2019xeU4
module load conda/py37

mpirun -np 5 python -m mpi4py.bench helloworld
## ------

TEST: ( on Centos6 --> glibc error)
module load mpi/openmpi4.0.2-Intel2019xe     
module load conda/py37

mpirun -np 5 python -m mpi4py.bench helloworld

## ------


II. Install Voro++

module load conda3
source activate py37 
pip install     tess                                           #  voro++ library


Ref : https://github.com/abria/TeraStitcher/wiki/Multi-CPU-parallelization-using-MPI-and-Python-scripts
https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/

Friday, July 26, 2019

Compiling OpenMPI 4

- Some applications require C++11, this is only supported on GCC 4.8 or newer, which is not always available on system, then newer GCC need to be installed before compiling Openmpi
- Make sure to build OpenMPI with 64-bit support. To check whether the currently available 

OpenMPI do support 64-bit or not, type this:
                ompi_info -a | grep 'Fort integer size'

If the output is 8, then it supports 64-bit. If output is 4, then it just supports 32-bit.* configuration for 64-bit support:

+ For Intel compilers use:
           FFLAGS=-i8 FCFLAGS=-i8 CFLAGS=-m64 CXXFLAGS=-m64
+ For GNU compilers type:
           FFLAGS="-m64 -fdefault-integer-8" FCFLAGS="-m64 -fdefault-integer-8" CFLAGS=-m64 CXXFLAGS=-m64

- must keep the source after compiling


Install UCX and libfabric:
################
Installation OPTIONS in README.txt
or ./configure -h

tar xvzf openmpi-4.0.2.tar.gz
cd openmpi-4.0.2
mkdir build
cd build

I. OpenMPI-4.0.4 + Intel-2020xe (USC)

Note: Intel2019 on cluster has wrong path, cannot work

- "--with-verbs" to use InfiniBand
- Use infiniband may limit the number of nodes?
-without-verbs/--without-ucx... can be turned at run-time:
mpirun --mca btl ^tcp,openib -np 4 a.out
export OMPI_MCA_btl=^tcp,openib
export OMPI_MCA_btl=^tcp
So just compile the whole thing.
#--
# USC1
module load intel/compiler-xe19u5
module load compiler/gcc/9.1.0

check: icpc -v
Configure OpenMPI
../configure CC=icc CXX=icpc FC=ifort F77=ifort \
--with-sge --with-verbs --without-ucx --without-cma \
--prefix=/uhome/p001cao/local/app/openmpi/4.0.2-intelxe19u5-IB 
make -j 8
make install

3. Test openMPI
the only thing that users need to do to use Open MPI is ensure that: 
<prefix>/bin is in their PATH, and
<prefix>/lib is in their LD_LIBRARY_PATH.
Users may need to ensure to set the PATH and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc) so that non-interactive rsh/ssh-based logins will be able to find the Open MPI executables.

Create Module file:
.............

TEST: mpic++ -v

#2.2. USC 2: 
# use linker lld (include in Intel-bin, require GLIBC >2.15)
module load compiler/gcc-10.1.0
module load intel/compiler-xe19u5       # lld
#--
export myUCX=/home1/p001cao/local/app/tool_dev/ucx-1.8-intel  
../configure CC=icc CXX=icpc FC=ifort F77=ifort LDFLAGS="-fuse-ld=lld -lrt" \
--with-sge --without-verbs --with-ucx=${myUCX} \
--prefix=/home1/p001cao/local/app/openmpi/4.0.4-intelxe19u5
## consider link to lib
export myIntel=/home1/p001cao/local/app/intel/xe19u5/compilers_and_libraries_2019.5.281/linux/compiler/lib
LDFLAGS="-L${myIntel}/intel64_lin -Wl,-rpath,${myIntel}/intel64_lin" \
##--
export PATH=/home1/p001cao/local/app/intel/xe19u5/compilers_and_libraries_2019.5.281/linux/bin/intel64:$PATH
export CC=icc  export CXX=icpc  export FORTRAN=ifort

II. install OpenMPI + GCC (USC)

NOTE: 
* install libfabric, knem,... in UCX (openMPI 4.0,3 --> ucx-1.7 or older), or install with openMPI 
* use UCX is recommended: --without-verbs  

## consider lld linker: 
module load llvm/llvm-gcc10-lld                   # to use lld  
LDFLAGS="-fuse-ld=lld -lrt"    

## gold linker:
module load tool_dev/binutils-2.32                                         
LDFLAGS="-fuse-ld=gold -lrt"     

# 2.1. USC 1:
## 1. not use UCX
module load tool_dev/binutils-2.36                       # gold, should use to avoid link-error
module load compiler/gcc-11.2
export myKNEM=/uhome/p001cao/local/app/tool_dev/knem-1.1.4
    
## IB cluster
mkdir build_eagle && cd build_eagle 
../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran LDFLAGS="-fuse-ld=gold -lrt" \
--with-sge --without-ucx --with-verbs --with-knem=${myKNEM} \
--prefix=/uhome/p001cao/local/app/openmpi/4.1.1-gcc11.2-noUCX-eagle
## noIB cluster
mkdir build_lion && cd build_lion
../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran LDFLAGS="-fuse-ld=gold -lrt" \
--with-sge --without-ucx --without-verbs --with-knem=${myKNEM} \
--prefix=/uhome/p001cao/local/app/openmpi/4.1.1-gcc11.2-noUCX-lion

### 2. with ucx
# ucx Error: ib_md.c:329  UCX  ERROR ibv_reg_mr(address=0x145cb580, length=263504, access=0xf) failed: Resource temporarily unavailable

## use the same procedure to compile on Lion and Eagle
module load tool_dev/binutils-2.35                        # gold 
module load compiler/gcc-10.2                 
export myUCX=/uhome/p001cao/local/app/tool_dev/ucx-1.9
export myKNEM=/uhome/p001cao/local/app/tool_dev/knem-1.1.4
../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran LDFLAGS="-fuse-ld=gold -lrt" \
--with-sge --without-verbs --with-ucx=${myUCX} --with-knem=${myKNEM} \
--prefix=/uhome/p001cao/local/app/openmpi/4.1.1-gcc10.3-eagle


##2.2. USC 2: 
## with UCX: on Tacheon, ucx gives better performance (but raise posibility of err)
cd openmpi-4.1.1
mkdir buildGCC && cd buildGCC
#--
module load tool_dev/binutils-2.35                        # gold
module load compiler/gcc-10.3
export myUCX=/home1/p001cao/local/app/tool_dev/ucx-1.10               ## UCX
../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran LDFLAGS="-fuse-ld=gold -lrt" \
--with-sge --without-verbs --with-ucx=${myUCX}  \
--prefix=/home1/p001cao/local/app/openmpi/4.1.1-gcc10.3
#not use
export myKNEM=/home1/p001cao/local/app/tool_dev/knem-1.1.4
--with-knem=${myKNEM}

## without UCX: 
module load tool_dev/binutils-2.35                        # gold
module load compiler/gcc-10.3
../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran LDFLAGS="-fuse-ld=gold -lrt" \
--with-sge --with-verbs --without-ucx \
--prefix=/home1/p001cao/local/app/openmpi/4.1.1-gcc10.3-noUCX

##2.3. CAN: 

module load gcc/gcc-7.4.0
check:   g++   -v

2. Configuration

cd openmpi-4.0.2
mkdir build
cd build

../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran \
--with-sge --without-verbs --without-ucx  \
--prefix=/home/thang/local/app/openmpi/4.0.2-gcc7.4.0

##2.4. CAN-GPU: 

I. Install Cuda with Intel-xe19: (Runfile Installation)
Download:  wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_rhel6.run

Install: (root acc)
1. disable the graphical target, to update Nvidia driver
systemctl isolate multi-user.target
modprobe -r nvidia-drm

module load compiler/gcc-7.4
sh cuda_10.2.89_440.33.01_rhel6.run --toolkitpath=/home/thang/local/app/cuda-10.2

2. after install Cuda, start the graphical environment again
systemctl start graphical.target

II. Install OpenMPI
# need binutils 2.22 or newer to link cuda

cd openmpi-4.1.1
mkdir build && cd build

Load compilers:
module load compiler/gcc-7.4   # cuda-10 only support to gcc-8
module load binutils-2.35 

../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran \
--with-sge --without-ucx \
--with-cuda=/home/thang/local/app/cuda-10.2 \
--prefix=/home/thang/local/app/openmpi/4.1.1-gcc7.4-cuda