# Practictal session: Studying the behavior of parallel applications and identifying bottlenecks by using performance analysis tools

# When there is the mark % in front of the line then toy should execute it

1  PAPI

1.1 Discover which are the PAPI metrics and which ones are available on your CPU.

   % papi_avail

    Available events and hardware information.
    --------------------------------------------------------------------------------
   PAPI Version             : 4.2.0.0                                                                                                                                                   Vendor string and code   : GenuineIntel (1)
   Model string and code    : Intel(R) Xeon(R) CPU       X5675  @ 3.07GHz (44)
   CPU Revision             : 2.000000
   CPUID Info               : Family: 6  Model: 44  Stepping: 2
   CPU Megahertz            : 3066.216064
   CPU Clock Megahertz      : 3066
   Hdw Threads per core     : 2
   Cores per Socket         : 6
   NUMA Nodes               : 2
   CPU's per Node           : 12
   Total CPU's              : 24
   Number Hardware Counters : 7
   Max Multiplex Counters   : 64
   --------------------------------------------------------------------------------
   The following correspond to fields in the PAPI_event_info_t structure.


       Name        Code    Avail Deriv Description (Note)
   PAPI_L1_DCM  0x80000000  Yes   No   Level 1 data cache misses
   ....

   -------------------------------------------------------------------------
   Of 107 possible events, 57 are available, of which 14 are derived.

   From the above command we have some information about the CPU, its model, Megahertz etc.
   However we are informed that there are 2 Hdw threads per core and there are 6 cores per
   CPU. The node constitutes by 2 CPUs so we have 24 threads and we can measyre at most
   64 multiplex counters.

   When a metric is "Avail" then there is on the CPU and when the Deriv is No this means
   that this metric can be measured without the combination of more than one of the other 
   metrics. This CPU can measure 7 hardware counters without Multiplexing, which means that 
   when we measure more than 7 hardware counters we loose some accuracy so if we have 3-4 Derived
   metrics we could use all the 7 hardware counters.


1.2 Overhead caused by PAPI measurement

    % papi_cost

    Cost of execution for PAPI start/stop, read and accum.
    This test takes a while. Please be patient...

    Performing loop latency test...

    Total cost for loop latency over 1000000 iterations
    min cycles   : 46
    max cycles   : 45149
    mean cycles  : 46.360995
    std deviation: 71.696173
 
    Performing start/stop test...

    Total cost for PAPI_start/stop (2 counters) over 1000000 iterations
    min cycles   : 22800
    max cycles   : 695076
    mean cycles  : 23253.136683
    std deviation: 5051.375835
    ...

   You can observe for many functions how much is the cost in cycles of
   starting/stopping reading the value of the PAPI hardware counter etc.
   Just think that during the profiling application this can happne all
   the time.



2 NAS Parallel Benchmarks (NPB)
  
    Copy the NAS benchmarks to your home folder:
 
    % cd
    % cp -r /srv/app/data/tutorial .

    We have one folder with the MPI version of the benchmarks
    (NPB3.3-MPI) and another one with the serial version (NPB3.3-SER) 

    Compilation

    Before compilation, one needs to check the configuration file
    'make.def' in the config directory and modify the file if necessary.

       make <benchmark-name> NPROCS=<number> CLASS=<class> \
         [SUBTYPE=<type>] [VERSION=VEC]

    where <benchmark-name>  is "bt", "cg", "dt", "ep", "ft", "is",
                              "lu", "mg", or "sp"
         <number>          is the number of processes
         <class>           is "S", "W", "A", "B", "C", "D", or "E"

 
    We are going to use the LU benchmark, small number of processes
    and the classes A and B.

    IMPORTANT: There are not many available cores so it is logic to have
    delays during the execution.   

 
    2.1 Go to the MPI folder of the NPB
      % cd ~/tutorial/NPB3.3-MPI

    - Check your configuration file
 
      % vim config/make.def

      Make sure that the MPIF77 is equal to mpif77 and all the other lines related to
      MPIF77 are commented. Close the file (Esc + : + q and Enter or if you did changes 
      Esc + w + q + Enter)

    - Compile the LU benchmark for class A and 4 processes
      % make clean
      % make LU NPROCS=4 CLASS=A

      if everything is OK then you have the executable lu.A.4 in your bin folder.
      You should alaways exeute the make clean command in order to clean the previous
      compilation

    2.2 Let's execute the benchmark

      % cd bin
      % mpirun --bind-to-core -np 4 lu.A.4

      We use the flag --bind-to-core in order to map one process to a core and not allow any
      multicore issues. However we are a lot of users so if there is any error do not
      use this flag but if it is possible do it. 

      Now hopefully you have executed the benchmark without any instrumentation. In any case
      execute it twice just to observe if the duration varies because there are many users.

3  Scalasca


    Program instrumentation: square -instrument or skin
    Summary measurement colelction \& analysis: \structure{scan [-s]}
    Summary analysis report examination: \structure{square}
    Summary experiment scoring: \structure{square -s}
    Event trace collection \& analysis: \structure{scan -t}
    Event trace analysis report examination: \structure{square} 

    3.1 Compile
       
      We have to declare at the NPB to use scalasca in order to isntrument the benchmark

      Go to the MPI folder of the NPB and edit the makefile
      % cd ~/tutorial/NPB3.3-MPI
      % vim config/make.def

      Comment the first line about MPIF77
      MPIF77=mpif77  -> #MPIF77=mpif77

      and uncomment the following
      MPIF77 = scalasca -instrument mpif77

      Save and close the makefile (:wq)
      
      Let's compile the LU benchmark for 8 processes (or use 4 processes)

      % make clean
      % make LU NPROCS=8 CLASS=A

      Now the executable is saved in bin.scalasca folder

    3.2 Execute the benchmark

      % cd bin.scalasca
      % scalasca --analyze mpirun --bind-to-core -np 8 lu.A.8
       or
      % skin mpirun --bind-to-core -np 8 lu.A.8

      Now scalasca profiles the application and informs you about what is doing now

      When it finishes there is a folder epik_lu_8_sum or epik_ 

      In order to visualize the data execute the following

      % scalasca -examine epik_lu_8_sum

      Wait and if everything is right a window will pop up (maybe the network will be slow)

      Start playing with the environment. You can see three browser trees (more on the slides)

      
    3.3 Let's analyze the summary

      Execute the following command

      % scalasca -examine -s epik_lu_8_sum 

     /srv/app/scalasca/bin/cube3_score -r ./epik_lu_8_sum/summary.cube.gz > 
     ./epik_lu_8_sum/epik.score
     Reading ./epik_lu_8_sum/summary.cube.gz... done.
     Estimated aggregate size of event trace (total_tbc): 130098176 bytes
     Estimated size of largest process trace (max_tbc):   17275190 bytes
     (Hint: When tracing set ELG_BUFFER_SIZE > max_tbc to avoid 
     intermediate flushes or reduce requirements using file listing 
     names of USR regions to be filtered.)

     INFO: Score report written to ./epik_lu_8_sum/epik.score

     Check the estimated size of the traces when we trace the application
     and the largest process trace. The instructions informing us about
     which environment variable should set in order to avoid flushes to 
     the hard disk.

   3.4  Check the summary analysis report

      less epik_lu_8_sum/epik.score

   3.5 Buffer

     Change the ELG_BUFFER to 17300000 in the ~/.bashrc
     Do not forget always execute
     % source ~/.bashrc
     after every change in the file .bashrc
    
   3.6 Filtering

     This procedure is one of the most basics as it the application could
     call a small function which its percentage compared to the other functions can not
     be considered important but it could cause problems. From the epik.score we
     choose the one max_tbc and small duration. This is the exact_

   3.7 Create the filer file

     % echo "exact_" > lu.filtering

   3.8 Apply the filter without execute the benchmark again

     %  scalasca -examine -s -f lu.filtering ./epik_lu_8_sum

      Be careful about the paths of the files

    /srv/app/scalasca/bin/cube3_score -f lu.filtering -r 
    ./epik_lu_8_sum/summary.cube.gz > 
    ./epik_lu_8_sum/epik.score_lu.filtering
    Reading ./epik_lu_8_sum/summary.cube.gz... done.
    Applying filter "lu.filtering":
    Estimated aggregate size of event trace (total_tbc): 54560192 bytes
    Estimated size of largest process trace (max_tbc):   7582262 bytes
    (Hint: When tracing set ELG_BUFFER_SIZE > max_tbc to avoid 
    intermediate flushes.)

    INFO: Score report written to 
    ./epik_lu_8_sum/epik.score_lu.filtering 

    Now the size of the traces would be 54.6MB!

   3.9 Check the new filtered analysis report

    % less epik_lu_8_sum/epik.score_lu.filtering

      The mark + indicates the filtered routines

   3.10 Execute the benchmark with filtering 

      Save the current measurement 
      % mv epik_lu_8_sum epik_lu_8_sum_no_filter

      change the environment variable EPK_FILTER in the ~/,bashrc

      EPK_FILTER=lu.filtering (be careful with the paths)

      % source ~/.bashrc
      
      % scalasca -analyze mpirun -np 8 lu.A.8

  3.11 Examine the scoring of the new measurement

     % scalasca -examine -s epik_lu_8_sum

  3.12 View the score file

     less epik_lu_8_sum/epik.score


  3.13 TRACING

     If you want to trace an application then add the flag -t

    % scalasca -analyze -t mpirun -np 8 lu.A.8

     Now the experiment directory is called epik_lu_8_trace

    Automatically also a tool called SCOUT is called which is
    one big adavantage for Scalasca. This tool analyzes the traces
    and finds communication issues.

   % square epik_lu_8_trace

    Observe the issues on communication by expanding the Metric tree.
   Select events and see at the next tree where these events occur when
   you expand it.

   3.14  Selective instrumentation with PDT

    Sometimes we want to exclude functions or specific part of code because
  they cause more overhead than provide useful information about the behavior 
  of the application. This is called selective instrumentation. For example maybe
  you don't want to instrument the initialization phases.

    Go to the root MPI folder
   % cd ..
    Edit the make.def file
   % vim config/make.def
   
    Uncomment only the following
    MPIF77 = scalasca -instrument -pdt -optTauSelectFile=/path/NPB3.3-MPI/lu.pdt mpif77 -comp=none -ffixed-line-length-0

    Create the lu.pdt file

   % vim lu.pdt
     BEGIN_EXCLUDE_LIST
     EXACT
     END_EXCLUDE_LIST

     Now we declared which function should be exclude from the tracing

    Let's compile
    % make clean
    % make LU NPROCS=8 CLASS=A
 
    Execute the benchmark
    % cd bin.scalasca
    % scalasca -analyze mpirun -np 8 lu.A.8

    Apply the scoring

    % scalasca -examine -s epik_lu_8_sum
    % cat epik_lu_8_sum/epik.score | grep EXACT

     If the command cat returns nothing then the EXACT function was not instrumented 

   3.15  Instrument PAPI Hardware counters

      We want to instrument the total instructions (PAPI_TOT_INS), the floating operations
   (PAPI_FP_OPS), the L2 cache misses (PAPI_L2_TCM) and the stalled cycles on any resource
   (PAPI_RES_STL). We have to declare them in the environment variable EPK_METRICS in the
   ~/.bashrc as follows

     EPK_METRICS=PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:PAPI_RES_STL

     % source ~/.bashrc

     Execute again the benchmark as before
     % scalasca -analyze mpirun -np 8 lu.A.8

     % square epik_lu_8_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:PAPI_RES_STL

     Now you can see the values of the hardware counters at the Metric tree. Expand the Call tree
     and observe how the instructions distribute across the functions. Expand also the System tree
     to observe for any variation between the processors.
 
   3.16 CUBE3 Algebra utilities

     We want to create a new CUBE3 file which includes the SSOR functions and its subroutines

     % cube3_cut -r 'SSOR' epik_lu_8_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:PAPI_RES_STL/epitome.cube
  
     Reading epik_lu_8_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:PAPI_RES_STL/epitome.cube ... done.
     ++++++++++++ Cut operation begins ++++++++++++++++++++++++++
     ++++++++++++ Cut operation ends successfully ++++++++++++++++
     Writing cut.cube.gz ... done.

     % square cut.cube.gz

     Now there is only the SSOR function and its subroutines in the new CUBE3 file

     - Compare two executions

       The current version of the NPB contain -O3 flags. We want to change this to -O2 and observe if there is
    any difference. So we want to save the measurement that we just did because Scalasca will try to create a measurement
    with the same name and it will fail

      % mv epik_lu_8_sum_PAPI_TOT_INS\:PAPI_FP_OPS\:PAPI_L2_TCM\:PAPI_RES_STL/ \
       epik_lu_b_8_o3_sum_PAPI_TOT_INS\:PAPI_FP_OPS\:PAPI_L2_TCM\:PAPI_RES_STL/

     Go to your NPB-MPI root folder and edit the make.def
      % cd ..
      % vim config/make.def

     Change the FFLAGS to -O2 . Save and exit

     Compile again

     % make clean
     % make LU NPROCS=8 CLASS=A
     
     Execute
     % cd bin.scalasca
     % scalasca -analyze mpirun -np 8 lu.A.8

    Compare the two measurements

    % cube3_diff epik_lu_b_8_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:\
     PAPI_RES_STL/epitome.cube epik_lu_8_O3_sum_PAPI_TOT_INS:PAPI_FP_OPS: \
     PAPI_L2_TCM:PAPI_RES_STL/epitome.cube 

     Reading epik_lu_b_8_o2_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM: \
     PAPI_RES_STL/epitome.cube ... done.
     Reading epik_lu_8_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM: \
     PAPI_RES_STL/epitome.cube ... done.
     ++++++++++++ Diff operation begins ++++++++++++++++++++++++++
     INFO::Merging metric dimension... done.
     INFO::Merging program dimension... done.
     INFO::Merging system dimension... done.
     INFO::Mapping severities... done.
     INFO::Merging topologies... done.
     INFO::Diff operation... done.
     ++++++++++++ Diff operation ends successfully ++++++++++++++++
     Writing diff.cube.gz ... done.

    % square diff.cube.gz

    
     Observe the difference on time and metrics    

   3.17  Serial version

    Go to the root folder of the serial version of NPB
    % cd ~/tutorial/NPB3.3-SER

    Change the config file

    % vim config/make.def

    MPIF77 = scalasca -instrument -pdt -optTauSelectFile=/path/lu.pdt mpif77 -comp=none -ffixed-line-length-0

    Compile the LU benchmark for class A
    % make clean
    % make LU CLASS=A

    Execute the benchmark
    % cd bin.scalasca
    % scalasca -analyze ./lu.A.x

    % square epik_lu_O_sum_PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:PAPI_RES_STL

    Observe the browser tress for the serial version

    The following two cases will be difficult to be executed because there are not enough cores.
    Just to know that the intelnode does not provide 32 cores.
    Compile and execute the MPI versions for class A for 2,4,8,16,32 processes with profiling

    Similar compile and execute the MPI versions for class A for 2.4.8.16.32 processes with tracing


4  TAU

  4.1  Compile the LU benchmark for class A and 4 processes

    % cd ~/tutorial/NPB3.3-MPI
    % vim config/make.def
     
     Uncomment the line 
     MPIF77=tau_f90.sh
     
     Comment any other MPIF77 line and exit the file
    
     Declare the appropriate TAU Makefile
     
     % vim ~/.bashrc 
     export TAU_MAKEFILE=/srv/app/tau/x86_64/lib/Makefile/Makefile.tau-papi-mpi-pdt
     Exit the file
   
     % source ~/.bashrc    

     % make clean
     % make LU NPROCS=4 CLASS=A

     Execute the benchmark

     % cd bin.tau
     % mpirun --bind-to-core -np 4 lu.A.4
    
    Pack the measurement data
  
     % paraprof --pack app.ppk
     % paraprof app.ppk

      Now the Paraprof manager window and a sub-window opens. Study the contents

  4.2   Loop level profile

      Declare the following to ~/.bashrc

      export TAU_PROFILE=1
      export TAU_PROFILE_FORMAT=profile
      export TAU_OPTIONS='-optTauSelectFile=select.tau'

      Exit and save

      % source ~/.bashrc

      Create the select.tau file

      % vim select.tau
 
      BEGIN_INSTRUMENT_SECTION
      loops routine="#"
      END_INSTRUMENT_SECTION 
     
      Compile and execute

      % make clean
      % make LU NPROCS=4 CLASS=A

      Execute the benchmark

      % cd bin.tau
      % rm -r MULTI* // always delete previous experiments or save them somewhere else
      % mpirun -np 4 lu.A.4

      % paraprof --pack lu_a_4.ppk
      % paraprof lu_a_4.ppk

       Observe the bar charts with the duration of the functions

   4.3  Profiling with PAPI

        Declare the PAPI metrics

       % vim ~/.bashrc
       export TAU_METRICS=TIME:PAPI_FP_OPS:PAPI_TOT_INS

       Save and exit

       % source ~/.bashrc
       % rm -r MULTI*
       % mpirun -np 4 lu.A.4

       % paraprof -pack lu_a_4_papi.ppk
       % paraprof lu_a_4_papi.ppk

      Click Options -> Show Derived Metric Panel -> click PAPI\_TOT\_INS, click "/"
      click TIME, Apply, choose the new metric by double clicking       
  
  4.4  Profile with Callpath

      Declare in the ~/.bashrc
      export TAU_CALLPATH=1
      export TAU_CALLPATH_DEPTH=10

      Execute the benchmark
      % rm -r MULTI*
      % mpirun -np 4 lu.A.4 

      Pack the data
      % paraprof --pack lu_a_4_papi_callpath.ppk
      % paraprof lu_a_4_papi_callpath,ppk

      From the Paraprof window click Windows -> Click Thread -> Click Call Graph, 
      select one process 

  4.5 Communication Matrix Display

      Declare in the ~/.bashrc
      export TAU_COMM_MATRIX=1

      Execute the benchmark
      % rm -r MULTI*
      % mpirun -np 4 lu.A.4

      Pack the data
      % paraprof --pack lu_a_4_papi_comm.ppk
      % paraprof lu_a_4_papi_comm.ppk

      Click Windows -> Click 3D Communication Matrix

      Play with the Height, Color and the metric in order to see how the plot changes.
      Also change the Function and the node from the menu at the right.
      Exit Paraprof
   
  4.6  Tracing + Jumpshot  

       Enable the tracing feature
   
       % vim ~/.bashrc
       % export TAU_TRACE=1
       Save and exit

       % source ~/.bashrc

       Execute the benchmark
       % rm -r MULTI*
       % mpirun -np 4 lu.A.4

       Merge the tracefiles
       % tau_treemerge.pl

       Convert the traces to SLOG2 format
       % tau2otf tau.trc tau.edf -o app.slog2
       
       Execute Jumpshot
       % jumpshot app.slog2

  4.7  Experiments

       Declare in the ~/.bashrc the following (if there are not already)
       export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM:PAPI_RES_STL
       export TAU_CALLPATH=0
       export TAU_PROFILE_FORMAT=profile
       export TAU_TRACE=0

 
       % source ~/.bashrc
   
       Compile the LU benchmark for classes A,B and 2-32 processes

       % make clean; make LU NPROCS=2 CLASS=A
       % make clean; make LU NPROCS=4 CLASS=A
       % make clean; make LU NPROCS=8 CLASS=A
       % make clean; make LU NPROCS=16 CLASS=A
       % make clean; make LU NPROCS=32 CLASS=A
       % make clean; make LU NPROCS=2 CLASS=B
       % make clean; make LU NPROCS=4 CLASS=B
       % make clean; make LU NPROCS=8 CLASS=B
       % make clean; make LU NPROCS=16 CLASS=B
       % make clean; make LU NPROCS=32 CLASS=B

       Execute them (if it is feasible)

       % cd bin.tau
       % rm -r MULTI*
       % mpirun --bind-to-core -np 4 lu.A.4
       % paraprof --pack lu_a_4.ppk
       % rm -r MULTI*
       % mpirun --bind-to-core -np 8 lu.A.8
       % paraprof --pack lu_a_8.ppk
         ...

     

      % paraprof lu_a_4.ppk
        Click Options -> Click Uncheck Stack Bars Together
        Move your mouse over the aeras in order to figure out
        the names and the info
        Click on the aera or RHS function

        Click Windows -> Click Thread -> Click User Event Statistics ->
        Select a thread


   4.8  Compare executions

       From the Paraprof Manager click File -> Click Open ... -> Click Select File(s)
    and select one of the previous file. Repead the procedure to add all of them.
    Now in the paraprof select each PPK file that belongs to the same class, right 
    click on it and select Add Mean to Comparison Window. A new window will pop up.
    Do not close it, Add the rest PPK files of the class with the same procedure.
    Click on Paraprof window Oprions -> Click Select Metric -> Click Select Exclusive 
    and choose a metric in order to see differences.


   4.9 Dynamic phases

       This feature is useful when you want to profile an iterative method and see
     the behavior of the application for every iteration.


     The loop that we want to measure is in the ssor.f file between the lines 83 and 202.
     So we create the following file:

     % vim dyn_phase.pdt
       BEGIN_INSTRUMENT_SECTION
       dynamic phase name="iteration" file="ssor.f" line=83 to line=202
       END_INSTRUMENT_SECTION

     Declare the appropriate Makefile
     % vim ~/.bashrc
     export TAU_MAKEFILE=/srv/app/tau/x86_64/lib/Makefile.tau-phase-papi-mpi-pdt

     export TAU_OPTIONS='-optPDTInst -optTauSelectFile=/path/dyn_phase.pdt'

     % source ~/.bashrc

     Compile

     % make clean; make LU NPROCS=4 CLASS=A

     Execute

     % cd bin.tau
     % rm -r MULTI*
     % mpirun -np 4 --bind-to-core lu.A.4

     Pack the data
     % paraprof --pack lu_a_4_phases.ppk

     View the data
     % paraprof lu_a_4_phases.ppk

     Now we can see a lot of areas as TAU profiles each call per iteration and there are 
     250 iterations. 
     - From the Paraprof Manager Click Options -> Click Show Derived Metric Panel
     Expand the lu_a_4_phases experiment, select PAPI_TOT_INS, after click the symbol "/"
     from the Derived Metric Panel and select the metric PAPI_TOT_CYC. Click Apply
     - Double click on the TIME metric
     - Right click on any iteration (small continuous areas) and select Open Profile for this phase
     - Select a function in order to see the bar chart
     - Change metrics, do the values vary accross difference processors?

    From the Paraprof Manager double click on the metric PAPI_TOT_INS, right click on any node, for 
    exameple node 3 and select Show Thread Statistics Table. Now you can see the behavior per iteration,
    exapnd an iteration for more information.
