# Profiling¶

PETSc includes a consistent, lightweight scheme to allow the profiling of application programs. The PETSc routines automatically log performance data if certain options are specified at runtime. The user can also log information about application codes for a complete picture of performance. In addition, as described in Interpreting -log_view Output: The Basics, PETSc provides a mechanism for printing informative messages about computations. Basic Profiling Information introduces the various profiling options in PETSc, while the remainder of the chapter focuses on details such as monitoring application codes and tips for accurate profiling.

## Basic Profiling Information¶

If an application code and the PETSc libraries have been configured with --with-log=1, the default, then various kinds of profiling of code between calls to PetscInitialize() and PetscFinalize() can be activated at runtime. The profiling options include the following:

• -log_view - Prints an ASCII version of performance data at program’s conclusion. These statistics are comprehensive and concise and require little overhead; thus, -log_view is intended as the primary means of monitoring the performance of PETSc codes.

• -info [infofile] - Prints verbose information about code to stdout or an optional file. This option provides details about algorithms, data structures, etc. Since the overhead of printing such output slows a code, this option should not be used when evaluating a program’s performance.

• -log_trace [logfile] - Traces the beginning and ending of all PETSc events. This option, which can be used in conjunction with -info, is useful to see where a program is hanging without running in the debugger.

As discussed in Using -log_mpe with Jumpshot, additional profiling can be done with MPE.

### Interpreting -log_view Output: The Basics¶

As shown in the listing in Profiling Programs, the option -log_view activates printing of profile data to standard output at the conclusion of a program. Profiling data can also be printed at any time within a program by calling PetscLogView().

We print performance data for each routine, organized by PETSc libraries, followed by any user-defined events (discussed in Profiling Application Codes). For each routine, the output data include the maximum time and floating point operation (flop) rate over all processes. Information about parallel performance is also included, as discussed in the following section.

For the purpose of PETSc floating point operation counting, we define one flop as one operation of any of the following types: multiplication, division, addition, or subtraction. For example, one VecAXPY() operation, which computes $$y = \alpha x + y$$ for vectors of length $$N$$, requires $$2N$$ flop (consisting of $$N$$ additions and $$N$$ multiplications). Bear in mind that flop rates present only a limited view of performance, since memory loads and stores are the real performance barrier.

For simplicity, the remainder of this discussion focuses on interpreting profile data for the KSP library, which provides the linear solvers at the heart of the PETSc package. Recall the hierarchical organization of the PETSc library, as shown in Numerical Libraries in PETSc. Each KSP solver is composed of a PC (preconditioner) and a KSP (Krylov subspace) part, which are in turn built on top of the Mat (matrix) and Vec (vector) modules. Thus, operations in the KSP module are composed of lower-level operations in these packages. Note also that the nonlinear solvers library, SNES, is built on top of the KSP module, and the timestepping library, TS, is in turn built on top of SNES.

We briefly discuss interpretation of the sample output in the listing, which was generated by solving a linear system on one process using restarted GMRES and ILU preconditioning. The linear solvers in KSP consist of two basic phases, KSPSetUp() and KSPSolve(), each of which consists of a variety of actions, depending on the particular solution technique. For the case of using the PCILU preconditioner and KSPGMRES Krylov subspace method, the breakdown of PETSc routines is listed below. As indicated by the levels of indentation, the operations in KSPSetUp() include all of the operations within PCSetUp(), which in turn include MatILUFactor(), and so on.

• KSPSetUp - Set up linear solver

• PCSetUp - Set up preconditioner

• MatILUFactor - Factor preconditioning matrix

• MatILUFactorSymbolic - Symbolic factorization phase

• MatLUFactorNumeric - Numeric factorization phase

• KSPSolve - Solve linear system

• PCApply - Apply preconditioner

• MatSolve - Forward/backward triangular solves

• KSPGMRESOrthog - Orthogonalization in GMRES

• VecDot or VecMDot - Inner products

• MatMult - Matrix-vector product

• MatMultAdd - Matrix-vector product + vector addition

• VecScale, VecNorm, VecAXPY, VecCopy, …

The summaries printed via -log_view reflect this routine hierarchy. For example, the performance summaries for a particular high-level routine such as KSPSolve() include all of the operations accumulated in the lower-level components that make up the routine.

Admittedly, we do not currently present the output with -log_view so that the hierarchy of PETSc operations is completely clear, primarily because we have not determined a clean and uniform way to do so throughout the library. Improvements may follow. However, for a particular problem, the user should generally have an idea of the basic operations that are required for its implementation (e.g., which operations are performed when using GMRES and ILU, as described above), so that interpreting the -log_view data should be relatively straightforward.

### Interpreting -log_view Output: Parallel Performance¶

We next discuss performance summaries for parallel programs, as shown within the listings below , which present the combined output generated by the -log_view option. The program that generated this data is KSP Tutorial ex10. The code loads a matrix and right-hand-side vector from a binary file and then solves the resulting linear system; the program then repeats this process for a second linear system. This particular case was run on four processors of an Intel x86_64 Linux cluster, using restarted GMRES and the block Jacobi preconditioner, where each block was solved with ILU. The two input files medium and arco6 can be downloaded from this FTP link.

The first listing presents an overall performance summary, including times, floating-point operations, computational rates, and message-passing activity (such as the number and size of messages sent and collective operations). Summaries for various user-defined stages of monitoring (as discussed in Profiling Multiple Sections of Code) are also given. Information about the various phases of computation then follow (as shown separately here in the second listing). Finally, a summary of memory usage and object creation and destruction is presented.

mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
-matload_block_size 3 -pc_type bjacobi -options_left

Number of iterations = 19
Residual norm 1.088292e-05
Number of iterations = 59
Residual norm 3.871022e-02
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018
Using Petsc Development GIT revision: v3.9-163-gbe3efd42  GIT Date: 2018-04-16 10:45:40 -0500

Max       Max/Min        Avg      Total
Time (sec):           1.849e-01      1.00002   1.849e-01
Objects:              1.060e+02      1.00000   1.060e+02
Flop:                 2.361e+08      1.00684   2.353e+08  9.413e+08
Flop/sec:            1.277e+09      1.00685   1.273e+09  5.091e+09
MPI Messages:         2.360e+02      1.34857   2.061e+02  8.245e+02
MPI Message Lengths:  1.256e+07      2.24620   4.071e+04  3.357e+07
MPI Reductions:       2.160e+02      1.00000

Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
0:      Main Stage: 5.9897e-04   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
1:   Load System 0: 2.9113e-03   1.6%  0.0000e+00   0.0%  3.550e+01   4.3%  5.984e+02        0.1%  2.200e+01  10.2%
2:      KSPSetUp 0: 7.7349e-04   0.4%  9.9360e+03   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
3:      KSPSolve 0: 1.7690e-03   1.0%  2.9673e+05   0.0%  1.520e+02  18.4%  1.800e+02        0.1%  3.900e+01  18.1%
4:   Load System 1: 1.0056e-01  54.4%  0.0000e+00   0.0%  3.700e+01   4.5%  5.657e+05       62.4%  2.200e+01  10.2%
5:      KSPSetUp 1: 5.6883e-03   3.1%  2.1205e+07   2.3%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
6:      KSPSolve 1: 7.2578e-02  39.3%  9.1979e+08  97.7%  6.000e+02  72.8%  2.098e+04       37.5%  1.200e+02  55.6%

------------------------------------------------------------------------------------------------------------------------

.... [Summary of various phases, see part II below] ...

------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.
...
--- Event Stage 3: KSPSolve 0

Matrix     0              4        23024     0.
Vector    20             30        60048     0.
Index Set     0              3         2568     0.
Vec Scatter     0              1         1264     0.
Krylov Solver     0              2        19592     0.
Preconditioner     0              2         1912     0.


We next focus on the summaries for the various phases of the computation, as given in the table within the following listing. The summary for each phase presents the maximum times and flop rates over all processes, as well as the ratio of maximum to minimum times and flop rates for all processes. A ratio of approximately 1 indicates that computations within a given phase are well balanced among the processes; as the ratio increases, the balance becomes increasingly poor. Also, the total computational rate (in units of MFlop/sec) is given for each phase in the final column of the phase summary table.

${\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors})$

Note: Total computational rates $$<$$ 1 MFlop are listed as 0 in this column of the phase summary table. Additional statistics for each phase include the total number of messages sent, the average message length, and the number of global reductions.

mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
-matload_block_size 3 -pc_type bjacobi -options_left

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
.... [Overall summary, see part I] ...

Phase summary info:
Count: number of times phase was executed
Time and Flop/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length
Reduct: number of global reductions
Global: entire computation
Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase         %F - percent flop in this phase
%M - percent messages in this phase     %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Phase              Count      Time (sec)       Flop/sec                          --- Global ---  --- Stage ----  Total
Max    Ratio      Max    Ratio  Mess AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
...

--- Event Stage 5: KSPSetUp 1

MatLUFactorNum         1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0  62100  0  0  0  5819
MatILUFactorSym        1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  26  0  0  0  0     0
MatGetRowIJ            1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
KSPSetUp               2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  1   9  0  0  0100     0
PCSetUp                2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3907
PCSetUpOnBlocks        1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3927

--- Event Stage 6: KSPSolve 1

MatMult               60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37  0  32 28100100  0 10731
MatSolve              61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25  0  0  0  26 26  0  0  0 12491
VecMDot               59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01  7 21  0  0 27  18 21  0  0 49 13189
VecNorm               61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01  1  1  0  0 28   4  1  0  0 51  4332
VecScale              61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  6602
VecCopy                2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
VecAXPY                3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  8712
VecMAXPY              61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  5 22  0  0  0  12 23  0  0  0 23393
VecScatterBegin       60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00  0  0 73 37  0   1  0100100  0     0
VecScatterEnd         60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   8  0  0  0  0     0
VecNormalize          61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01  2  2  0  0 28   5  2  0  0 51  4632
KSPGMRESOrthog        59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41  0  0 27  29 42  0  0 49 17185
KSPSolve               1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56  99100100100100 12673
PCSetUpOnBlocks        1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply               61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25  0  0  0  28 26  0  0  0 11726
------------------------------------------------------------------------------------------------------------------------
.... [Conclusion of overall summary, see part I] ...


As discussed in the preceding section, the performance summaries for higher-level PETSc routines include the statistics for the lower levels of which they are made up. For example, the communication within matrix-vector products MatMult() consists of vector scatter operations, as given by the routines VecScatterBegin() and VecScatterEnd().

The final data presented are the percentages of the various statistics (time (%T), flop/sec (%F), messages(%M), average message length (%L), and reductions (%R)) for each event relative to the total computation and to any user-defined stages (discussed in Profiling Multiple Sections of Code). These statistics can aid in optimizing performance, since they indicate the sections of code that could benefit from various kinds of tuning. Hints for Performance Tuning gives suggestions about achieving good performance with PETSc codes.

### Using -log_mpe with Jumpshot¶

It is also possible to use the Jumpshot package [HL91] to visualize PETSc events. This package comes with the MPE software, which is part of the MPICH [Getal] implementation of MPI. The option

-log_mpe [logfile]


creates a logfile of events appropriate for viewing with Jumpshot. The user can either use the default logging file or specify a name via logfile. Events can be deactivated as described in Restricting Event Logging.

The user can also log MPI events. To do this, simply consider the PETSc application as any MPI application, and follow the MPI implementation’s instructions for logging MPI calls. For example, when using MPICH, this merely required adding -llmpich to the library list before -lmpich.

## Profiling Application Codes¶

PETSc automatically logs object creation, times, and floating-point counts for the library routines. Users can easily supplement this information by monitoring their application codes as well. The basic steps involved in logging a user-defined portion of code, called an event, are shown in the code fragment below:

PetscLogEvent  USER_EVENT;
PetscClassId   classid;
PetscLogDouble user_event_flops;

PetscClassIdRegister("class name",&classid);
PetscLogEventRegister("User event name",classid,&USER_EVENT);
PetscLogEventBegin(USER_EVENT,0,0,0,0);
/* code segment to monitor */
PetscLogFlops(user_event_flops);
PetscLogEventEnd(USER_EVENT,0,0,0,0);


One must register the event by calling PetscLogEventRegister(), which assigns a unique integer to identify the event for profiling purposes:

PetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e);


Here string is a user-defined event name, and color is an optional user-defined event color (for use with Jumpshot logging; see Using -log_mpe with Jumpshot); one should see the manual page for details. The argument returned in e should then be passed to the PetscLogEventBegin() and PetscLogEventEnd() routines.

Events are logged by using the pair

PetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
PetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);


The four objects are the PETSc objects that are most closely associated with the event. For instance, in a matrix-vector product they would be the matrix and the two vectors. These objects can be omitted by specifying 0 for o1 - o4. The code between these two routine calls will be automatically timed and logged as part of the specified event.

The user can log the number of floating-point operations for this segment of code by calling

PetscLogFlops(number of flop for this code segment);


between the calls to PetscLogEventBegin() and PetscLogEventEnd(). This value will automatically be added to the global flop counter for the entire program.

## Profiling Multiple Sections of Code¶

By default, the profiling produces a single set of statistics for all code between the PetscInitialize() and PetscFinalize() calls within a program. One can independently monitor up to ten stages of code by switching among the various stages with the commands

PetscLogStagePush(PetscLogStage stage);
PetscLogStagePop();


where stage is an integer (0-9); see the manual pages for details. The command

PetscLogStageRegister(const char *name,PetscLogStage *stage)


allows one to associate a name with a stage; these names are printed whenever summaries are generated with -log_view or PetscLogView(). The following code fragment uses three profiling stages within an program.

PetscInitialize(int *argc,char ***args,0,0);
/* stage 0 of code here */
PetscLogStageRegister("Stage 0 of Code", &stagenum0);
for (i=0; i<ntimes; i++) {
PetscLogStageRegister("Stage 1 of Code", &stagenum1);
PetscLogStagePush(stagenum1);
/* stage 1 of code here */
PetscLogStagePop();
PetscLogStageRegister("Stage 2 of Code", &stagenum2);
PetscLogStagePush(stagenum2);
/* stage 2 of code here */
PetscLogStagePop();
}
PetscFinalize();


The listings above Figures [fig_exparprof] and show output generated by -log_view for a program that employs several profiling stages. In particular, this program is subdivided into six stages except the Main stage: loading a matrix and right-hand-side vector from a binary file, setting up the preconditioner, and solving the linear system; this sequence is then repeated for a second linear system. For simplicity, the second listing contains output only for stages 5 and 6 (linear solve of the second system), which comprise the part of this computation of most interest to us in terms of performance monitoring. This code organization (solving a small linear system followed by a larger system) enables generation of more accurate profiling statistics for the second system by overcoming the often considerable overhead of paging, as discussed in Accurate Profiling and Paging Overheads.

## Restricting Event Logging¶

By default, all PETSc operations are logged. To enable or disable the PETSc logging of individual events, one uses the commands

PetscLogEventActivate(int event);
PetscLogEventDeactivate(int event);


## Accurate Profiling and Paging Overheads¶

One factor that often plays a significant role in profiling a code is paging by the operating system. Generally, when running a program, only a few pages required to start it are loaded into memory rather than the entire executable. When the execution proceeds to code segments that are not in memory, a pagefault occurs, prompting the required pages to be loaded from the disk (a very slow process). This activity distorts the results significantly. (The paging effects are noticeable in the log files generated by -log_mpe, which is described in Using -log_mpe with Jumpshot.)

To eliminate the effects of paging when profiling the performance of a program, we have found an effective procedure is to run the exact same code on a small dummy problem before running it on the actual problem of interest. We thus ensure that all code required by a solver is loaded into memory during solution of the small problem. When the code proceeds to the actual (larger) problem of interest, all required pages have already been loaded into main memory, so that the performance numbers are not distorted.

When this procedure is used in conjunction with the user-defined stages of profiling described in Profiling Multiple Sections of Code, we can focus easily on the problem of interest. For example, we used this technique in the program KSP Tutorial ex10 to generate the timings within the listings above In this case, the profiled code of interest (solving the linear system for the larger problem) occurs within event stages 5 and 6. Interpreting -log_view Output: Parallel Performance provides details about interpreting such profiling data.

In particular, the macros

PetscPreLoadBegin(PetscBool flag,char* stagename)


and

PetscPreLoadEnd()


can be used to easily convert a regular PETSc program to one that uses preloading. The command line options -preload true and -preload false may be used to turn on and off preloading at run time for PETSc programs that use these macros.

Getal

William Gropp and et. al. MPICH Web page. http://www.mpich.org. URL: http://www.mpich.org.

HL91

Virginia Herrarte and Ewing Lusk. Studying parallel program behavior with Upshot. Technical Report ANL-91/15, Argonne National Laboratory, August 1991.