Workflow

The core of the PASTA framework—underlying AccelProf—is a modular, stage-based workflow designed for scalable and extensible program analysis. It consists of three major components:

Application → [Event Handler] → [Event Processor] → [Tool Collection]

Each component plays a distinct role in the profiling pipeline, enabling clean separation of concerns and easy extensibility.

Components

🔹 Application

The target workload can be:

A deep learning model executed via a framework such as PyTorch
A binary GPU-accelerated application using libraries like cuDNN

This application is the source of events captured during runtime.

🔹 Event Handler

Responsible for intercepting execution events. It integrates:

Low-level profiling APIs
e.g., NVIDIA Compute Sanitizer, AMD ROCm ROCProfiler
High-level callbacks
e.g., from DL frameworks like PyTorch

This component abstracts vendor-specific and API-level complexity through a unified internal interface, offering consistent behavior across platforms.

🔹 Event Processor

The pre-processing stage that prepares collected data for analysis:

Can be executed on CPU or GPU
Transforms raw events into enriched, normalized structures
Performs filtering, bucketing, or timing-based correlation

The processor routes processed data to registered tools for final analysis.

🔹 Tool Collection

This is the analysis backend of PASTA:

Hosts user-defined profiling tools
Operates on preprocessed events
Generates profiling results such as:
- Kernel launch frequency
- Memory access hotness maps
- Tensor allocation patterns
- Operator-level performance summaries

Each tool is modular, and developers can add new tools by inheriting from a simple interface and registering their implementation.

This architecture enables flexibility, cross-vendor support, and fine-grained customization—making PASTA a powerful backend for performance engineering on modern heterogeneous systems.