Overview

AccelProf is a modular, extensible, and low-overhead framework for performance analysis on heterogeneous accelerators such as NVIDIA and AMD GPUs. It offers a unified profiling interface that bridges low-level hardware event tracing with high-level deep learning (DL) framework insights, making it an effective tool for analyzing modern workloads.

Architecture

Figure 1: Architecture of the PASTA Framework.

AccelProf is built on top of the PASTA (Program AnalysiS Tool Architecture) framework, which is composed of three core, decoupled components:

Event Handler: Interfaces with vendor-specific profiling APIs and DL framework callbacks to collect runtime data.
Processor: Performs pre-processing of collected data—either on the CPU or GPU—and routes it to analysis modules.
Tool Collection: Hosts a variety of user-defined analysis tools that implement specific profiling features.

This clean separation of responsibilities supports easy extension, flexible integration, and compatibility across vendors and platforms.

Key Features

✅ Modular architecture separating handler, processor, and tool logic
🔄 Cross-vendor support for NVIDIA and AMD accelerators
🧠 Deep learning framework integration, currently supporting PyTorch
⚡ GPU-accelerated in-situ preprocessing (optional but highly efficient)
🎯 Fine-grained instrumentation using annotation APIs (e.g., start()/end() wrappers)

Typical Use Cases

AccelProf is suited for a wide range of performance analysis scenarios, including:

🔍 Kernel frequency profiling to identify performance-critical code regions
🚀 UVM memory optimization through fine-grained access pattern analysis
🧩 Operator-level DL analysis to capture tensor allocations, operations, and kernel execution
📊 Custom tool development for research or production use

Whether you’re debugging memory bottlenecks, tuning kernel launches, or analyzing large DL models, AccelProf provides a flexible platform to support your goals.