Template Profiling Tool

Overview

The Template Profiling Tool is a powerful pytest-based framework for automated testing and profiling of MLIR operations across different hardware targets. It enables developers to validate operator implementations with various input shapes, data types, and compiler configurations while collecting detailed performance metrics.

Prerequisites

Important

Before using the Template Profiling Tool, you must complete the build and setup instructions in the Getting Started Guide, specifically the Build and Setup section. This ensures:

The TORQ compiler environment is properly configured
All required dependencies are installed
Python virtual environment is activated
The build is completed successfully

Hardware Setup for Remote Testing

For testing on Astra Machina hardware:

Ensure the Astra Machina board is connected to your local network
Obtain the IP address of the board (e.g., 10.3.120.54)
Verify SSH access to the board:
```
ssh root@<board-ip-address>
```
Ensure you can authenticate (via SSH keys or password)

Quick Start

Run template profiling tests on remote SoC hardware with a single command:

(venv) ~/torq-compiler-dev$ pytest tests/test_template_mlir.py -k add_bf16.mlir -v \
  --torq-addr=root@10.3.120.54 \
  --torq-runtime-hw-type=astra_machina \
  --torq-runtime-profiling-output-dir=./result/ \
  --template-profiling-enabled \
  --recompute-cache

This command will:

Run sample template mlir matching name “add_bf16.mlir” in torq-compiler-dev/tests/testdata/template_ops
Execute on the remote Astra Machina at 10.3.120.54
Enable profiling and save results to ./result/ directory
Recompute all cached results

Tool Output

Template Tool Example Output

Example output showing the template profiling tool running tests across multiple shapes, and compiler configurations with detailed profiling metrics.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Template Profiling Tool                      │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │   MLIR Template Files (.mlir)           │
        │   - Placeholders: {shape_1_i8}          │
        │   - Placeholders: {shape_2_bf16}        │
        │   - Placeholders: {shape_3_i32}         │
        └─────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │   Dynamic Shape Generation              │
        │   - Rank: 1D, 2D, 3D, 4D                │
        │   - Dtypes: i8, i16, i32, bf16, f32...  │
        │   - LRAM size constraints               │
        └─────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │   Compiler Variants                     │
        │   - NSS (no CSS, no Host)               │
        │   - Host (no NSS, no CSS)               │
        │   - CSS (no NSS, no Host)               │
        └─────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │   Torq Execution                        │
        │   - Remote SoC via SSH                  │
        └─────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │   Performance Profiling                 │
        │   - Execution time                      │
        │   - Hardware utilization                │
        │   - Performance visualization           │
        └─────────────────────────────────────────┘

How It Works

Template MLIR Files

Template files contain placeholders for dynamic shape substitution:

module {
  func.func @main(%arg0: !torch.vtensor<{shape_4_bf16},bf16>, %arg1: !torch.vtensor<{shape_4_bf16},bf16>) -> !torch.vtensor<{shape_4_bf16},bf16> attributes {torch.onnx_meta.ir_version = 10 : si64, torch.onnx_meta.opset_version = 22 : si64, torch.onnx_meta.producer_name = "", torch.onnx_meta.producer_version = ""} {
    %0 = torch.operator "onnx.Add"(%arg0, %arg1) : (!torch.vtensor<{shape_4_bf16},bf16>, !torch.vtensor<{shape_4_bf16},bf16>) -> !torch.vtensor<{shape_4_bf16},bf16> 
    return %0 : !torch.vtensor<{shape_4_bf16},bf16>
  }
}

Placeholder Syntax:

{shape_1_i8} - 1D shape with int8 dtype
{shape_2_bf16} - 2D shape with bfloat16 dtype
{shape_3_i32} - 3D shape with int32 dtype
{shape_4_f32} - 4D shape with float32 dtype

Warning

Current Limitation: Each template file can only use one type of placeholder throughout. You cannot mix different rank or dtype placeholders in the same template. For example, you cannot combine {shape_3_bf16} and {shape_4_i32} in a single template file. All placeholders in a template must have the same rank and dtype combination for now.

Shape Generation Algorithm

┌──────────────────────────────────────────────────────────────┐
│  Shape Generation Flow                                       │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │  Calculate Max Elements                 │
        │  max_elements = (LRAM_KB * 1024)        │
        │                / (dtype_size * 3)       │
        └─────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │  Generate Size Factors                  │
        │  - Quadratic distribution               │
        │  - More samples at small sizes          │
        │  - 1% to 100% range                     │
        └─────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │  Rank-Specific Shape Generation         │
        │  - 1D: [size]                           │
        │  - 2D: [1, features]                    │
        │  - 3D: [1, seq, features]               │
        │  - 4D: [1, ch, height, width]           │
        └─────────────────────────────────────────┘

Key Parameters:

lram_size: Maximum tensor size in KB (default: 500 KB)
num_samples: Number of shape variations per (rank, dtype) (default: 1)
Size constraint: product(shape) × dtype_size × 3 < lram_size

Note

To modify lram_size or num_samples, edit the module-level constants in torq-compiler-dev/tests/test_template_mlir.py:

# Shape generation defaults
DEFAULT_LRAM_SIZE = 500  # KB
DEFAULT_NUM_SAMPLES = 1

These constants are used throughout the test file in both the case_config fixture and pytest_generate_tests hook.

Compiler Variants

Three compiler configuration variants are automatically tested:

Variant	Description	Compiler Flags
NSS	NSS-only execution	`--torq-disable-css --torq-disable-host`
Host	Host-only execution	`--torq-disable-slices --torq-disable-css`
CSS	CSS-only execution	`--torq-disable-slices --torq-disable-host`

Features and Capabilities

What It Can Do

1. Multi-Rank Testing

Test operators with 1D, 2D, 3D, and 4D tensors automatically:

1D: Vector operations [size]
2D: Matrix operations [batch=1, features]
3D: Sequence operations [batch=1, seq, features]
4D: Image/Conv operations [batch=1, channels, height, width]

2. Multi-Dtype Support

Comprehensive data type coverage:

Integers: i8, i16, i32, i64
Unsigned: ui8, ui16, ui32, ui64
Floating Point: f16, f32, f64, bf16

3. Dynamic Shape Generation

Automatic generation of valid shapes based on LRAM constraints
Quadratic distribution for more small-size testing
64-byte alignment for last dimension
Configurable by editing case_config in test_template_mlir.py

4. Performance Profiling

Enable detailed profiling with --template-profiling-enabled:

Execution time per operator
Hardware utilization metrics
Memory bandwidth analysis
Profiling data exported to CSV/PNG

5. Selective Test Execution

Use pytest’s -k flag for focused testing:

# Test only addition operators
pytest tests/test_template_mlir.py -k add.mlir ...

6. Custom Compiler Options

Add extra compiler flags:

pytest tests/test_template_mlir.py \
  --extra-torq-compiler-options="..." ...

Understanding Test Output

Profiling Output

Profiling data is saved in the specified output directory:

./profiling_results/
├── test_run_templates_on_soc[r4_bf16_1x1x29x64-css-add_bf16.mlir-astra_machina-default].csv
├── test_run_templates_on_soc[r4_bf16_1x1x29x64-nss-add_bf16.mlir-astra_machina-default].csv
├── test_run_templates_on_soc[r4_bf16_1x1x29x64-host-add_bf16.mlir-astra_machina-default].csv
|
...
|
└── latency_by_shape_variant_bar.png

Limitations and Current Constraints

What It Currently Cannot Do

1. Arbitrary Shape Relationships

Templates cannot express relationships like “output shape = input1 + input2”
Each placeholder is independently generated
Workaround: Create multiple template files with explicit shapes

3. Complex Dtype Interactions

Limited support for mixed-precision operations
All operands of a placeholder must use the same dtype
Example: Cannot easily test i8 × i8 → i32 accumulation patterns

4. Memory Constraint Validation

LRAM size checks are heuristic-based (factor of 3)
Does not account for intermediate buffer allocations
Risk: May generate shapes that OOM at runtime

5. Multi-Input Shape Broadcasting

No automatic generation of broadcast-compatible shapes
All inputs must have identical shapes
Workaround: Manually create templates with specific broadcast patterns