Profiling
Profiling helps you understand the performance characteristics of your models on the Synaptics backend. This includes:
Compile-time profiling: Estimates of approximate clock cycles per operation.
Runtime profiling: Actual execution times for each block during inference.
Compile-Time Profiling
Use profiling to identify performance bottlenecks, optimize memory usage, and better understand hardware behavior.
Torq profiling adds memory footprints and approximate clock cycles information in the MLIR output. By default it is disabled, to enable it add option
--torq-enable-profilingto thetorq-compilecommand. By default profiling is written totimeline.csv. The profiling details can be written to a different file using--torq-dump-profiling=path/to/trace.csvCompile the model using torq-compile with the profiling flag:
$ torq-compile tests/testdata/tosa_ops/add.mlir -o model.vmfb --torq-enable-profiling --torq-dump-profiling=./trace.csv --dump-compilation-phases-to=./compilation_phases
Note: The
--dump-compilation-phases-toflag dumps the debug information into a specified directory. These debug files are later used to annotate the runtime profiling results.
Understanding trace.csv Output
The
trace.csvfile contains a timeline of estimated execution steps, memory operations, and kernel invocations emitted during compile-time profiling.Column Breakdown:
Column
Description
ID
Unique identifier. Convention: DI# (DMA_IN), DO# (DMA_OUT), S# (compute op).
Start
Start time in cycles of this operation.
End
End time in cycles.
Location
MLIR source location that generated this operation. Helpful for tracing.
Op Type
Type of operation — e.g., DMA_IN, DMA_OUT, fully_connected, etc.
Runtime Profiling
Warning
This release only supports host-based simulation which is not cycle accurate. The reported timing will not be representative of the real performance; the documentation in this section shall be considered as a preview of the profiling mode available on the upcoming release.
Runtime profiling records the actual execution time for each code block when the model is run on simulation. This will print the time each individual block of code takes to execute.
Run the model using torq-run-module with the profiling flag:
$ torq-run-module --module=model.vmfb --input="1x56x56x24xi8=1" --torq_profile_host=./runtime.csv
Understanding
runtime.csvOutputThe
runtime.csvfile contains detailed timing information per execution block. This helps analyze actual performance, identify slow paths, and correlate them with model structure.Column Breakdown:
Column
Description
actionIndex
Runtime action index (starting from 0).
elapsed_time(us)
Time elapsed (μs) since the previous operation.
timestamp(us)
Timestamp (μs) of the event
event
Type of action
location
MLIR source location that generated this operation. Helpful for tracing.
To annotate the runtime profiling
runtime.csv, use annotate_profiling.py by passing the runtime.csv file along with the executable-targets phase dump file, which you can obtain using the--dump-compilation-phases-toflag during compilation.$ annotate_profiling.py ./compilation_phases/add.9.executable-targets.mlir ./runtime.csv ./annotated_runtime.csv
This enriches the trace with hardware-level details such as actual DMA operations, kernel launch times, and usage of slices.
Annotated CSV Columns:
Column
Description
action_id
Index of the runtime action
job_id
Index of the NSS job associated with the action (if applicable)
operation
Action being performed
invocation_names
Name of the slice programs invoked during the action (if applicable)
original_operator
Original operator from input model
total_time
Total duration of the operation in μs
slice_used_0_in_program
Flag indicating if slice 0 was used by this action
slice_used_1_in_program
Flag indicating if slice 1 was used by this action
dma_in_used_in_program
Flag indicating if DMA input was used by this action
dma_out_used_in_program
Flag indicating if DMA output was used by this action
cdma_used_in_program
Flag indicating if CDMA was used by this action
css_used_in_program
Flag indicating if CSS was used by this action
timestamp_start
Start timestamp in μs
timestamp_end
End timestamp in μs
location
MLIR source location that generated this operation
Profiling with the Test Framework
When running tests on real hardware via pytest, you can combine runtime profiling with
--update-astra-runtime to get a profiling summary printed at the end of the session:
pytest tests/test_onnx_model.py -k example-matmul_layer \
--torq-runtime-hw-type=astra_machina \
--torq-addr root@10.46.130.17 \
--torq-runtime-profiling-output-dir=./profile \
--update-astra-runtime \
--recompute-cache
At the end of the session, a Host Profile Overview section is printed with a
summary extracted from the Perfetto .pb trace files:
====================== Host Profile Overview =======================
Model: conv2d_f4_s4_64x64x16_i16
────────────────────────────────────────────────────────
WALL_TIME 207.000ms
OVERALL 12.845ms
Dma 453.000µs (3.53%)
Dma Only 46.000µs (0.36%)
Dma Total 453.000µs (3.53%)
Cdma 0ns (0.00%)
Compute 407.000µs (3.17%)
Compute Only 0ns (0.00%)
Slice 407.000µs (3.17%)
Slice 0 407.000µs (3.17%)
Slice 1 407.000µs (3.17%)
Css 0ns (0.00%)
Overlap 407.000µs (3.17%)
Idle 6.000µs (0.05%)
Host 0ns (0.00%)
Host Copy 12.386ms (96.43%)
The summary dynamically displays all available metrics from the trace — no hard-coded
list of keys. WALL_TIME is the end-to-end time measured on the host (including SSH
overhead), while OVERALL and the breakdown rows come from the on-device trace.
Wall-time measurement
When --update-astra-runtime is active, the framework first syncs your locally built
torq-run-module into your user-specific board path
(/home/root/iree-build-soc/<username>/torq-run-module) and then measures wall-clock time
around that remote invocation using Python’s time.monotonic(). The result is:
Logged during the test (
Wall time: 1.234s)Saved as
wall_time.txtin the test results directoryAttached to the test report via
record_property("wall_time", ...)Printed at the top of the Host Profile Overview summary