Torq Runtime Python API (Beta)

The torq-runtime Python package provides bindings for loading and running compiled .vmfb models on a Torq device directly from Python.

Warning

torq-runtime is currently in beta and is not yet available on PyPI.

Installation

The torq-runtime package is included in the GitHub release. Install the runtime wheel directly from any release snapshot.

Quick Start

import numpy as np
from torq.runtime import VMFBInferenceRunner

# Load the compiled model
runner = VMFBInferenceRunner("mobilenetv2.vmfb", device_uri="torq")

# Prepare input data
input_data = np.random.randint(0, 255, size=(1, 224, 224, 3), dtype=np.int8)

# Run inference
outputs = runner.infer([input_data])
print(f"Inference took {runner.infer_time_ms:.2f} ms")

Examples

Inspecting Model Inputs and Outputs

from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")

if runner.inputs_info:
    for i, info in enumerate(runner.inputs_info):
        print(f"Input {i}: dtype={info.dtype}, shape={info.shape}")

if runner.outputs_info:
    for i, info in enumerate(runner.outputs_info):
        print(f"Output {i}: dtype={info.dtype}, shape={info.shape}")

Profiling Inference Latency

from torq.runtime import profile_vmfb_inference_time

avg_ms = profile_vmfb_inference_time(
    "model.vmfb",
    n_iters=10,
    do_warmup=True,
    device="torq",
)
print(f"Average inference time: {avg_ms:.2f} ms")

Profiling System Resources

from torq.runtime import profile_vmfb_resources

stats = profile_vmfb_resources(
    "model.vmfb",
    n_iters=10,
    do_warmup=True,
    device="torq",
)
print(f"Average inference time: {stats.avg_inference_time_ms:.2f} ms")
print(f"Average DRAM footprint: {stats.avg_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Peak DRAM footprint: {stats.peak_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Average memory usage: {stats.avg_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {stats.peak_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Average CPU usage: {stats.avg_cpu_percent:.1f}%")

Running with Custom Inputs

import numpy as np
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")

# Load preprocessed input from a .npy file
input_data = np.load("preprocessed_input.npy")
outputs = runner.infer([input_data])

Zero-Copy Device Outputs

When model outputs are fed back as inputs (e.g. KV-cache tensors in autoregressive decoding), use device_outputs=True to keep data on the device and avoid unnecessary host transfers:

import numpy as np
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq", device_outputs=True)

# Allocate an initial on-device buffer (e.g. for a KV cache)
kv_cache = runner.allocate_device_array(np.zeros((1, 8, 256, 64), dtype=np.float16))

# Run inference — outputs stay on device as DeviceArray objects
results = runner.infer([input_data, kv_cache])
logits, new_kv = results[0], results[1]

# Use the output DeviceArray directly as next step's input
kv_cache = new_kv

# Only bring the logits to host for sampling
logits_np = logits.to_host()

API Reference

`VMFBInferenceRunner`

The main class for loading and running .vmfb models via the IREE runtime.

VMFBInferenceRunner(
    model_path,
    *,
    function="main",
    device_uri="torq",
    n_threads=None,
    load_method="preload",
    load_model_to_mem=True,
    runtime_flags=None,
    device_outputs=False,
)

Parameters:

Parameter	Type	Default	Description
`model_path`	`str \| PathLike`	(required)	Path to the `.vmfb` file.
`function`	`str`	`"main"`	Exported function name inside the module.
`device_uri`	`str`	`"torq"`	IREE device identifier.
`n_threads`	`int \| None`	`None`	Worker thread count (only for llvm-cpu device).
`load_method`	`"preload" \| "mmap"`	`"preload"`	`"preload"` copies into memory; `"mmap"` memory-maps the file.
`load_model_to_mem`	`bool`	`True`	Whether to load the model into memory during initialization.
`runtime_flags`	`Iterable[str] \| None`	`None`	Extra IREE runtime flags.
`device_outputs`	`bool`	`False`	If `True`, `infer()` returns on-device `DeviceArray` objects instead of NumPy arrays, avoiding device-to-host transfers. Useful for pipelines where outputs are fed back as inputs (e.g. KV-cache in autoregressive decoding).

Properties:

Property	Type	Description
`model_path`	`PathLike`	Path to the loaded model file.
`infer_time_ms`	`float`	Elapsed time in milliseconds for the last call to `infer()`.
`device`	`HalDevice`	The underlying IREE HAL device.
`inputs_info`	`list[TensorInfo] \| None`	Input tensor metadata extracted from the model, or `None` if unavailable.
`outputs_info`	`list[TensorInfo] \| None`	Output tensor metadata extracted from the model, or `None` if unavailable.

Methods:

`infer(inputs)`

Run inference and return the output arrays.

inputs — Either an iterable of NumPy arrays (or DeviceArray objects) or a mapping of name to array.
Returns — A list of NumPy arrays by default. When device_outputs=True, returns a list of on-device DeviceArray objects instead.

`allocate_device_array(array)`

Allocate a device buffer and copy a host NumPy array into it.

array — A NumPy array to upload to the device.
Returns — A DeviceArray that can be passed directly to infer() without further copies.

`profile_vmfb_inference_time`

Load a .vmfb model and run inference multiple times for profiling.

profile_vmfb_inference_time(
    model_path,
    inputs=None,
    *,
    n_iters=5,
    do_warmup=True,
    function="main",
    device="torq",
    n_threads=None,
    load_model_to_mem=True,
    runtime_flags=None,
    device_io=False,
)

Parameters:

Parameter	Type	Default	Description
`model_path`	`str \| PathLike`	(required)	Path to the `.vmfb` file.
`inputs`	`Iterable[NDArray] \| None`	`None`	Input arrays. Generated randomly from model metadata when `None`.
`n_iters`	`int`	`5`	Number of timed inference iterations.
`do_warmup`	`bool`	`True`	Whether to run one untimed warmup pass first.
`function`	`str`	`"main"`	Exported function name inside the module.
`device`	`str`	`"torq"`	IREE device URI.
`n_threads`	`int \| None`	`None`	Worker thread count (only for llvm-cpu device).
`load_method`	`"preload" \| "mmap"`	`"preload"`	`"preload"` copies into memory; `"mmap"` memory-maps the file.
`load_model_to_mem`	`bool`	`True`	Whether to load the model into memory during initialization.
`runtime_flags`	`Iterable[str] \| None`	`None`	Extra IREE runtime flags.
`device_io`	`bool`	`False`	Exclude some host copies from profiling by using iree DeviceArray objects for I/O.

Returns: Average wall-clock inference time in milliseconds.

`profile_vmfb_resources`

Load a .vmfb model and run inference multiple times, collecting DRAM and CPU statistics alongside timing. Requires a Linux target (reads from /proc).

from torq.runtime import profile_vmfb_resources

profile_vmfb_resources(
    model_path,
    inputs=None,
    *,
    n_iters=5,
    do_warmup=True,
    function="main",
    device="torq",
    n_threads=None,
    load_method="preload",
    load_model_to_mem=True,
    runtime_flags=None,
    device_io=False,
)

Parameters:

Parameter	Type	Default	Description
`model_path`	`str \| PathLike`	(required)	Path to the `.vmfb` file.
`inputs`	`Iterable[NDArray] \| None`	`None`	Input arrays. Generated randomly from model metadata when `None`.
`n_iters`	`int`	`5`	Number of timed inference iterations.
`do_warmup`	`bool`	`True`	Whether to run one untimed warmup pass first.
`function`	`str`	`"main"`	Exported function name inside the module.
`device`	`str`	`"torq"`	IREE device URI.
`n_threads`	`int \| None`	`None`	Worker thread count (only for llvm-cpu device).
`load_method`	`"preload" \| "mmap"`	`"preload"`	`"preload"` copies into memory; `"mmap"` memory-maps the file.
`load_model_to_mem`	`bool`	`True`	Whether to load the model into memory during initialization.
`runtime_flags`	`Iterable[str] \| None`	`None`	Extra IREE runtime flags.
`device_io`	`bool`	`False`	Exclude some host copies from profiling by using iree DeviceArray objects for I/O.

Returns: A ProfileStats dataclass (see below).

`ProfileStats`

Dataclass returned by profile_vmfb_resources containing profiling results. All resource metrics are process-wide and include overhead from the Python interpreter, the sampling thread, and any other activity in the process.

from torq.runtime.profiling import ProfileStats

Field	Type	Description
`avg_inference_time_ms`	`float`	Average wall-clock inference time in milliseconds.
`avg_dram_footprint_bytes`	`int`	Average total DRAM footprint in bytes, including file-backed pages (e.g. mmap’d model data).
`peak_dram_footprint_bytes`	`int`	Peak total DRAM footprint in bytes.
`avg_anon_mem_bytes`	`int`	Average anonymous memory (heap/stack) in bytes, excluding file-backed pages.
`peak_anon_mem_bytes`	`int`	Peak anonymous memory in bytes.
`avg_cpu_percent`	`float`	Average CPU utilisation as a percentage (across all cores) during timed iterations.

Methods:

`summary()`

Return a human-readable summary string with formatted values (times in ms, memory in MB, CPU as a percentage).

stats = profile_vmfb_resources("model.vmfb")
print(stats.summary())
# Avg inference time:  12.345 ms
#
# Process-wide resource usage (includes Python overhead):
#   Avg DRAM footprint:  85.2 MB  (includes mmap'd model file)
#   Peak DRAM footprint: 91.7 MB
#   Avg memory usage:    23.4 MB  (heap/stack only, excludes model file)
#   Peak memory usage:   28.1 MB
#   Avg CPU usage:       47.3%

`ResourceSampler`

A reusable background sampler for collecting process-wide DRAM and CPU metrics around arbitrary workloads. Used internally by profile_vmfb_resources but available for custom profiling scenarios.

from torq.runtime.profiling import ResourceSampler

Parameter	Type	Default	Description
`pid`	`int`	(required)	Process ID to monitor.
`interval`	`float`	`0.01`	Sampling interval in seconds.

Properties (available after stop()):

Property	Type	Description
`avg_rss`	`int`	Average total RSS (DRAM footprint) in bytes.
`peak_rss`	`int`	Peak total RSS in bytes.
`avg_anon_rss`	`int`	Average anonymous RSS in bytes.
`peak_anon_rss`	`int`	Peak anonymous RSS in bytes.
`avg_cpu_percent`	`float`	Average CPU utilisation percentage.

`run_vmfb`

Run a .vmfb model via the iree-run-module CLI and return wall-clock time.

run_vmfb(
    model_path,
    inputs,
    outputs,
    device="torq",
    n_threads=None,
    iree_binary=None,
)

Parameters:

Parameter	Type	Default	Description
`model_path`	`str \| PathLike`	(required)	Path to the `.vmfb` file.
`inputs`	`Iterable[str]`	(required)	Input descriptors forwarded as `--input` flags.
`outputs`	`Iterable[str]`	(required)	Output descriptors forwarded as `--output` flags.
`device`	`str`	`"torq"`	IREE device URI.
`n_threads`	`int \| None`	`None`	Worker thread count (only for llvm-cpu device, defaults to `os.cpu_count()`).
`iree_binary`	`str \| PathLike \| None`	`None`	Path to the `iree-run-module` binary. Resolved from `PATH` if not provided.

Returns: Elapsed wall-clock time in milliseconds.

`TensorInfo`

Dataclass holding dtype and shape metadata for a tensor.

@dataclass
class TensorInfo:
    dtype: DTypeLike
    shape: list[int | str]

Field	Type	Description
`dtype`	`DTypeLike`	NumPy-compatible dtype.
`shape`	`list[int \| str]`	Tensor dimensions.

Methods:

is_valid() — Returns True if every dimension is an integer (i.e., no dynamic dimensions).

Utility Functions

`random_inputs_from_info(inputs_info)`

Generate random NumPy arrays matching the given tensor metadata. Useful for testing.

inputs_info — Iterable of TensorInfo.
Returns — List of NumPy arrays with appropriate shapes and dtypes.

Torq Runtime Python API (Beta)

Installation

Quick Start

Examples

Inspecting Model Inputs and Outputs

Profiling Inference Latency

Profiling System Resources

Running with Custom Inputs

Zero-Copy Device Outputs

API Reference

VMFBInferenceRunner

infer(inputs)

allocate_device_array(array)

profile_vmfb_inference_time

profile_vmfb_resources

ProfileStats

summary()

ResourceSampler

run_vmfb

TensorInfo

Utility Functions

random_inputs_from_info(inputs_info)

`VMFBInferenceRunner`

`infer(inputs)`

`allocate_device_array(array)`

`profile_vmfb_inference_time`

`profile_vmfb_resources`

`ProfileStats`

`summary()`

`ResourceSampler`

`run_vmfb`

`TensorInfo`

`random_inputs_from_info(inputs_info)`