Torq Runtime Python API (Beta)

The torq-runtime Python package provides bindings for loading and running compiled .vmfb models on a Torq device directly from Python.

Warning

torq-runtime is currently in beta and is not yet available on PyPI.

Installation

The torq-runtime package is included in the GitHub release. Install the runtime wheel directly from any release snapshot.

Quick Start

import numpy as np
from torq.runtime import VMFBInferenceRunner

# Load the compiled model
runner = VMFBInferenceRunner("mobilenetv2.vmfb", device_uri="torq")

# Prepare input data
input_data = np.random.randint(0, 255, size=(1, 224, 224, 3), dtype=np.int8)

# Run inference
outputs = runner.infer([input_data])
print(f"Inference took {runner.infer_time_ms:.2f} ms")

Examples

Inspecting Model Inputs and Outputs

from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")

if runner.inputs_info:
    for i, info in enumerate(runner.inputs_info):
        print(f"Input {i}: dtype={info.dtype}, shape={info.shape}")

if runner.outputs_info:
    for i, info in enumerate(runner.outputs_info):
        print(f"Output {i}: dtype={info.dtype}, shape={info.shape}")

Profiling Inference Latency

from torq.runtime import profile_vmfb_inference_time

avg_ms = profile_vmfb_inference_time(
    "model.vmfb",
    n_iters=10,
    do_warmup=True,
    device="torq",
)
print(f"Average inference time: {avg_ms:.2f} ms")

Profiling System Resources

from torq.runtime import profile_vmfb_resources

stats = profile_vmfb_resources(
    "model.vmfb",
    n_iters=10,
    do_warmup=True,
    device="torq",
)
print(f"Average inference time: {stats.avg_inference_time_ms:.2f} ms")
print(f"Average DRAM footprint: {stats.avg_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Peak DRAM footprint: {stats.peak_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Average memory usage: {stats.avg_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {stats.peak_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Average CPU usage: {stats.avg_cpu_percent:.1f}%")

Running with Custom Inputs

import numpy as np
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")

# Load preprocessed input from a .npy file
input_data = np.load("preprocessed_input.npy")
outputs = runner.infer([input_data])

Zero-Copy Device Outputs

When model outputs are fed back as inputs (e.g. KV-cache tensors in autoregressive decoding), use device_outputs=True to keep data on the device and avoid unnecessary host transfers:

import numpy as np
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq", device_outputs=True)

# Allocate an initial on-device buffer (e.g. for a KV cache)
kv_cache = runner.allocate_device_array(np.zeros((1, 8, 256, 64), dtype=np.float16))

# Run inference — outputs stay on device as DeviceArray objects
results = runner.infer([input_data, kv_cache])
logits, new_kv = results[0], results[1]

# Use the output DeviceArray directly as next step's input
kv_cache = new_kv

# Only bring the logits to host for sampling
logits_np = logits.to_host()

API Reference

VMFBInferenceRunner

The main class for loading and running .vmfb models via the IREE runtime.

VMFBInferenceRunner(
    model_path,
    *,
    function="main",
    device_uri="torq",
    n_threads=None,
    load_method="preload",
    load_model_to_mem=True,
    runtime_flags=None,
    device_outputs=False,
)

Parameters:

Parameter

Type

Default

Description

model_path

str | PathLike

(required)

Path to the .vmfb file.

function

str

"main"

Exported function name inside the module.

device_uri

str

"torq"

IREE device identifier.

n_threads

int | None

None

Worker thread count (only for llvm-cpu device).

load_method

"preload" | "mmap"

"preload"

"preload" copies into memory; "mmap" memory-maps the file.

load_model_to_mem

bool

True

Whether to load the model into memory during initialization.

runtime_flags

Iterable[str] | None

None

Extra IREE runtime flags.

device_outputs

bool

False

If True, infer() returns on-device DeviceArray objects instead of NumPy arrays, avoiding device-to-host transfers. Useful for pipelines where outputs are fed back as inputs (e.g. KV-cache in autoregressive decoding).

Properties:

Property

Type

Description

model_path

PathLike

Path to the loaded model file.

infer_time_ms

float

Elapsed time in milliseconds for the last call to infer().

device

HalDevice

The underlying IREE HAL device.

inputs_info

list[TensorInfo] | None

Input tensor metadata extracted from the model, or None if unavailable.

outputs_info

list[TensorInfo] | None

Output tensor metadata extracted from the model, or None if unavailable.

Methods:

infer(inputs)

Run inference and return the output arrays.

  • inputs — Either an iterable of NumPy arrays (or DeviceArray objects) or a mapping of name to array.

  • Returns — A list of NumPy arrays by default. When device_outputs=True, returns a list of on-device DeviceArray objects instead.

allocate_device_array(array)

Allocate a device buffer and copy a host NumPy array into it.

  • array — A NumPy array to upload to the device.

  • Returns — A DeviceArray that can be passed directly to infer() without further copies.

profile_vmfb_inference_time

Load a .vmfb model and run inference multiple times for profiling.

profile_vmfb_inference_time(
    model_path,
    inputs=None,
    *,
    n_iters=5,
    do_warmup=True,
    function="main",
    device="torq",
    n_threads=None,
    load_model_to_mem=True,
    runtime_flags=None,
    device_io=False,
)

Parameters:

Parameter

Type

Default

Description

model_path

str | PathLike

(required)

Path to the .vmfb file.

inputs

Iterable[NDArray] | None

None

Input arrays. Generated randomly from model metadata when None.

n_iters

int

5

Number of timed inference iterations.

do_warmup

bool

True

Whether to run one untimed warmup pass first.

function

str

"main"

Exported function name inside the module.

device

str

"torq"

IREE device URI.

n_threads

int | None

None

Worker thread count (only for llvm-cpu device).

load_method

"preload" | "mmap"

"preload"

"preload" copies into memory; "mmap" memory-maps the file.

load_model_to_mem

bool

True

Whether to load the model into memory during initialization.

runtime_flags

Iterable[str] | None

None

Extra IREE runtime flags.

device_io

bool

False

Exclude some host copies from profiling by using iree DeviceArray objects for I/O.

Returns: Average wall-clock inference time in milliseconds.

profile_vmfb_resources

Load a .vmfb model and run inference multiple times, collecting DRAM and CPU statistics alongside timing. Requires a Linux target (reads from /proc).

from torq.runtime import profile_vmfb_resources

profile_vmfb_resources(
    model_path,
    inputs=None,
    *,
    n_iters=5,
    do_warmup=True,
    function="main",
    device="torq",
    n_threads=None,
    load_method="preload",
    load_model_to_mem=True,
    runtime_flags=None,
    device_io=False,
)

Parameters:

Parameter

Type

Default

Description

model_path

str | PathLike

(required)

Path to the .vmfb file.

inputs

Iterable[NDArray] | None

None

Input arrays. Generated randomly from model metadata when None.

n_iters

int

5

Number of timed inference iterations.

do_warmup

bool

True

Whether to run one untimed warmup pass first.

function

str

"main"

Exported function name inside the module.

device

str

"torq"

IREE device URI.

n_threads

int | None

None

Worker thread count (only for llvm-cpu device).

load_method

"preload" | "mmap"

"preload"

"preload" copies into memory; "mmap" memory-maps the file.

load_model_to_mem

bool

True

Whether to load the model into memory during initialization.

runtime_flags

Iterable[str] | None

None

Extra IREE runtime flags.

device_io

bool

False

Exclude some host copies from profiling by using iree DeviceArray objects for I/O.

Returns: A ProfileStats dataclass (see below).

ProfileStats

Dataclass returned by profile_vmfb_resources containing profiling results. All resource metrics are process-wide and include overhead from the Python interpreter, the sampling thread, and any other activity in the process.

from torq.runtime.profiling import ProfileStats

Field

Type

Description

avg_inference_time_ms

float

Average wall-clock inference time in milliseconds.

avg_dram_footprint_bytes

int

Average total DRAM footprint in bytes, including file-backed pages (e.g. mmap’d model data).

peak_dram_footprint_bytes

int

Peak total DRAM footprint in bytes.

avg_anon_mem_bytes

int

Average anonymous memory (heap/stack) in bytes, excluding file-backed pages.

peak_anon_mem_bytes

int

Peak anonymous memory in bytes.

avg_cpu_percent

float

Average CPU utilisation as a percentage (across all cores) during timed iterations.

Methods:

summary()

Return a human-readable summary string with formatted values (times in ms, memory in MB, CPU as a percentage).

stats = profile_vmfb_resources("model.vmfb")
print(stats.summary())
# Avg inference time:  12.345 ms
#
# Process-wide resource usage (includes Python overhead):
#   Avg DRAM footprint:  85.2 MB  (includes mmap'd model file)
#   Peak DRAM footprint: 91.7 MB
#   Avg memory usage:    23.4 MB  (heap/stack only, excludes model file)
#   Peak memory usage:   28.1 MB
#   Avg CPU usage:       47.3%

ResourceSampler

A reusable background sampler for collecting process-wide DRAM and CPU metrics around arbitrary workloads. Used internally by profile_vmfb_resources but available for custom profiling scenarios.

from torq.runtime.profiling import ResourceSampler

Parameter

Type

Default

Description

pid

int

(required)

Process ID to monitor.

interval

float

0.01

Sampling interval in seconds.

Properties (available after stop()):

Property

Type

Description

avg_rss

int

Average total RSS (DRAM footprint) in bytes.

peak_rss

int

Peak total RSS in bytes.

avg_anon_rss

int

Average anonymous RSS in bytes.

peak_anon_rss

int

Peak anonymous RSS in bytes.

avg_cpu_percent

float

Average CPU utilisation percentage.

run_vmfb

Run a .vmfb model via the iree-run-module CLI and return wall-clock time.

run_vmfb(
    model_path,
    inputs,
    outputs,
    device="torq",
    n_threads=None,
    iree_binary=None,
)

Parameters:

Parameter

Type

Default

Description

model_path

str | PathLike

(required)

Path to the .vmfb file.

inputs

Iterable[str]

(required)

Input descriptors forwarded as --input flags.

outputs

Iterable[str]

(required)

Output descriptors forwarded as --output flags.

device

str

"torq"

IREE device URI.

n_threads

int | None

None

Worker thread count (only for llvm-cpu device, defaults to os.cpu_count()).

iree_binary

str | PathLike | None

None

Path to the iree-run-module binary. Resolved from PATH if not provided.

Returns: Elapsed wall-clock time in milliseconds.

TensorInfo

Dataclass holding dtype and shape metadata for a tensor.

@dataclass
class TensorInfo:
    dtype: DTypeLike
    shape: list[int | str]

Field

Type

Description

dtype

DTypeLike

NumPy-compatible dtype.

shape

list[int | str]

Tensor dimensions.

Methods:

  • is_valid() — Returns True if every dimension is an integer (i.e., no dynamic dimensions).

Utility Functions

random_inputs_from_info(inputs_info)

Generate random NumPy arrays matching the given tensor metadata. Useful for testing.

  • inputs_info — Iterable of TensorInfo.

  • Returns — List of NumPy arrays with appropriate shapes and dtypes.