Torq Runtime Python API (Beta)
The torq-runtime Python package provides bindings for loading and running compiled .vmfb models on a Torq device directly from Python.
Warning
torq-runtime is currently in beta and is not yet available on PyPI.
Installation
The torq-runtime package is included in the GitHub release. Install the runtime wheel directly from any release snapshot.
Quick Start
import numpy as np
from torq.runtime import VMFBInferenceRunner
# Load the compiled model
runner = VMFBInferenceRunner("mobilenetv2.vmfb", device_uri="torq")
# Prepare input data
input_data = np.random.randint(0, 255, size=(1, 224, 224, 3), dtype=np.int8)
# Run inference
outputs = runner.infer([input_data])
print(f"Inference took {runner.infer_time_ms:.2f} ms")
Examples
Inspecting Model Inputs and Outputs
from torq.runtime import VMFBInferenceRunner
runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")
if runner.inputs_info:
for i, info in enumerate(runner.inputs_info):
print(f"Input {i}: dtype={info.dtype}, shape={info.shape}")
if runner.outputs_info:
for i, info in enumerate(runner.outputs_info):
print(f"Output {i}: dtype={info.dtype}, shape={info.shape}")
Profiling Inference Latency
from torq.runtime import profile_vmfb_inference_time
avg_ms = profile_vmfb_inference_time(
"model.vmfb",
n_iters=10,
do_warmup=True,
device="torq",
)
print(f"Average inference time: {avg_ms:.2f} ms")
Profiling System Resources
from torq.runtime import profile_vmfb_resources
stats = profile_vmfb_resources(
"model.vmfb",
n_iters=10,
do_warmup=True,
device="torq",
)
print(f"Average inference time: {stats.avg_inference_time_ms:.2f} ms")
print(f"Average DRAM footprint: {stats.avg_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Peak DRAM footprint: {stats.peak_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Average memory usage: {stats.avg_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {stats.peak_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Average CPU usage: {stats.avg_cpu_percent:.1f}%")
Running with Custom Inputs
import numpy as np
from torq.runtime import VMFBInferenceRunner
runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")
# Load preprocessed input from a .npy file
input_data = np.load("preprocessed_input.npy")
outputs = runner.infer([input_data])
Zero-Copy Device Outputs
When model outputs are fed back as inputs (e.g. KV-cache tensors in autoregressive decoding), use device_outputs=True to keep data on the device and avoid unnecessary host transfers:
import numpy as np
from torq.runtime import VMFBInferenceRunner
runner = VMFBInferenceRunner("model.vmfb", device_uri="torq", device_outputs=True)
# Allocate an initial on-device buffer (e.g. for a KV cache)
kv_cache = runner.allocate_device_array(np.zeros((1, 8, 256, 64), dtype=np.float16))
# Run inference — outputs stay on device as DeviceArray objects
results = runner.infer([input_data, kv_cache])
logits, new_kv = results[0], results[1]
# Use the output DeviceArray directly as next step's input
kv_cache = new_kv
# Only bring the logits to host for sampling
logits_np = logits.to_host()
API Reference
VMFBInferenceRunner
The main class for loading and running .vmfb models via the IREE runtime.
VMFBInferenceRunner(
model_path,
*,
function="main",
device_uri="torq",
n_threads=None,
load_method="preload",
load_model_to_mem=True,
runtime_flags=None,
device_outputs=False,
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Path to the |
|
|
|
Exported function name inside the module. |
|
|
|
IREE device identifier. |
|
|
|
Worker thread count (only for llvm-cpu device). |
|
|
|
|
|
|
|
Whether to load the model into memory during initialization. |
|
|
|
Extra IREE runtime flags. |
|
|
|
If |
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Path to the loaded model file. |
|
|
Elapsed time in milliseconds for the last call to |
|
|
The underlying IREE HAL device. |
|
|
Input tensor metadata extracted from the model, or |
|
|
Output tensor metadata extracted from the model, or |
Methods:
infer(inputs)
Run inference and return the output arrays.
inputs — Either an iterable of NumPy arrays (or
DeviceArrayobjects) or a mapping of name to array.Returns — A list of NumPy arrays by default. When
device_outputs=True, returns a list of on-deviceDeviceArrayobjects instead.
allocate_device_array(array)
Allocate a device buffer and copy a host NumPy array into it.
array — A NumPy array to upload to the device.
Returns — A
DeviceArraythat can be passed directly toinfer()without further copies.
profile_vmfb_inference_time
Load a .vmfb model and run inference multiple times for profiling.
profile_vmfb_inference_time(
model_path,
inputs=None,
*,
n_iters=5,
do_warmup=True,
function="main",
device="torq",
n_threads=None,
load_model_to_mem=True,
runtime_flags=None,
device_io=False,
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Path to the |
|
|
|
Input arrays. Generated randomly from model metadata when |
|
|
|
Number of timed inference iterations. |
|
|
|
Whether to run one untimed warmup pass first. |
|
|
|
Exported function name inside the module. |
|
|
|
IREE device URI. |
|
|
|
Worker thread count (only for llvm-cpu device). |
|
|
|
|
|
|
|
Whether to load the model into memory during initialization. |
|
|
|
Extra IREE runtime flags. |
|
|
|
Exclude some host copies from profiling by using iree DeviceArray objects for I/O. |
Returns: Average wall-clock inference time in milliseconds.
profile_vmfb_resources
Load a .vmfb model and run inference multiple times, collecting DRAM and CPU statistics alongside timing. Requires a Linux target (reads from /proc).
from torq.runtime import profile_vmfb_resources
profile_vmfb_resources(
model_path,
inputs=None,
*,
n_iters=5,
do_warmup=True,
function="main",
device="torq",
n_threads=None,
load_method="preload",
load_model_to_mem=True,
runtime_flags=None,
device_io=False,
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Path to the |
|
|
|
Input arrays. Generated randomly from model metadata when |
|
|
|
Number of timed inference iterations. |
|
|
|
Whether to run one untimed warmup pass first. |
|
|
|
Exported function name inside the module. |
|
|
|
IREE device URI. |
|
|
|
Worker thread count (only for llvm-cpu device). |
|
|
|
|
|
|
|
Whether to load the model into memory during initialization. |
|
|
|
Extra IREE runtime flags. |
|
|
|
Exclude some host copies from profiling by using iree DeviceArray objects for I/O. |
Returns: A ProfileStats dataclass (see below).
ProfileStats
Dataclass returned by profile_vmfb_resources containing profiling results. All resource metrics are process-wide and include overhead from the Python interpreter, the sampling thread, and any other activity in the process.
from torq.runtime.profiling import ProfileStats
Field |
Type |
Description |
|---|---|---|
|
|
Average wall-clock inference time in milliseconds. |
|
|
Average total DRAM footprint in bytes, including file-backed pages (e.g. mmap’d model data). |
|
|
Peak total DRAM footprint in bytes. |
|
|
Average anonymous memory (heap/stack) in bytes, excluding file-backed pages. |
|
|
Peak anonymous memory in bytes. |
|
|
Average CPU utilisation as a percentage (across all cores) during timed iterations. |
Methods:
summary()
Return a human-readable summary string with formatted values (times in ms, memory in MB, CPU as a percentage).
stats = profile_vmfb_resources("model.vmfb")
print(stats.summary())
# Avg inference time: 12.345 ms
#
# Process-wide resource usage (includes Python overhead):
# Avg DRAM footprint: 85.2 MB (includes mmap'd model file)
# Peak DRAM footprint: 91.7 MB
# Avg memory usage: 23.4 MB (heap/stack only, excludes model file)
# Peak memory usage: 28.1 MB
# Avg CPU usage: 47.3%
ResourceSampler
A reusable background sampler for collecting process-wide DRAM and CPU metrics around arbitrary workloads. Used internally by profile_vmfb_resources but available for custom profiling scenarios.
from torq.runtime.profiling import ResourceSampler
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Process ID to monitor. |
|
|
|
Sampling interval in seconds. |
Properties (available after stop()):
Property |
Type |
Description |
|---|---|---|
|
|
Average total RSS (DRAM footprint) in bytes. |
|
|
Peak total RSS in bytes. |
|
|
Average anonymous RSS in bytes. |
|
|
Peak anonymous RSS in bytes. |
|
|
Average CPU utilisation percentage. |
run_vmfb
Run a .vmfb model via the iree-run-module CLI and return wall-clock time.
run_vmfb(
model_path,
inputs,
outputs,
device="torq",
n_threads=None,
iree_binary=None,
)
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Path to the |
|
|
(required) |
Input descriptors forwarded as |
|
|
(required) |
Output descriptors forwarded as |
|
|
|
IREE device URI. |
|
|
|
Worker thread count (only for llvm-cpu device, defaults to |
|
|
|
Path to the |
Returns: Elapsed wall-clock time in milliseconds.
TensorInfo
Dataclass holding dtype and shape metadata for a tensor.
@dataclass
class TensorInfo:
dtype: DTypeLike
shape: list[int | str]
Field |
Type |
Description |
|---|---|---|
|
|
NumPy-compatible dtype. |
|
|
Tensor dimensions. |
Methods:
is_valid()— ReturnsTrueif every dimension is an integer (i.e., no dynamic dimensions).
Utility Functions
random_inputs_from_info(inputs_info)
Generate random NumPy arrays matching the given tensor metadata. Useful for testing.
inputs_info — Iterable of
TensorInfo.Returns — List of NumPy arrays with appropriate shapes and dtypes.