# Torq Runtime Python API (Beta)

The `torq-runtime` Python package provides bindings for loading and running compiled `.vmfb` models on a Torq device directly from Python.

```{warning}
`torq-runtime` is currently in beta and is not yet available on PyPI.
```

## Installation

The `torq-runtime` package is included in the GitHub release. Install the runtime wheel directly from any release snapshot.

## Quick Start

```python
import numpy as np
from torq.runtime import VMFBInferenceRunner

# Load the compiled model
runner = VMFBInferenceRunner("mobilenetv2.vmfb", device_uri="torq")

# Prepare input data
input_data = np.random.randint(0, 255, size=(1, 224, 224, 3), dtype=np.int8)

# Run inference
outputs = runner.infer([input_data])
print(f"Inference took {runner.infer_time_ms:.2f} ms")
```

## Examples

### Inspecting Model Inputs and Outputs

```python
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")

if runner.inputs_info:
    for i, info in enumerate(runner.inputs_info):
        print(f"Input {i}: dtype={info.dtype}, shape={info.shape}")

if runner.outputs_info:
    for i, info in enumerate(runner.outputs_info):
        print(f"Output {i}: dtype={info.dtype}, shape={info.shape}")
```

### Profiling Inference Latency

```python
from torq.runtime import profile_vmfb_inference_time

avg_ms = profile_vmfb_inference_time(
    "model.vmfb",
    n_iters=10,
    do_warmup=True,
    device="torq",
)
print(f"Average inference time: {avg_ms:.2f} ms")
```

### Profiling System Resources

```python
from torq.runtime import profile_vmfb_resources

stats = profile_vmfb_resources(
    "model.vmfb",
    n_iters=10,
    do_warmup=True,
    device="torq",
)
print(f"Average inference time: {stats.avg_inference_time_ms:.2f} ms")
print(f"Average DRAM footprint: {stats.avg_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Peak DRAM footprint: {stats.peak_dram_footprint_bytes / 1024 / 1024:.1f} MB")
print(f"Average memory usage: {stats.avg_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {stats.peak_anon_mem_bytes / 1024 / 1024:.1f} MB")
print(f"Average CPU usage: {stats.avg_cpu_percent:.1f}%")
```

### Running with Custom Inputs

```python
import numpy as np
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq")

# Load preprocessed input from a .npy file
input_data = np.load("preprocessed_input.npy")
outputs = runner.infer([input_data])
```

### Zero-Copy Device Outputs

When model outputs are fed back as inputs (e.g. KV-cache tensors in autoregressive decoding), use `device_outputs=True` to keep data on the device and avoid unnecessary host transfers:

```python
import numpy as np
from torq.runtime import VMFBInferenceRunner

runner = VMFBInferenceRunner("model.vmfb", device_uri="torq", device_outputs=True)

# Allocate an initial on-device buffer (e.g. for a KV cache)
kv_cache = runner.allocate_device_array(np.zeros((1, 8, 256, 64), dtype=np.float16))

# Run inference — outputs stay on device as DeviceArray objects
results = runner.infer([input_data, kv_cache])
logits, new_kv = results[0], results[1]

# Use the output DeviceArray directly as next step's input
kv_cache = new_kv

# Only bring the logits to host for sampling
logits_np = logits.to_host()
```

## API Reference

### `VMFBInferenceRunner`

The main class for loading and running `.vmfb` models via the IREE runtime.

```python
VMFBInferenceRunner(
    model_path,
    *,
    function="main",
    device_uri="torq",
    n_threads=None,
    load_method="preload",
    load_model_to_mem=True,
    runtime_flags=None,
    device_outputs=False,
)
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_path` | `str \| PathLike` | *(required)* | Path to the `.vmfb` file. |
| `function` | `str` | `"main"` | Exported function name inside the module. |
| `device_uri` | `str` | `"torq"` | IREE device identifier. |
| `n_threads` | `int \| None` | `None` | Worker thread count (only for llvm-cpu device). |
| `load_method` | `"preload" \| "mmap"` | `"preload"` | `"preload"` copies into memory; `"mmap"` memory-maps the file. |
| `load_model_to_mem` | `bool` | `True` | Whether to load the model into memory during initialization. |
| `runtime_flags` | `Iterable[str] \| None` | `None` | Extra IREE runtime flags. |
| `device_outputs` | `bool` | `False` | If `True`, `infer()` returns on-device `DeviceArray` objects instead of NumPy arrays, avoiding device-to-host transfers. Useful for pipelines where outputs are fed back as inputs (e.g. KV-cache in autoregressive decoding). |

**Properties:**

| Property | Type | Description |
|----------|------|-------------|
| `model_path` | `PathLike` | Path to the loaded model file. |
| `infer_time_ms` | `float` | Elapsed time in milliseconds for the last call to `infer()`. |
| `device` | `HalDevice` | The underlying IREE HAL device. |
| `inputs_info` | `list[TensorInfo] \| None` | Input tensor metadata extracted from the model, or `None` if unavailable. |
| `outputs_info` | `list[TensorInfo] \| None` | Output tensor metadata extracted from the model, or `None` if unavailable. |

**Methods:**

#### `infer(inputs)`

Run inference and return the output arrays.

- **inputs** — Either an iterable of NumPy arrays (or `DeviceArray` objects) or a mapping of name to array.
- **Returns** — A list of NumPy arrays by default. When `device_outputs=True`, returns a list of on-device `DeviceArray` objects instead.

#### `allocate_device_array(array)`

Allocate a device buffer and copy a host NumPy array into it.

- **array** — A NumPy array to upload to the device.
- **Returns** — A `DeviceArray` that can be passed directly to `infer()` without further copies.

### `profile_vmfb_inference_time`

Load a `.vmfb` model and run inference multiple times for profiling.

```python
profile_vmfb_inference_time(
    model_path,
    inputs=None,
    *,
    n_iters=5,
    do_warmup=True,
    function="main",
    device="torq",
    n_threads=None,
    load_model_to_mem=True,
    runtime_flags=None,
    device_io=False,
)
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_path` | `str \| PathLike` | *(required)* | Path to the `.vmfb` file. |
| `inputs` | `Iterable[NDArray] \| None` | `None` | Input arrays. Generated randomly from model metadata when `None`. |
| `n_iters` | `int` | `5` | Number of timed inference iterations. |
| `do_warmup` | `bool` | `True` | Whether to run one untimed warmup pass first. |
| `function` | `str` | `"main"` | Exported function name inside the module. |
| `device` | `str` | `"torq"` | IREE device URI. |
| `n_threads` | `int \| None` | `None` | Worker thread count (only for llvm-cpu device). |
| `load_method` | `"preload" \| "mmap"` | `"preload"` | `"preload"` copies into memory; `"mmap"` memory-maps the file. |
| `load_model_to_mem` | `bool` | `True` | Whether to load the model into memory during initialization. |
| `runtime_flags` | `Iterable[str] \| None` | `None` | Extra IREE runtime flags. |
| `device_io` | `bool` | `False` | Exclude some host copies from profiling by using iree DeviceArray objects for I/O. |

**Returns:** Average wall-clock inference time in milliseconds.

### `profile_vmfb_resources`

Load a `.vmfb` model and run inference multiple times, collecting DRAM and CPU statistics alongside timing. Requires a Linux target (reads from `/proc`).

```python
from torq.runtime import profile_vmfb_resources

profile_vmfb_resources(
    model_path,
    inputs=None,
    *,
    n_iters=5,
    do_warmup=True,
    function="main",
    device="torq",
    n_threads=None,
    load_method="preload",
    load_model_to_mem=True,
    runtime_flags=None,
    device_io=False,
)
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_path` | `str \| PathLike` | *(required)* | Path to the `.vmfb` file. |
| `inputs` | `Iterable[NDArray] \| None` | `None` | Input arrays. Generated randomly from model metadata when `None`. |
| `n_iters` | `int` | `5` | Number of timed inference iterations. |
| `do_warmup` | `bool` | `True` | Whether to run one untimed warmup pass first. |
| `function` | `str` | `"main"` | Exported function name inside the module. |
| `device` | `str` | `"torq"` | IREE device URI. |
| `n_threads` | `int \| None` | `None` | Worker thread count (only for llvm-cpu device). |
| `load_method` | `"preload" \| "mmap"` | `"preload"` | `"preload"` copies into memory; `"mmap"` memory-maps the file. |
| `load_model_to_mem` | `bool` | `True` | Whether to load the model into memory during initialization. |
| `runtime_flags` | `Iterable[str] \| None` | `None` | Extra IREE runtime flags. |
| `device_io` | `bool` | `False` | Exclude some host copies from profiling by using iree DeviceArray objects for I/O. |

**Returns:** A `ProfileStats` dataclass (see below).

### `ProfileStats`

Dataclass returned by `profile_vmfb_resources` containing profiling results. All resource metrics are **process-wide** and include overhead from the Python interpreter, the sampling thread, and any other activity in the process.

```python
from torq.runtime.profiling import ProfileStats
```

| Field | Type | Description |
|-------|------|-------------|
| `avg_inference_time_ms` | `float` | Average wall-clock inference time in milliseconds. |
| `avg_dram_footprint_bytes` | `int` | Average total DRAM footprint in bytes, including file-backed pages (e.g. mmap'd model data). |
| `peak_dram_footprint_bytes` | `int` | Peak total DRAM footprint in bytes. |
| `avg_anon_mem_bytes` | `int` | Average anonymous memory (heap/stack) in bytes, excluding file-backed pages. |
| `peak_anon_mem_bytes` | `int` | Peak anonymous memory in bytes. |
| `avg_cpu_percent` | `float` | Average CPU utilisation as a percentage (across all cores) during timed iterations. |

**Methods:**

#### `summary()`

Return a human-readable summary string with formatted values (times in ms, memory in MB, CPU as a percentage).

```python
stats = profile_vmfb_resources("model.vmfb")
print(stats.summary())
# Avg inference time:  12.345 ms
#
# Process-wide resource usage (includes Python overhead):
#   Avg DRAM footprint:  85.2 MB  (includes mmap'd model file)
#   Peak DRAM footprint: 91.7 MB
#   Avg memory usage:    23.4 MB  (heap/stack only, excludes model file)
#   Peak memory usage:   28.1 MB
#   Avg CPU usage:       47.3%
```

### `ResourceSampler`

A reusable background sampler for collecting process-wide DRAM and CPU metrics around arbitrary workloads. Used internally by `profile_vmfb_resources` but available for custom profiling scenarios.

```python
from torq.runtime.profiling import ResourceSampler
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `pid` | `int` | *(required)* | Process ID to monitor. |
| `interval` | `float` | `0.01` | Sampling interval in seconds. |

**Properties (available after `stop()`):**

| Property | Type | Description |
|----------|------|-------------|
| `avg_rss` | `int` | Average total RSS (DRAM footprint) in bytes. |
| `peak_rss` | `int` | Peak total RSS in bytes. |
| `avg_anon_rss` | `int` | Average anonymous RSS in bytes. |
| `peak_anon_rss` | `int` | Peak anonymous RSS in bytes. |
| `avg_cpu_percent` | `float` | Average CPU utilisation percentage. |

### `run_vmfb`

Run a `.vmfb` model via the `iree-run-module` CLI and return wall-clock time.

```python
run_vmfb(
    model_path,
    inputs,
    outputs,
    device="torq",
    n_threads=None,
    iree_binary=None,
)
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_path` | `str \| PathLike` | *(required)* | Path to the `.vmfb` file. |
| `inputs` | `Iterable[str]` | *(required)* | Input descriptors forwarded as `--input` flags. |
| `outputs` | `Iterable[str]` | *(required)* | Output descriptors forwarded as `--output` flags. |
| `device` | `str` | `"torq"` | IREE device URI. |
| `n_threads` | `int \| None` | `None` | Worker thread count (only for llvm-cpu device, defaults to `os.cpu_count()`). |
| `iree_binary` | `str \| PathLike \| None` | `None` | Path to the `iree-run-module` binary. Resolved from `PATH` if not provided. |

**Returns:** Elapsed wall-clock time in milliseconds.

### `TensorInfo`

Dataclass holding dtype and shape metadata for a tensor.

```python
@dataclass
class TensorInfo:
    dtype: DTypeLike
    shape: list[int | str]
```

| Field | Type | Description |
|-------|------|-------------|
| `dtype` | `DTypeLike` | NumPy-compatible dtype. |
| `shape` | `list[int \| str]` | Tensor dimensions. |

**Methods:**

- `is_valid()` — Returns `True` if every dimension is an integer (i.e., no dynamic dimensions).

### Utility Functions

#### `random_inputs_from_info(inputs_info)`

Generate random NumPy arrays matching the given tensor metadata. Useful for testing.

- **inputs_info** — Iterable of `TensorInfo`.
- **Returns** — List of NumPy arrays with appropriate shapes and dtypes.