Skip to content

Quickstart

Two complete, runnable examples covering the two main pipeline modes.

Skip ROI setup

Use WholeSlideProvider() instead of RectROIProvider to patch the entire slide with no setup required.


Example 1: Stream patches to NumPy

Best for inference pipelines where you process patches in memory without writing to disk.

from pathlib import Path
from wsi_patching import AttachROIs, NumpyStreamWriter, PatchExtractor, RectROIProvider, WSIGrid

slides = [
    "./data/slide_a.tiff",
    "./data/slide_b.tiff",
]

# Map slide stem → list of (x, y, width, height) ROI tuples (pixel coords at selected resolution)
rois_dict = {Path(s).stem: [(0, 0, 18000, 10000)] for s in slides}

p = (
    WSIGrid(slides=slides, resolution=0, unit="level")
    .then(AttachROIs(providers=[RectROIProvider(rois_dict)]))
    .then(PatchExtractor(tile_size=256, stride=256))
    .to(NumpyStreamWriter(layout="NCHW"))
)

for wsi_id, images, coords, meta in p.stream(num_workers=4):
    # images: np.ndarray of shape (N, 3, 256, 256), dtype float32
    # coords: np.ndarray of shape (N, 2), pixel (x, y) top-left of each patch
    # meta:   list of dicts, one per patch, with slide metadata
    print(f"{wsi_id}: {images.shape}")

Notes:

  • resolution=0, unit="level" selects native level 0 (full resolution). See Concepts — Resolution for other options.
  • Batches are not ordered per slide when num_workers > 1. Use wsi_id to track which slide each batch belongs to, or set num_workers=1 for ordered output.
  • layout="NCHW" transposes from the native NHWC read order. Use layout="NHWC" to skip the transpose.
  • NumpyStreamWriter also accepts a dtype parameter (default np.float32).

Example 2: Materialize to WebDataset

Best for creating large training datasets on disk as shuffled WebDataset shards.

from pathlib import Path
from wsi_patching import AttachROIs, PatchExtractor, PNGEncoder, RectROIProvider, WebDatasetWriter, WSIGrid

slides = [
    "./data/slide_a.tiff",
    "./data/slide_b.tiff",
]

rois_dict = {Path(s).stem: [(0, 0, 18000, 10000)] for s in slides}

p = (
    WSIGrid(slides=slides, resolution=0, unit="level")
    .then(AttachROIs(providers=[RectROIProvider(rois_dict)]))
    .then(PatchExtractor(tile_size=224, stride=224, max_batch_size=200))
    .then(PNGEncoder())
    .to(WebDatasetWriter(outdir=Path("./output/"), shard_size=300, shuffle_buffer_size=500))
)

p.materialize(num_workers=4, profile=True)
p.print_profile()

Notes:

  • PNGEncoder is required before WebDatasetWriter. The pipeline checks this at construction time and raises a TypeError immediately if the types do not match.
  • Output shards are written to ./output/shard-000000.tar, shard-000001.tar, etc.
  • Each shard entry has keys __key__ (e.g. slide_a_1024_2048), png (PNG bytes), and meta (JSON-encoded metadata dict).
  • shuffle_buffer_size controls how many patches accumulate before a random flush to disk. Larger values improve shuffle quality at the cost of memory.
  • After materialize(), p.failed_slides contains the names of any slides that were skipped due to errors (empty list if all succeeded).

Sample profile output:

=== Pipeline Profile (isolated timings only) ===
Stage                                Yields         Wall (s)   Avg (ms/yield)
PNGEncoder.isolated                    640            1.440s          2.412ms

--- Per slide breakdown ---
[slide_a]
  PNGEncoder.isolated          yields=  320    wall=  0.762s    avg=  2.382ms
[slide_b]
  PNGEncoder.isolated          yields=  320    wall=  0.778s    avg=  2.432ms