Skip to contents

Reads multiple Parquet objects from MinIO and concatenates them. Objects can be provided explicitly via objects or discovered via prefix and optional regex pattern. Optionally filters objects by a date-like suffix in the key: _YYYY, _YYYYMM, or _YYYYMMDD (inclusive range).

Usage

minio_read_many(
  bucket,
  objects = NULL,
  prefix = NULL,
  pattern = NULL,
  date_from = NULL,
  date_to = NULL,
  warn_bytes = 1e+09,
  out_dir = NULL,
  verbose = TRUE,
  use_https = TRUE,
  region = ""
)

Arguments

bucket

Character. Name of the MinIO bucket.

objects

Character vector or NULL. Explicit object keys to read.

prefix

Character or NULL. Prefix for listing objects when objects is NULL.

pattern

Character or NULL. Regex pattern to filter keys when using prefix.

date_from, date_to

Character/numeric or NULL. Inclusive range filter based on suffix. If 4 digits -> year, 6 digits -> yearmonth, 8 digits -> yearmonthday. Example: date_from = 2020, date_to = 2024 filters _YYYY.

warn_bytes

Numeric. If total remote size exceeds this threshold, a warning is emitted. Defaults to 1e9 (≈ 1 GB).

out_dir

Character or NULL. If provided, results are streamed to this local directory as a Parquet dataset (one file per input object) and the function returns the directory path.

verbose

Logical. Whether to print progress messages. Defaults to TRUE.

use_https

Logical. Whether to use HTTPS when connecting to MinIO.

region

Character. Region string required by aws.s3.

Value

If out_dir is NULL, returns a data.frame (union schema). If out_dir is provided, returns out_dir (invisibly) after writing the dataset.

Details

The function computes total remote bytes using minio_get_metadata and warns if the estimated size exceeds warn_bytes. For large reads, you can stream results to a local Parquet dataset directory via out_dir.

Concatenation is schema-union: the output includes all columns across all files. Missing columns in individual files are filled with NA.

Examples

if (FALSE) { # \dontrun{
# Explicit keys
df <- minio_read_many(
  bucket = "assets",
  objects = c("raw/x_202301.parquet", "raw/x_202302.parquet")
)

# Prefix + pattern + date range by year
df <- minio_read_many(
  bucket = "assets",
  prefix = "raw/x/",
  pattern = "\\\\.parquet$",
  date_from = 2020,
  date_to = 2024
)

# Stream to local dataset (recommended for large volumes)
out <- minio_read_many(
  bucket = "assets",
  prefix = "raw/x/",
  pattern = "\\\\.parquet$",
  date_from = 202301,
  date_to = 202312,
  out_dir = "output/many_parquet_dataset"
)
} # }