K-Nearest Neighbor (KNN) Sampling from Candidates

Given a set of candidate vectors and a target vector, selects indices of the k nearest candidates (using weighted Euclidean distance) to the target, then samples n indices from these neighbors, either uniformly or with rank-based or distance-based probabilities. Optionally, a random seed can be set for reproducibility.

Usage

knn_sample(
  candidates,
  target,
  k,
  n = 1,
  prob = FALSE,
  weights = NULL,
  seed = NULL,
  sampling = c("rank", "distance"),
  bandwidth = NULL,
  epsilon = 1e-08
)

Arguments

candidates: Numeric matrix or data frame. Each row is a candidate vector.
target: Numeric vector. The target vector for comparison.
k: Integer. Number of nearest neighbors to consider (k <= number of candidates).
n: Integer. Number of samples to draw (default = 1).
prob: Logical. If TRUE, sampling probabilities favor closer neighbors; if FALSE (default), sampling is uniform among k nearest neighbors.
weights: Optional numeric vector of length equal to ncol(candidates). Feature weights for distance calculation. Default is equal weights.
seed: Optional integer. If provided, sets random seed for reproducible sampling. The seed state is restored on exit.
sampling: Character. Sampling method when prob = TRUE. Either "rank" (default, probability decreases with neighbor rank: 1, 1/2, 1/3, ...) or "distance" (probability based on Gaussian kernel of distances). Ignored when prob = FALSE.
bandwidth: Numeric. Bandwidth parameter for distance-based sampling kernel. If NULL (default), uses median nearest neighbor distance. Only used when sampling = "distance" and prob = TRUE.
epsilon: Numeric. Small constant added to distance-based probabilities to prevent zero probabilities (default = 1e-8). Only used when sampling = "distance" and prob = TRUE.

Value

Integer vector of length n, giving indices of sampled candidates (rows of candidates).

Details

The function computes the weighted Euclidean distance between each candidate and the target. The k nearest neighbors (smallest distances) are identified.

Sampling modes:

prob = FALSE: All k neighbors sampled with equal probability (uniform)
prob = TRUE, sampling = "rank": Probability = 1/rank, normalized. Closest neighbor has highest probability.
prob = TRUE, sampling = "distance": Probability based on Gaussian kernel: exp(-distance^2 / (2 * bandwidth^2))

Sampling is with replacement. If seed is set, the random seed will be temporarily changed and restored on exit.

Examples

set.seed(42)
candidates <- matrix(rnorm(50), ncol = 2)
target <- c(0, 0)

# Sample 1 index from 5 nearest neighbors, uniform probability
knn_sample(candidates, target, k = 5, n = 1)
#> [1] 21

# Sample 3 indices from 5 nearest neighbors, rank-weighted probability
knn_sample(candidates, target, k = 5, n = 3, prob = TRUE, seed = 123)
#> [1] 15  6 15

# Using feature weights (weight first dimension more heavily)
knn_sample(candidates, target, k = 5, n = 2, weights = c(2, 1), seed = 10)
#> [1] 17 10

# Distance-based sampling with custom bandwidth
knn_sample(candidates, target, k = 10, n = 5, prob = TRUE,
           sampling = "distance", bandwidth = 1.5)
#> [1]  6  6 10  8  8