Given a set of candidate vectors and a target vector, selects indices of the
k nearest candidates (using weighted Euclidean distance) to the target,
then samples n indices from these neighbors, either uniformly or with
rank-based or distance-based probabilities. Optionally, a random seed can be
set for reproducibility.
Usage
knn_sample(
candidates,
target,
k,
n = 1,
prob = FALSE,
weights = NULL,
seed = NULL,
sampling = c("rank", "distance"),
bandwidth = NULL,
epsilon = 1e-08
)Arguments
- candidates
Numeric matrix or data frame. Each row is a candidate vector.
- target
Numeric vector. The target vector for comparison.
- k
Integer. Number of nearest neighbors to consider (k <= number of candidates).
- n
Integer. Number of samples to draw (default = 1).
- prob
Logical. If TRUE, sampling probabilities favor closer neighbors; if FALSE (default), sampling is uniform among k nearest neighbors.
- weights
Optional numeric vector of length equal to ncol(candidates). Feature weights for distance calculation. Default is equal weights.
- seed
Optional integer. If provided, sets random seed for reproducible sampling. The seed state is restored on exit.
- sampling
Character. Sampling method when prob = TRUE. Either "rank" (default, probability decreases with neighbor rank: 1, 1/2, 1/3, ...) or "distance" (probability based on Gaussian kernel of distances). Ignored when prob = FALSE.
- bandwidth
Numeric. Bandwidth parameter for distance-based sampling kernel. If NULL (default), uses median nearest neighbor distance. Only used when sampling = "distance" and prob = TRUE.
- epsilon
Numeric. Small constant added to distance-based probabilities to prevent zero probabilities (default = 1e-8). Only used when sampling = "distance" and prob = TRUE.
Details
The function computes the weighted Euclidean distance between each candidate
and the target. The k nearest neighbors (smallest distances) are
identified.
Sampling modes:
- prob = FALSE
All k neighbors sampled with equal probability (uniform)
- prob = TRUE, sampling = "rank"
Probability = 1/rank, normalized. Closest neighbor has highest probability.
- prob = TRUE, sampling = "distance"
Probability based on Gaussian kernel: exp(-distance^2 / (2 * bandwidth^2))
Sampling is with replacement. If seed is set, the random seed will be
temporarily changed and restored on exit.
Examples
set.seed(42)
candidates <- matrix(rnorm(50), ncol = 2)
target <- c(0, 0)
# Sample 1 index from 5 nearest neighbors, uniform probability
knn_sample(candidates, target, k = 5, n = 1)
#> [1] 21
# Sample 3 indices from 5 nearest neighbors, rank-weighted probability
knn_sample(candidates, target, k = 5, n = 3, prob = TRUE, seed = 123)
#> [1] 15 6 15
# Using feature weights (weight first dimension more heavily)
knn_sample(candidates, target, k = 5, n = 2, weights = c(2, 1), seed = 10)
#> [1] 17 10
# Distance-based sampling with custom bandwidth
knn_sample(candidates, target, k = 10, n = 5, prob = TRUE,
sampling = "distance", bandwidth = 1.5)
#> [1] 6 6 10 8 8