Getting Started with weathergenr

Introduction

The weathergenr package provides a comprehensive toolkit for stochastic weather generation and climate data analysis. It implements sophisticated statistical methods to create synthetic daily weather series that preserve the statistical properties of observed climate data while enabling scenario-based climate impact assessments.

This vignette demonstrates the core workflow:

Reading gridded meteorological data from NetCDF files
Generating synthetic weather realizations using coupled statistical methods
Evaluating the quality of synthetic weather against historical observations

Load Required Packages

Code

library(weathergenr)
library(dplyr)
library(ggplot2)

Related functions

[read_netcdf()]

[generate_weather()]

[evaluate_weather_generator()]

[write_netcdf()]

[apply_climate_perturbations()]

[analyze_wavelet_spectrum()]

Reading Gridded Multivariate Weather Data

About the Example Dataset

For this tutorial, we use the ERA5 dataset (ECMWF Reanalysis v5) included with the package. The data covers the Ntoum Basin in Gabon, a tropical watershed in West Central Africa. The dataset includes daily precipitation, mean temperature, minimum temperature, and maximum temperature at a 0.25-degree spatial resolution.

Map of Ntoum Basin, Gabon {width=“65%”, fig-align=“left”}

Loading NetCDF Data

The read_netcdf() function provides an efficient interface for extracting meteorological data from NetCDF files. It handles the complexities of NetCDF format and returns data in a tidy, analysis-ready structure optimized for the weathergenr workflow.

Key Features

The function performs several important operations:

Extracts spatial coordinates and temporal information
Organizes data into per-grid-cell time series
Handles calendar systems including leap day management
Provides variable renaming capabilities
Optionally drops variables with all missing values to reduce memory usage

Code

# Locate the example NetCDF file
ncfile <- system.file("extdata", "ntoum_era5_data.nc", package = "weathergenr")

# Read the NetCDF file with default settings
ncdata <- read_netcdf(ncfile)

Understanding the Output Structure

The read_netcdf() function returns a list with five components that organize different aspects of the climate data:

Code

# Display the top-level structure
names(ncdata)

[1] "data"       "grid"       "date"       "dimensions" "attributes"

1. Time Series Data

The data element contains a list of data frames, one for each grid cell. Each data frame represents a daily time series with columns for each meteorological variable:

Code

# Examine the first grid cell's time series
head(ncdata$data[[1]])

# A tibble: 6 × 7
  press_msl   kin temp_min temp_max  temp  kout precip
      <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl>  <dbl>
1     1008.  174.     24.4     28.1  25.9  410.   9.12
2     1008.  162.     23.7     28.1  24.9  410.  15.6 
3     1008.  201.     23.6     27.9  25.6  411.   4.84
4     1008.  177.     25.0     28.3  26.1  411.   5.20
5     1007.  218.     24.6     28.1  26.1  411.   1.54
6     1007.  199.     25.0     28.2  26.3  411.   8.35

This structure allows efficient access to complete time series for individual locations while maintaining a compact memory footprint.

2. Spatial Grid Information

The grid element provides spatial metadata for each grid cell:

Code

# View grid metadata
head(ncdata$grid)

# A tibble: 6 × 5
     id  xind  yind         x         y
  <int> <int> <int> <dbl[1d]> <dbl[1d]>
1     1     1     1      9.5       0.5 
2     2     2     1      9.75      0.5 
3     3     3     1     10         0.5 
4     4     1     2      9.5       0.25
5     5     2     2      9.75      0.25
6     6     3     2     10         0.25

Each row describes a grid cell with:

id: Sequential identifier for the grid cell
xind, yind: Array indices in the original NetCDF file
x, y: Spatial coordinates (longitude/latitude or projected coordinates)

3. Temporal Information

The date element contains the complete time axis as a vector of Date objects:

Code

# Display the first and last few dates
head(ncdata$date)

[1] "2000-01-01" "2000-01-02" "2000-01-03" "2000-01-04" "2000-01-05"
[6] "2000-01-06"

Code

tail(ncdata$date)

[1] "2020-12-26" "2020-12-27" "2020-12-28" "2020-12-29" "2020-12-30"
[6] "2020-12-31"

Code

# Confirm the total number of days
length(ncdata$date)

[1] 7671

4. Dimension Vectors

The dimensions element stores the raw dimension vectors from the NetCDF file:

Code

# View available dimensions
names(ncdata$dimensions)

[1] "longitude" "latitude"  "time"

5. Metadata Attributes

The attributes element preserves variable attributes, spatial reference information, and global metadata from the NetCDF file, ensuring traceability and documentation.

Advanced Usage

The read_netcdf() function supports several optional parameters for customized data loading:

Code

# Example: Load specific variables and rename them
ncdata <- read_netcdf(
  nc_path = ncfile,
  var = c("precip", "temp"),
  var_name = c(precip = "precipitation", temp = "temperature")
)

# Example: Handle 365-day calendar data (no leap days)
ncdata <- read_netcdf(
  nc_path = ncfile,
  keep_leap_day = FALSE
)

# Example: Round values to reduce precision and memory
ncdata <- read_netcdf(
  nc_path = ncfile,
  signif_digits = 4,
  drop_all_na = TRUE
)

Generating Stochastic Weather Realizations

Overview of the Weather Generation Method

The generate_weather() function implements a sophisticated multi-scale weather generation framework that combines several statistical methods:

Statistical Components

Wavelet Autoregressive Modeling (WARM): Captures low-frequency (annual to multi-year) climate variability by decomposing annual precipitation signals into wavelet components and simulating their temporal evolution using ARIMA models.
Markov Chain Model: Controls the persistence of wet and dry spells at the daily scale by modeling transitions between three precipitation states (dry, wet, extreme) with monthly-varying transition probabilities.
K-Nearest Neighbors (KNN) Resampling: Selects historical analogue years based on similarity in annual precipitation patterns and resamples daily weather sequences from these analogues while preserving spatial and temporal correlations.

This coupled approach ensures that synthetic weather series preserve both:

Temporal structure: Day-to-day persistence, seasonal cycles, and interannual variability
Spatial coherence: Correlations between grid cells and co-occurrence patterns of weather events
Statistical properties: Means, variances, extremes, and distributional characteristics

The method is described in detail by Steinschneider and Brown (2013).

Setting Up the Simulation

First, define the output directory and specify which meteorological variables to simulate:

Code

# Create a temporary directory for outputs
output_dir <- tempdir()

# Define variables to simulate
variables <- c("precip", "temp", "temp_min", "temp_max")

# Labels for plotting (optional)
variable_labels <- c("Precipitation", "Temperature (mean)", 
                     "Temperature (min)", "Temperature (max)")

# Number of stochastic realizations to generate
realization_num <- 3

Understanding Simulation Parameters

Key parameters control different aspects of the weather generation:

Temporal scope: n_years and start_year define the simulation period
Realization count: n_realizations specifies how many independent synthetic series to create
WARM parameters: Control annual-scale variability simulation
- warm_var: The primary driver variable (typically precipitation)
- warm_signif: Significance threshold for retaining wavelet components (0.90 = 90% confidence)
- warm_pool_size: Number of candidate annual traces to generate before filtering
KNN parameters: Control daily-scale resampling
- annual_knn_n: Number of historical analogue years to consider
Markov Chain parameters: Define precipitation state thresholds
- wet_q: Threshold for wet day classification (0.3 = 30th percentile)
- extreme_q: Threshold for extreme precipitation (0.8 = 80th percentile)

Running the Weather Generator

Code

# Set a seed for reproducibility
seed <- 123

# Generate synthetic weather series
stochastic_weather <- generate_weather(
  obs_data = ncdata$data,
  obs_grid = ncdata$grid,
  obs_dates = ncdata$date,
  vars = variables,
  n_years = 20,
  start_year = 2020,
  year_start_month = 1,
  n_realizations = realization_num,
  warm_var = "precip",
  warm_signif = 0.90,
  warm_pool_size = 10000,
  annual_knn_n = 100,
  wet_q = 0.2,
  extreme_q = 0.8,
  out_dir = output_dir,
  seed = seed,
  parallel = FALSE,
  n_cores = NULL,
  verbose = FALSE
)

Understanding the Output

The function returns a list containing:

resampled: The historical dates selected as analogues for each simulation day
dates: The simulated daily time axis (always 365-day calendar, Feb 29 excluded)

Code

# View the structure of the output
names(stochastic_weather)

[1] "resampled" "dates"

Code

# Check the simulated date range
head(stochastic_weather$dates)

[1] "2020-01-01" "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05"
[6] "2020-01-06"

Code

tail(stochastic_weather$dates)

[1] "2039-12-26" "2039-12-27" "2039-12-28" "2039-12-29" "2039-12-30"
[6] "2039-12-31"

Code

# Examine resampled dates for the first realization
head(stochastic_weather$resampled[[1]])

[1] "2001-01-01" "2001-01-02" "2001-01-03" "2016-01-02" "2016-01-03"
[6] "2016-01-04"

Interpreting Resampled Dates

The resampled dates show which historical days were selected as analogues. This information reveals:

Whether the generator is drawing from wet or dry historical periods
The diversity of historical conditions being sampled
Temporal clustering patterns in the resampling scheme

Code

# Display resampled dates for all realizations
stochastic_weather$resampled

# A tibble: 7,300 × 3
   rlz_1      rlz_2      rlz_3     
   <date>     <date>     <date>    
 1 2001-01-01 2002-01-01 2011-01-01
 2 2001-01-02 2002-01-01 2011-01-05
 3 2001-01-03 2002-01-01 2011-01-07
 4 2016-01-02 2002-01-02 2011-01-08
 5 2016-01-03 2002-01-03 2011-01-06
 6 2016-01-04 2002-01-07 2011-01-06
 7 2006-01-07 2002-12-19 2011-01-07
 8 2006-01-11 2007-01-08 2011-01-10
 9 2011-01-09 2007-01-09 2011-01-13
10 2011-01-13 2007-01-10 2011-01-09
# ℹ 7,290 more rows

Extracting Synthetic Weather Data

To access the actual weather values (not just the resampled dates), you need to extract the corresponding data from the historical record:

Code

# Example: Extract synthetic weather for all realizations
synthetic_data <- list()

for (i in 1:realization_num) {
  # Get the day order for this realization
  day_order <- match(stochastic_weather$resampled[[i]], ncdata$date)
  
  # Extract corresponding weather data for each grid cell
  synthetic_data[[i]] <- lapply(ncdata$data, function(cell_data) {
    cell_data[day_order, variables] %>%
      mutate(date = stochastic_weather$dates, .before = 1)
  })
}

Evaluating Synthetic Weather Series

Purpose of Evaluation

Validating the weather generator’s performance is crucial to ensure that synthetic weather series are suitable for impact modeling. The evaluate_weather_generator() function provides comprehensive diagnostic comparisons between simulated and observed weather across multiple statistical dimensions.

Preparing Data for Evaluation

The evaluation function requires:

Observed weather data in list format (one element per grid cell)
Simulated weather data structured similarly
Matching date columns in both datasets

Code

# Create day order indices for each realization
day_order <- sapply(
  1:realization_num,
  function(n) match(stochastic_weather$resampled[[n]], ncdata$date)
)

# Extract synthetic weather samples
rlz_sample <- list()
for (n in 1:realization_num) {
  rlz_sample[[n]] <- lapply(ncdata$data[ncdata$grid$id], function(x) {
    x[day_order[, n], ] %>%
      select(precip, temp, temp_min, temp_max) %>%
      mutate(date = stochastic_weather$dates, .before = 1)
  })
}

# Prepare observed data
obs_sample <- lapply(ncdata$data[ncdata$grid$id], function(x) {
  x %>%
    select(precip, temp, temp_min, temp_max) %>%
    mutate(date = ncdata$date, .before = 1)
})

Running the Evaluation

Code

# Evaluate weather generator performance
evaluation_plots <- evaluate_weather_generator(
  daily_sim = rlz_sample,
  daily_obs = obs_sample,
  vars = variables,
  variable_labels = variable_labels,
  n_realizations = realization_num,
  wet_q = 0.3,
  extreme_q = 0.8,
  output_dir = NULL,
  save_plots = FALSE
)

Understanding Evaluation Metrics

The evaluation function computes and visualizes several key aspects of weather generator performance:

1. Daily Statistics

Comparison of mean, standard deviation, and skewness for each variable across all grid cells. This assesses whether the generator preserves the central tendency, variability, and distributional shape of observed weather.

2. Wet and Dry Day Frequencies

For precipitation, the evaluation examines:

Total number of wet days (above the wet threshold)
Number of extreme precipitation days (above the extreme threshold)
Dry day frequencies

These metrics reveal whether the generator correctly reproduces precipitation occurrence patterns.

3. Spell Length Distributions

Analysis of consecutive wet and dry periods:

Mean wet spell length
Mean dry spell length
Distribution of spell durations

This assesses the generator’s ability to capture persistence in weather patterns.

4. Spatial Correlations

Inter-site correlation matrices for each variable, comparing observed versus simulated spatial coherence. This is critical for applications requiring realistic spatial patterns (e.g., regional hydrology).

Viewing Evaluation Results

Code

# Display all diagnostic plots
evaluation_plots

$daily_mean


$daily_sd


$spell_length


$wetdry_days_count


$crossgrid


$intergrid


$precip_cond_cor


$annual_pattern_precip


$annual_pattern_temp


$annual_pattern_temp_min


$annual_pattern_temp_max


$monthly_cycle


$annual_precip


attr(,"class")
[1] "weather_assessment" "list"              
attr(,"fit_summary")
# A tibble: 3 × 17
  rlz    rank overall_score mae_mean_precip mae_mean_temp mae_mean_temp_max
  <chr> <int>         <dbl>           <dbl>         <dbl>             <dbl>
1 3         1         0.360            2.67          17.0              19.2
2 1         2         0.412            2.70          17.0              19.2
3 2         3         0.695            2.67          17.0              19.2
# ℹ 11 more variables: mae_mean_temp_min <dbl>, mae_sd_precip <dbl>,
#   mae_sd_temp <dbl>, mae_sd_temp_max <dbl>, mae_sd_temp_min <dbl>,
#   mae_days_Dry <dbl>, mae_days_Wet <dbl>, mae_spell_Dry <dbl>,
#   mae_spell_Wet <dbl>, mae_cor_crossgrid <dbl>, mae_cor_intervariable <dbl>
attr(,"metadata")
attr(,"metadata")$n_grids
[1] 6

attr(,"metadata")$n_realizations
[1] 3

attr(,"metadata")$variables
[1] "precip"   "temp"     "temp_min" "temp_max"

attr(,"metadata")$assessment_date
[1] "2026-01-23"

Interpreting the Plots

Good weather generator performance is indicated by:

Scatter plots near the 1:1 line: Simulated statistics closely match observed values
Overlapping distributions: Similar shapes and ranges between observed and simulated
Consistent correlation patterns: Spatial correlations preserved in synthetic data
Balanced residuals: No systematic bias in any direction

Discrepancies may indicate:

Parameter tuning needed - adjust WARM or Markov chain thresholds
Insufficient historical analogue pool - increase KNN sample size
Structural limitations for specific climate regimes

Other Features

The weathergenr package supports additional capabilities not covered in this basic tutorial:

Climate Scenario Analysis

Use apply_climate_perturbations() to modify synthetic weather based on climate change scenarios:

Code

# Example: Apply temperature increase and precipitation changes
grid_with_lat <- ncdata$grid
grid_with_lat$lat <- grid_with_lat$y

perturbed_weather <- apply_climate_perturbations(
  data = synthetic_data[[1]],
  grid = grid_with_lat,
  date = stochastic_weather$dates,
  precip_mean_factor = rep(1.1, 12),
  precip_var_factor = rep(1.0, 12),
  temp_delta = rep(2.0, 12),
  temp_transient = FALSE,
  precip_transient = FALSE,
  compute_pet = FALSE
)

Wavelet Spectral Analysis

Explore low-frequency climate variability using wavelet transforms:

Code

# Analyze spectral characteristics of annual precipitation
wavelet_results <- analyze_wavelet_spectrum(
  series = annual_precip,
  signif = 0.90,
  noise = "red",
  min_period = 2,
  detrend = FALSE,
  mode = "fast"
)

Parallel Computing

For large spatial domains, enable parallel processing to accelerate computation:

Code

# Generate weather using multiple cores
stochastic_weather <- generate_weather(
  # ... other parameters ...
  parallel = TRUE,
  n_cores = 4
)

Saving and Exporting Results

Results can be written back to NetCDF format for use in other tools:

Code

# Example: Write synthetic weather to NetCDF
data_to_write <- lapply(synthetic_data[[1]], function(df) {
  as.list(df[variables])
})

write_netcdf(
  data = data_to_write,
  grid = ncdata$grid,
  out_dir = output_dir,
  origin_date = stochastic_weather$dates[1],
  calendar = "noleap",
  template_path = ncfile,
  file_prefix = "synthetic_weather",
  file_suffix = "rlz1"
)

Introduction

Load Required Packages

Reading Gridded Multivariate Weather Data

About the Example Dataset

Loading NetCDF Data

Key Features

Understanding the Output Structure

1. Time Series Data

2. Spatial Grid Information

3. Temporal Information

4. Dimension Vectors

5. Metadata Attributes

Advanced Usage

Generating Stochastic Weather Realizations

Overview of the Weather Generation Method

Statistical Components

Setting Up the Simulation

Understanding Simulation Parameters

Running the Weather Generator

Understanding the Output

Interpreting Resampled Dates

Extracting Synthetic Weather Data

Evaluating Synthetic Weather Series

Purpose of Evaluation

Preparing Data for Evaluation

Running the Evaluation

Understanding Evaluation Metrics

1. Daily Statistics

2. Wet and Dry Day Frequencies

3. Spell Length Distributions

4. Spatial Correlations

Viewing Evaluation Results

Interpreting the Plots

Other Features

Climate Scenario Analysis

Wavelet Spectral Analysis

Parallel Computing

Saving and Exporting Results

Further Reading