Spatial clustering based on correlation or other metrics.

cluster_locid(
  x,
  varname,
  locid = "locid",
  time = "UTC",
  locid_info = NULL,
  weight = NULL,
  group = NULL,
  k = NULL,
  max_loss = 0.05,
  verbose = TRUE,
  distance = "cor",
  cores = 1,
  ...
)

Arguments

x

`data.frame` (merra subset) with location and time identifiers, and a time-series variable to cluster.

varname

name of column with data to be used to cluster locations.

locid

name of column of location identifiers.

time

name of column with time dimension

locid_info

(optional) `data.frame` or `sf` object with weights and/or spatial groups (regions) of location identifiers.

weight

(optional) name of column with (positive) weights in `locid_info`, used in calculating weighted `mean` and `sd` metrics.

group

(optional) name of column with group-names of locations (such as regions). If provided, clustering will be made for each group separately.

k

(optional) integer vector of number of clusters to test. By default (`NULL`) clustering process start from `1` to the number of locations and terminates when `max_loss` condition is met.

max_loss

maximum loss of variation (standard deviation) of clustered variable, measured as `1 - sd(clustered_variable) / sd(original_variable)`. Default value is `0.05`, meaning up to `5` percent of variability of original, non-clustered variable is allowed to be lost by clustering.

verbose

logical, should the clustering process be reported, TRUE by default.

distance

character name of a selected distance measure to use `TSdist::KMedoids`. Default metrics is `cor` - Pearson's correlation between the time series variable in different locations. Alternative, allowed methasures: `"euclidean", "manhattan", "minkowski", "infnorm", "ccor", "sts", "dtw", "keogh_lb", "edr", "erp", "lcss", "fourier", "tquest", "dissimfull", "dissimapprox", "acf", "pacf", "ar.lpc.ceps", "ar.mah", "ar.mah.statistic", "ar.mah.pvalue", "ar.pic", "cdm", "cid", "cor", "cort", "wav", "int.per", "per", "mindist.sax", "ncd", "pred", "spec.glk", "spec.isd", "spec.llr", "pdc", "frechet"`.

cores

integer number of processor cores to use, currently ignored.

...

additional parameters to pass to `TSdist::KMedoids`, might be required for some distance measures.

Value

`data.frame` with alternative number of clusters with columns:

k

Number of clusters

N

Total number of time series

locid

location identifier in `merra2ools` datasets

"group"

(if provided) column with locid-groups

cluster

cluster number in every `k`-group

weight

weight of the cluster in the `k`-group

sd_N

standard deviation of the whole sample of (N) time-series

sd_k

standard deviation of clustered time series with `k` clusters

sd_loss

loss of standard deviation as result of clusterisation, for each `k`

Examples

# see "Cluster locations" in "Get started"