High-Level Synthetic Data Generation with Data Set Archetypes (2024)

Michael J. ZellingerCorresponding author: zellinger@caltech.eduDepartment of Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USAPeter BühlmannSeminar for Statistics, ETH Zürich, Zürich, Switzerland

Abstract

Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the variation in cluster shapes, can be effectively varied. Unfortunately, curating evaluation scenarios is often laborious, as practitioners must find lower-level geometric parameters (like cluster covariance matrices) to match a higher-level scenario description like “clusters with very different shapes.” To make benchmarks more convenient and informative, we propose synthetic data generation based on data set archetypes. In this paradigm, the user describes an evaluation scenario in a high-level manner, and the software automatically generates data sets with the desired characteristics. Combining such data set archetypes with large language models (LLMs), it is possible to set up benchmarks purely from verbal descriptions of the evaluation scenarios. We provide an open-source Python package, repliclust, that implements this workflow. A demo of data generation from verbal inputs is available at demo.repliclust.org.

Keywords: Synthetic Data, Simulation, Validation, Clustering, Unsupervised Learning, Artificial Intelligence, Natural Language Processing

1 Introduction

The goal of clustering is to separate data points into groups such that points within a group a more similar to each other than to those outside the group (Cormack, (1971)). In practice, it is often not clear what constitutes a cluster (Hennig, (2015)). As a result, many practitioners evaluate cluster analysis algorithms on synthetic data (Milligan, (1980), Milligan etal., (1983), Tibshirani etal., (2001), Steinley and Brusco, (2008), Steinley and Brusco, (2011), VanMechelen etal., (2023)).

Synthetic data is valuable for two reasons. First, it clearly stipulates which data points belong to which cluster, allowing objective evaluation. Second, it allows independently manipulating different aspects of the data (such as the overlap between clusters or the variability of cluster shapes), which is critical for drawing scientific conclusions about the relative merits of different cluster analysis techniques (Milligan, (1996)).

Unfortunately, setting up benchmarks with synthetic data can be laborious. The process typically involves creating data sets for a number of different scenarios. For example, on benchmarks with convex clusters drawn from probabilistic mixture models, the scenarios may involve “clusters of very different shapes and sizes”, “highly overlapping oblong clusters”, “high-dimensional spherical clusters”, etc. (Tibshirani etal., (2001)).

Existing data generators do not cater directly to such high-level scenarios. Instead, the user must carefully tune simulation parameters to arrive at the desired scenarios (Steinley and Henson, (2005), Schubert and Zimek, (2019), Iglesias etal., (2019)). While some generators make it easy to control the overlaps between clusters, such high-level control typically does not extend to other aspects like the desired diversity of cluster shapes and sizes (Qiu and Joe, (2006), Shand etal., (2019)).

In this paper, we explore generating synthetic data directly from high level descriptions. Our Python package, repliclust, accomplishes this goal by summarizing the overall geometry of probabilistic mixture models with a few high-level parameters. We use a large language model to map a user’s verbal description of a scenario onto these parameters. Although our approach is based on ellipsoidal clusters, we have implemented two post-processing functions for generating more irregular cluster shapes. The first makes clusters non-convex by passing them through a randomly initialized neural network. The second makes a p𝑝pitalic_p-dimensional data set directional by wrapping it around the (p+1)𝑝1(p+1)( italic_p + 1 )-dimensional sphere through an inverse stereographic projection.

2 Generating Data from High-Level Archetypes

High-Level Synthetic Data Generation with Data Set Archetypes (1)

Our data generator repliclust is based on data set archetypes. A data set archetype is a high-level description of the overall geometry of a data set with clusters. For example, the class of all data sets with “three oblong and slightly overlapping clusters in two dimensions with some class imbalance” is a data set archetype.

We implement archetype-based generation by summarizing the overall geometry of probabilistic mixture models in terms of a few high-level parameters, as listed in Table 1. To create an archetype, users can either specify these parameters directly or verbally describe the archetype in English. To map an English description to a precise parameter setting, we use few-shot prompting of a large language model (Brown etal., (2020), OpenAI etal., (2024)).

Parameter(s)Purpose n_clusters / dim / n_samples select number of clusters / dimensions / data points aspect_ref / aspect_maxmin determine how elongated vs spherical clustersare / how much this varies between clusters radius_maxmin determine the variation in cluster volumes max_overlap / min_overlap set maximum / minimum overlaps between clusters imbalance_ratio make some clusters have more data points than others distributions /distribution_proportions select probability distributions appearing in eachdata set / how many clusters have each distribution111Specifically, the proportion of clusters having a given distribution is the same in each data set. We are interested in adding a spread parameter that controls the variability of the distribution mix between data sets.

Most of the high-level geometric parameters describing an archetype are based on what we call “max-min sampling.” In this approach, the user controls a geometric attribute by specifying a reference value and max-min ratio. In addition, a constraint ensures that the reference value and is indeed typical for every data set. For example, the aspect ratio of a cluster measures how elongated it is. The reference value aspect_ref sets the typical aspect ratio among all clusters in a data set, while aspect_maxmin sets the ratio of the highest to the lowest aspect ratio. To make sure that aspect_ref is indeed the typical aspect ratio, max-min sampling enforces the location constraint (j=1kαj)1k=aspect_refsuperscriptsuperscriptsubscriptproduct𝑗1𝑘subscript𝛼𝑗1𝑘aspect_ref(\prod_{j=1}^{k}\alpha_{j})^{\frac{1}{k}}=\texttt{aspect\_ref}( ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT = aspect_ref. Appendix A gives more details on how we manage different geometric attributes using max-min sampling.

Once an archetype has been defined, sampling concrete data sets proceeds in two steps. First, the algorithm samples a new probabilistic mixture model whose geometric structure matches the archetype. Second, we draw i.i.d. samples from this mixture model to generate a data set. Figure 1 illustrates this flow.

To accommodate use cases in which variation of an archetype’s hyperparameters (n_clusters, dim, n_samples) is desired , we have implemented a function Archetype.sample_hyperparams for generating a list of archetypes with hyperparameters sampled from Poisson distributions centered on the original values (subject to rejection sampling based on user-specified minimum and maximum values).

3 Sampling Probabilistic Mixture Models

Generating a synthetic dataset with repliclust starts with sampling a probabilistic mixture model that matches the desired archetype.

Sampling a mixture model proceeds through the following steps:

  1. 1.

    Draw random cluster covariance matrices based on the archetype.

  2. 2.

    Randomly initialize the cluster centers. Adjust their placement using stochastic gradient descent to meet desired constraints on the overlaps between clusters.

  3. 3.

    Sample the number njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of data points per cluster, based on the extent of class imbalance specified by the archetype.

  4. 4.

    Construct a data set X𝑋Xitalic_Xand cluster labels y𝑦yitalic_y by sampling njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT data points i.i.d. from the mixture component describing cluster j𝑗jitalic_j. Return (X,y,𝒜)𝑋𝑦𝒜(X,y,\mathcal{A})( italic_X , italic_y , caligraphic_A ), where 𝒜𝒜\mathcal{A}caligraphic_A is the archetype.

  5. 5.

    Optionally, make cluster shapes non-convex by either

    1. (a)

      passing X𝑋Xitalic_X through a randomlyinitialized neural network (repliclust.distort)

    2. (b)

      wrapping Xn×p𝑋superscript𝑛𝑝X\in\mathbb{R}^{n\times p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT around the (p+1)𝑝1(p+1)( italic_p + 1 )-dimensional sphere to create directional data (repliclust.wrap_around_sphere).

In the following sections, we give more details on each step involved in data generation.

3.1 Defining Clusters and Mixture Models

In the first four steps of data generation, repliclust models clusters as ellipsoidal probability distributions characterized by a central point. Specifically, each cluster 𝒞𝒞\mathcal{C}caligraphic_C is defined by a cluster center 𝐱𝒞psubscript𝐱𝒞superscript𝑝\mathbf{x}_{\mathcal{C}}\in\mathbb{R}^{p}bold_x start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, orthogonal principal axes 𝒖^𝒞(1),𝒖^𝒞(2),,𝒖^𝒞(p)subscriptsuperscript^𝒖1𝒞subscriptsuperscript^𝒖2𝒞subscriptsuperscript^𝒖𝑝𝒞\hat{\boldsymbol{u}}^{(1)}_{\mathcal{C}},\hat{\boldsymbol{u}}^{(2)}_{\mathcal{%C}},...,\hat{\boldsymbol{u}}^{(p)}_{\mathcal{C}}over^ start_ARG bold_italic_u end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , over^ start_ARG bold_italic_u end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , … , over^ start_ARG bold_italic_u end_ARG start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT pointing in arbitrary directions, the lengths σ𝒞(1),σ𝒞(2),,σ𝒞(p)subscriptsuperscript𝜎1𝒞subscriptsuperscript𝜎2𝒞subscriptsuperscript𝜎𝑝𝒞\sigma^{(1)}_{\mathcal{C}},\sigma^{(2)}_{\mathcal{C}},...,\sigma^{(p)}_{%\mathcal{C}}italic_σ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT , … , italic_σ start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT of the principal axes, and a univariate probability distribution f𝒞()subscript𝑓𝒞f_{\mathcal{C}}(\cdot)italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( ⋅ ).

To generate an i.i.d. sample from a cluster, we 1) sample the direction 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG from the ellipsoid defined by the cluster’s principal axes, and 2) sample the length 𝐱2subscriptnorm𝐱2||\mathbf{x}||_{2}| | bold_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT according to the cluster’s univariate distribution f𝒞subscript𝑓𝒞f_{\mathcal{C}}italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT (which can be one of many supported distributions including normal, lognormal, exponential, Student’s t, gamma, chi-square, Weibull, Gumbel, F, Pareto, beta, and uniform). To make the spread of each cluster depend only on the lengths of the principal axes, we normalize each univariate distribution so that the 68.2%percent68.268.2\%68.2 % quantiles of its absolute value is unity. For example, if the univariate distribution is exponential with rate λ𝜆\lambdaitalic_λ, we would actually sample from a re-scaled random variable Exp(λ)/q0.682Exp𝜆subscript𝑞0.682\text{Exp}(\lambda)/q_{0.682}Exp ( italic_λ ) / italic_q start_POSTSUBSCRIPT 0.682 end_POSTSUBSCRIPT, where the quantile q0.682subscript𝑞0.682q_{0.682}italic_q start_POSTSUBSCRIPT 0.682 end_POSTSUBSCRIPT satisfies (|Exp(λ)|q0.682)=0.682Exp𝜆subscript𝑞0.6820.682\mathbb{P}(|\text{Exp}(\lambda)|\leq q_{0.682})=0.682blackboard_P ( | Exp ( italic_λ ) | ≤ italic_q start_POSTSUBSCRIPT 0.682 end_POSTSUBSCRIPT ) = 0.682.222This rescaling puts all distributions on the same scale as the multivariate normal distribution, which natively satisfies (|𝒩(0,1)|1)0.682𝒩0110.682\mathbb{P}(|\mathcal{N}(0,1)|\leq 1)\approx 0.682blackboard_P ( | caligraphic_N ( 0 , 1 ) | ≤ 1 ) ≈ 0.682.

Figure 2(a) visualizes clusters with different base distributions. Note that using heavy-tailed distributions can lead to far outliers (not shown). By contrast, univariate distributions with bounded support give rise to clusters with crisply defined boundaries. Stretches where a probability density function vanishes give rise to concentric holes.

High-Level Synthetic Data Generation with Data Set Archetypes (2)
High-Level Synthetic Data Generation with Data Set Archetypes (3)

A mixture model (repliclust.MixtureModel) represents several clusters. Figure 2(b) shows a two-dimensional mixture model with four clusters, and a data set sampled from it. Note that our mixture models do not model class probabilities. Instead, the number of data points per cluster are drawn using max-min sampling, based on the archetype’s imbalance ratio (desired ratio between the number of data points in the most and least populous clusters). Appendix B lists the formal attributes of a mixture model.

3.2 Managing Cluster Overlaps

Managing the degree of overlaps between clusters is one of the most important tasks of a cluster generator. In repliclust, we quantify pairwise overlap between two clusters as the irreducible error rate when classifying a new data point as belonging to one of the two clusters (assuming equal class probabilities). On the level of entire datasets, the user controls overlap by specifying the maximum and minimum pairwise overlap.

The maximum overlap max_overlap states that no pair of clusters may overlap more than a certain amount. By contrast, the minimum overlap min_overlap prevents isolated clusters by enforcing that each cluster has some neighbor with which it overlaps at least min_overlap.We construct a loss function that enforces the minimum and maximum overlap conditions and optimize it using stochastic gradient descent (SGD).

In this section, we first describe our notion of pairwise overlap. Second, we explain the loss function we use to enforce the maximum and minimum overlap for an entire dataset.

3.2.1 Measuring Pairwise Overlap

Viewing clusters as probability distributions, we formally define the overlap between two clusters as twice the minimax error rate when classifying new data points using a linear decision boundary between the two clusters. Figure 3 illustrates this definition. To explain what we mean by “minimax” in this context, observe that any linear classifier y^:p{1,2}:^𝑦maps-tosuperscript𝑝12\hat{y}:\mathbb{R}^{p}\mapsto\{1,2\}over^ start_ARG italic_y end_ARG : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ↦ { 1 , 2 } depends on an axis 𝒂p𝒂superscript𝑝\boldsymbol{a}\in\mathbb{R}^{p}bold_italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and threshold c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R such that

y^(𝐱)={1,if𝒂𝐱c2,if𝒂𝐱>c.^𝑦𝐱cases1if𝒂𝐱c2if𝒂𝐱>c\hat{y}(\mathbf{x})=\begin{cases}1,&\text{if $\boldsymbol{a}^{\top}\mathbf{x}%\leq c$}\\2,&\text{if $\boldsymbol{a}^{\top}\mathbf{x}>c$}.\end{cases}over^ start_ARG italic_y end_ARG ( bold_x ) = { start_ROW start_CELL 1 , end_CELL start_CELL if bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ≤ italic_c end_CELL end_ROW start_ROW start_CELL 2 , end_CELL start_CELL if bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x > italic_c . end_CELL end_ROW(1)

By definition, the minimax classifier y^superscript^𝑦\hat{y}^{*}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT minimizes the worst-case loss. In symbols,

maxy(yy^(𝐱)|y)=miny^maxy(yy^(𝐱)|y),subscript𝑦𝑦conditionalsuperscript^𝑦𝐱𝑦subscript^𝑦subscript𝑦𝑦conditional^𝑦𝐱𝑦\max_{y}~{}\mathbb{P}(y\neq\hat{y}^{*}(\mathbf{x})|y)=\min_{\hat{y}}\max_{y}~{%}\mathbb{P}(y\neq\hat{y}(\mathbf{x})|y),roman_max start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT blackboard_P ( italic_y ≠ over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) | italic_y ) = roman_min start_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT blackboard_P ( italic_y ≠ over^ start_ARG italic_y end_ARG ( bold_x ) | italic_y ) ,(2)

where y{1,2}𝑦12y\in\{1,2\}italic_y ∈ { 1 , 2 } is the true cluster label corresponding to a new data point 𝐱p𝐱superscript𝑝\mathbf{x}\in\mathbb{R}^{p}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. The outer minimum on the right hand side ranges over all linear classifiers y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, including y^superscript^𝑦\hat{y}^{*}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Rewriting (2) in terms of the classification axes 𝒂𝒂\boldsymbol{a}bold_italic_a and thresholds c𝑐citalic_c yields

𝒂,c=argmin𝒂p,cmax{(𝒂𝐱>c|y=1),(𝒂𝐱c|y=2)}.superscript𝒂superscript𝑐subscriptargminformulae-sequence𝒂superscript𝑝𝑐superscript𝒂top𝐱conditional𝑐𝑦1superscript𝒂top𝐱conditional𝑐𝑦2\boldsymbol{a}^{*},~{}c^{*}=\operatorname*{arg\,min}_{\boldsymbol{a}\in\mathbb%{R}^{p},~{}c\in\mathbb{R}}~{}\max\{~{}\mathbb{P}(\boldsymbol{a}^{\top}\mathbf{%x}>c~{}|~{}y=1),~{}\mathbb{P}(\boldsymbol{a}^{\top}\mathbf{x}\leq c~{}|~{}y=2)%~{}\}.bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_c ∈ blackboard_R end_POSTSUBSCRIPT roman_max { blackboard_P ( bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x > italic_c | italic_y = 1 ) , blackboard_P ( bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ≤ italic_c | italic_y = 2 ) } .(3)

It is not hard to see that the minimax condition requires the cluster-specific error rates (𝒂𝐱>c|y=1)superscriptsuperscript𝒂top𝐱conditionalsuperscript𝑐𝑦1\mathbb{P}(\boldsymbol{a^{*}}^{\top}\mathbf{x}>c^{*}~{}|~{}y=1)blackboard_P ( bold_italic_a start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x > italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_y = 1 ) and (𝒂𝐱c|y=2)superscriptsuperscript𝒂top𝐱conditionalsuperscript𝑐𝑦2\mathbb{P}(\boldsymbol{a^{*}}^{\top}\mathbf{x}\leq c^{*}~{}|~{}y=2)blackboard_P ( bold_italic_a start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ≤ italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_y = 2 ) to be equal. Consequently, the cluster overlap α𝛼\alphaitalic_α becomes

α=2(𝒂𝐱>c|y=1).𝛼2superscriptsuperscript𝒂top𝐱conditionalsuperscript𝑐𝑦1\alpha=2~{}\mathbb{P}(\boldsymbol{a^{*}}^{\top}\mathbf{x}>c^{*}~{}|~{}y=1).italic_α = 2 blackboard_P ( bold_italic_a start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x > italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_y = 1 ) .(4)

Geometrically, our definition means that two clusters overlap at level α𝛼\alphaitalic_α if their marginal distributions along the minimax classification axis 𝒂superscript𝒂\boldsymbol{a}^{*}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT intersect at the 1α/21𝛼21-\alpha/21 - italic_α / 2 and α/2𝛼2\alpha/2italic_α / 2 quantiles. The left panel of Figure 3 highlights the probability mass bounded by these quantiles in gray.

Note that our formulation of minimax classification error explicitly does not take into account class probabilities for the clusters. Equation (2) depends only on the class-conditional probabilities.Our reasoning is that the underlying reality of each cluster depends on its class-conditional probability distribution, not the class probability.

Steinley and Henson, (2005) quantify cluster overlap by computing the full distributional overlap in p𝑝pitalic_p dimensions. We prefer our one-dimensional notion in terms of minimax classification error, since high-dimensional Euclidean geometry exhibits a number of counterintuitive phenomena. Specifically, as p𝑝p\rightarrow\inftyitalic_p → ∞, the majority of a sphere’s volume becomes concentrated in a thin shell from its surface, and so p𝑝pitalic_p-dimensional overlap goes to zero unless the marginal overlaps in each dimension approach 100% (Blum etal., (2020), Steinley and Henson, (2005)).

High-Level Synthetic Data Generation with Data Set Archetypes (4)

Computing our cluster overlap α𝛼\alphaitalic_α requires finding the minimax classification axis 𝒂superscript𝒂\boldsymbol{a}^{*}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Anderson and Bahadur, (1962) describe an algorithm for computing 𝒂superscript𝒂\boldsymbol{a}^{*}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT exactly, in the case of multivariate normal distributions. Unfortunately, their method requires computing the matrix inverse (t𝚺1+(1t)𝚺2)1superscript𝑡subscript𝚺11𝑡subscript𝚺21(t\boldsymbol{\Sigma}_{1}+(1-t)\boldsymbol{\Sigma}_{2})^{-1}( italic_t bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_t ) bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for O(log(1/ϵ)\text{O}(\log(1/\epsilon)O ( roman_log ( 1 / italic_ϵ ) values of t𝑡titalic_t, where 𝚺1,𝚺2subscript𝚺1subscript𝚺2\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the clusters’ covariance matrices and ϵitalic-ϵ\epsilonitalic_ϵ is a numerical tolerance. To avoid these matrix inversions, we propose an approximation of the minimax classification axis based on linear discriminant analysis (LDA).

For a pair of multivariate normal clusters with means 𝝁1𝝁2subscript𝝁1subscript𝝁2\boldsymbol{\mu}_{1}\neq\boldsymbol{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the same covariance matrix 𝚺𝚺\boldsymbol{\Sigma}bold_Σ, the axis 𝒂LDA=𝚺1(𝝁2𝝁1)subscript𝒂LDAsuperscript𝚺1subscript𝝁2subscript𝝁1\boldsymbol{a}_{\text{\tiny{LDA}}}=\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{%2}-\boldsymbol{\mu}_{1})bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT = bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) minimizes the overall misclassification rate, assuming equal class probabilities (see Hastie etal., (2009)). Unfortunately, the result does not hold for unequal covariance matrices 𝚺1𝚺2subscript𝚺1subscript𝚺2\boldsymbol{\Sigma}_{1}\neq\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, we propose the approximation 𝒂LDA=(𝚺1+𝚺22)1(𝝁2𝝁1)subscript𝒂LDAsuperscriptsubscript𝚺1subscript𝚺221subscript𝝁2subscript𝝁1\boldsymbol{a}_{\text{\tiny{LDA}}}=(\frac{\boldsymbol{\Sigma}_{1}+\boldsymbol{%\Sigma}_{2}}{2})^{-1}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT = ( divide start_ARG bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Figure 4 verifies that this LDA-based approximation closely matches the exact overlap, as compared with a cruder “center-to-center" (C2C) approximation that uses the difference between the cluster centers as the classification axis (i.e., 𝐚C2C=𝝁2𝝁1subscript𝐚𝐶2𝐶subscript𝝁2subscript𝝁1\mathbf{a}_{\tiny{C2C}}=\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}bold_a start_POSTSUBSCRIPT italic_C 2 italic_C end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT).

High-Level Synthetic Data Generation with Data Set Archetypes (5)

High-Level Synthetic Data Generation with Data Set Archetypes (6)

Theorem 1 gives a formula for computing the LDA-approximate cluster overlap. The proof is in Appendix C.

Theorem 1 (LDA-Based Cluster Overlap)

For two multivariate normal clusters with means 𝛍1𝛍2subscript𝛍1subscript𝛍2\boldsymbol{\mu}_{1}\neq\boldsymbol{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and covariance matrices 𝚺1,𝚺2subscript𝚺1subscript𝚺2\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the approximate cluster overlap αLDAsubscript𝛼LDA\alpha_{\text{\tiny{LDA}}}italic_α start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT based on the linear separator 𝐚LDA=(𝚺1+𝚺22)1(𝛍2𝛍1)subscript𝐚LDAsuperscriptsubscript𝚺1subscript𝚺221subscript𝛍2subscript𝛍1\boldsymbol{a}_{\text{\tiny{LDA}}}=(\frac{\boldsymbol{\Sigma}_{1}+\boldsymbol{%\Sigma}_{2}}{2})^{-1}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT = ( divide start_ARG bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is

αLDA=2(1Φ(𝒂LDA(𝝁2𝝁1)𝒂LDA𝚺1𝒂LDA+𝒂LDA𝚺2𝒂LDA)),subscript𝛼LDA21Φsuperscriptsubscript𝒂LDAtopsubscript𝝁2subscript𝝁1superscriptsubscript𝒂LDAtopsubscript𝚺1subscript𝒂LDAsuperscriptsubscript𝒂LDAtopsubscript𝚺2subscript𝒂LDA\alpha_{\text{\tiny{LDA}}}=2\big{(}1-\Phi\Big{(}\frac{\boldsymbol{a}_{\text{%\tiny{LDA}}}^{\top}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})}{\sqrt{%\boldsymbol{a}_{\text{\tiny{LDA}}}^{\top}\boldsymbol{\Sigma}_{1}\boldsymbol{a}%_{\text{\tiny{LDA}}}}+\sqrt{\boldsymbol{a}_{\text{\tiny{LDA}}}^{\top}%\boldsymbol{\Sigma}_{2}\boldsymbol{a}_{\text{\tiny{LDA}}}}}\Big{)}\big{)},italic_α start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT = 2 ( 1 - roman_Φ ( divide start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT end_ARG + square-root start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT end_ARG end_ARG ) ) ,(5)

where Φ(z)Φ𝑧\Phi(z)roman_Φ ( italic_z ) is the cumulative distribution function of the standard normal distribution.Moreover, if 𝚺1=λ𝚺2subscript𝚺1𝜆subscript𝚺2\boldsymbol{\Sigma}_{1}=\lambda\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for some λ𝜆\lambdaitalic_λ then αLDAsubscript𝛼LDA\alpha_{\text{\tiny{LDA}}}italic_α start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT equals the exact cluster overlap α𝛼\alphaitalic_α.

Theorem 1 shows that cluster overlap depends on quantiles q𝑞qitalic_q of the form

q(𝝁1,𝝁2;𝚺1,𝚺2;𝒂)𝑞subscript𝝁1subscript𝝁2subscript𝚺1subscript𝚺2𝒂\displaystyle q(\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2};\boldsymbol{\Sigma}_%{1},\boldsymbol{\Sigma}_{2};\boldsymbol{a})italic_q ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; bold_italic_a )=𝒂(𝝁2𝝁1)𝒂𝚺1𝒂+𝒂𝚺2𝒂,absentsuperscript𝒂topsubscript𝝁2subscript𝝁1superscript𝒂topsubscript𝚺1𝒂superscript𝒂topsubscript𝚺2𝒂\displaystyle=\frac{\boldsymbol{a}^{\top}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu%}_{1})}{\sqrt{\boldsymbol{a}^{\top}\boldsymbol{\Sigma}_{1}\boldsymbol{a}}+%\sqrt{\boldsymbol{a}^{\top}\boldsymbol{\Sigma}_{2}\boldsymbol{a}}},= divide start_ARG bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_a end_ARG + square-root start_ARG bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_a end_ARG end_ARG ,(6)

where 𝒂𝒂\boldsymbol{a}bold_italic_a is a classification axis. Since these quantiles are inversely related to cluster overlap, they quantify cluster separation.

When generating synthetic data, repliclust uses Theorem 1 even when cluster distributions are non-normal. Figure 12 in Appendix D suggests that controlling overlap between non-normal distributions works well, except for distributions with bounded support (such as beta).

3.3 Adjusting Cluster Centers to Meet Overlap Constraints

To enforce maximum and minimum overlaps for a mixture model, we optimize a loss function using stochastic gradient descent (SGD). Given mixture model parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ (including the cluster centers, principal axes, and principal axis lengths), the overlap loss is

(𝜽)=1ki=1ki,𝜽1𝑘superscriptsubscript𝑖1𝑘subscript𝑖\mathcal{L}(\boldsymbol{\theta})=\frac{1}{k}\sum_{i=1}^{k}\ell_{i},caligraphic_L ( bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(7)

where the loss on the i𝑖iitalic_i-th cluster is

i=pλ((minjiqijqmax)+)+jipλ((qminqij)+).subscript𝑖subscript𝑝𝜆superscriptsubscript𝑗𝑖subscript𝑞𝑖𝑗subscript𝑞maxsubscript𝑗𝑖subscript𝑝𝜆superscriptsubscript𝑞minsubscript𝑞𝑖𝑗\ell_{i}=p_{\lambda}\big{(}(\min_{j\neq i}q_{ij}-q_{\text{max}})^{+}\big{)}+%\sum_{j\neq i}p_{\lambda}\big{(}(q_{\text{min}}-q_{ij})^{+}\big{)}.roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( ( roman_min start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( ( italic_q start_POSTSUBSCRIPT min end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) .(8)

Here, qminsubscript𝑞minq_{\text{min}}italic_q start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, qmaxsubscript𝑞maxq_{\text{max}}italic_q start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, and qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT measure cluster separation as expressed in Equation (6): qminsubscript𝑞minq_{\text{min}}italic_q start_POSTSUBSCRIPT min end_POSTSUBSCRIPT is the minimum allowed separation (corresponding to overlap α=max_overlap𝛼max_overlap\alpha=\texttt{max\_overlap}italic_α = max_overlap); qmaxsubscript𝑞maxq_{\text{max}}italic_q start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the maximum allowed separation (corresponding to α=min_overlap𝛼min_overlap\alpha=\texttt{min\_overlap}italic_α = min_overlap); and qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the separation between the i𝑖iitalic_i-th and j𝑗jitalic_j-th clusters. The penalty function pλsubscript𝑝𝜆p_{\lambda}italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT is the polynomial pλ(x):=λx+(1λ)x2assignsubscript𝑝𝜆𝑥𝜆𝑥1𝜆superscript𝑥2p_{\lambda}(x):=\lambda x+(1-\lambda)x^{2}italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_x ) := italic_λ italic_x + ( 1 - italic_λ ) italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where λ𝜆\lambdaitalic_λ is a tuning parameter. Finally, x+:=max(x,0)assignsuperscript𝑥𝑥0x^{+}:=\max(x,0)italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT := roman_max ( italic_x , 0 ) is a filter that passes on only positive inputs (corresponding to a violation of user-specified constraints).

3.3.1 Intuition behind the Overlap Loss

By design, the loss (8) vanishes when the cluster centers, principal axes, and principal axis lengths satisfy the user-specified overlap constraints. The first term penalizes violation of the minimum overlap condition. Indeed, if cluster i𝑖iitalic_i is too far away from the other clusters, the separation minjiqijsubscript𝑗𝑖subscript𝑞𝑖𝑗\min_{j\neq i}q_{ij}roman_min start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between cluster i𝑖iitalic_i and its closest neighbor exceeds the maximum allowed separation qmaxsubscript𝑞maxq_{\text{max}}italic_q start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. A penalty of the excess (minjiqijqmax)+superscriptsubscript𝑗𝑖subscript𝑞𝑖𝑗subscript𝑞max(\min_{j\neq i}q_{ij}-q_{\text{max}})^{+}( roman_min start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT yields the first term in (8). The second term measures violation of the maximum overlap condition. If the separation qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between clusters i𝑖iitalic_i and j𝑗jitalic_j falls short of the smallest allowed separation qminsubscript𝑞minq_{\text{min}}italic_q start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, the shortfall (qminqij)+superscriptsubscript𝑞minsubscript𝑞𝑖𝑗(q_{\text{min}}-q_{ij})^{+}( italic_q start_POSTSUBSCRIPT min end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT incurs a penalty that serves to push these clusters apart.

The penalty pλsubscript𝑝𝜆p_{\lambda}italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT in (8) ranges from quadratic to linear based on the value of λ𝜆\lambdaitalic_λ. Keeping the penalty partly linear (λ>0𝜆0\lambda>0italic_λ > 0) helps gradient descent drive the overlap loss to exactly zero because a purely quadratic loss would result in a vanishing derivative when overlap constraints approach satisfaction.

3.3.2 Running the Minimization in Practice

When minimizing (7) using SGD, we initially place cluster centers randomly within a sphere. The volume V𝑉Vitalic_V of this sphere influences the initial overlaps between clusters. To select an appropriate value, we fix the ratio ρ𝜌\rhoitalic_ρ of the sum of cluster volumes to V𝑉Vitalic_V; essentially, ρ𝜌\rhoitalic_ρ is the density of clusters within the sampling volume. Values of ρ𝜌\rhoitalic_ρ around 10% work well in low dimensions. In higher dimensions, however, results from the mathematics of sphere-packing motivate a downward adjustment. A lower bound by Ball, (1992) states that the maximum achievable density when placing non-overlapping spheres inside psuperscript𝑝\mathbb{R}^{p}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is at least p21p𝑝superscript21𝑝p2^{1-p}italic_p 2 start_POSTSUPERSCRIPT 1 - italic_p end_POSTSUPERSCRIPT. Thus, in p𝑝pitalic_p dimensions we use an adjusted density ρadjsuperscript𝜌adj\rho^{\text{adj}}italic_ρ start_POSTSUPERSCRIPT adj end_POSTSUPERSCRIPT defined by

ρadj(p)=p21pρ2D,superscript𝜌adj𝑝𝑝superscript21𝑝superscript𝜌2𝐷\rho^{\text{adj}}(p)=p2^{1-p}\rho^{2D},italic_ρ start_POSTSUPERSCRIPT adj end_POSTSUPERSCRIPT ( italic_p ) = italic_p 2 start_POSTSUPERSCRIPT 1 - italic_p end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ,

where ρ2D10%superscript𝜌2𝐷percent10\rho^{2D}\approx 10\%italic_ρ start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ≈ 10 % is the equivalent density in 2D.

Following initialization, we optimize the cluster centers {𝝁i}i=1ksuperscriptsubscriptsubscript𝝁𝑖𝑖1𝑘\{\boldsymbol{\mu}_{i}\}_{i=1}^{k}{ bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT using stochastic gradient descent. During this process, the principal axes and their lengths are fixed. Each iteration performs the update

[𝝁1|𝝁2||𝝁k][𝝁1|𝝁2||𝝁k]η[𝝁1i|𝝁2i||𝝁ki]delimited-[]conditionalsubscript𝝁1subscript𝝁2subscript𝝁𝑘delimited-[]conditionalsubscript𝝁1subscript𝝁2subscript𝝁𝑘𝜂delimited-[]conditionalsubscriptsubscript𝝁1subscript𝑖subscriptsubscript𝝁2subscript𝑖subscriptsubscript𝝁𝑘subscript𝑖\big{[}\boldsymbol{\mu}_{1}|\boldsymbol{\mu}_{2}|~{}...~{}|\boldsymbol{\mu}_{k%}\big{]}~{}\leftarrow~{}\big{[}\boldsymbol{\mu}_{1}|\boldsymbol{\mu}_{2}|~{}..%.~{}|\boldsymbol{\mu}_{k}\big{]}~{}-~{}\eta~{}\big{[}\nabla_{\boldsymbol{\mu}_%{1}}\ell_{i}|\nabla_{\boldsymbol{\mu}_{2}}\ell_{i}|~{}...~{}|\nabla_{%\boldsymbol{\mu}_{k}}\ell_{i}\big{]}[ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | … | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ← [ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | … | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] - italic_η [ ∇ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∇ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | … | ∇ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ](9)

on the single-cluster loss isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where η𝜂\etaitalic_η is the learning rate. For each epoch, we randomly permute the order i=1,2,,k𝑖12𝑘i=1,2,...,kitalic_i = 1 , 2 , … , italic_k of clusters and apply the updates (9) in turn for each cluster.

Experiments suggest that our minimization procedure drives the overlap loss to zero at an exponential rate (linear convergence rate), as expected for gradient descent. The number of epochs required seems largely independent of the number of clusters, though it increases slightly with the number of dimensions.

3.4 Making Cluster Shapes More Irregular

In many application domains, ellipsoidal center-based clusters are good models for the data. This is often the case when clusters can be characterized by a “typical” element. For example, in single-cell RNA sequencing, Chen etal., (2020) represent the differences in transcriptional activity across cell types using Gaussian mixture models. In this case, the cluster centers correspond to each cell type’s typical gene expression profile. Similarly, in deep learning, a large language model’s hidden states often form compact clusters that can effectively classify text (Howard and Ruder, (2018)), assess the truthfulness of a response (Azaria and Mitchell, (2023)), and help detect out-of-distribution inputs (Ren etal., (2023)).

However, convex clusters can be inappropriate. For example, Ester etal., (1996) developed the density-based DBSCAN algorithm specifically to handle spatial (GPS) data. Characterized by thin loops and irregular shapes, such data does not fit a center-based clustering paradigm at all. Another example is directional data lying on a sphere (Banerjee etal., (2005), Salah and Nadif, (2019)). In this case, the clusters have centers but are not convex.

To accommodate use cases where ellipsoidal clusters are inappropriate, repliclust provides two post-processing functions for creating non-convex clusters. The first,
repliclust.distort, passes a data set through a randomly initialized neural network, as shown in Figure 5. The second, repliclust.wrap_around_sphere, wraps a p𝑝pitalic_p-dimensional data set around the (p+1)𝑝1(p+1)( italic_p + 1 )-dimensional sphere by applying an inverse stereographic projection, as shown in Figure 6. Appendix D describes the default architecture of the neural network used in repliclust.distort.

High-Level Synthetic Data Generation with Data Set Archetypes (7)
High-Level Synthetic Data Generation with Data Set Archetypes (8)
High-Level Synthetic Data Generation with Data Set Archetypes (9)
High-Level Synthetic Data Generation with Data Set Archetypes (10)

3.5 Generating Archetypes from Natural Language

To fully realize the potential of a high-level, archetype-based approach to synthetic data generation, our package repliclust draws on the OpenAI API (OpenAI, (2023)) to allow users to create data set archetypes directly from verbal descriptions in English.

To map a user’s natural language input to the high-level geometric paramters of Table 1), we use few-shot prompting of OpenAI’s GPT-4o large language model model (Brown etal., (2020), OpenAI etal., (2024)). We use this same approach to automatically create short identifiers for archetypes, such as “ten_highly_oblong_very_different_shapes_moderate_overlap”. These identifiers contain no white space and obey the naming rules for Python variables.

Version 1.0.0. of repliclust uses 22-shot prompting to map archetype descriptions to parameters, and 11-shot prompting to generate archetype identifiers, which appears to work well. In the rare case that the natural language workflow fails, repliclust throws an error. We release all few-shot examples and prompt templates in the natural_language module of our code base. For convenience, they are reproduced in Appendix F.

4 Results

In this section, we first illustrate the convenience of archetype-based data generation by conducting a mock benchmark between several clustering algorithms. Second, we verify that our notion of clustering overlap effectively captures clustering difficulty (even in higher dimensions), and study the distribution of pairwise overlaps in data sets with multiple clusters.

4.1 Mock Benchmark

We demonstrate the convenience and informativeness of our archetype-based data generator by running a mock benchmark comparing several clustering algorithms on data drawn from different archetypes. The benchmark is not meant to comprehensively evaluate the strengths and weaknesses of the clustering algorithms we consider. Instead, our goal is to demonstrate the advantages (in convenience and informativeness) of an archetype-based approach to synthetic data generation.

First, we construct six archetypes by providing repliclust with the following verbal descriptions:

  1. 1.

    ‘twelve clusters of different distributions’

  2. 2.

    ‘twelve clusters of different distributions and high class imbalance’

  3. 3.

    ‘seven highly separated clusters in 10D with very different shapes’

  4. 4.

    ‘seven clusters in 10D with very different shapes and significant overlap’

  5. 5.

    ‘four clusters in 100D with 100 samples each’

  6. 6.

    ‘four clusters in 100D with 1000 samples each’

Repliclust maps these descriptions to the following data set archetypes333With one exception: the automatically generated names for archetype 5. and 6. did not end in the suffix “_each.” We added this suffix for clarity.:

  1. 1.

    {‘name’: ‘twelve_clusters_different_distributions’, ‘n_clusters’: 12,
    ‘dim’: 2, ‘n_samples’: 1200, ‘aspect_ref’: 1.5, ‘aspect_maxmin’: 2,
    ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2, ‘max_overlap’: 0.05,
    ‘min_overlap’: 0.001,‘distributions’: [‘normal’, ‘exponential’, ‘gamma’, ‘weibull’, ‘lognormal’]}

  2. 2.

    {‘name’: ‘twelve_different_distributions_high_class_imbalance’,
    ‘n_clusters’: 12, ‘dim’: 2, ‘n_samples’: 1200, ‘aspect_ref’: 1.5,
    ‘aspect_maxmin’: 2, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 5,
    ‘max_overlap’: 0.05, ‘min_overlap’: 0.001,‘distributions’: [‘normal’, ‘exponential’, ‘gamma’, ‘weibull’, ‘lognormal’]}

  3. 3.

    {‘name’: ‘seven_highly_separated_10d_very_different_shapes’,
    ‘n_clusters’: 7, ‘dim’: 10, ‘n_samples’: 700, ‘aspect_ref’: 1.5,
    ‘aspect_maxmin’: 3.0, ‘radius_maxmin’: 3.0, ‘imbalance_ratio’: 2,
    ‘max_overlap’: 0.0001, ‘min_overlap’: 1e-05,
    ‘distributions’: [‘normal’, ‘exponential’]}

  4. 4.

    {‘name’: ‘seven_very_different_shapes_significant_overlap_10d’,
    ‘n_clusters’: 7, ‘dim’: 10, ‘n_samples’: 700, ‘aspect_ref’: 1.5,
    ‘aspect_maxmin’: 3.0, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2,
    ‘max_overlap’: 0.2, ‘min_overlap’: 0.1,
    ‘distributions’: [‘normal’, ‘exponential’]}

  5. 5.

    {‘name’: ‘four_clusters_100d_100_samples_each’,
    ‘n_clusters’: 4, ‘dim’: 100, ‘n_samples’: 400, ‘aspect_ref’: 1.5,
    ‘aspect_maxmin’: 2, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2,
    ‘max_overlap’: 0.05, ‘min_overlap’: 0.001,
    ‘distributions’: [‘normal’, ‘exponential’]}

  6. 6.

    {‘name’: ‘four_clusters_100d_1000_samples_each’,
    ‘n_clusters’: 4, ‘dim’: 100, ‘n_samples’: 4000, ‘aspect_ref’: 1.0,
    ‘aspect_maxmin’: 1.0, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2,
    ‘max_overlap’: 0.05, ‘min_overlap’: 0.001,
    ‘distributions’: [‘normal’, ‘exponential’]}

4.1.1 Examining the Generated Data Sets

Figure 7 shows four representative data sets with convex clusters drawn from each archetype. Figure 8 shows non-convex clusters resulting from applying the distort function (as described in Section 3.4). For archetypes with dimensionality greater than two, we use t-SNE to visualize the data sets in 2D (vander Maaten and Hinton, (2008)).

The figures show that repliclust effectively generates data sets with similar geometric characteristics. Moreover, the data sets match up well with the user-specified verbal descriptions. Applying distort seems to make some clusters very long and thin, but otherwise results in satisfying non-convex clusters. The distort function seems to roughly preserve cluster overlaps; we leave a careful study of this to future work. The 10 and 100-dimensional archetypes indicate that our overlap control effectively handles different numbers of dimensions.

High-Level Synthetic Data Generation with Data Set Archetypes (11)

High-Level Synthetic Data Generation with Data Set Archetypes (12)

High-Level Synthetic Data Generation with Data Set Archetypes (13)

High-Level Synthetic Data Generation with Data Set Archetypes (14)

High-Level Synthetic Data Generation with Data Set Archetypes (15)

High-Level Synthetic Data Generation with Data Set Archetypes (16)

High-Level Synthetic Data Generation with Data Set Archetypes (17)

High-Level Synthetic Data Generation with Data Set Archetypes (18)

High-Level Synthetic Data Generation with Data Set Archetypes (19)

High-Level Synthetic Data Generation with Data Set Archetypes (20)

High-Level Synthetic Data Generation with Data Set Archetypes (21)

High-Level Synthetic Data Generation with Data Set Archetypes (22)

4.1.2 Comparing the Clustering Algorithms

We compare the following clustering algorithms: K-Means, hierarchical (with Ward linkage), spectral (with radial-basis function affinity), HDBSCAN, and expectation maximization for a Gaussian mixture model (EM-GMM). We originally intended to include DBSCAN as well. However, the heuristics we tried for choosing the neighborhood radius (including what is suggested in Sander etal., (1998)) did not work well across the range of dimensionalities we consider, resulting in too many noise points (see also the discussion by Schubert etal., (2017)).

We measure performance in terms of adjusted mutual information (AMI) and adjusted Rand index (AIR), based on the ground truth cluster labels (Vinh etal., (2010), Hubert and Arabie, (1985)). To carry out the benchmark, we sample 10 times from each archetype, resulting in 60 distinct data sets. We repeat this process twice to evaluate performance on convex and non-convex clusters (the latter resulting from applying the distort function on new data sets). Thus, the full benchmark is based on 120 distinct data sets. Table 2 shows the results for K-Means, hierarchical, spectral, and EM-GMM. All these algorithms receive the true number of clusters as an input.

Table 3 separately lists the performance for HDBSCAN.444We used min_samples=5 in the scikit-learn implementation (Pedregosa etal., (2011)). We do this for two reasons. First, as a density-based algorithm, HDBSCAN cannot make use of the true number of clusters, putting it at a disadvantage. Second, HDBSCAN reports a large number of noise points (\approx60% on average), so that its results are not directly comparable to that of the other algorithms. We report the performance of HDBSCAN based only on the non-noise points. This leads to higher performance numbers, which may compensate for the disadvantage HDBSCAN faces in not knowing the true number of clusters.

ArchetypeConvexNon-Convex
AMIAMI
EM-GMMK-MeansSpectralHierarchicalEM-GMMK-MeansSpectralHierarchical
twelve_clusters_different_distributions0.884 (.007)0.848 (.008)0.859 (.010)0.853 (.007)0.576 (.017)0.563 (.012)0.565 (.015)0.565 (.011)
twelve_different_distributions_high_class_imbalance0.854 (.008)0.836 (.007)0.852 (.010)0.834 (.009)0.544 (.021)0.542 (.023)0.547 (.025)0.553 (.024)
seven_highly_separated_10d_very_different_shapes0.983 (.009)0.976 (.005)0.986 (.003)0.990 (.002)0.750 (.035)0.606 (.027)0.709 (.027)0.637 (.029)
seven_very_different_shapes_significant_overlap0.349 (.019)0.452 (.009)0.443 (.014)0.380 (.010)0.173 (.014)0.149 (.008)0.165 (.010)0.145 (.010)
four_clusters_100d_100_samples_each0.063 (.008)0.512 (.028)0.074 (.023)0.654 (.022)0.077 (.011)0.100 (.012)0.079 (.008)0.092 (.008)
four_clusters_100d_1000_samples_each0.548 (.061)0.664 (.023)0.205 (.014)0.868 (.025)0.307 (.030)0.145 (.014)0.109 (.012)0.132 (.016)
Average0.614 (.044)0.715 (.025)0.570 (.046)0.763 (.026)0.404 (.032)0.351 (.030)0.362 (.033)0.354 (.031)
ARIARI
EM-GMMK-MeansSpectralHierarchicalEM-GMMK-MeansSpectralHierarchical
twelve_clusters_different_distributions0.855 (.011)0.795 (.012)0.807 (.018)0.798 (.014)0.368 (.021)0.365 (.015)0.364 (.015)0.362 (.012)
twelve_different_distributions_high_class_imbalance0.800 (.019)0.760 (.012)0.787 (.020)0.756 (.017)0.341 (.022)0.347 (.021)0.345 (.025)0.360 (.024)
seven_highly_separated_10d_very_different_shapes0.967 (.020)0.977 (.005)0.988 (.003)0.992 (.002)0.684 (.045)0.505 (.035)0.628 (.036)0.523 (.039)
seven_very_different_shapes_significant_overlap0.238 (.023)0.375 (.012)0.385 (.016)0.291 (.015)0.109 (.011)0.090 (.006)0.089 (.007)0.081 (.007)
four_clusters_100d_100_samples_each0.004 (.007)0.404 (.036)0.053 (.014)0.554 (.028)0.018 (.006)0.062 (.011)0.053 (.008)0.051 (.005)
four_clusters_100d_1000_samples_each0.408 (.048)0.601 (.040)0.185 (.011)0.874 (.032)0.252 (.029)0.088 (.014)0.065 (.012)0.072 (.016)
Average0.545 (.047)0.652 (.030)0.534 (.044)0.711 (.031)0.295 (.029)0.243 (.023)0.264 (.025)0.241 (.025)

ArchetypeConvexNon-Convex
AMIARI𝐩noisesubscript𝐩noise\mathbf{p}_{\text{noise}}bold_p start_POSTSUBSCRIPT noise end_POSTSUBSCRIPTAMIARI𝐩noisesubscript𝐩noise\mathbf{p}_{\text{noise}}bold_p start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT
twelve_clusters_different_distributions0.796 (.045)0.705 (.066)0.269 (.033)0.605 (.016)0.400 (.026)0.353 (.007)
twelve_different_distributions_high_class_imbalance0.733 (.054)0.572 (.083)0.238 (.042)0.560 (.024)0.356 (.036)0.346 (.016)
seven_highly_separated_10d_very_different_shapes0.992 (.006)0.984 (.014)0.175 (.014)0.747 (.106)0.700 (.117)0.417 (.043)
seven_very_different_shapes_significant_overlap0.595 (.110)0.617 (.124)0.909 (.020)0.163 (.050)0.149 (.066)0.744 (.072)
four_clusters_100d_100_samples_each1.000 (.000)1.000 (.000)1.000 (.000)0.794 (.137)0.804 (.131)0.990 (.007)
four_clusters_100d_1000_samples_each0.800 (.133)0.800 (.133)0.999 (.000)0.031 (.018)0.023 (.016)0.687 (.070)
Average0.819 (.035)0.780 (.040)0.599 (.049)0.483 (.047)0.405 (.047)0.590 (.036)

The results show that hierarchical clustering performs best on the convex cluster shapes, as long as there is sufficient separation between clusters. On the non-convex clusters, EM-GMM exhibits the strongest performance even though the clusters are no longer multivariate normal after applying distort (see Figure 8).

K-Means and hierarchical clustering both hold up well on the high-dimensional data, including in the low-sample regime of 100 samples per cluster in 100D. By contrast, EM-GMM dramatically benefits from more samples on the high-dimensional data.

Spectral clustering does not display competitive performance in this benchmark. While it improves over K-Means in some scenarios, it delivers weaker performance in high dimensions and never performs best across all algorithms.

HDBSCAN shows great difficulty in handling high dimensionality or significant overlap between clusters. However, the algorithm shows strong performance on the 12 non-convex clusters drawn from diverse distributions. We note again that HDBSCAN does not have access to the true number of clusters, in contrast with the other algorithms.

4.2 Minimax Classification Error Captures Clustering Difficulty

In Section 3.2, we defined the overlap between two clusters in terms of the error rate of the best minimax linear classifier. We verify that this notion of overlap conveys clustering difficulty by measuring clustering performance on data sets with different degrees of overlap. For this simulation, we consider data sets with two clusters drawn from an archetype we described as “two clusters with very different shapes in p𝑝pitalic_pD”555This verbal description yields an archetype with parameters
{ ‘name’: ‘two_very_different_shapes_p𝑝pitalic_pd’, ‘n_clusters’: 2, ‘dim’: p𝑝pitalic_p, ‘n_samples’: 200, ‘aspect_ref’: 1.5, ‘aspect_maxmin’: 3.0, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2, ‘max_overlap’: 0.05, ’min_overlap’: 0.001, ‘distributions’: [‘normal’, ‘exponential’] }
, where the dimensionality p𝑝pitalic_p ranges across [2,10,30,100]21030100[2,10,30,100][ 2 , 10 , 30 , 100 ]. We vary max_overlap from 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT to 0.5, while setting min_overlap = max_overlap/10. For each overlap setting, we generate 100 distinct data sets and evaluate the average clustering performance of hierarchical clustering, quantified in terms of adjusted mutual information (AMI) and adjusted Rand index (ARI), as in the benchmark of Section 4.1. We choose hierarchical clustering because it is computationally efficient and performed well in the benchmark. We repeat this process twice, where in the second run we make clusters non-convex by applying the distort function.

Figure 9 confirms that clustering difficulty rises with increasing overlap. Figure 10 shows the same in the case of non-convex clusters, suggesting that applying distort maintains the desired relationship between overlap and clustering difficulty. Additionally, both figures show how our cluster overlap relates to the silhouette score, a popular metric for quantifying clustering difficulty (Rousseeuw, (1987), Shand etal., (2019)). At a fixed value of max_overlap, the silhouette score decreases markedly with a rise in dimensionality. This is not an artifact of our overlap measure, since plotting clustering performance vs silhouette score shows a similar dependence on dimensionality (not shown). This makes sense since the silhouette score is based on the difference of Euclidean distances, and distances between points tend to become more similar in high dimensions (Aggarwal etal., (2001)).

High-Level Synthetic Data Generation with Data Set Archetypes (23)
High-Level Synthetic Data Generation with Data Set Archetypes (24)
High-Level Synthetic Data Generation with Data Set Archetypes (25)
High-Level Synthetic Data Generation with Data Set Archetypes (26)
High-Level Synthetic Data Generation with Data Set Archetypes (27)
High-Level Synthetic Data Generation with Data Set Archetypes (28)

4.2.1 Examining the Distribution of Pairwise Overlaps

Since repliclust controls cluster overlap on the level of entire data sets by setting two global parameters (max_overlap and min_overlap), it is worthwhile to investigate the distributions of pairwise overlaps on datasets with multiple clusters. Figure 11 shows the distribution of pairwise overlaps for six data set archetypes, confirming that setting max_overlap and min_overlap effectively controls the pairwise overlaps between clusters.

High-Level Synthetic Data Generation with Data Set Archetypes (29)

5 Related Work

Simulations on synthetic data play an important role in cluster analysis. Accordingly, many synthetic data generators for cluster analysis have been proposed. However, the key idea in this paper is to specify the overall geometric characteristics of synthetic data in a high-level manner via data set archetypes. By contrast, previous data generators have put the onus on the user to design the overall geometric structure by carefully tuning lower-level properties of individual clusters.

In the following overview, we mainly focus on general-purpose generators. However, the literature has also proposed more specialized solutions to fill specific needs in the community. For example, Beer etal., (2019) present a data generator for subspace clustering, Gan and Tao, (2015) evaluates density-based clustering using a data generation process based on random walks, and Handl and Knowles, (2005) focus on creating long and thin ellipsoidal clusters in higher dimensions. None of these contributions focus on giving high-level control over the overall geometry of the data sets.

Milligan and Cooper, (1985) implements a generator for generating several clusters in up to 8 dimensions. The method enforces an upper bound on cluster overlap by limiting overlap in the first dimension, but does not otherwise provide control over high-level geometric structure.

Pei and Zaïane, (2006) present a compelling software system for generating two-dimensional clusters, which creates data sets with specified clustering difficulty “easy”, “medium” or “hard”; “easy” data consists only of spherical/convex clusters, whereas “medium” and “hard” data include curve segments and special shapes like letters of the alphabet. The generator does not offer high-level control over data set geometry except for the difficulty scale and the density of noise points to add to the data. The software comes with an appealing graphical user interface.

The popular scikit-learn library for machine learning (Pedregosa etal., (2011)) offers several functions for creating synthetic clusters. Among these, some are aimed at reproducing canonical 2D toy data sets like concentric circles and nested moons (make_moons, make_circles), while others focus on sampling multivariate normal clusters. These functions offer some valuable features, such as the ability to create datasets with informative and redundant features, as in the make_classification function. However, they do not control overall geometric characteristics of the data.

The data mining framework ELKI (Schubert and Zimek, (2019)) provides a synthetic data generator based on specifying probabilistic models in XML files. This XML-based approach makes it easy to reproduce benchmarks by sharing the underlying XML files. Drawing inspiration from this work, we have implemented an Archetype.describe function allowing users to easily share collections of data set archetypes as JSONL files.

The generators OCLUS (Steinley and Henson, (2005)) and GenRandomClust (Qiu and Joe, (2006)) focus on providing more sophisticated overlap control compared to previous generators. GenRandomClust extends the generator of Milligan and Cooper, (1985) by managing overlaps between clusters with different ellipsoidal shapes and arbitrary spatial orientations. Similar to our classification error-based notion of cluster overlap, their method finds an optimal separation direction between two clusters. To enforce the desired degree of overlap, the algorithm initially places cluster centers on a scaffold, then scales individual edges of the scaffold up or down to meet the overlap constraint. The method supports making all cluster shapes more or less elongated, but does not otherwise provide high-level control over data set geometry. Moreover, the scaling operations undertaken to manage cluster overlaps implicitly sacrifice control over cluster volumes (which, in repliclust, can be managed independently).

OCLUS (Steinley and Henson, (2005)) quantifies cluster overlap in terms of the shared density between two clusters. The generator uses analytical formulas for integrals of several interesting probability distributions (including exponential, gamma, and chi-square), thereby effectively managing overlaps between non-normal clusters. As a result, the method is limited to treating all dimensions independently, so that cluster distributions simplify as the products of marginals. The paper contains a valuable discussion of the distinction between marginal and joint cluster overlaps (of which their software supports both).

Like other existing generators, OCLUS does not aspire to helping the user establish the overall geometric characteristics of synthetic data sets. To generate a data set, the user must provide a covariance matrix for each cluster, the desired overlaps between all pairs of clusters, and a design matrix specifying which clusters overlap at all. In sum, generating 10 clusters in 10D requires supplying over 500 numbers, compared to only a handful in repliclust (or none if the user chooses to describe the archetype in English).

MDCGen (Iglesias etal., (2019) is a feature-rich generator that supports many desiderata in cluster analysis, such as overlap control, different probability distributions, subspace clusters, and the ability to add noise points. In particular, it is nice to be able to place noise points away from the clusters, which is made possible by the grid-based strategy for placing cluster centers. MDCGen does not target the overall geometric characteristics of synthetic data sets, instead giving users low-level control enabling extensive configurability. For example, managing the overlap between clusters involves setting compactness coefficients, grid granularity, and overall scale, compared to only tweaking max_overlap in repliclust.666max_overlap may also have to be tweaked but it can usually stay at max_overlap/10 or a similar value. In the words of the authors, “to enable covering a broad range of dataset possibilities, the parameters are multiple and some training for tuning the tool is required.”

Finally, the HAWKS generator (Shand etal., (2019)) controls cluster overlaps using an evolutionary algorithm that evolves the means and covariance matrices of multivariate normal distributions. The paper applies this framework to create data sets with a user-specified silhouette score representing clustering difficulty (Rousseeuw, (1987)). In principle, the evolutionary framework can be extended to attain desired high-level geometric characteristics. Of these, the authors consider two examples, cluster overlaps and elongations (the latter relating to our notion of cluster aspect ratio, as listed in Table 1). An interesting aspect of HAWKS is the ability to generate data sets that maximize the performance difference between two clustering algorithms. This feature is especially useful in two dimensions, since we can then visually examine the data sets to better understand when each algorithm succeeds or fails.

6 Conclusion

In this paper, we have presented an archetype-based approach to synthetic data generation for cluster analysis. Our method works by summarizing the overall geometric characteristics of a probabilistic mixture model with a few high-level parameters, and sampling mixture models subject to these constraints. To convey the convenience and informativeness of such an archetype-based approach, we have implemented a natural language interface allowing users to create archetypes purely from verbal descriptions in English. Thus, our software repliclust makes it possible to run an entire benchmark by describing the desired evaluation scenarios in English.

Although our data generator relies on creating a skeleton of convex, ellipsoidal clusters, we have implemented ways to make the cluster shapes more irregular and complex. The first method passes convex clusters through a randomly initialized neural network, making their shapes non-convex and irregular. The second method creates directional datasets by wrapping p𝑝pitalic_p-dimensional convex clusters around the (p+1)𝑝1(p+1)( italic_p + 1 )-dimensional sphere.

In future work, we are most interested in learning data set archetypes that mimic the geometric characteristics of real data. In application domains where multivariate Gaussians provide a good model for the data, it would suffice to fit a Gaussian mixture model, empirically measure its high-level geometric parameters (as listed in Table 1), then create a synthetic data set archetype with these parameters. For application domains with non-convex clusters, it would be interesting to implement a generalized version of this approach. We can imagine training an auto-encoder type neural network that maps non-convex clusters into a hidden space where they become multivariate Gaussian, then undoes the transformation. Defining an archetype based on the high-level geometric characteristics in the hidden space would presumably allow us to sample irregularly shaped clusters that look similar to those found in real data.

Acknowledgements

MJZ would like to acknowledge support from Matt Thomson and members of the Thomson Lab at Caltech. In addition, MJZ thanks Dante Roy Calapate for providing his computer monitor during the early stages of this project.

References

  • Aggarwal etal., (2001)Aggarwal, C.C., Hinneburg, A., and Keim, D.A. (2001).On the surprising behavior of distance metrics in high dimensional space.In Vanden Bussche, J. and Vianu, V., editors, Database Theory — ICDT 2001, pages 420–434, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Anderson and Bahadur, (1962)Anderson, T.W. and Bahadur, R.R. (1962).Classification into two multivariate normal distributions with different covariance matrices.The Annals of Mathematical Statistics, 33(2):420–431.
  • Azaria and Mitchell, (2023)Azaria, A. and Mitchell, T. (2023).The internal state of an LLM knows when it’s lying.In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
  • Ba etal., (2016)Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016).Layer normalization.arXiv preprint arXiv:1607.06450.
  • Ball, (1992)Ball, K. (1992).A lower bound for the optimal density of lattice packings.International Mathematics Research Notices, 1992(10):217–221.
  • Banerjee etal., (2005)Banerjee, A., Dhillon, I.S., Ghosh, J., and Sra, S. (2005).Clustering on the unit hypersphere using von mises-fisher distributions.Journal of Machine Learning Research, 6(46):1345–1382.
  • Beer etal., (2019)Beer, A., Schüler, N.-S., and Seidl, T. (2019).A generator for subspace clusters.
  • Blum etal., (2020)Blum, A., Hopcroft, J., and Kannan, R. (2020).Foundations of Data Science.Cambridge University Press.
  • Brown etal., (2020)Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020).Language models are few-shot learners.In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc.
  • Chen etal., (2020)Chen, S., Rivaud, P., Park, J.H., Tsou, T., Charles, E., Haliburton, J.R., Pichiorri, F., and Thomson, M. (2020).Dissecting heterogeneous cell populations across drug and disease conditions with popalign.Proceedings of the National Academy of Sciences, 117(46):28784–28794.
  • Cormack, (1971)Cormack, R.M. (1971).A review of classification.Journal of the Royal Statistical Society. Series A (General), 134(3):321–367.
  • Dempster etal., (1977)Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977).Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
  • Ester etal., (1996)Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).A density-based algorithm for discovering clusters in large spatial databases with noise.In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press.
  • Gan and Tao, (2015)Gan, J. and Tao, Y. (2015).Dbscan revisited: Mis-claim, un-fixability, and approximation.In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, page 519–530, New York, NY, USA. Association for Computing Machinery.
  • Handl and Knowles, (2005)Handl, J. and Knowles, J. (2005).Improvements to the scalability of multiobjective clustering.In Proceedings of the IEEE Congress on Evolutionary Computation, CEC ’05, pages 2372–2379. IEEE.
  • Harris etal., (2020)Harris, C.R., Millman, K.J., vander Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T.E. (2020).Array programming with NumPy.Nature, 585(7825):357–362.
  • Hastie etal., (2009)Hastie, T., Tibshirani, R., and Friedman, J. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition.Springer Series in Statistics. Springer New York.
  • Hennig, (2015)Hennig, C. (2015).What are the true clusters?Pattern Recognition Letters, 64:53–62.Philosophical Aspects of Pattern Recognition.
  • Howard and Ruder, (2018)Howard, J. and Ruder, S. (2018).Universal language model fine-tuning for text classification.In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  • Hubert and Arabie, (1985)Hubert, L. and Arabie, P. (1985).Comparing partitions.Journal of Classification, 2(1):193–218.
  • Hunter, (2007)Hunter, J.D. (2007).Matplotlib: A 2D graphics environment.Computing in Science & Engineering, 9(3):90–95.
  • Iglesias etal., (2019)Iglesias, F., Zseby, T., and Ferreira, D. (2019).MDCGen: Multidimensional dataset generator for clustering.Journal of Classification, 36:599–618.
  • Inan etal., (2017)Inan, H., Khosravi, K., and Socher, R. (2017).Tying word vectors and word classifiers: A loss framework for language modeling.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  • LeCun etal., (2012)LeCun, Y.A., Bottou, L., Orr, G.B., and Müller, K.-R. (2012).Efficient BackProp, pages 9–48.Springer Berlin Heidelberg, Berlin, Heidelberg.
  • MacQueen, (1967)MacQueen, J. (1967).Some methods for classification and analysis of multivariate observations.In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, pages 281–297.
  • McInnes etal., (2017)McInnes, L., Healy, J., and Astels, S. (2017).hdbscan: Hierarchical density based clustering.Journal of Open Source Software, 2(11):205.
  • Milligan, (1980)Milligan, G.W. (1980).An examination of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45(3):325–342.
  • Milligan, (1996)Milligan, G.W. (1996).Clustering validation: Results and implications for applied analyses.In Clustering and Classification, pages 341–375.
  • Milligan and Cooper, (1985)Milligan, G.W. and Cooper, M.C. (1985).An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50(2):159–179.
  • Milligan etal., (1983)Milligan, G.W., Soon, S.C., and Sokol, L.M. (1983).The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(1):40–47.
  • Ng etal., (2001)Ng, A., Jordan, M., and Weiss, Y. (2001).On spectral clustering: Analysis and an algorithm.In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems, volume14. MIT Press.
  • OpenAI, (2023)OpenAI (2023).Openai api.https://platform.openai.com/.Accessed: 2024-09-18.
  • OpenAI etal., (2024)OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G., Brooks, T., Brundage, M., Button, K., Cai, T., Campbell, R., Cann, A., Carey, B., Carlson, C., Carmichael, R., Chan, B., Chang, C., Chantzis, F., Chen, D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B., Cho, C., Chu, C., Chung, H.W., Cummings, D., Currier, J., Dai, Y., Decareaux, C., Degry, T., Deutsch, N., Deville, D., Dhar, A., Dohan, D., Dowling, S., Dunning, S., Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., Felix, N., Fishman, S.P., Forte, J., Fulford, I., Gao, L., Georges, E., Gibson, C., Goel, V., Gogineni, T., Goh, G., Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S., Greene, R., Gross, J., Gu, S.S., Guo, Y., Hallacy,C., Han, J., Harris, J., He, Y., Heaton, M., Heidecke, J., Hesse, C., Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu, K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., Jiang, A., Jiang, R., Jin, H., Jin, D., Jomoto, S., Jonn, B., Jun, H., Kaftan, T., Łukasz Kaiser, Kamali, A., Kanitscheider, I., Keskar, N.S., Khan, T., Kilpatrick, L., Kim, J.W., Kim, C., Kim, Y., Kirchner, J.H., Kiros, J., Knight, M., Kokotajlo, D., Łukasz Kondraciuk, Kondrich, A., Konstantinidis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan, I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C.M., Lim, R., Lin, M., Lin, S., Litwin, M., Lopez, T., Lowe, R., Lue, P., Makanju, A., Malfacini, K., Manning, S., Markov, T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., McGrew, B., McKinney, S.M., McLeavey, C., McMillan, P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz, L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa, E., Mossing, D., Mu, T., Murati, M., Murk, O., Mély, D., Nair, A., Nakano, R.,Nayak, R., Neelakantan, A., Ngo, R., Noh, H., Ouyang, L., O’Keefe, C., Pachocki, J., Paino, A., Palermo, J., Pantuliano, A., Parascandolo, G., Parish, J., Parparita, E., Passos, A., Pavlov, M., Peng, A., Perelman, A., deAvila BelbutePeres, F., Petrov, M., deOliveiraPinto, H.P., Michael, Pokorny, Pokrass, M., Pong, V.H., Powell, T., Power, A., Power, B., Proehl, E., Puri, R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, F., Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ryder, N., Saltarelli, M., Sanders, T., Santurkar, S., Sastry, G., Schmidt, H., Schnurr, D., Schulman, J., Selsam, D., Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., Shyam, P., Sidor, S., Sigler, E., Simens, M., Sitkin, J., Slama, K., Sohl, I., Sokolowsky, B., Song, Y., Staudacher, N., Such, F.P., Summers, N., Sutskever, I., Tang, J., Tezak, N., Thompson, M.B., Tillet, P., Tootoonchian, A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J. F.C., Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, C., Wang,J.J., Wang, A., Wang, B., Ward, J., Wei, J., Weinmann, C., Welihinda, A., Welinder, P., Weng, J., Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman, L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba, W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W., and Zoph, B. (2024).Gpt-4 technical report.
  • Pedregosa etal., (2011)Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011).Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830.
  • Pei and Zaïane, (2006)Pei, Y. and Zaïane, O. (2006).A synthetic data generator for clustering and outlier analysis.Technical report, University of Alberta, Edmonton.
  • Qiu and Joe, (2006)Qiu, W. and Joe, H. (2006).Generation of random clusters with specified degree of separation.Journal of Classification, 23:315–334.
  • Ren etal., (2023)Ren, J., Luo, J., Zhao, Y., Krishna, K., Saleh, M., Lakshminarayanan, B., and Liu, P.J. (2023).Out-of-distribution detection and selective generation for conditional language models.
  • Rousseeuw, (1987)Rousseeuw, P. (1987).Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65.
  • Salah and Nadif, (2019)Salah, A. and Nadif, M. (2019).Directional co-clustering.Advances in Data Analysis and Classification, 13(3):591–620.
  • Sander etal., (1998)Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (1998).Density-based clustering in spatial databases: The algorithm gdbscan and its applications.Data Mining and Knowledge Discovery, 2(2):169–194.
  • Schubert etal., (2017)Schubert, E., Sander, J., Ester, M., Kriegel, H.P., and Xu, X. (2017).Dbscan revisited, revisited: Why and how you should (still) use dbscan.ACM Trans. Database Syst., 42(3).
  • Schubert and Zimek, (2019)Schubert, E. and Zimek, A. (2019).ELKI: A large open-source library for data analysis - ELKI release 0.7.5 “Heidelberg”.CoRR, abs/1902.03616.
  • Shand etal., (2019)Shand, C., Allmendinger, R., Handl, J., Webb, A., and Keane, J. (2019).Evolving controllably difficult datasets for clustering.In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’19, pages 463–471. ACM.
  • Shi and Malik, (1997)Shi, J. and Malik, J. (1997).Normalized cuts and image segmentation.Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 731–737.
  • Steinley and Brusco, (2008)Steinley, D. and Brusco, M.J. (2008).Selection of variables in cluster analysis: An empirical comparison of eight procedures.Psychometrika, 73(1):125–144.
  • Steinley and Brusco, (2011)Steinley, D. and Brusco, M.J. (2011).Evaluating mixture modeling for clustering: recommendations and cautions.Psychological Methods, 16(1):63–79.
  • Steinley and Henson, (2005)Steinley, D.L. and Henson, R. (2005).Oclus: An analytic method for generating clusters with known overlap.Journal of Classification, 22:221–250.
  • Tibshirani etal., (2001)Tibshirani, R., Walther, G., and Hastie, T. (2001).Estimating the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.
  • vander Maaten and Hinton, (2008)vander Maaten, L. and Hinton, G. (2008).Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605.
  • VanMechelen etal., (2023)VanMechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., and Warrens, M.J. (2023).A white paper on good research practices in benchmarking: The case of cluster analysis.WIREs Data Mining and Knowledge Discovery, 13(6):e1511.
  • Vinh etal., (2010)Vinh, N.X., Epps, J., and Bailey, J. (2010).Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance.Journal of Machine Learning Research, 11(95):2837–2854.
  • Virtanen etal., (2020)Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A. R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, I., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors (2020).SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.Nature Methods, 17:261–272.

Appendix A

We give more detail on how repliclust manages various geometric attributes using max-min parameters. Table 4 lists all geometric attributes managed with max-min sampling and names the corresponding parameters in repliclust. The reference value for each geometric parameter serves as a location constraint, while the max-min ratio determines the spread. A constraint ensures that the distribution of geometric parameters within a data set is similar across data sets drawn from the same archetype.

Geometric AttributeMax-Min RatioReference ValueConstraintcluster volumes radius_maxmin scale cluster volumes averageto reference volumegroup sizesimbalance_ratio average group size group sizes sum tonumber of samplescluster aspect ratiosaspect_maxmin aspect_ref geometric mean of aspectratios equals reference cluster axis lengths aspect ratioof the cluster dim-th root ofcluster volume geometric mean of lengthsequals reference length


To enforce the attribute-specific constraints, repliclust samples new values of a geometric parameter in pairs. The first value is randomly drawn from a triangular distribution whose mode and endpoints are determined from the typical value and max-min ratio. The second value is then deterministically computed to maintain the constraint. For reference, see the sample_with_maxmin function in the maxmin.utils module (Version 1.0.0 of repliclust).

Appendix B

Table 5 lists the formal attributes of a mixture model in repliclust. A data set archetype provides a way to randomly sample mixture models with similar overall geometric characteristics. Thus, an archetype implicitly defines a probability distribution over the attributes in Table 5.

AttributeMeaningMathematical Definition cluster centersthe positions of cluster centers in space 𝝁1,𝝁2,,𝝁kpsubscript𝝁1subscript𝝁2subscript𝝁𝑘superscript𝑝\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2},...,\boldsymbol{\mu}_{k}\in\mathbb{R%}^{p}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT principal axis orientations the spatial orientation of each cluster’sellipsoidal shape (different for each cluster) orthonormal matrices𝐔1,𝐔2,,𝐔kp×psubscript𝐔1subscript𝐔2subscript𝐔𝑘superscript𝑝𝑝\mathbf{U}_{1},\mathbf{U}_{2},...,\mathbf{U}_{k}\in\mathbb{R}^{p\times p}bold_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT principal axis lengths the lengths of each cluster’s principal axes(axes have different lengths between and within clusters) 𝝈1,𝝈2,,𝝈k(>0)psubscript𝝈1subscript𝝈2subscript𝝈𝑘superscriptsuperscriptabsent0𝑝\boldsymbol{\sigma}_{1},\boldsymbol{\sigma}_{2},...,\boldsymbol{\sigma}_{k}\in%{(\mathbb{R}^{>0}})^{p}bold_italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT > 0 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT cluster distributions multivariate probability distributions forgenerating data (different for each cluster) distributions 1,2,,ksubscript1subscript2subscript𝑘\mathbb{P}_{1},\mathbb{P}_{2},...,\mathbb{P}_{k}blackboard_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , blackboard_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

Appendix C

We prove Theorem 1 of Section 3.2. Additionally, we provide an analogous result for the cruder “center-to-center” approximation of cluster overlap.

Theorem 1 (LDA-Based Cluster Overlap)

For two multivariate normal clusters with means 𝛍1𝛍2subscript𝛍1subscript𝛍2\boldsymbol{\mu}_{1}\neq\boldsymbol{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and covariance matrices 𝚺1,𝚺2subscript𝚺1subscript𝚺2\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the approximate cluster overlap αLDAsubscript𝛼LDA\alpha_{\text{\tiny{LDA}}}italic_α start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT based on the linear separator 𝐚LDA=(𝚺1+𝚺22)1(𝛍2𝛍1)subscript𝐚LDAsuperscriptsubscript𝚺1subscript𝚺221subscript𝛍2subscript𝛍1\boldsymbol{a}_{\text{\tiny{LDA}}}=(\frac{\boldsymbol{\Sigma}_{1}+\boldsymbol{%\Sigma}_{2}}{2})^{-1}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT = ( divide start_ARG bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is

αLDA=2(1Φ(𝒂LDA(𝝁2𝝁1)𝒂LDA𝚺1𝒂LDA+𝒂LDA𝚺2𝒂LDA)),subscript𝛼LDA21Φsuperscriptsubscript𝒂LDAtopsubscript𝝁2subscript𝝁1superscriptsubscript𝒂LDAtopsubscript𝚺1subscript𝒂LDAsuperscriptsubscript𝒂LDAtopsubscript𝚺2subscript𝒂LDA\alpha_{\text{\tiny{LDA}}}=2\big{(}1-\Phi\Big{(}\frac{\boldsymbol{a}_{\text{%\tiny{LDA}}}^{\top}(\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1})}{\sqrt{%\boldsymbol{a}_{\text{\tiny{LDA}}}^{\top}\boldsymbol{\Sigma}_{1}\boldsymbol{a}%_{\text{\tiny{LDA}}}}+\sqrt{\boldsymbol{a}_{\text{\tiny{LDA}}}^{\top}%\boldsymbol{\Sigma}_{2}\boldsymbol{a}_{\text{\tiny{LDA}}}}}\Big{)}\big{)},italic_α start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT = 2 ( 1 - roman_Φ ( divide start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT end_ARG + square-root start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT end_ARG end_ARG ) ) ,(10)

where Φ(z)Φ𝑧\Phi(z)roman_Φ ( italic_z ) is the cumulative distribution function of the standard normal distribution.Moreover, if 𝚺1=λ𝚺2subscript𝚺1𝜆subscript𝚺2\boldsymbol{\Sigma}_{1}=\lambda\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for some λ𝜆\lambdaitalic_λ then αLDAsubscript𝛼LDA\alpha_{\text{\tiny{LDA}}}italic_α start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT equals the exact cluster overlap α𝛼\alphaitalic_α.

Proof. Let 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT be the classification axis. Minimax optimality requires that the cluster-specific misclassification probabilities are equal. Since 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT is the classification axis, these probabilities correspond to the tails of the marginal distributions along 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT. Specifically, let

σ1=𝒂LDAΣ1𝒂LDAsubscript𝜎1superscriptsubscript𝒂LDAtopsubscriptΣ1subscript𝒂LDA\sigma_{1}=\sqrt{\boldsymbol{a}_{\text{\tiny{LDA}}}^{\top}\Sigma_{1}%\boldsymbol{a}_{\text{\tiny{LDA}}}}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = square-root start_ARG bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT end_ARG(11)

be the standard deviation of cluster 1’s marginal distribution along 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT, where Σ1subscriptΣ1\Sigma_{1}roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the cluster’s covariance matrix; σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is defined analogously. If 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT is oriented to point from cluster 1 to cluster 2, then the 1α/21𝛼21-\alpha/21 - italic_α / 2 quantile of cluster 1’s marginal distribution meets the α/2𝛼2\alpha/2italic_α / 2 quantile of cluster 2’s marginal distribution at the decision boundary, where α𝛼\alphaitalic_α is the unknown cluster overlap. This intersection implies

𝝁1𝒂LDA+q1α/2σ1=𝝁2𝒂LDA+qα/2σ2,superscriptsubscript𝝁1topsubscript𝒂LDAsubscript𝑞1𝛼2subscript𝜎1superscriptsubscript𝝁2topsubscript𝒂LDAsubscript𝑞𝛼2subscript𝜎2\boldsymbol{\mu}_{1}^{\top}\boldsymbol{a}_{\text{\tiny{LDA}}}+q_{1-\alpha/2}%\sigma_{1}=\boldsymbol{\mu}_{2}^{\top}\boldsymbol{a}_{\text{\tiny{LDA}}}+q_{%\alpha/2}\sigma_{2},bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT + italic_q start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT + italic_q start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(12)

where qξsubscript𝑞𝜉q_{\xi}italic_q start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT is the ξ𝜉\xiitalic_ξ-quantile of the standard normal distribution.Rearranging this equation, and using qα/2=q1α/2subscript𝑞𝛼2subscript𝑞1𝛼2q_{\alpha/2}=-q_{1-\alpha/2}italic_q start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT = - italic_q start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT and Φ(qξ)=ξΦsubscript𝑞𝜉𝜉\Phi(q_{\xi})=\xiroman_Φ ( italic_q start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) = italic_ξ, gives (10).

Next, suppose that 𝚺1=λ𝚺2subscript𝚺1𝜆subscript𝚺2\boldsymbol{\Sigma}_{1}=\lambda\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for some λ𝜆\lambdaitalic_λ. In this case, maximum likelihood classification results in a linear decision boundary that coincides with the LDA solution. Hence, the minimax-optimal linear classifier uses the LDA-based classification axis 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT.

Theorem 2 (Center-to-Center Cluster Overlap)

For two multivariate normal clusters with means 𝛍1𝛍2subscript𝛍1subscript𝛍2\boldsymbol{\mu}_{1}\neq\boldsymbol{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and covariance matrices 𝚺1,𝚺2subscript𝚺1subscript𝚺2\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the center-to-center cluster overlap αC2Csubscript𝛼C2C\alpha_{\text{\tiny{C2C}}}italic_α start_POSTSUBSCRIPT C2C end_POSTSUBSCRIPT, based on a classification boundary perpendicular to the line connecting the cluster centers, is

αC2C=2(1Φ(𝜹𝜹𝜹𝚺1𝜹+𝜹𝚺2𝜹)),subscript𝛼C2C21Φsuperscript𝜹top𝜹superscript𝜹topsubscript𝚺1𝜹superscript𝜹topsubscript𝚺2𝜹\alpha_{\text{\tiny{C2C}}}=2\big{(}1-\Phi\Big{(}\frac{\boldsymbol{\delta}^{%\top}\boldsymbol{\delta}}{\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}_%{1}\boldsymbol{\delta}}+\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}_{2%}\boldsymbol{\delta}}}\Big{)}\big{)},italic_α start_POSTSUBSCRIPT C2C end_POSTSUBSCRIPT = 2 ( 1 - roman_Φ ( divide start_ARG bold_italic_δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_δ end_ARG start_ARG square-root start_ARG bold_italic_δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_δ end_ARG + square-root start_ARG bold_italic_δ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_δ end_ARG end_ARG ) ) ,(13)

where 𝛅:=𝛍2𝛍1assign𝛅subscript𝛍2subscript𝛍1\boldsymbol{\delta}:=\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}bold_italic_δ := bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the difference between cluster centers and Φ(z)Φ𝑧\Phi(z)roman_Φ ( italic_z ) is the cumulative distribution function of the standard normal distribution.

Moreover, if the covariance matrices 𝚺1subscript𝚺1\boldsymbol{\Sigma}_{1}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺2subscript𝚺2\boldsymbol{\Sigma}_{2}bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are both multiples of the identity matrix, then αC2Csubscript𝛼C2C\alpha_{\text{\tiny{C2C}}}italic_α start_POSTSUBSCRIPT C2C end_POSTSUBSCRIPT equals the exact cluster overlap α𝛼\alphaitalic_α.

Proof. The proof proceeds along the same lines as the proof of Theorem 1, except that the classification axis is 𝝁2𝝁1subscript𝝁2subscript𝝁1\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. If both covariance matrices are multiples of the identity matrix, 𝝁2𝝁1subscript𝝁2subscript𝝁1\boldsymbol{\mu}_{2}-\boldsymbol{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a scalar multiple of the LDA-based classification axis 𝒂LDAsubscript𝒂LDA\boldsymbol{a}_{\text{\tiny{LDA}}}bold_italic_a start_POSTSUBSCRIPT LDA end_POSTSUBSCRIPT. Hence, the second part of Theorem 1 kicks in to establish equality between αC2Csubscript𝛼C2C\alpha_{\text{\tiny{C2C}}}italic_α start_POSTSUBSCRIPT C2C end_POSTSUBSCRIPT and the exact overlap.

Appendix D

Figure 12 visualizes two-dimensional data sets created from the same archetype but with different radial probability distributions. The results suggest that pegging the 68.15% quantile of the radial distribution at unity leads to satisfactory overlap control for distributions with infinite support. However, for distributions with bounded support (such as the beta distribution), this approach leads to greater separation between the clusters, as shown in the rightmost column of Figure 12.

High-Level Synthetic Data Generation with Data Set Archetypes (30)

Appendix E

The default architecture for the neural network used in the distort function of Section 3.4 is a feed-forward network consisting (in order) of a linear embedding, 16 repeated feed-forward blocks (each consisting of a fully connected linear layer followed by layer normalization and Tanh activation), and a linear projection. Importantly, we tie the weights of the embedding and projection layers, so that the projection weights are the transpose of the embedding weights (Inan etal., (2017)). The default hidden dimensionality is 128, so that the embedding layer maps a data set archetype’s dimension to 128. The internal feed-forward blocks preserve this hidden dimension, and the final projection layer maps it back to the archetype’s dimension.

Appendix F

Below we list the prompt templates (including few-shot examples) used in Version 1.0.0 of repliclust. Up-to-date versions are available in the code base (see repliclust.org).

1. Prompt template mapping archetype description to high-level geometric parameters:

Your task is to turn a verbal description of a data set archetype from Repliclust into a precise JSON that specifies which parameter settings to use to create the desired data set archetype in Repliclust. These are the allowed parameters:

n_clusters: int >= 1, the number of clusters to generate

dim: int >= 2, the dimensionality of the data

n_samples: int >= 1, the number of data samples to generate

aspect_ref: float >= 1, the eccentricity of a typical cluster (how oblong vs spherical it is)

aspect_maxmin: float >= 1, how much the eccentricity varies across clusters in a data set

radius_maxmin: float >= 1, how much cluster radius (and thereby cluster volume) varies across the clusters

max_overlap: float > 0, the maximum allowed overlap between any pair of clusters (0.1-0.2 is significant overlap, 0.01-0.05 is little overlap, 0.001 is very little overlap, and 0.0001 and lower is well-separated)

min_overlap: float > 0, the minimum amount of overlap each cluster should have with some other cluster, preventing a cluster from being too far away from all other clusters

imbalance_ratio: float >= 1, specifies how imbalanced the number of data points per cluster is

distributions: list[str], determines the probability distributions to use for the clusters; the available distributions are normal’, standard_t’, exponential’, beta’, uniform’, chisquare’, gumbel’, weibull’, gamma’, pareto’, f’, and lognormal

IMPORTANT NOTES:

Any words like "separated", "far away", "close together", or "overlapping" refer to the overlap between clusters. Far apart means that max_overlap is 1e-4 or less

Always make min_overlap smaller than max_overlap, usually ten times smaller!

ONLY include the Pareto (’pareto’) distribution if the user specifically asks for heavy tails!

EXAMPLES:

Description: five oblong clusters in two dimensions

Archetype JSON: {

"n_clusters": 5,

"dim": 2,

"n_samples": 500,

"aspect_ref": 3,

"aspect_maxmin": 1.5,

}

Description: three spherical clusters with significant overlap in two dimensions

Archetype JSON: {

"n_clusters": 3,

"dim": 2,

"n_samples": 300,

"max_overlap": 0.2,

"min_overlap": 0.1,

"aspect_ref": 1.0,

"aspect_maxmin": 1.0

}

Description: eight spherical clusters of different sizes with significant overlap in two dimensions

Archetype JSON: {

"n_clusters": 8,

"dim": 2,

"n_samples": 800,

"max_overlap": 0.25,

"min_overlap": 0.1,

"aspect_ref": 1.0,

"aspect_maxmin": 1.0,

"radius_maxmin": 2.0

}

Description: ten clusters which are all highly oblong (and equally so) but of very different sizes, with moderate overlap

Archetype JSON: {

"n_clusters": 10,

"n_samples": 1000,

"aspect_ref": 5,

"aspect_maxmin": 1.0,

"max_overlap": 0.10,

"min_overlap": 0.05,

"radius_maxmin": 4.0

}

Description: five clusters with significant class imbalance

Archetype JSON: {

"n_clusters": 5,

"n_samples": 500,

"imbalance_ratio": 5,

"aspect_ref": 1.5,

"aspect_maxmin": 1.5

}

Description: five clusters with perfect class balance

Archetype JSON: {

"n_clusters": 5,

"n_samples": 500,

"imbalance_ratio": 1.0,

"aspect_ref": 1.4,

"aspect_maxmin": 1.6

}

Description: eight clusters of which 70% are exponentially distributed and 30% are normally distributed

Archetype JSON: {

"n_clusters": 8,

"n_samples": 800,

"aspect_ref": 1.7,

"aspect_maxmin": 1.5,

"distributions": ["exponential", "normal"],

"distribution_proportions": [0.7, 0.3],

}

Description: eight clusters with 1000 total samples of which half are exponentially distributed and half are normally distributed

Archetype JSON: {

"n_clusters": 8,

"n_samples": 1000,

"aspect_ref": 1.7,

"aspect_maxmin": 1.5,

"distributions": ["exponential", "normal"],

"distribution_proportions": [0.5, 0.5]

}

Description: two clusters of different sizes in 10 dimensions that are well-separated

Archetype JSON: {

"n_clusters": 2,

"dim": 10,

"n_samples": 200,

"aspect_ref": 2

"aspect_maxmin": 2,

"radius_maxmin": 4.0,

"max_overlap": 0.001,

"min_overlap": 0.0001

}

Description: very oblong clusters that overlap heavily

Archetype JSON: {

"n_clusters": 6,

"n_samples": 600,

"aspect_ref": 7,

"aspect_maxmin": 1.4,

"max_overlap": 0.4,

"min_overlap": 0.3

}

Description: highly separated and very oblong clusters

Archetype JSON: {

"n_clusters": 4,

"n_samples": 400,

"aspect_ref": 6,

"aspect_maxmin": 1.6,

"max_overlap": 1e-4,

"min_overlap": 1e-5

}

Description: ten clusters with very different shapes

Archetype JSON: {

"n_clusters": 10,

"n_samples": 1000,

"aspect_ref": 1.5,

"aspect_maxmin": 3.0,

"radius_maxmin": 3.0

}

Description: twelve well-separated clusters with very different shapes

Archetype JSON: {

"n_clusters": 12,

"n_samples": 1200,

"aspect_ref": 1.5,

"aspect_maxmin": 5.0,

"radius_maxmin": 5.0,

"max_overlap": 1e-4,

"min_overlap": 1e-5

}}

Description: twelve highly separated Gaussian clusters with very different shapes

Archetype JSON: {

"n_clusters": 12,

"n_samples": 1200,

"aspect_ref": 1.5,

"aspect_maxmin": 5.0,

"radius_maxmin": 5.0,

"max_overlap": 1e-4,

"min_overlap": 1e-5

"distributions": ["normal"]}}

Description: five heavy-tailed clusters

Archetype JSON: {

"n_clusters": 5,

"n_samples": 500,

"aspect_ref": 1.5,

"distributions": ["standard_t", "lognormal", "pareto"]}}

Description: clusters with holes

Archetype JSON: {"distributions": ["f"]}

Description: clusters from a variety of distributions

Archetype JSON: {"distributions": ["normal", "exponential", "gamma", "weibull", "lognormal"]}

Description: clusters from all different distributions

Archetype JSON: {"distributions": [’normal’, standard_t’, exponential’, beta’, uniform’, chisquare’, gumbel’, weibull’, gamma’, f’, and lognormal’]}

Description: clusters from different distributions

Archetype JSON: {"distributions": [’normal’, exponential’, beta’, uniform’, chisquare’, gumbel’, weibull’, gamma’, f’, and lognormal’]}

Description: highly separated clusters from all different distributions but no heavy tails

Archetype JSON: {"max_overlap": 1e-4,

"min_overlap": 1e-5,

"distributions": [’normal’, exponential’, beta’, uniform’, chisquare’, gumbel’, weibull’, gamma’, f’, and lognormal’]}

Description: seven clusters with uniform distribution with light overlap

Archetype JSON: { "max_overlap": 0.025,

"min_overlap": 0.0025,

"distributions": ["uniform"]}

Description: clusters with bounded support

Archetype JSON: {"distributions": ["beta", "uniform"]}

Description: {description}

Archetype JSON:

2. Prompt template mapping archetype description to a descriptive identifier:

Your task is to turn a description of a data set archetype into an identifier for

the archetype. The identifier should be short yet descriptive, and not contain any whitespace

(but underscores are OK). IMPORTANT: the identifier should be a valid Python variable name.

Specifically, it may NOT start with a number, nor contain any special character except for

underscores.

EXAMPLES:

Description: five oblong clusters in two dimensions

Archetype identifier: five_oblong_2d

Description: three spherical clusters with significant overlap in two dimensions

Archetype identifier: three_spherical_significant_overlap_2d

Description: eight spherical clusters of different sizes with significant overlap in two dimensions

Archetype identifier: eight_spherical_different_sizes_significant_overlap_2d

Description: ten clusters which are all highly oblong (and equally so) but of very different sizes, with moderate overlap

Archetype identifier: ten_highly_oblong_very_different_shapes_moderate_overlap

Description: five clusters with significant class imbalance

Archetype identifier: five_significant_class_imbalance

Description: five clusters with perfect class balance

Archetype identifier: five_perfect_class_balance

Description: eight clusters of which 70% are exponentially distributed and 30% are normally distributed

Archetype identifier: eight_exp_and_norm

Description: eight clusters with 1000 total samples of which half are exponentially distributed and half are normally distributed

Archetype identifier: eight_exp_and_norm_1000_samples

Description: two clusters of different sizes in 10 dimensions that are well-separated

Archetype identifier: two_different_sizes_well_separated_10d

Description: very oblong clusters that overlap heavily

Archetype identifier: very_oblong_heavy_overlap

Description: ten clusters with very different shapes

Archetype identifier: ten_very_different_shapes

Description: {description}

Archetype identifier:

High-Level Synthetic Data Generation with Data Set Archetypes (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Kieth Sipes

Last Updated:

Views: 6093

Rating: 4.7 / 5 (67 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Kieth Sipes

Birthday: 2001-04-14

Address: Suite 492 62479 Champlin Loop, South Catrice, MS 57271

Phone: +9663362133320

Job: District Sales Analyst

Hobby: Digital arts, Dance, Ghost hunting, Worldbuilding, Kayaking, Table tennis, 3D printing

Introduction: My name is Kieth Sipes, I am a zany, rich, courageous, powerful, faithful, jolly, excited person who loves writing and wants to share my knowledge and understanding with you.