Michael J. ZellingerCorresponding author: zellinger@caltech.eduDepartment of Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USAPeter BühlmannSeminar for Statistics, ETH Zürich, Zürich, Switzerland
Abstract
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the variation in cluster shapes, can be effectively varied. Unfortunately, curating evaluation scenarios is often laborious, as practitioners must find lower-level geometric parameters (like cluster covariance matrices) to match a higher-level scenario description like “clusters with very different shapes.” To make benchmarks more convenient and informative, we propose synthetic data generation based on data set archetypes. In this paradigm, the user describes an evaluation scenario in a high-level manner, and the software automatically generates data sets with the desired characteristics. Combining such data set archetypes with large language models (LLMs), it is possible to set up benchmarks purely from verbal descriptions of the evaluation scenarios. We provide an open-source Python package, repliclust, that implements this workflow. A demo of data generation from verbal inputs is available at demo.repliclust.org.
Keywords: Synthetic Data, Simulation, Validation, Clustering, Unsupervised Learning, Artificial Intelligence, Natural Language Processing
1 Introduction
The goal of clustering is to separate data points into groups such that points within a group a more similar to each other than to those outside the group (Cormack, (1971)). In practice, it is often not clear what constitutes a cluster (Hennig, (2015)). As a result, many practitioners evaluate cluster analysis algorithms on synthetic data (Milligan, (1980), Milligan etal., (1983), Tibshirani etal., (2001), Steinley and Brusco, (2008), Steinley and Brusco, (2011), VanMechelen etal., (2023)).
Synthetic data is valuable for two reasons. First, it clearly stipulates which data points belong to which cluster, allowing objective evaluation. Second, it allows independently manipulating different aspects of the data (such as the overlap between clusters or the variability of cluster shapes), which is critical for drawing scientific conclusions about the relative merits of different cluster analysis techniques (Milligan, (1996)).
Unfortunately, setting up benchmarks with synthetic data can be laborious. The process typically involves creating data sets for a number of different scenarios. For example, on benchmarks with convex clusters drawn from probabilistic mixture models, the scenarios may involve “clusters of very different shapes and sizes”, “highly overlapping oblong clusters”, “high-dimensional spherical clusters”, etc. (Tibshirani etal., (2001)).
Existing data generators do not cater directly to such high-level scenarios. Instead, the user must carefully tune simulation parameters to arrive at the desired scenarios (Steinley and Henson, (2005), Schubert and Zimek, (2019), Iglesias etal., (2019)). While some generators make it easy to control the overlaps between clusters, such high-level control typically does not extend to other aspects like the desired diversity of cluster shapes and sizes (Qiu and Joe, (2006), Shand etal., (2019)).
In this paper, we explore generating synthetic data directly from high level descriptions. Our Python package, repliclust, accomplishes this goal by summarizing the overall geometry of probabilistic mixture models with a few high-level parameters. We use a large language model to map a user’s verbal description of a scenario onto these parameters. Although our approach is based on ellipsoidal clusters, we have implemented two post-processing functions for generating more irregular cluster shapes. The first makes clusters non-convex by passing them through a randomly initialized neural network. The second makes a -dimensional data set directional by wrapping it around the -dimensional sphere through an inverse stereographic projection.
2 Generating Data from High-Level Archetypes
Our data generator repliclust is based on data set archetypes. A data set archetype is a high-level description of the overall geometry of a data set with clusters. For example, the class of all data sets with “three oblong and slightly overlapping clusters in two dimensions with some class imbalance” is a data set archetype.
We implement archetype-based generation by summarizing the overall geometry of probabilistic mixture models in terms of a few high-level parameters, as listed in Table 1. To create an archetype, users can either specify these parameters directly or verbally describe the archetype in English. To map an English description to a precise parameter setting, we use few-shot prompting of a large language model (Brown etal., (2020), OpenAI etal., (2024)).
Most of the high-level geometric parameters describing an archetype are based on what we call “max-min sampling.” In this approach, the user controls a geometric attribute by specifying a reference value and max-min ratio. In addition, a constraint ensures that the reference value and is indeed typical for every data set. For example, the aspect ratio of a cluster measures how elongated it is. The reference value aspect_ref sets the typical aspect ratio among all clusters in a data set, while aspect_maxmin sets the ratio of the highest to the lowest aspect ratio. To make sure that aspect_ref is indeed the typical aspect ratio, max-min sampling enforces the location constraint . Appendix A gives more details on how we manage different geometric attributes using max-min sampling.
Once an archetype has been defined, sampling concrete data sets proceeds in two steps. First, the algorithm samples a new probabilistic mixture model whose geometric structure matches the archetype. Second, we draw i.i.d. samples from this mixture model to generate a data set. Figure 1 illustrates this flow.
To accommodate use cases in which variation of an archetype’s hyperparameters (n_clusters, dim, n_samples) is desired , we have implemented a function Archetype.sample_hyperparams for generating a list of archetypes with hyperparameters sampled from Poisson distributions centered on the original values (subject to rejection sampling based on user-specified minimum and maximum values).
3 Sampling Probabilistic Mixture Models
Generating a synthetic dataset with repliclust starts with sampling a probabilistic mixture model that matches the desired archetype.
Sampling a mixture model proceeds through the following steps:
- 1.
Draw random cluster covariance matrices based on the archetype.
- 2.
Randomly initialize the cluster centers. Adjust their placement using stochastic gradient descent to meet desired constraints on the overlaps between clusters.
- 3.
Sample the number of data points per cluster, based on the extent of class imbalance specified by the archetype.
- 4.
Construct a data set and cluster labels by sampling data points i.i.d. from the mixture component describing cluster . Return , where is the archetype.
- 5.
Optionally, make cluster shapes non-convex by either
- (a)
passing through a randomlyinitialized neural network (repliclust.distort)
- (b)
wrapping around the -dimensional sphere to create directional data (repliclust.wrap_around_sphere).
- (a)
In the following sections, we give more details on each step involved in data generation.
3.1 Defining Clusters and Mixture Models
In the first four steps of data generation, repliclust models clusters as ellipsoidal probability distributions characterized by a central point. Specifically, each cluster is defined by a cluster center , orthogonal principal axes pointing in arbitrary directions, the lengths of the principal axes, and a univariate probability distribution .
To generate an i.i.d. sample from a cluster, we 1) sample the direction from the ellipsoid defined by the cluster’s principal axes, and 2) sample the length according to the cluster’s univariate distribution (which can be one of many supported distributions including normal, lognormal, exponential, Student’s t, gamma, chi-square, Weibull, Gumbel, F, Pareto, beta, and uniform). To make the spread of each cluster depend only on the lengths of the principal axes, we normalize each univariate distribution so that the quantiles of its absolute value is unity. For example, if the univariate distribution is exponential with rate , we would actually sample from a re-scaled random variable , where the quantile satisfies .222This rescaling puts all distributions on the same scale as the multivariate normal distribution, which natively satisfies .
Figure 2(a) visualizes clusters with different base distributions. Note that using heavy-tailed distributions can lead to far outliers (not shown). By contrast, univariate distributions with bounded support give rise to clusters with crisply defined boundaries. Stretches where a probability density function vanishes give rise to concentric holes.
A mixture model (repliclust.MixtureModel) represents several clusters. Figure 2(b) shows a two-dimensional mixture model with four clusters, and a data set sampled from it. Note that our mixture models do not model class probabilities. Instead, the number of data points per cluster are drawn using max-min sampling, based on the archetype’s imbalance ratio (desired ratio between the number of data points in the most and least populous clusters). Appendix B lists the formal attributes of a mixture model.
3.2 Managing Cluster Overlaps
Managing the degree of overlaps between clusters is one of the most important tasks of a cluster generator. In repliclust, we quantify pairwise overlap between two clusters as the irreducible error rate when classifying a new data point as belonging to one of the two clusters (assuming equal class probabilities). On the level of entire datasets, the user controls overlap by specifying the maximum and minimum pairwise overlap.
The maximum overlap max_overlap states that no pair of clusters may overlap more than a certain amount. By contrast, the minimum overlap min_overlap prevents isolated clusters by enforcing that each cluster has some neighbor with which it overlaps at least min_overlap.We construct a loss function that enforces the minimum and maximum overlap conditions and optimize it using stochastic gradient descent (SGD).
In this section, we first describe our notion of pairwise overlap. Second, we explain the loss function we use to enforce the maximum and minimum overlap for an entire dataset.
3.2.1 Measuring Pairwise Overlap
Viewing clusters as probability distributions, we formally define the overlap between two clusters as twice the minimax error rate when classifying new data points using a linear decision boundary between the two clusters. Figure 3 illustrates this definition. To explain what we mean by “minimax” in this context, observe that any linear classifier depends on an axis and threshold such that
(1) |
By definition, the minimax classifier minimizes the worst-case loss. In symbols,
(2) |
where is the true cluster label corresponding to a new data point . The outer minimum on the right hand side ranges over all linear classifiers , including . Rewriting (2) in terms of the classification axes and thresholds yields
(3) |
It is not hard to see that the minimax condition requires the cluster-specific error rates and to be equal. Consequently, the cluster overlap becomes
(4) |
Geometrically, our definition means that two clusters overlap at level if their marginal distributions along the minimax classification axis intersect at the and quantiles. The left panel of Figure 3 highlights the probability mass bounded by these quantiles in gray.
Note that our formulation of minimax classification error explicitly does not take into account class probabilities for the clusters. Equation (2) depends only on the class-conditional probabilities.Our reasoning is that the underlying reality of each cluster depends on its class-conditional probability distribution, not the class probability.
Steinley and Henson, (2005) quantify cluster overlap by computing the full distributional overlap in dimensions. We prefer our one-dimensional notion in terms of minimax classification error, since high-dimensional Euclidean geometry exhibits a number of counterintuitive phenomena. Specifically, as , the majority of a sphere’s volume becomes concentrated in a thin shell from its surface, and so -dimensional overlap goes to zero unless the marginal overlaps in each dimension approach 100% (Blum etal., (2020), Steinley and Henson, (2005)).
Computing our cluster overlap requires finding the minimax classification axis . Anderson and Bahadur, (1962) describe an algorithm for computing exactly, in the case of multivariate normal distributions. Unfortunately, their method requires computing the matrix inverse for values of , where are the clusters’ covariance matrices and is a numerical tolerance. To avoid these matrix inversions, we propose an approximation of the minimax classification axis based on linear discriminant analysis (LDA).
For a pair of multivariate normal clusters with means and the same covariance matrix , the axis minimizes the overall misclassification rate, assuming equal class probabilities (see Hastie etal., (2009)). Unfortunately, the result does not hold for unequal covariance matrices . Thus, we propose the approximation . Figure 4 verifies that this LDA-based approximation closely matches the exact overlap, as compared with a cruder “center-to-center" (C2C) approximation that uses the difference between the cluster centers as the classification axis (i.e., ).
Theorem 1 gives a formula for computing the LDA-approximate cluster overlap. The proof is in Appendix C.
Theorem 1 (LDA-Based Cluster Overlap)
For two multivariate normal clusters with means and covariance matrices , the approximate cluster overlap based on the linear separator is
(5) |
where is the cumulative distribution function of the standard normal distribution.Moreover, if for some then equals the exact cluster overlap .
Theorem 1 shows that cluster overlap depends on quantiles of the form
(6) |
where is a classification axis. Since these quantiles are inversely related to cluster overlap, they quantify cluster separation.
When generating synthetic data, repliclust uses Theorem 1 even when cluster distributions are non-normal. Figure 12 in Appendix D suggests that controlling overlap between non-normal distributions works well, except for distributions with bounded support (such as beta).
3.3 Adjusting Cluster Centers to Meet Overlap Constraints
To enforce maximum and minimum overlaps for a mixture model, we optimize a loss function using stochastic gradient descent (SGD). Given mixture model parameters (including the cluster centers, principal axes, and principal axis lengths), the overlap loss is
(7) |
where the loss on the -th cluster is
(8) |
Here, , , and measure cluster separation as expressed in Equation (6): is the minimum allowed separation (corresponding to overlap ); is the maximum allowed separation (corresponding to ); and is the separation between the -th and -th clusters. The penalty function is the polynomial , where is a tuning parameter. Finally, is a filter that passes on only positive inputs (corresponding to a violation of user-specified constraints).
3.3.1 Intuition behind the Overlap Loss
By design, the loss (8) vanishes when the cluster centers, principal axes, and principal axis lengths satisfy the user-specified overlap constraints. The first term penalizes violation of the minimum overlap condition. Indeed, if cluster is too far away from the other clusters, the separation between cluster and its closest neighbor exceeds the maximum allowed separation . A penalty of the excess yields the first term in (8). The second term measures violation of the maximum overlap condition. If the separation between clusters and falls short of the smallest allowed separation , the shortfall incurs a penalty that serves to push these clusters apart.
The penalty in (8) ranges from quadratic to linear based on the value of . Keeping the penalty partly linear () helps gradient descent drive the overlap loss to exactly zero because a purely quadratic loss would result in a vanishing derivative when overlap constraints approach satisfaction.
3.3.2 Running the Minimization in Practice
When minimizing (7) using SGD, we initially place cluster centers randomly within a sphere. The volume of this sphere influences the initial overlaps between clusters. To select an appropriate value, we fix the ratio of the sum of cluster volumes to ; essentially, is the density of clusters within the sampling volume. Values of around 10% work well in low dimensions. In higher dimensions, however, results from the mathematics of sphere-packing motivate a downward adjustment. A lower bound by Ball, (1992) states that the maximum achievable density when placing non-overlapping spheres inside is at least . Thus, in dimensions we use an adjusted density defined by
where is the equivalent density in 2D.
Following initialization, we optimize the cluster centers using stochastic gradient descent. During this process, the principal axes and their lengths are fixed. Each iteration performs the update
(9) |
on the single-cluster loss , where is the learning rate. For each epoch, we randomly permute the order of clusters and apply the updates (9) in turn for each cluster.
Experiments suggest that our minimization procedure drives the overlap loss to zero at an exponential rate (linear convergence rate), as expected for gradient descent. The number of epochs required seems largely independent of the number of clusters, though it increases slightly with the number of dimensions.
3.4 Making Cluster Shapes More Irregular
In many application domains, ellipsoidal center-based clusters are good models for the data. This is often the case when clusters can be characterized by a “typical” element. For example, in single-cell RNA sequencing, Chen etal., (2020) represent the differences in transcriptional activity across cell types using Gaussian mixture models. In this case, the cluster centers correspond to each cell type’s typical gene expression profile. Similarly, in deep learning, a large language model’s hidden states often form compact clusters that can effectively classify text (Howard and Ruder, (2018)), assess the truthfulness of a response (Azaria and Mitchell, (2023)), and help detect out-of-distribution inputs (Ren etal., (2023)).
However, convex clusters can be inappropriate. For example, Ester etal., (1996) developed the density-based DBSCAN algorithm specifically to handle spatial (GPS) data. Characterized by thin loops and irregular shapes, such data does not fit a center-based clustering paradigm at all. Another example is directional data lying on a sphere (Banerjee etal., (2005), Salah and Nadif, (2019)). In this case, the clusters have centers but are not convex.
To accommodate use cases where ellipsoidal clusters are inappropriate, repliclust provides two post-processing functions for creating non-convex clusters. The first,
repliclust.distort, passes a data set through a randomly initialized neural network, as shown in Figure 5. The second, repliclust.wrap_around_sphere, wraps a -dimensional data set around the -dimensional sphere by applying an inverse stereographic projection, as shown in Figure 6. Appendix D describes the default architecture of the neural network used in repliclust.distort.
3.5 Generating Archetypes from Natural Language
To fully realize the potential of a high-level, archetype-based approach to synthetic data generation, our package repliclust draws on the OpenAI API (OpenAI, (2023)) to allow users to create data set archetypes directly from verbal descriptions in English.
To map a user’s natural language input to the high-level geometric paramters of Table 1), we use few-shot prompting of OpenAI’s GPT-4o large language model model (Brown etal., (2020), OpenAI etal., (2024)). We use this same approach to automatically create short identifiers for archetypes, such as “ten_highly_oblong_very_different_shapes_moderate_overlap”. These identifiers contain no white space and obey the naming rules for Python variables.
Version 1.0.0. of repliclust uses 22-shot prompting to map archetype descriptions to parameters, and 11-shot prompting to generate archetype identifiers, which appears to work well. In the rare case that the natural language workflow fails, repliclust throws an error. We release all few-shot examples and prompt templates in the natural_language module of our code base. For convenience, they are reproduced in Appendix F.
4 Results
In this section, we first illustrate the convenience of archetype-based data generation by conducting a mock benchmark between several clustering algorithms. Second, we verify that our notion of clustering overlap effectively captures clustering difficulty (even in higher dimensions), and study the distribution of pairwise overlaps in data sets with multiple clusters.
4.1 Mock Benchmark
We demonstrate the convenience and informativeness of our archetype-based data generator by running a mock benchmark comparing several clustering algorithms on data drawn from different archetypes. The benchmark is not meant to comprehensively evaluate the strengths and weaknesses of the clustering algorithms we consider. Instead, our goal is to demonstrate the advantages (in convenience and informativeness) of an archetype-based approach to synthetic data generation.
First, we construct six archetypes by providing repliclust with the following verbal descriptions:
- 1.
‘twelve clusters of different distributions’
- 2.
‘twelve clusters of different distributions and high class imbalance’
- 3.
‘seven highly separated clusters in 10D with very different shapes’
- 4.
‘seven clusters in 10D with very different shapes and significant overlap’
- 5.
‘four clusters in 100D with 100 samples each’
- 6.
‘four clusters in 100D with 1000 samples each’
Repliclust maps these descriptions to the following data set archetypes333With one exception: the automatically generated names for archetype 5. and 6. did not end in the suffix “_each.” We added this suffix for clarity.:
- 1.
{‘name’: ‘twelve_clusters_different_distributions’, ‘n_clusters’: 12,
‘dim’: 2, ‘n_samples’: 1200, ‘aspect_ref’: 1.5, ‘aspect_maxmin’: 2,
‘radius_maxmin’: 3, ‘imbalance_ratio’: 2, ‘max_overlap’: 0.05,
‘min_overlap’: 0.001,‘distributions’: [‘normal’, ‘exponential’, ‘gamma’, ‘weibull’, ‘lognormal’]} - 2.
{‘name’: ‘twelve_different_distributions_high_class_imbalance’,
‘n_clusters’: 12, ‘dim’: 2, ‘n_samples’: 1200, ‘aspect_ref’: 1.5,
‘aspect_maxmin’: 2, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 5,
‘max_overlap’: 0.05, ‘min_overlap’: 0.001,‘distributions’: [‘normal’, ‘exponential’, ‘gamma’, ‘weibull’, ‘lognormal’]}See AlsoGideon58's Reviews [Archive] - Page 19Garrett Mitchell Net Worth: Shocking Figures And Wealth Insights! - PerfectleyThe Palm Beach Post from West Palm Beach, FloridaMechanisms of soil organic carbon and nitrogen stabilization in mineral-associated organic matter – insights from modeling in phase space - 3.
{‘name’: ‘seven_highly_separated_10d_very_different_shapes’,
‘n_clusters’: 7, ‘dim’: 10, ‘n_samples’: 700, ‘aspect_ref’: 1.5,
‘aspect_maxmin’: 3.0, ‘radius_maxmin’: 3.0, ‘imbalance_ratio’: 2,
‘max_overlap’: 0.0001, ‘min_overlap’: 1e-05,
‘distributions’: [‘normal’, ‘exponential’]} - 4.
{‘name’: ‘seven_very_different_shapes_significant_overlap_10d’,
‘n_clusters’: 7, ‘dim’: 10, ‘n_samples’: 700, ‘aspect_ref’: 1.5,
‘aspect_maxmin’: 3.0, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2,
‘max_overlap’: 0.2, ‘min_overlap’: 0.1,
‘distributions’: [‘normal’, ‘exponential’]} - 5.
{‘name’: ‘four_clusters_100d_100_samples_each’,
‘n_clusters’: 4, ‘dim’: 100, ‘n_samples’: 400, ‘aspect_ref’: 1.5,
‘aspect_maxmin’: 2, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2,
‘max_overlap’: 0.05, ‘min_overlap’: 0.001,
‘distributions’: [‘normal’, ‘exponential’]} - 6.
{‘name’: ‘four_clusters_100d_1000_samples_each’,
‘n_clusters’: 4, ‘dim’: 100, ‘n_samples’: 4000, ‘aspect_ref’: 1.0,
‘aspect_maxmin’: 1.0, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2,
‘max_overlap’: 0.05, ‘min_overlap’: 0.001,
‘distributions’: [‘normal’, ‘exponential’]}
4.1.1 Examining the Generated Data Sets
Figure 7 shows four representative data sets with convex clusters drawn from each archetype. Figure 8 shows non-convex clusters resulting from applying the distort function (as described in Section 3.4). For archetypes with dimensionality greater than two, we use t-SNE to visualize the data sets in 2D (vander Maaten and Hinton, (2008)).
The figures show that repliclust effectively generates data sets with similar geometric characteristics. Moreover, the data sets match up well with the user-specified verbal descriptions. Applying distort seems to make some clusters very long and thin, but otherwise results in satisfying non-convex clusters. The distort function seems to roughly preserve cluster overlaps; we leave a careful study of this to future work. The 10 and 100-dimensional archetypes indicate that our overlap control effectively handles different numbers of dimensions.
4.1.2 Comparing the Clustering Algorithms
We compare the following clustering algorithms: K-Means, hierarchical (with Ward linkage), spectral (with radial-basis function affinity), HDBSCAN, and expectation maximization for a Gaussian mixture model (EM-GMM). We originally intended to include DBSCAN as well. However, the heuristics we tried for choosing the neighborhood radius (including what is suggested in Sander etal., (1998)) did not work well across the range of dimensionalities we consider, resulting in too many noise points (see also the discussion by Schubert etal., (2017)).
We measure performance in terms of adjusted mutual information (AMI) and adjusted Rand index (AIR), based on the ground truth cluster labels (Vinh etal., (2010), Hubert and Arabie, (1985)). To carry out the benchmark, we sample 10 times from each archetype, resulting in 60 distinct data sets. We repeat this process twice to evaluate performance on convex and non-convex clusters (the latter resulting from applying the distort function on new data sets). Thus, the full benchmark is based on 120 distinct data sets. Table 2 shows the results for K-Means, hierarchical, spectral, and EM-GMM. All these algorithms receive the true number of clusters as an input.
Table 3 separately lists the performance for HDBSCAN.444We used min_samples=5 in the scikit-learn implementation (Pedregosa etal., (2011)). We do this for two reasons. First, as a density-based algorithm, HDBSCAN cannot make use of the true number of clusters, putting it at a disadvantage. Second, HDBSCAN reports a large number of noise points (60% on average), so that its results are not directly comparable to that of the other algorithms. We report the performance of HDBSCAN based only on the non-noise points. This leads to higher performance numbers, which may compensate for the disadvantage HDBSCAN faces in not knowing the true number of clusters.
Archetype Convex Non-Convex AMI AMI EM-GMM K-Means Spectral Hierarchical EM-GMM K-Means Spectral Hierarchical twelve_clusters_different_distributions 0.884 (.007) 0.848 (.008) 0.859 (.010) 0.853 (.007) 0.576 (.017) 0.563 (.012) 0.565 (.015) 0.565 (.011) twelve_different_distributions_high_class_imbalance 0.854 (.008) 0.836 (.007) 0.852 (.010) 0.834 (.009) 0.544 (.021) 0.542 (.023) 0.547 (.025) 0.553 (.024) seven_highly_separated_10d_very_different_shapes 0.983 (.009) 0.976 (.005) 0.986 (.003) 0.990 (.002) 0.750 (.035) 0.606 (.027) 0.709 (.027) 0.637 (.029) seven_very_different_shapes_significant_overlap 0.349 (.019) 0.452 (.009) 0.443 (.014) 0.380 (.010) 0.173 (.014) 0.149 (.008) 0.165 (.010) 0.145 (.010) four_clusters_100d_100_samples_each 0.063 (.008) 0.512 (.028) 0.074 (.023) 0.654 (.022) 0.077 (.011) 0.100 (.012) 0.079 (.008) 0.092 (.008) four_clusters_100d_1000_samples_each 0.548 (.061) 0.664 (.023) 0.205 (.014) 0.868 (.025) 0.307 (.030) 0.145 (.014) 0.109 (.012) 0.132 (.016) Average 0.614 (.044) 0.715 (.025) 0.570 (.046) 0.763 (.026) 0.404 (.032) 0.351 (.030) 0.362 (.033) 0.354 (.031) ARI ARI EM-GMM K-Means Spectral Hierarchical EM-GMM K-Means Spectral Hierarchical twelve_clusters_different_distributions 0.855 (.011) 0.795 (.012) 0.807 (.018) 0.798 (.014) 0.368 (.021) 0.365 (.015) 0.364 (.015) 0.362 (.012) twelve_different_distributions_high_class_imbalance 0.800 (.019) 0.760 (.012) 0.787 (.020) 0.756 (.017) 0.341 (.022) 0.347 (.021) 0.345 (.025) 0.360 (.024) seven_highly_separated_10d_very_different_shapes 0.967 (.020) 0.977 (.005) 0.988 (.003) 0.992 (.002) 0.684 (.045) 0.505 (.035) 0.628 (.036) 0.523 (.039) seven_very_different_shapes_significant_overlap 0.238 (.023) 0.375 (.012) 0.385 (.016) 0.291 (.015) 0.109 (.011) 0.090 (.006) 0.089 (.007) 0.081 (.007) four_clusters_100d_100_samples_each 0.004 (.007) 0.404 (.036) 0.053 (.014) 0.554 (.028) 0.018 (.006) 0.062 (.011) 0.053 (.008) 0.051 (.005) four_clusters_100d_1000_samples_each 0.408 (.048) 0.601 (.040) 0.185 (.011) 0.874 (.032) 0.252 (.029) 0.088 (.014) 0.065 (.012) 0.072 (.016) Average 0.545 (.047) 0.652 (.030) 0.534 (.044) 0.711 (.031) 0.295 (.029) 0.243 (.023) 0.264 (.025) 0.241 (.025)
Archetype Convex Non-Convex AMI ARI AMI ARI twelve_clusters_different_distributions 0.796 (.045) 0.705 (.066) 0.269 (.033) 0.605 (.016) 0.400 (.026) 0.353 (.007) twelve_different_distributions_high_class_imbalance 0.733 (.054) 0.572 (.083) 0.238 (.042) 0.560 (.024) 0.356 (.036) 0.346 (.016) seven_highly_separated_10d_very_different_shapes 0.992 (.006) 0.984 (.014) 0.175 (.014) 0.747 (.106) 0.700 (.117) 0.417 (.043) seven_very_different_shapes_significant_overlap 0.595 (.110) 0.617 (.124) 0.909 (.020) 0.163 (.050) 0.149 (.066) 0.744 (.072) four_clusters_100d_100_samples_each 1.000 (.000) 1.000 (.000) 1.000 (.000) 0.794 (.137) 0.804 (.131) 0.990 (.007) four_clusters_100d_1000_samples_each 0.800 (.133) 0.800 (.133) 0.999 (.000) 0.031 (.018) 0.023 (.016) 0.687 (.070) Average 0.819 (.035) 0.780 (.040) 0.599 (.049) 0.483 (.047) 0.405 (.047) 0.590 (.036)
The results show that hierarchical clustering performs best on the convex cluster shapes, as long as there is sufficient separation between clusters. On the non-convex clusters, EM-GMM exhibits the strongest performance even though the clusters are no longer multivariate normal after applying distort (see Figure 8).
K-Means and hierarchical clustering both hold up well on the high-dimensional data, including in the low-sample regime of 100 samples per cluster in 100D. By contrast, EM-GMM dramatically benefits from more samples on the high-dimensional data.
Spectral clustering does not display competitive performance in this benchmark. While it improves over K-Means in some scenarios, it delivers weaker performance in high dimensions and never performs best across all algorithms.
HDBSCAN shows great difficulty in handling high dimensionality or significant overlap between clusters. However, the algorithm shows strong performance on the 12 non-convex clusters drawn from diverse distributions. We note again that HDBSCAN does not have access to the true number of clusters, in contrast with the other algorithms.
4.2 Minimax Classification Error Captures Clustering Difficulty
In Section 3.2, we defined the overlap between two clusters in terms of the error rate of the best minimax linear classifier. We verify that this notion of overlap conveys clustering difficulty by measuring clustering performance on data sets with different degrees of overlap. For this simulation, we consider data sets with two clusters drawn from an archetype we described as “two clusters with very different shapes in D”555This verbal description yields an archetype with parameters
{ ‘name’: ‘two_very_different_shapes_d’, ‘n_clusters’: 2, ‘dim’: , ‘n_samples’: 200, ‘aspect_ref’: 1.5, ‘aspect_maxmin’: 3.0, ‘radius_maxmin’: 3, ‘imbalance_ratio’: 2, ‘max_overlap’: 0.05, ’min_overlap’: 0.001, ‘distributions’: [‘normal’, ‘exponential’] } , where the dimensionality ranges across . We vary max_overlap from to 0.5, while setting min_overlap = max_overlap/10. For each overlap setting, we generate 100 distinct data sets and evaluate the average clustering performance of hierarchical clustering, quantified in terms of adjusted mutual information (AMI) and adjusted Rand index (ARI), as in the benchmark of Section 4.1. We choose hierarchical clustering because it is computationally efficient and performed well in the benchmark. We repeat this process twice, where in the second run we make clusters non-convex by applying the distort function.
Figure 9 confirms that clustering difficulty rises with increasing overlap. Figure 10 shows the same in the case of non-convex clusters, suggesting that applying distort maintains the desired relationship between overlap and clustering difficulty. Additionally, both figures show how our cluster overlap relates to the silhouette score, a popular metric for quantifying clustering difficulty (Rousseeuw, (1987), Shand etal., (2019)). At a fixed value of max_overlap, the silhouette score decreases markedly with a rise in dimensionality. This is not an artifact of our overlap measure, since plotting clustering performance vs silhouette score shows a similar dependence on dimensionality (not shown). This makes sense since the silhouette score is based on the difference of Euclidean distances, and distances between points tend to become more similar in high dimensions (Aggarwal etal., (2001)).
4.2.1 Examining the Distribution of Pairwise Overlaps
Since repliclust controls cluster overlap on the level of entire data sets by setting two global parameters (max_overlap and min_overlap), it is worthwhile to investigate the distributions of pairwise overlaps on datasets with multiple clusters. Figure 11 shows the distribution of pairwise overlaps for six data set archetypes, confirming that setting max_overlap and min_overlap effectively controls the pairwise overlaps between clusters.
5 Related Work
Simulations on synthetic data play an important role in cluster analysis. Accordingly, many synthetic data generators for cluster analysis have been proposed. However, the key idea in this paper is to specify the overall geometric characteristics of synthetic data in a high-level manner via data set archetypes. By contrast, previous data generators have put the onus on the user to design the overall geometric structure by carefully tuning lower-level properties of individual clusters.
In the following overview, we mainly focus on general-purpose generators. However, the literature has also proposed more specialized solutions to fill specific needs in the community. For example, Beer etal., (2019) present a data generator for subspace clustering, Gan and Tao, (2015) evaluates density-based clustering using a data generation process based on random walks, and Handl and Knowles, (2005) focus on creating long and thin ellipsoidal clusters in higher dimensions. None of these contributions focus on giving high-level control over the overall geometry of the data sets.
Milligan and Cooper, (1985) implements a generator for generating several clusters in up to 8 dimensions. The method enforces an upper bound on cluster overlap by limiting overlap in the first dimension, but does not otherwise provide control over high-level geometric structure.
Pei and Zaïane, (2006) present a compelling software system for generating two-dimensional clusters, which creates data sets with specified clustering difficulty “easy”, “medium” or “hard”; “easy” data consists only of spherical/convex clusters, whereas “medium” and “hard” data include curve segments and special shapes like letters of the alphabet. The generator does not offer high-level control over data set geometry except for the difficulty scale and the density of noise points to add to the data. The software comes with an appealing graphical user interface.
The popular scikit-learn library for machine learning (Pedregosa etal., (2011)) offers several functions for creating synthetic clusters. Among these, some are aimed at reproducing canonical 2D toy data sets like concentric circles and nested moons (make_moons, make_circles), while others focus on sampling multivariate normal clusters. These functions offer some valuable features, such as the ability to create datasets with informative and redundant features, as in the make_classification function. However, they do not control overall geometric characteristics of the data.
The data mining framework ELKI (Schubert and Zimek, (2019)) provides a synthetic data generator based on specifying probabilistic models in XML files. This XML-based approach makes it easy to reproduce benchmarks by sharing the underlying XML files. Drawing inspiration from this work, we have implemented an Archetype.describe function allowing users to easily share collections of data set archetypes as JSONL files.
The generators OCLUS (Steinley and Henson, (2005)) and GenRandomClust (Qiu and Joe, (2006)) focus on providing more sophisticated overlap control compared to previous generators. GenRandomClust extends the generator of Milligan and Cooper, (1985) by managing overlaps between clusters with different ellipsoidal shapes and arbitrary spatial orientations. Similar to our classification error-based notion of cluster overlap, their method finds an optimal separation direction between two clusters. To enforce the desired degree of overlap, the algorithm initially places cluster centers on a scaffold, then scales individual edges of the scaffold up or down to meet the overlap constraint. The method supports making all cluster shapes more or less elongated, but does not otherwise provide high-level control over data set geometry. Moreover, the scaling operations undertaken to manage cluster overlaps implicitly sacrifice control over cluster volumes (which, in repliclust, can be managed independently).
OCLUS (Steinley and Henson, (2005)) quantifies cluster overlap in terms of the shared density between two clusters. The generator uses analytical formulas for integrals of several interesting probability distributions (including exponential, gamma, and chi-square), thereby effectively managing overlaps between non-normal clusters. As a result, the method is limited to treating all dimensions independently, so that cluster distributions simplify as the products of marginals. The paper contains a valuable discussion of the distinction between marginal and joint cluster overlaps (of which their software supports both).
Like other existing generators, OCLUS does not aspire to helping the user establish the overall geometric characteristics of synthetic data sets. To generate a data set, the user must provide a covariance matrix for each cluster, the desired overlaps between all pairs of clusters, and a design matrix specifying which clusters overlap at all. In sum, generating 10 clusters in 10D requires supplying over 500 numbers, compared to only a handful in repliclust (or none if the user chooses to describe the archetype in English).
MDCGen (Iglesias etal., (2019) is a feature-rich generator that supports many desiderata in cluster analysis, such as overlap control, different probability distributions, subspace clusters, and the ability to add noise points. In particular, it is nice to be able to place noise points away from the clusters, which is made possible by the grid-based strategy for placing cluster centers. MDCGen does not target the overall geometric characteristics of synthetic data sets, instead giving users low-level control enabling extensive configurability. For example, managing the overlap between clusters involves setting compactness coefficients, grid granularity, and overall scale, compared to only tweaking max_overlap in repliclust.666max_overlap may also have to be tweaked but it can usually stay at max_overlap/10 or a similar value. In the words of the authors, “to enable covering a broad range of dataset possibilities, the parameters are multiple and some training for tuning the tool is required.”
Finally, the HAWKS generator (Shand etal., (2019)) controls cluster overlaps using an evolutionary algorithm that evolves the means and covariance matrices of multivariate normal distributions. The paper applies this framework to create data sets with a user-specified silhouette score representing clustering difficulty (Rousseeuw, (1987)). In principle, the evolutionary framework can be extended to attain desired high-level geometric characteristics. Of these, the authors consider two examples, cluster overlaps and elongations (the latter relating to our notion of cluster aspect ratio, as listed in Table 1). An interesting aspect of HAWKS is the ability to generate data sets that maximize the performance difference between two clustering algorithms. This feature is especially useful in two dimensions, since we can then visually examine the data sets to better understand when each algorithm succeeds or fails.
6 Conclusion
In this paper, we have presented an archetype-based approach to synthetic data generation for cluster analysis. Our method works by summarizing the overall geometric characteristics of a probabilistic mixture model with a few high-level parameters, and sampling mixture models subject to these constraints. To convey the convenience and informativeness of such an archetype-based approach, we have implemented a natural language interface allowing users to create archetypes purely from verbal descriptions in English. Thus, our software repliclust makes it possible to run an entire benchmark by describing the desired evaluation scenarios in English.
Although our data generator relies on creating a skeleton of convex, ellipsoidal clusters, we have implemented ways to make the cluster shapes more irregular and complex. The first method passes convex clusters through a randomly initialized neural network, making their shapes non-convex and irregular. The second method creates directional datasets by wrapping -dimensional convex clusters around the -dimensional sphere.
In future work, we are most interested in learning data set archetypes that mimic the geometric characteristics of real data. In application domains where multivariate Gaussians provide a good model for the data, it would suffice to fit a Gaussian mixture model, empirically measure its high-level geometric parameters (as listed in Table 1), then create a synthetic data set archetype with these parameters. For application domains with non-convex clusters, it would be interesting to implement a generalized version of this approach. We can imagine training an auto-encoder type neural network that maps non-convex clusters into a hidden space where they become multivariate Gaussian, then undoes the transformation. Defining an archetype based on the high-level geometric characteristics in the hidden space would presumably allow us to sample irregularly shaped clusters that look similar to those found in real data.
Acknowledgements
MJZ would like to acknowledge support from Matt Thomson and members of the Thomson Lab at Caltech. In addition, MJZ thanks Dante Roy Calapate for providing his computer monitor during the early stages of this project.
References
- Aggarwal etal., (2001)Aggarwal, C.C., Hinneburg, A., and Keim, D.A. (2001).On the surprising behavior of distance metrics in high dimensional space.In Vanden Bussche, J. and Vianu, V., editors, Database Theory — ICDT 2001, pages 420–434, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Anderson and Bahadur, (1962)Anderson, T.W. and Bahadur, R.R. (1962).Classification into two multivariate normal distributions with different covariance matrices.The Annals of Mathematical Statistics, 33(2):420–431.
- Azaria and Mitchell, (2023)Azaria, A. and Mitchell, T. (2023).The internal state of an LLM knows when it’s lying.In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
- Ba etal., (2016)Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016).Layer normalization.arXiv preprint arXiv:1607.06450.
- Ball, (1992)Ball, K. (1992).A lower bound for the optimal density of lattice packings.International Mathematics Research Notices, 1992(10):217–221.
- Banerjee etal., (2005)Banerjee, A., Dhillon, I.S., Ghosh, J., and Sra, S. (2005).Clustering on the unit hypersphere using von mises-fisher distributions.Journal of Machine Learning Research, 6(46):1345–1382.
- Beer etal., (2019)Beer, A., Schüler, N.-S., and Seidl, T. (2019).A generator for subspace clusters.
- Blum etal., (2020)Blum, A., Hopcroft, J., and Kannan, R. (2020).Foundations of Data Science.Cambridge University Press.
- Brown etal., (2020)Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020).Language models are few-shot learners.In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc.
- Chen etal., (2020)Chen, S., Rivaud, P., Park, J.H., Tsou, T., Charles, E., Haliburton, J.R., Pichiorri, F., and Thomson, M. (2020).Dissecting heterogeneous cell populations across drug and disease conditions with popalign.Proceedings of the National Academy of Sciences, 117(46):28784–28794.
- Cormack, (1971)Cormack, R.M. (1971).A review of classification.Journal of the Royal Statistical Society. Series A (General), 134(3):321–367.
- Dempster etal., (1977)Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977).Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
- Ester etal., (1996)Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).A density-based algorithm for discovering clusters in large spatial databases with noise.In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press.
- Gan and Tao, (2015)Gan, J. and Tao, Y. (2015).Dbscan revisited: Mis-claim, un-fixability, and approximation.In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, page 519–530, New York, NY, USA. Association for Computing Machinery.
- Handl and Knowles, (2005)Handl, J. and Knowles, J. (2005).Improvements to the scalability of multiobjective clustering.In Proceedings of the IEEE Congress on Evolutionary Computation, CEC ’05, pages 2372–2379. IEEE.
- Harris etal., (2020)Harris, C.R., Millman, K.J., vander Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T.E. (2020).Array programming with NumPy.Nature, 585(7825):357–362.
- Hastie etal., (2009)Hastie, T., Tibshirani, R., and Friedman, J. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition.Springer Series in Statistics. Springer New York.
- Hennig, (2015)Hennig, C. (2015).What are the true clusters?Pattern Recognition Letters, 64:53–62.Philosophical Aspects of Pattern Recognition.
- Howard and Ruder, (2018)Howard, J. and Ruder, S. (2018).Universal language model fine-tuning for text classification.In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Hubert and Arabie, (1985)Hubert, L. and Arabie, P. (1985).Comparing partitions.Journal of Classification, 2(1):193–218.
- Hunter, (2007)Hunter, J.D. (2007).Matplotlib: A 2D graphics environment.Computing in Science & Engineering, 9(3):90–95.
- Iglesias etal., (2019)Iglesias, F., Zseby, T., and Ferreira, D. (2019).MDCGen: Multidimensional dataset generator for clustering.Journal of Classification, 36:599–618.
- Inan etal., (2017)Inan, H., Khosravi, K., and Socher, R. (2017).Tying word vectors and word classifiers: A loss framework for language modeling.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- LeCun etal., (2012)LeCun, Y.A., Bottou, L., Orr, G.B., and Müller, K.-R. (2012).Efficient BackProp, pages 9–48.Springer Berlin Heidelberg, Berlin, Heidelberg.
- MacQueen, (1967)MacQueen, J. (1967).Some methods for classification and analysis of multivariate observations.In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, pages 281–297.
- McInnes etal., (2017)McInnes, L., Healy, J., and Astels, S. (2017).hdbscan: Hierarchical density based clustering.Journal of Open Source Software, 2(11):205.
- Milligan, (1980)Milligan, G.W. (1980).An examination of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45(3):325–342.
- Milligan, (1996)Milligan, G.W. (1996).Clustering validation: Results and implications for applied analyses.In Clustering and Classification, pages 341–375.
- Milligan and Cooper, (1985)Milligan, G.W. and Cooper, M.C. (1985).An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50(2):159–179.
- Milligan etal., (1983)Milligan, G.W., Soon, S.C., and Sokol, L.M. (1983).The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(1):40–47.
- Ng etal., (2001)Ng, A., Jordan, M., and Weiss, Y. (2001).On spectral clustering: Analysis and an algorithm.In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems, volume14. MIT Press.
- OpenAI, (2023)OpenAI (2023).Openai api.https://platform.openai.com/.Accessed: 2024-09-18.
- OpenAI etal., (2024)OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G., Brooks, T., Brundage, M., Button, K., Cai, T., Campbell, R., Cann, A., Carey, B., Carlson, C., Carmichael, R., Chan, B., Chang, C., Chantzis, F., Chen, D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B., Cho, C., Chu, C., Chung, H.W., Cummings, D., Currier, J., Dai, Y., Decareaux, C., Degry, T., Deutsch, N., Deville, D., Dhar, A., Dohan, D., Dowling, S., Dunning, S., Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., Felix, N., Fishman, S.P., Forte, J., Fulford, I., Gao, L., Georges, E., Gibson, C., Goel, V., Gogineni, T., Goh, G., Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S., Greene, R., Gross, J., Gu, S.S., Guo, Y., Hallacy,C., Han, J., Harris, J., He, Y., Heaton, M., Heidecke, J., Hesse, C., Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu, K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., Jiang, A., Jiang, R., Jin, H., Jin, D., Jomoto, S., Jonn, B., Jun, H., Kaftan, T., Łukasz Kaiser, Kamali, A., Kanitscheider, I., Keskar, N.S., Khan, T., Kilpatrick, L., Kim, J.W., Kim, C., Kim, Y., Kirchner, J.H., Kiros, J., Knight, M., Kokotajlo, D., Łukasz Kondraciuk, Kondrich, A., Konstantinidis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan, I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C.M., Lim, R., Lin, M., Lin, S., Litwin, M., Lopez, T., Lowe, R., Lue, P., Makanju, A., Malfacini, K., Manning, S., Markov, T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., McGrew, B., McKinney, S.M., McLeavey, C., McMillan, P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz, L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa, E., Mossing, D., Mu, T., Murati, M., Murk, O., Mély, D., Nair, A., Nakano, R.,Nayak, R., Neelakantan, A., Ngo, R., Noh, H., Ouyang, L., O’Keefe, C., Pachocki, J., Paino, A., Palermo, J., Pantuliano, A., Parascandolo, G., Parish, J., Parparita, E., Passos, A., Pavlov, M., Peng, A., Perelman, A., deAvila BelbutePeres, F., Petrov, M., deOliveiraPinto, H.P., Michael, Pokorny, Pokrass, M., Pong, V.H., Powell, T., Power, A., Power, B., Proehl, E., Puri, R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, F., Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ryder, N., Saltarelli, M., Sanders, T., Santurkar, S., Sastry, G., Schmidt, H., Schnurr, D., Schulman, J., Selsam, D., Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., Shyam, P., Sidor, S., Sigler, E., Simens, M., Sitkin, J., Slama, K., Sohl, I., Sokolowsky, B., Song, Y., Staudacher, N., Such, F.P., Summers, N., Sutskever, I., Tang, J., Tezak, N., Thompson, M.B., Tillet, P., Tootoonchian, A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J. F.C., Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, C., Wang,J.J., Wang, A., Wang, B., Ward, J., Wei, J., Weinmann, C., Welihinda, A., Welinder, P., Weng, J., Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman, L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba, W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W., and Zoph, B. (2024).Gpt-4 technical report.
- Pedregosa etal., (2011)Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011).Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830.
- Pei and Zaïane, (2006)Pei, Y. and Zaïane, O. (2006).A synthetic data generator for clustering and outlier analysis.Technical report, University of Alberta, Edmonton.
- Qiu and Joe, (2006)Qiu, W. and Joe, H. (2006).Generation of random clusters with specified degree of separation.Journal of Classification, 23:315–334.
- Ren etal., (2023)Ren, J., Luo, J., Zhao, Y., Krishna, K., Saleh, M., Lakshminarayanan, B., and Liu, P.J. (2023).Out-of-distribution detection and selective generation for conditional language models.
- Rousseeuw, (1987)Rousseeuw, P. (1987).Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65.
- Salah and Nadif, (2019)Salah, A. and Nadif, M. (2019).Directional co-clustering.Advances in Data Analysis and Classification, 13(3):591–620.
- Sander etal., (1998)Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (1998).Density-based clustering in spatial databases: The algorithm gdbscan and its applications.Data Mining and Knowledge Discovery, 2(2):169–194.
- Schubert etal., (2017)Schubert, E., Sander, J., Ester, M., Kriegel, H.P., and Xu, X. (2017).Dbscan revisited, revisited: Why and how you should (still) use dbscan.ACM Trans. Database Syst., 42(3).
- Schubert and Zimek, (2019)Schubert, E. and Zimek, A. (2019).ELKI: A large open-source library for data analysis - ELKI release 0.7.5 “Heidelberg”.CoRR, abs/1902.03616.
- Shand etal., (2019)Shand, C., Allmendinger, R., Handl, J., Webb, A., and Keane, J. (2019).Evolving controllably difficult datasets for clustering.In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’19, pages 463–471. ACM.
- Shi and Malik, (1997)Shi, J. and Malik, J. (1997).Normalized cuts and image segmentation.Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 731–737.
- Steinley and Brusco, (2008)Steinley, D. and Brusco, M.J. (2008).Selection of variables in cluster analysis: An empirical comparison of eight procedures.Psychometrika, 73(1):125–144.
- Steinley and Brusco, (2011)Steinley, D. and Brusco, M.J. (2011).Evaluating mixture modeling for clustering: recommendations and cautions.Psychological Methods, 16(1):63–79.
- Steinley and Henson, (2005)Steinley, D.L. and Henson, R. (2005).Oclus: An analytic method for generating clusters with known overlap.Journal of Classification, 22:221–250.
- Tibshirani etal., (2001)Tibshirani, R., Walther, G., and Hastie, T. (2001).Estimating the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.
- vander Maaten and Hinton, (2008)vander Maaten, L. and Hinton, G. (2008).Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605.
- VanMechelen etal., (2023)VanMechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Hennig, C., Leisch, F., Steinley, D., and Warrens, M.J. (2023).A white paper on good research practices in benchmarking: The case of cluster analysis.WIREs Data Mining and Knowledge Discovery, 13(6):e1511.
- Vinh etal., (2010)Vinh, N.X., Epps, J., and Bailey, J. (2010).Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance.Journal of Machine Learning Research, 11(95):2837–2854.
- Virtanen etal., (2020)Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A. R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, I., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors (2020).SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.Nature Methods, 17:261–272.
Appendix A
We give more detail on how repliclust manages various geometric attributes using max-min parameters. Table 4 lists all geometric attributes managed with max-min sampling and names the corresponding parameters in repliclust. The reference value for each geometric parameter serves as a location constraint, while the max-min ratio determines the spread. A constraint ensures that the distribution of geometric parameters within a data set is similar across data sets drawn from the same archetype.
Geometric AttributeMax-Min RatioReference ValueConstraintcluster volumes radius_maxmin scale cluster volumes averageto reference volumegroup sizesimbalance_ratio average group size group sizes sum tonumber of samplescluster aspect ratiosaspect_maxmin aspect_ref geometric mean of aspectratios equals reference cluster axis lengths aspect ratioof the cluster dim-th root ofcluster volume geometric mean of lengthsequals reference length
To enforce the attribute-specific constraints, repliclust samples new values of a geometric parameter in pairs. The first value is randomly drawn from a triangular distribution whose mode and endpoints are determined from the typical value and max-min ratio. The second value is then deterministically computed to maintain the constraint. For reference, see the sample_with_maxmin function in the maxmin.utils module (Version 1.0.0 of repliclust).
Appendix B
Table 5 lists the formal attributes of a mixture model in repliclust. A data set archetype provides a way to randomly sample mixture models with similar overall geometric characteristics. Thus, an archetype implicitly defines a probability distribution over the attributes in Table 5.
AttributeMeaningMathematical Definition cluster centersthe positions of cluster centers in space principal axis orientations the spatial orientation of each cluster’sellipsoidal shape (different for each cluster) orthonormal matrices principal axis lengths the lengths of each cluster’s principal axes(axes have different lengths between and within clusters) cluster distributions multivariate probability distributions forgenerating data (different for each cluster) distributions
Appendix C
We prove Theorem 1 of Section 3.2. Additionally, we provide an analogous result for the cruder “center-to-center” approximation of cluster overlap.
Theorem 1 (LDA-Based Cluster Overlap)
For two multivariate normal clusters with means and covariance matrices , the approximate cluster overlap based on the linear separator is
(10) |
where is the cumulative distribution function of the standard normal distribution.Moreover, if for some then equals the exact cluster overlap .
Proof. Let be the classification axis. Minimax optimality requires that the cluster-specific misclassification probabilities are equal. Since is the classification axis, these probabilities correspond to the tails of the marginal distributions along . Specifically, let
(11) |
be the standard deviation of cluster 1’s marginal distribution along , where is the cluster’s covariance matrix; is defined analogously. If is oriented to point from cluster 1 to cluster 2, then the quantile of cluster 1’s marginal distribution meets the quantile of cluster 2’s marginal distribution at the decision boundary, where is the unknown cluster overlap. This intersection implies
(12) |
where is the -quantile of the standard normal distribution.Rearranging this equation, and using and , gives (10).
Next, suppose that for some . In this case, maximum likelihood classification results in a linear decision boundary that coincides with the LDA solution. Hence, the minimax-optimal linear classifier uses the LDA-based classification axis .
Theorem 2 (Center-to-Center Cluster Overlap)
For two multivariate normal clusters with means and covariance matrices , the center-to-center cluster overlap , based on a classification boundary perpendicular to the line connecting the cluster centers, is
(13) |
where is the difference between cluster centers and is the cumulative distribution function of the standard normal distribution.
Moreover, if the covariance matrices and are both multiples of the identity matrix, then equals the exact cluster overlap .
Proof. The proof proceeds along the same lines as the proof of Theorem 1, except that the classification axis is . If both covariance matrices are multiples of the identity matrix, is a scalar multiple of the LDA-based classification axis . Hence, the second part of Theorem 1 kicks in to establish equality between and the exact overlap.
Appendix D
Figure 12 visualizes two-dimensional data sets created from the same archetype but with different radial probability distributions. The results suggest that pegging the 68.15% quantile of the radial distribution at unity leads to satisfactory overlap control for distributions with infinite support. However, for distributions with bounded support (such as the beta distribution), this approach leads to greater separation between the clusters, as shown in the rightmost column of Figure 12.
Appendix E
The default architecture for the neural network used in the distort function of Section 3.4 is a feed-forward network consisting (in order) of a linear embedding, 16 repeated feed-forward blocks (each consisting of a fully connected linear layer followed by layer normalization and Tanh activation), and a linear projection. Importantly, we tie the weights of the embedding and projection layers, so that the projection weights are the transpose of the embedding weights (Inan etal., (2017)). The default hidden dimensionality is 128, so that the embedding layer maps a data set archetype’s dimension to 128. The internal feed-forward blocks preserve this hidden dimension, and the final projection layer maps it back to the archetype’s dimension.
Appendix F
Below we list the prompt templates (including few-shot examples) used in Version 1.0.0 of repliclust. Up-to-date versions are available in the code base (see repliclust.org).
1. Prompt template mapping archetype description to high-level geometric parameters:
⬇
Your task is to turn a verbal description of a data set archetype from Repliclust into a precise JSON that specifies which parameter settings to use to create the desired data set archetype in Repliclust. These are the allowed parameters:
n_clusters: int >= 1, the number of clusters to generate
dim: int >= 2, the dimensionality of the data
n_samples: int >= 1, the number of data samples to generate
aspect_ref: float >= 1, the eccentricity of a typical cluster (how oblong vs spherical it is)
aspect_maxmin: float >= 1, how much the eccentricity varies across clusters in a data set
radius_maxmin: float >= 1, how much cluster radius (and thereby cluster volume) varies across the clusters
max_overlap: float > 0, the maximum allowed overlap between any pair of clusters (0.1-0.2 is significant overlap, 0.01-0.05 is little overlap, 0.001 is very little overlap, and 0.0001 and lower is well-separated)
min_overlap: float > 0, the minimum amount of overlap each cluster should have with some other cluster, preventing a cluster from being too far away from all other clusters
imbalance_ratio: float >= 1, specifies how imbalanced the number of data points per cluster is
distributions: list[str], determines the probability distributions to use for the clusters; the available distributions are ’normal’, ’standard_t’, ’exponential’, ’beta’, ’uniform’, ’chisquare’, ’gumbel’, ’weibull’, ’gamma’, ’pareto’, ’f’, and ’lognormal’
IMPORTANT NOTES:
Any words like "separated", "far away", "close together", or "overlapping" refer to the overlap between clusters. Far apart means that max_overlap is 1e-4 or less
Always make min_overlap smaller than max_overlap, usually ten times smaller!
ONLY include the Pareto (’pareto’) distribution if the user specifically asks for heavy tails!
EXAMPLES:
Description: five oblong clusters in two dimensions
Archetype JSON: {
"n_clusters": 5,
"dim": 2,
"n_samples": 500,
"aspect_ref": 3,
"aspect_maxmin": 1.5,
}
Description: three spherical clusters with significant overlap in two dimensions
Archetype JSON: {
"n_clusters": 3,
"dim": 2,
"n_samples": 300,
"max_overlap": 0.2,
"min_overlap": 0.1,
"aspect_ref": 1.0,
"aspect_maxmin": 1.0
}
Description: eight spherical clusters of different sizes with significant overlap in two dimensions
Archetype JSON: {
"n_clusters": 8,
"dim": 2,
"n_samples": 800,
"max_overlap": 0.25,
"min_overlap": 0.1,
"aspect_ref": 1.0,
"aspect_maxmin": 1.0,
"radius_maxmin": 2.0
}
Description: ten clusters which are all highly oblong (and equally so) but of very different sizes, with moderate overlap
Archetype JSON: {
"n_clusters": 10,
"n_samples": 1000,
"aspect_ref": 5,
"aspect_maxmin": 1.0,
"max_overlap": 0.10,
"min_overlap": 0.05,
"radius_maxmin": 4.0
}
Description: five clusters with significant class imbalance
Archetype JSON: {
"n_clusters": 5,
"n_samples": 500,
"imbalance_ratio": 5,
"aspect_ref": 1.5,
"aspect_maxmin": 1.5
}
Description: five clusters with perfect class balance
Archetype JSON: {
"n_clusters": 5,
"n_samples": 500,
"imbalance_ratio": 1.0,
"aspect_ref": 1.4,
"aspect_maxmin": 1.6
}
Description: eight clusters of which 70% are exponentially distributed and 30% are normally distributed
Archetype JSON: {
"n_clusters": 8,
"n_samples": 800,
"aspect_ref": 1.7,
"aspect_maxmin": 1.5,
"distributions": ["exponential", "normal"],
"distribution_proportions": [0.7, 0.3],
}
Description: eight clusters with 1000 total samples of which half are exponentially distributed and half are normally distributed
Archetype JSON: {
"n_clusters": 8,
"n_samples": 1000,
"aspect_ref": 1.7,
"aspect_maxmin": 1.5,
"distributions": ["exponential", "normal"],
"distribution_proportions": [0.5, 0.5]
}
Description: two clusters of different sizes in 10 dimensions that are well-separated
Archetype JSON: {
"n_clusters": 2,
"dim": 10,
"n_samples": 200,
"aspect_ref": 2
"aspect_maxmin": 2,
"radius_maxmin": 4.0,
"max_overlap": 0.001,
"min_overlap": 0.0001
}
Description: very oblong clusters that overlap heavily
Archetype JSON: {
"n_clusters": 6,
"n_samples": 600,
"aspect_ref": 7,
"aspect_maxmin": 1.4,
"max_overlap": 0.4,
"min_overlap": 0.3
}
Description: highly separated and very oblong clusters
Archetype JSON: {
"n_clusters": 4,
"n_samples": 400,
"aspect_ref": 6,
"aspect_maxmin": 1.6,
"max_overlap": 1e-4,
"min_overlap": 1e-5
}
Description: ten clusters with very different shapes
Archetype JSON: {
"n_clusters": 10,
"n_samples": 1000,
"aspect_ref": 1.5,
"aspect_maxmin": 3.0,
"radius_maxmin": 3.0
}
Description: twelve well-separated clusters with very different shapes
Archetype JSON: {
"n_clusters": 12,
"n_samples": 1200,
"aspect_ref": 1.5,
"aspect_maxmin": 5.0,
"radius_maxmin": 5.0,
"max_overlap": 1e-4,
"min_overlap": 1e-5
}}
Description: twelve highly separated Gaussian clusters with very different shapes
Archetype JSON: {
"n_clusters": 12,
"n_samples": 1200,
"aspect_ref": 1.5,
"aspect_maxmin": 5.0,
"radius_maxmin": 5.0,
"max_overlap": 1e-4,
"min_overlap": 1e-5
"distributions": ["normal"]}}
Description: five heavy-tailed clusters
Archetype JSON: {
"n_clusters": 5,
"n_samples": 500,
"aspect_ref": 1.5,
"distributions": ["standard_t", "lognormal", "pareto"]}}
Description: clusters with holes
Archetype JSON: {"distributions": ["f"]}
Description: clusters from a variety of distributions
Archetype JSON: {"distributions": ["normal", "exponential", "gamma", "weibull", "lognormal"]}
Description: clusters from all different distributions
Archetype JSON: {"distributions": [’normal’, ’standard_t’, ’exponential’, ’beta’, ’uniform’, ’chisquare’, ’gumbel’, ’weibull’, ’gamma’, ’f’, and ’lognormal’]}
Description: clusters from different distributions
Archetype JSON: {"distributions": [’normal’, ’exponential’, ’beta’, ’uniform’, ’chisquare’, ’gumbel’, ’weibull’, ’gamma’, ’f’, and ’lognormal’]}
Description: highly separated clusters from all different distributions but no heavy tails
Archetype JSON: {"max_overlap": 1e-4,
"min_overlap": 1e-5,
"distributions": [’normal’, ’exponential’, ’beta’, ’uniform’, ’chisquare’, ’gumbel’, ’weibull’, ’gamma’, ’f’, and ’lognormal’]}
Description: seven clusters with uniform distribution with light overlap
Archetype JSON: { "max_overlap": 0.025,
"min_overlap": 0.0025,
"distributions": ["uniform"]}
Description: clusters with bounded support
Archetype JSON: {"distributions": ["beta", "uniform"]}
Description: {description}
Archetype JSON:
2. Prompt template mapping archetype description to a descriptive identifier:
⬇
Your task is to turn a description of a data set archetype into an identifier for
the archetype. The identifier should be short yet descriptive, and not contain any whitespace
(but underscores are OK). IMPORTANT: the identifier should be a valid Python variable name.
Specifically, it may NOT start with a number, nor contain any special character except for
underscores.
EXAMPLES:
Description: five oblong clusters in two dimensions
Archetype identifier: five_oblong_2d
Description: three spherical clusters with significant overlap in two dimensions
Archetype identifier: three_spherical_significant_overlap_2d
Description: eight spherical clusters of different sizes with significant overlap in two dimensions
Archetype identifier: eight_spherical_different_sizes_significant_overlap_2d
Description: ten clusters which are all highly oblong (and equally so) but of very different sizes, with moderate overlap
Archetype identifier: ten_highly_oblong_very_different_shapes_moderate_overlap
Description: five clusters with significant class imbalance
Archetype identifier: five_significant_class_imbalance
Description: five clusters with perfect class balance
Archetype identifier: five_perfect_class_balance
Description: eight clusters of which 70% are exponentially distributed and 30% are normally distributed
Archetype identifier: eight_exp_and_norm
Description: eight clusters with 1000 total samples of which half are exponentially distributed and half are normally distributed
Archetype identifier: eight_exp_and_norm_1000_samples
Description: two clusters of different sizes in 10 dimensions that are well-separated
Archetype identifier: two_different_sizes_well_separated_10d
Description: very oblong clusters that overlap heavily
Archetype identifier: very_oblong_heavy_overlap
Description: ten clusters with very different shapes
Archetype identifier: ten_very_different_shapes
Description: {description}
Archetype identifier: