pyckmeans.core package

Submodules

pyckmeans.core.ckmeans module

ckmeans module

class pyckmeans.core.ckmeans.CKmeans(k: int, n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Consensus K-Means.

Parameters
kint

Number of clusters.

n_repint, optional

Number of K-Means to fit, by default 100

p_sampfloat, optional

Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).

p_featfloat, optional

Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).

metricsIterable[str]

Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

fit(x[, progress_callback])

Fit CKmeans.

predict(x[, linkage_type, return_cls, ...])

Predict cluster membership of new data from fitted CKmeans.

AVAILABLE_METRICS = ('sil', 'bic', 'db', 'ch')
fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit CKmeans.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.ckmeans.CKmeansResult

Predict cluster membership of new data from fitted CKmeans.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns
CKmeansResult

Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.core.ckmeans.CKmeansResult(consensus_matrix: numpy.ndarray, cluster_membership: numpy.ndarray, k: int, bic: Optional[float] = None, sil: Optional[float] = None, db: Optional[float] = None, ch: Optional[float] = None, names: Optional[Iterable[str]] = None, km_cls: Optional[numpy.ndarray] = None)

Bases: object

Result of CKmeans.predict.

Parameters
consensus_matrixnumpy.ndarray

n * n consensus matrix.

cluster_membershipnumpy.ndarray

n-length vector of cluster memberships.

kint

number of clusters.

bicOptional[float]

BIC score of the consensus clustering.

silOptional[float]

Silhouette score of the consensus clustering.

dbOptional[float]

Davies-Bouldin score of the consensus clustering.

chOptional[float]

Calinski-Harabasz score of the consensus clustering.

namesOptional[Iterable(str)]

Sample names.

km_clsOptional[numpy.ndarray]

m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Attributes
cmatrixnumpy.ndarray

Consensus matrix.

clnumpy.ndarray

Cluster membership.

namesOptional[numpy.ndarray]

Sample names.

kint

Number of clusters.

bicOptional[float]

Bayesian Information Criterion score of the clustering.

silOptional[float]

Silhouette scor of the clustering.

dbOptional[float]

Davies-Bouldin score of the clustering.

chOptional[float]

Calinski-Harabasz score of the clustering.

km_clsOptional[numpy.ndarray]

m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Methods

copy()

Get a deep copied CKmeansResult.

from_dict(ckm_res_dict)

Construct CKmeansResult from dictionary.

from_dir(directory)

Construct CKmeansResult from a directory contraining the three files 'cmatrix.csv', 'clusters.csv', 'metrics.csv', and optionally 'km_clusters.csv'.

from_json(file, **kwargs)

Construct CKmeansResult from JSON file.

from_json_str(json_str, **kwargs)

Construct CKmeansResult from JSON string.

order([method, linkage_type])

Get optimal order according to hierarchical clustering.

plot([names, order, cmap_cm, cmap_clbar, ...])

Plot pyckmeans result consensus matrix with consensus clusters.

recalculate_cluster_memberships(x, linkage_type)

ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.

reorder(order[, in_place])

Reorder samples according to provided order.

save_km_cls(out_file[, one_hot, row_names, ...])

Save predicted cluster membership for the single K-Means runs to a file.

sort([method, linkage_type, in_place])

Sort CKmeansResult using hierarchical clustering.

to_dict()

Convert CKmeansResult to dictionary.

to_dir(out_dir[, force])

Save CKmeansResult to directory.

to_json([file])

Convert CKmeansResult to JSON string or file.

copy() pyckmeans.core.ckmeans.CKmeansResult

Get a deep copied CKmeansResult.

Returns
CKmeansResult

A deep copy of self.

classmethod from_dict(ckm_res_dict: Dict) pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from dictionary.

Parameters
ckm_res_dictDict

CKmeansResult as dictionary.

Returns
CKmeansResult

CKmeansResult

classmethod from_dir(directory: str) pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from a directory contraining the three files ‘cmatrix.csv’, ‘clusters.csv’, ‘metrics.csv’, and optionally ‘km_clusters.csv’. See <pyckmeans.core.ckmeans.CKmeansResult.to_dir>().

Parameters
directorystr

CKmeansResult directory.

Returns
CKmeansResult

CKmeansResult

Raises
Exception

Raised if there is a problem with directory.

classmethod from_json(file: str, **kwargs: Dict[str, Any]) pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from JSON file.

Parameters
filestr

JSON file

kwargsDict[str, Any]

Additional keyword arguments passed to json.loads.

Returns
——-
CKmeansResult

CKmeansResult

classmethod from_json_str(json_str: str, **kwargs: Dict[str, Any]) pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from JSON string.

Parameters
json_str: str

JSON string.

kwargsDict[str, Any]

Additional keyword arguments passed to json.loads.

Returns
CKmeansResult

CKmeansResult

order(method: str = 'GW', linkage_type: str = 'average') numpy.ndarray

Get optimal order according to hierarchical clustering.

Parameters
methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

Returns
numpy.ndarray

Optimal sample order.

plot(names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure

Plot pyckmeans result consensus matrix with consensus clusters.

Parameters
namesOptional[Iterable[str]]

Sample names to be plotted.

orderOptional[Union[str, numpy.ndarray]]

Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.

cmap_cmUnion[str, matplotlib.colors.Colormap], optional

Colormap for the consensus matrix, by default ‘Blues’

cmap_clbarUnion[str, matplotlib.colors.Colormap], optional

Colormap for the cluster bar, by default ‘tab20’

figsizeTuple[float, float], optional

Figure size for the matplotlib figure, by default (7, 7).

Returns
matplotlib.figure.Figure

Matplotlib figure.

recalculate_cluster_memberships(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str, in_place: bool = False) pyckmeans.core.ckmeans.CKmeansResult

ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.

Recalculate cluster memberships using hierarchical clustering based on the given linkage type.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

The data that was used to predict the present CKmeansResult. A n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, CKmeansResult object will be returned. If True, the object will modified in place and self will be returned.

Returns
CKmeansResult

CKmeansResult with recalculated cluster memberships.

reorder(order: numpy.ndarray, in_place: bool = False) pyckmeans.core.ckmeans.CKmeansResult

Reorder samples according to provided order.

Parameters
ordernumpy.ndarray

New sample order.

in_placebool

If False, a new, sorted CKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns
CKmeansResult

Reordered CKmeansResult

save_km_cls(out_file: str, one_hot: bool = False, row_names: bool = False, col_names: bool = False)

Save predicted cluster membership for the single K-Means runs to a file. The file format depends on the one_hot parameter.

Parameters
out_filestr

Output file path.

one_hotbool

If False, a tab-delimited text file will be written containing a n*m cluster membership matrix, where n is the number of K-Means runs and m is the number of samples.

If True, a file comprising n one-hot encoded m*k cluster membership matrices in tab-delimited text format, separated by an empty line, will be written, where k is the number of clusters.

row_namesbool

If True, row names will be written.

col_namesbool

If True, column names will be written.

sort(method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) pyckmeans.core.ckmeans.CKmeansResult

Sort CKmeansResult using hierarchical clustering.

Parameters
methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, sorted CKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns
CKmeansResult

Sorted CKmeansResult

to_dict() Dict

Convert CKmeansResult to dictionary.

Returns
Dict

CKmeansResult as dictionary.

to_dir(out_dir: str, force: bool = False)

Save CKmeansResult to directory. The directory will contain the three files ‘cmatrix.csv’, comprising the consensus matrix, ‘clusters.csv’, comprising the consensus cluster membership, and ‘metrics.csv’, comprising the clustering metrics. If the CKmeansResult contains clustering information considering the single K-Means runs, those will be written to ‘km_clusters.csv’.

Parameters
out_dirstr

Output directory. Will be created if it does not exist.

forcebool, optional

Write into out_dir even if it does already exist, by default False.

Raises
Exception

Raised if there is a problem with out_dir.

to_json(file: Optional[str] = None, **kwargs: Dict[str, Any]) Optional[str]

Convert CKmeansResult to JSON string or file.

Parameters
fileOptional[str], optional

File path to write the CKmeansResult to or None. If None, the JSON string will be returned.

kwargsDict[str, Any]

Additional keyword arguments passed to json.dump or json.dumps.

Returns
Optional[str]

None or JSON string.

exception pyckmeans.core.ckmeans.InvalidClusteringMetric

Bases: Exception

Error signalling that an invalid clustering metric was provided.

pyckmeans.core.ckmeans.bic_kmeans(x: numpy.ndarray, cl: numpy.ndarray, centers: Optional[numpy.ndarray] = None) float

Calculate the Bayesian Information Criterion (BIC) for a KMeans result. The formula is using the BIC calculation for the Gaussian special case.

Parameters
xnumpy.ndarray

n * m matrix, where n is the number of samples (observations) and m is the number of features (predictors).

clIterable[int]

Iterable of length n, containing cluster membership coded as integer.

centersOptional[numpy.ndarray]

k * m matrix of cluster centers (centroids), where k is the number of clusters and m is the number of features (predictors). If None, centers will be calculated from cl and x.

Returns
float

BIC

pyckmeans.core.ckmeans.wss(x: numpy.ndarray, centers: numpy.ndarray, cl: Iterable[int]) float

Calculate within cluster sum of squares.

Parameters
xnumpy.ndarray

n * m matrix, where n is the number of samples (observations) and m is the number of features (predictors).

centersnumpy.ndarray

k * m matrix of cluster centers (centroids), where k is the number of clusters and m is the number of features (predictors).

clIterable[int]

Iterable of length n, containing cluster membership as coded as integer.

Returns
float

Within cluster sum of squares.

pyckmeans.core.multickmeans module

multickmeans module

class pyckmeans.core.multickmeans.MultiCKMeans(k: Iterable[int], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Convenience class wrapping Consensus K-Means runs for multiple different numbers of clusters.

Parameters
kIterable[int]

List of cluster counts for CKmeans.

n_repint, optional

Number of K-Means to fit for each single CKmeans, by default 100

p_sampfloat, optional

Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).

p_featfloat, optional

Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).

metricsIterable[str]

Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

fit(x[, progress_callback])

Fit MultiCKmeans.

predict(x[, linkage_type, return_cls, ...])

Predict cluster membership of new data from all fitted CKmeans.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit MultiCKmeans.

Parameters
xUnion[numpy.ndarray, PCOAResult]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.multickmeans.MultiCKmeansResult

Predict cluster membership of new data from all fitted CKmeans.

Parameters
xUnion[numpy.ndarray, PCOAResult]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns
CKmeansResult

Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.core.multickmeans.MultiCKmeansResult(ckmeans_results: List[pyckmeans.core.ckmeans.CKmeansResult], names: Optional[Iterable[str]] = None)

Bases: object

Result of MultiCKmeansResult.predict.

Parameters
ckmeans_results: List[CKmeansResult]

List of CKmeansResults.

names: Optional[Iterable(str)]

Sample names.

Methods

order(by[, method, linkage_type])

Get optimal sample order according to hierarchical clustering of the CKmeansResult at index "by".

plot_metrics([figsize])

Plot MultiCKMeansResult metrics.

reorder(order[, in_place])

Reorder samples in all CKmeansResults according to provided order.

sort(by[, method, linkage_type, in_place])

Sort samples according to hierarchical clustering of the CKmeansResult at index "by".

order(by: int, method: str = 'GW', linkage_type: str = 'average') numpy.ndarray

Get optimal sample order according to hierarchical clustering of the CKmeansResult at index “by”.

Parameters
byint

Index of the CKMeansResult to order by.

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

Returns
numpy.ndarray

Optimal sample order.

plot_metrics(figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure

Plot MultiCKMeansResult metrics.

Parameters
figsizeTuple[float, float], optional

Figure size for the matplotlib figure, by default (7, 7).

Returns
matplotlib.figure.Figure

Matplotlib Figure of the metrics plot.

reorder(order: numpy.ndarray, in_place: bool = False) pyckmeans.core.multickmeans.MultiCKmeansResult

Reorder samples in all CKmeansResults according to provided order.

Parameters
ordernumpy.ndarray

New sample order.

in_placebool

If False, a new, sorted MultiCKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns
MultiCKmeansResult

Reordered MultiCKmeansResult

sort(by: int, method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) pyckmeans.core.multickmeans.MultiCKmeansResult

Sort samples according to hierarchical clustering of the CKmeansResult at index “by”.

Parameters
byint

Index of the CKMeansResult to sort by.

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, sorted MultiCKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns
MultiCKmeansResult

Sorted MultiCKmeansResult

pyckmeans.core.utils module

core utilities

class pyckmeans.core.utils.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: json.encoder.JSONEncoder

Special json encoder for numpy types

Methods

default(obj)

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

encode(o)

Return a JSON string representation of a Python data structure.

iterencode(o[, _one_shot])

Encode the given object and yield each string representation as available.

default(obj)

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

pyckmeans.core.wecr module

Weighted Ensemble Consensus of Random K-Means (WECR K-Means)

exception pyckmeans.core.wecr.InvalidConstraintsError

Bases: Exception

exception pyckmeans.core.wecr.InvalidKError

Bases: Exception

class pyckmeans.core.wecr.WECR(k: Union[int, Iterable[int]], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, **kwargs: Dict[str, Any])

Bases: object

WECR K-Means

A class representing a Weighted Ensemble Consensus of Random K-Means [1].

Parameters
kUnion[int, Iterable[int]]

Number of clusters to drawn from for each K-Means run.

n_repint, optional

Number of K-Means to fit, by default 100

p_sampfloat, optional

Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).

p_featfloat, optional

Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.KMeans.

References

1

Lai, Y., S., He, Z., Lin, F., Yang, Q., Zhou, X., Zhou. 2019. “An Adaptive Robust Semi-Supervised Clustering Framework Using Weighted Consensus of Random K-Means Ensemble”. IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1877-1890. doi: 10.1109/TKDE.2019.2952596.

Methods

fit(x[, progress_callback])

Fit the WECR K-Means.

predict(x[, must_link, must_not_link, ...])

Predict from WECR.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit the WECR K-Means.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pandas.core.frame.DataFrame, pyckmeans.ordination.PCOAResult], must_link: Optional[Iterable] = None, must_not_link: Optional[Iterable] = None, gamma: float = 0.5, scale_consensus_matrix: bool = True, linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.wecr.WECRResult

Predict from WECR.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

must_linkOptional[Iterable], optional

Must-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

must_not_linkOptional[Iterable], optional

Must-not-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

gammafloat, optional

Weight parameter for the constraints. Must be between 0.0 and 1.0, by default 0.5. Higher values increase the weight of the constraints on the final result.

scale_consensus_matrixbool

If true, the consensus matrix will be scaled in such a way that the diagonal entries are all 1.

linkage_typestr

Linkage type of the hierarchical clustering that is used for final consensus cluster calculation.

One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable], optional

Optional callback function for progress reporting.

Returns
WECRResult

WECRResult object.

class pyckmeans.core.wecr.WECRResult(consensus_matrix: numpy.ndarray, cluster_membership: numpy.ndarray, k: numpy.ndarray, bic: Optional[numpy.ndarray] = None, sil: Optional[numpy.ndarray] = None, db: Optional[numpy.ndarray] = None, ch: Optional[numpy.ndarray] = None, names: Optional[Iterable[str]] = None, km_cls: Optional[numpy.ndarray] = None)

Bases: object

Result of WECR.predict.

Parameters
consensus_matrixnumpy.ndarray

n * n weighted consensus (co-association) matrix, where n is the number of samples (observations, data points)

cluster_membershipnumpy.ndarray

m * n matrix cluster memberships, where m in the number of different k values and n is the number of samples (observations, data points)

kIterable[int]

Vector of cluster numbers.

bicOptional[numpy.ndarray]

m-length vector of BIC scores of the consensus clustering for each k.

silOptional[numpy.ndarray]

m-length vector of Silhouette scores of the consensus clustering for each k.

dbOptional[numpy.ndarray]

m-length vector of Davies-Bouldin score of the consensus clustering for each k.

chOptional[numpy.ndarray]

m-length vector of Calinski-Harabasz score of the consensus clustering for each k.

namesOptional[Iterable(str)]

Sample names.

km_clsOptional[numpy.ndarray]

m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Attributes
cmatrixnumpy.ndarray

Consensus matrix.

clnumpy.ndarray

Cluster memberships for each k.

namesOptional[numpy.ndarray]

Sample names.

knumpy.ndarray

Number of clusters.

bicOptional[numpy.ndarray]

Bayesian Information Criterion score of the clustering.

silOptional[numpy.ndarray]

Silhouette scor of the clustering.

dbOptional[numpy.ndarray]

Davies-Bouldin score of the clustering.

chOptional[numpy.ndarray]

Calinski-Harabasz score of the clustering.

km_clsOptional[numpy.ndarray]

m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Methods

copy()

Get a deep copied WECRResult.

from_dict(wecr_res_dict)

Construct WECRResult from dictionary.

from_dir(directory)

Construct WECRResult from a directory contraining the three files 'cmatrix.csv', 'clusters.csv', 'metrics.csv', and optionally 'km_clusters.csv'.

from_json(file, **kwargs)

Construct WECRResult from JSON file.

from_json_str(json_str, **kwargs)

Construct WECRResult from JSON string.

get_cl(k[, with_names])

Return cluster memberships from hierarchical clustering at a specified k.

get_cl_affinity_propagation([with_names])

Get cluster membership according to Affinity Propagation clustering.

order([method, linkage_type])

Get optimal sample order according to hierarchical clustering.

plot(k[, names, order, cmap_cm, cmap_clbar, ...])

Plot wecr result consensus matrix with consensus clusters.

plot_affinity_propagation([names, order, ...])

plot

plot_metrics([figsize])

Plot WECRResult metrics.

recalculate_cluster_memberships(x, linkage_type)

ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.

reorder(order[, in_place])

Reorder samples according to provided order.

save_km_cls(out_file[, one_hot, row_names, ...])

Save predicted cluster membership for the single K-Means runs to a file.

sort([method, linkage_type, in_place])

Sort WECRResult using hierarchical clustering.

to_dict()

Convert WECRResult to dictionary.

to_dir(out_dir[, force])

Save WECRResult to directory.

to_json([file])

Convert WECRResult to JSON string or file.

copy() pyckmeans.core.wecr.WECRResult

Get a deep copied WECRResult.

Returns
WECRResult

A deep copy of self.

classmethod from_dict(wecr_res_dict: Dict) pyckmeans.core.wecr.WECRResult

Construct WECRResult from dictionary.

Parameters
wecr_res_dictDict

WECRResult as dictionary.

Returns
WECRResult

WECRResult

classmethod from_dir(directory: str) pyckmeans.core.wecr.WECRResult

Construct WECRResult from a directory contraining the three files ‘cmatrix.csv’, ‘clusters.csv’, ‘metrics.csv’, and optionally ‘km_clusters.csv’. See <pyckmeans.core.wecr.WECRResult.to_dir>().

Parameters
directorystr

WECRResult directory.

Returns
WECRResult

WECRResult

Raises
Exception

Raised if there is a problem with directory.

classmethod from_json(file: str, **kwargs: Dict[str, Any]) pyckmeans.core.wecr.WECRResult

Construct WECRResult from JSON file.

Parameters
filestr

JSON file

kwargsDict[str, Any]

Additional keyword arguments passed to json.loads.

Returns
——-
WECRResult

WECRResult

classmethod from_json_str(json_str: str, **kwargs: Dict[str, Any]) pyckmeans.core.wecr.WECRResult

Construct WECRResult from JSON string.

Parameters
json_str: str

JSON string.

kwargsDict[str, Any]

Additional keyword arguments passed to json.loads.

Returns
WECRResult

WECRResult

get_cl(k: int, with_names: bool = False) Union[numpy.ndarray, pandas.core.series.Series]

Return cluster memberships from hierarchical clustering at a specified k.

Parameters
kint

Number of clusters to return the cluster memberships for.

with_namesbool, optional

Return cluster memberships including sample names. If True, a pandas.Series will be returned.

Returns
Union[numpy.ndarray, pandas.Series]

Cluster memberships.

Raises
wecr.InvalidKError

Raised if an invalid k argument is provided.

get_cl_affinity_propagation(with_names: bool = False, **kwargs: Dict[str, Any]) Union[numpy.ndarray, pandas.core.series.Series]

Get cluster membership according to Affinity Propagation clustering.

Parameters
with_namesbool, optional

Return cluster memberships including sample names. If True, a pandas.Series will be returned.

kwargsDict[str, Any]

Additional keywords passed to sklearn.cluster.AffinityPropagation

Returns
Union[numpy.ndarray, pandas.Series]

Cluster memberships.

order(method: str = 'GW', linkage_type: str = 'average') numpy.ndarray

Get optimal sample order according to hierarchical clustering.

Parameters
methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) [1] or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

linkage_typestr

Linkage type for the hierarchical clustering. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

Returns
numpy.ndarray

Optimal sample order.

References

1

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

plot(k: int, names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure

Plot wecr result consensus matrix with consensus clusters.

Parameters
k: int

The number of clusters k to use for plotting.

namesOptional[Iterable[str]]

Sample names to be plotted. If None, self.names will be used.

orderOptional[Union[str, numpy.ndarray]]

Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.

cmap_cmUnion[str, matplotlib.colors.Colormap], optional

Colormap for the consensus matrix, by default ‘Blues’.

cmap_clbarUnion[str, matplotlib.colors.Colormap], optional

Colormap for the cluster bar, by default ‘tab20’.

figsizeTuple[float, float], optional

Figure size for the matplotlib figure, by default (7, 7).

Returns
matplotlib.figure.Figure

Matplotlib figure.

plot_affinity_propagation(names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7), **kwargs: Dict[str, Any]) matplotlib.figure.Figure

plot

Plot wecr result consensus matrix with consensus clusters calculated using Affinity Propagation.

Parameters
namesOptional[Iterable[str]]

Sample names to be plotted. If None, self.names will be used.

orderOptional[Union[str, numpy.ndarray]]

Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.

cmap_cmUnion[str, matplotlib.colors.Colormap], optional

Colormap for the consensus matrix, by default ‘Blues’.

cmap_clbarUnion[str, matplotlib.colors.Colormap], optional

Colormap for the cluster bar, by default ‘tab20’.

figsizeTuple[float, float], optional

Figure size for the matplotlib figure, by default (7, 7).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.AffinityPropagation.

Returns
matplotlib.figure.Figure

Matplotlib figure.

plot_metrics(figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure

Plot WECRResult metrics.

Parameters
figsizeTuple[float, float], optional

Figure size for the matplotlib figure, by default (7, 7).

Returns
matplotlib.figure.Figure

Matplotlib Figure of the metrics plot.

recalculate_cluster_memberships(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str, in_place: bool = False) pyckmeans.core.wecr.WECRResult

ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.

Recalculate cluster memberships using hierarchical clustering based on the given linkage type.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

The data that was used to predict the present WECRResult. A n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, CKmeansResult object will be returned. If True, the object will modified in place and self will be returned.

Returns
WECRResult

WECRResult with recalculated cluster memberships.

reorder(order: numpy.ndarray, in_place: bool = False) pyckmeans.core.wecr.WECRResult

Reorder samples according to provided order.

Parameters
ordernumpy.ndarray

New sample order.

in_placebool

If False, a new, sorted WECRResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns
WECRResult

Reordered WECRResult

save_km_cls(out_file: str, one_hot: bool = False, row_names: bool = False, col_names: bool = False)

Save predicted cluster membership for the single K-Means runs to a file. The file format depends on the one_hot parameter.

Parameters
out_filestr

Output file path.

one_hotbool

If False, a tab-delimited text file will be written containing a n*m cluster membership matrix, where n is the number of K-Means runs and m is the number of samples.

If True, a file comprising n one-hot encoded m*k cluster membership matrices in tab-delimited text format, separated by an empty line, will be written, where k is the number of clusters.

row_namesbool

If True, row names will be written.

col_namesbool

If True, column names will be written.

sort(method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) pyckmeans.core.wecr.WECRResult

Sort WECRResult using hierarchical clustering.

Parameters
methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) [1] or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

linkage_typestr

Linkage type for the hierarchical clustering. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, sorted WECRResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns
WECRResult

Sorted WECRResult

References

1

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

to_dict() Dict

Convert WECRResult to dictionary.

Returns
Dict

WECRResult as dictionary.

to_dir(out_dir: str, force: bool = False)

Save WECRResult to directory. The directory will contain the three files ‘cmatrix.csv’, comprising the consensus matrix, ‘clusters.csv’, comprising the consensus cluster memberships, and ‘metrics.csv’, comprising the clustering metrics. If the WECRResult contains clustering information considering the single K-Means runs, those will be written to ‘km_clusters.csv’.

Parameters
out_dirstr

Output directory. Will be created if it does not exist.

forcebool, optional

Write into out_dir even if it does already exist, by default False.

Raises
Exception

Raised if there is a problem with out_dir.

to_json(file: Optional[str] = None, **kwargs: Dict[str, Any]) Optional[str]

Convert WECRResult to JSON string or file.

Parameters
fileOptional[str], optional

File path to write the WECRResult to or None. If None, the JSON string will be returned.

kwargsDict[str, Any]

Additional keyword arguments passed to json.dump or json.dumps.

Returns
Optional[str]

None or JSON string.

Module contents

pyckmeans core module