pyckmeans.core package

Submodules

pyckmeans.core.ckmeans module

ckmeans module

class pyckmeans.core.ckmeans.CKmeans(k: int, n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Consensus K-Means.

Parameters

kint: Number of clusters.
n_repint, optional: Number of K-Means to fit, by default 100
p_sampfloat, optional: Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
p_featfloat, optional: Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
metricsIterable[str]: Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

`fit`(x[, progress_callback])	Fit CKmeans.
`predict`(x[, linkage_type, return_cls, ...])	Predict cluster membership of new data from fitted CKmeans.

AVAILABLE_METRICS = ('sil', 'bic', 'db', 'ch')

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit CKmeans.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]: a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
progress_callbackOptional[Callable]: Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) → pyckmeans.core.ckmeans.CKmeansResult

Predict cluster membership of new data from fitted CKmeans.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns

CKmeansResult: Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.core.ckmeans.CKmeansResult(consensus_matrix: numpy.ndarray, cluster_membership: numpy.ndarray, k: int, bic: Optional[float] = None, sil: Optional[float] = None, db: Optional[float] = None, ch: Optional[float] = None, names: Optional[Iterable[str]] = None, km_cls: Optional[numpy.ndarray] = None)

Bases: object

Result of CKmeans.predict.

Parameters

consensus_matrixnumpy.ndarray: n * n consensus matrix.
cluster_membershipnumpy.ndarray: n-length vector of cluster memberships.
kint: number of clusters.
bicOptional[float]: BIC score of the consensus clustering.
silOptional[float]: Silhouette score of the consensus clustering.
dbOptional[float]: Davies-Bouldin score of the consensus clustering.
chOptional[float]: Calinski-Harabasz score of the consensus clustering.
namesOptional[Iterable(str)]: Sample names.
km_clsOptional[numpy.ndarray]: m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Attributes

cmatrixnumpy.ndarray: Consensus matrix.
clnumpy.ndarray: Cluster membership.
namesOptional[numpy.ndarray]: Sample names.
kint: Number of clusters.
bicOptional[float]: Bayesian Information Criterion score of the clustering.
silOptional[float]: Silhouette scor of the clustering.
dbOptional[float]: Davies-Bouldin score of the clustering.
chOptional[float]: Calinski-Harabasz score of the clustering.
km_clsOptional[numpy.ndarray]: m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Methods

`copy`()	Get a deep copied CKmeansResult.
`from_dict`(ckm_res_dict)	Construct CKmeansResult from dictionary.
`from_dir`(directory)	Construct CKmeansResult from a directory contraining the three files 'cmatrix.csv', 'clusters.csv', 'metrics.csv', and optionally 'km_clusters.csv'.
`from_json`(file, **kwargs)	Construct CKmeansResult from JSON file.
`from_json_str`(json_str, **kwargs)	Construct CKmeansResult from JSON string.
`order`([method, linkage_type])	Get optimal order according to hierarchical clustering.
`plot`([names, order, cmap_cm, cmap_clbar, ...])	Plot pyckmeans result consensus matrix with consensus clusters.
`recalculate_cluster_memberships`(x, linkage_type)	ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.
`reorder`(order[, in_place])	Reorder samples according to provided order.
`save_km_cls`(out_file[, one_hot, row_names, ...])	Save predicted cluster membership for the single K-Means runs to a file.
`sort`([method, linkage_type, in_place])	Sort CKmeansResult using hierarchical clustering.
`to_dict`()	Convert CKmeansResult to dictionary.
`to_dir`(out_dir[, force])	Save CKmeansResult to directory.
`to_json`([file])	Convert CKmeansResult to JSON string or file.

copy() → pyckmeans.core.ckmeans.CKmeansResult

Get a deep copied CKmeansResult.

Returns

CKmeansResult: A deep copy of self.

classmethod from_dict(ckm_res_dict: Dict) → pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from dictionary.

Parameters

ckm_res_dictDict: CKmeansResult as dictionary.

Returns

CKmeansResult: CKmeansResult

classmethod from_dir(directory: str) → pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from a directory contraining the three files ‘cmatrix.csv’, ‘clusters.csv’, ‘metrics.csv’, and optionally ‘km_clusters.csv’. See <pyckmeans.core.ckmeans.CKmeansResult.to_dir>().

Parameters

directorystr: CKmeansResult directory.

Returns

CKmeansResult: CKmeansResult

Raises

Exception: Raised if there is a problem with directory.

classmethod from_json(file: str, **kwargs: Dict[str, Any]) → pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from JSON file.

Parameters

filestr: JSON file
kwargsDict[str, Any]: Additional keyword arguments passed to json.loads.
Returns
——-
CKmeansResult: CKmeansResult

classmethod from_json_str(json_str: str, **kwargs: Dict[str, Any]) → pyckmeans.core.ckmeans.CKmeansResult

Construct CKmeansResult from JSON string.

Parameters

json_str: str: JSON string.
kwargsDict[str, Any]: Additional keyword arguments passed to json.loads.

Returns

CKmeansResult: CKmeansResult

order(method: str = 'GW', linkage_type: str = 'average') → numpy.ndarray

Get optimal order according to hierarchical clustering.

Parameters

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

Returns

numpy.ndarray: Optimal sample order.

plot(names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7)) → matplotlib.figure.Figure

Plot pyckmeans result consensus matrix with consensus clusters.

Parameters

namesOptional[Iterable[str]]: Sample names to be plotted.
orderOptional[Union[str, numpy.ndarray]]: Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.
cmap_cmUnion[str, matplotlib.colors.Colormap], optional: Colormap for the consensus matrix, by default ‘Blues’
cmap_clbarUnion[str, matplotlib.colors.Colormap], optional: Colormap for the cluster bar, by default ‘tab20’
figsizeTuple[float, float], optional: Figure size for the matplotlib figure, by default (7, 7).

Returns

matplotlib.figure.Figure: Matplotlib figure.

recalculate_cluster_memberships(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str, in_place: bool = False) → pyckmeans.core.ckmeans.CKmeansResult

ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.

Recalculate cluster memberships using hierarchical clustering based on the given linkage type.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

The data that was used to predict the present CKmeansResult. A n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, CKmeansResult object will be returned. If True, the object will modified in place and self will be returned.

Returns

CKmeansResult: CKmeansResult with recalculated cluster memberships.

reorder(order: numpy.ndarray, in_place: bool = False) → pyckmeans.core.ckmeans.CKmeansResult

Reorder samples according to provided order.

Parameters

ordernumpy.ndarray: New sample order.
in_placebool: If False, a new, sorted CKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns

CKmeansResult: Reordered CKmeansResult

save_km_cls(out_file: str, one_hot: bool = False, row_names: bool = False, col_names: bool = False)

Save predicted cluster membership for the single K-Means runs to a file. The file format depends on the one_hot parameter.

Parameters

out_filestr

Output file path.

one_hotbool

If False, a tab-delimited text file will be written containing a n*m cluster membership matrix, where n is the number of K-Means runs and m is the number of samples.

If True, a file comprising n one-hot encoded m*k cluster membership matrices in tab-delimited text format, separated by an empty line, will be written, where k is the number of clusters.

row_namesbool

If True, row names will be written.

col_namesbool

If True, column names will be written.

sort(method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) → pyckmeans.core.ckmeans.CKmeansResult

Sort CKmeansResult using hierarchical clustering.

Parameters

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, sorted CKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns

CKmeansResult: Sorted CKmeansResult

to_dict() → Dict

Convert CKmeansResult to dictionary.

Returns

Dict: CKmeansResult as dictionary.

to_dir(out_dir: str, force: bool = False)

Save CKmeansResult to directory. The directory will contain the three files ‘cmatrix.csv’, comprising the consensus matrix, ‘clusters.csv’, comprising the consensus cluster membership, and ‘metrics.csv’, comprising the clustering metrics. If the CKmeansResult contains clustering information considering the single K-Means runs, those will be written to ‘km_clusters.csv’.

Parameters

out_dirstr: Output directory. Will be created if it does not exist.
forcebool, optional: Write into out_dir even if it does already exist, by default False.

Raises

Exception: Raised if there is a problem with out_dir.

to_json(file: Optional[str] = None, **kwargs: Dict[str, Any]) → Optional[str]

Convert CKmeansResult to JSON string or file.

Parameters

fileOptional[str], optional: File path to write the CKmeansResult to or None. If None, the JSON string will be returned.
kwargsDict[str, Any]: Additional keyword arguments passed to json.dump or json.dumps.

Returns

Optional[str]: None or JSON string.

exception pyckmeans.core.ckmeans.InvalidClusteringMetric

Bases: Exception

Error signalling that an invalid clustering metric was provided.

pyckmeans.core.ckmeans.bic_kmeans(x: numpy.ndarray, cl: numpy.ndarray, centers: Optional[numpy.ndarray] = None) → float

Calculate the Bayesian Information Criterion (BIC) for a KMeans result. The formula is using the BIC calculation for the Gaussian special case.

Parameters

xnumpy.ndarray: n * m matrix, where n is the number of samples (observations) and m is the number of features (predictors).
clIterable[int]: Iterable of length n, containing cluster membership coded as integer.
centersOptional[numpy.ndarray]: k * m matrix of cluster centers (centroids), where k is the number of clusters and m is the number of features (predictors). If None, centers will be calculated from cl and x.

Returns

float: BIC

pyckmeans.core.ckmeans.wss(x: numpy.ndarray, centers: numpy.ndarray, cl: Iterable[int]) → float

Calculate within cluster sum of squares.

Parameters

xnumpy.ndarray: n * m matrix, where n is the number of samples (observations) and m is the number of features (predictors).
centersnumpy.ndarray: k * m matrix of cluster centers (centroids), where k is the number of clusters and m is the number of features (predictors).
clIterable[int]: Iterable of length n, containing cluster membership as coded as integer.

Returns

float: Within cluster sum of squares.

pyckmeans.core.multickmeans module

multickmeans module

class pyckmeans.core.multickmeans.MultiCKMeans(k: Iterable[int], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Convenience class wrapping Consensus K-Means runs for multiple different numbers of clusters.

Parameters

kIterable[int]: List of cluster counts for CKmeans.
n_repint, optional: Number of K-Means to fit for each single CKmeans, by default 100
p_sampfloat, optional: Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
p_featfloat, optional: Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
metricsIterable[str]: Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

`fit`(x[, progress_callback])	Fit MultiCKmeans.
`predict`(x[, linkage_type, return_cls, ...])	Predict cluster membership of new data from all fitted CKmeans.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit MultiCKmeans.

Parameters

xUnion[numpy.ndarray, PCOAResult]: a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
progress_callbackOptional[Callable]: Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) → pyckmeans.core.multickmeans.MultiCKmeansResult

Predict cluster membership of new data from all fitted CKmeans.

Parameters

xUnion[numpy.ndarray, PCOAResult]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns

CKmeansResult: Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.core.multickmeans.MultiCKmeansResult(ckmeans_results: List[pyckmeans.core.ckmeans.CKmeansResult], names: Optional[Iterable[str]] = None)

Bases: object

Result of MultiCKmeansResult.predict.

Parameters

ckmeans_results: List[CKmeansResult]: List of CKmeansResults.
names: Optional[Iterable(str)]: Sample names.

Methods

`order`(by[, method, linkage_type])	Get optimal sample order according to hierarchical clustering of the CKmeansResult at index "by".
`plot_metrics`([figsize])	Plot MultiCKMeansResult metrics.
`reorder`(order[, in_place])	Reorder samples in all CKmeansResults according to provided order.
`sort`(by[, method, linkage_type, in_place])	Sort samples according to hierarchical clustering of the CKmeansResult at index "by".

order(by: int, method: str = 'GW', linkage_type: str = 'average') → numpy.ndarray

Get optimal sample order according to hierarchical clustering of the CKmeansResult at index “by”.

Parameters

byint

Index of the CKMeansResult to order by.

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

Returns

numpy.ndarray: Optimal sample order.

plot_metrics(figsize: Tuple[float, float] = (7, 7)) → matplotlib.figure.Figure

Plot MultiCKMeansResult metrics.

Parameters

figsizeTuple[float, float], optional: Figure size for the matplotlib figure, by default (7, 7).

Returns

matplotlib.figure.Figure: Matplotlib Figure of the metrics plot.

reorder(order: numpy.ndarray, in_place: bool = False) → pyckmeans.core.multickmeans.MultiCKmeansResult

Reorder samples in all CKmeansResults according to provided order.

Parameters

ordernumpy.ndarray: New sample order.
in_placebool: If False, a new, sorted MultiCKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns

MultiCKmeansResult: Reordered MultiCKmeansResult

sort(by: int, method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) → pyckmeans.core.multickmeans.MultiCKmeansResult

Sort samples according to hierarchical clustering of the CKmeansResult at index “by”.

Parameters

byint

Index of the CKMeansResult to sort by.

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

linkage_typestr

Linkage type for the hierarchical clustering. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, sorted MultiCKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns

MultiCKmeansResult: Sorted MultiCKmeansResult

pyckmeans.core.utils module

core utilities

class pyckmeans.core.utils.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: json.encoder.JSONEncoder

Special json encoder for numpy types

Methods

`default`(obj)	Implement this method in a subclass such that it returns a serializable object for `o`, or calls the base implementation (to raise a `TypeError`).
`encode`(o)	Return a JSON string representation of a Python data structure.
`iterencode`(o[, _one_shot])	Encode the given object and yield each string representation as available.

default(obj)

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

pyckmeans.core.wecr module

Weighted Ensemble Consensus of Random K-Means (WECR K-Means)

exception pyckmeans.core.wecr.InvalidConstraintsError: Bases: Exception

exception pyckmeans.core.wecr.InvalidKError: Bases: Exception

class pyckmeans.core.wecr.WECR(k: Union[int, Iterable[int]], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, **kwargs: Dict[str, Any])

Bases: object

WECR K-Means

A class representing a Weighted Ensemble Consensus of Random K-Means [1].

Parameters

kUnion[int, Iterable[int]]: Number of clusters to drawn from for each K-Means run.
n_repint, optional: Number of K-Means to fit, by default 100
p_sampfloat, optional: Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
p_featfloat, optional: Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.KMeans.

References

1: Lai, Y., S., He, Z., Lin, F., Yang, Q., Zhou, X., Zhou. 2019. “An Adaptive Robust Semi-Supervised Clustering Framework Using Weighted Consensus of Random K-Means Ensemble”. IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1877-1890. doi: 10.1109/TKDE.2019.2952596.

Methods

`fit`(x[, progress_callback])	Fit the WECR K-Means.
`predict`(x[, must_link, must_not_link, ...])	Predict from WECR.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit the WECR K-Means.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]: a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
progress_callbackOptional[Callable]: Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pandas.core.frame.DataFrame, pyckmeans.ordination.PCOAResult], must_link: Optional[Iterable] = None, must_not_link: Optional[Iterable] = None, gamma: float = 0.5, scale_consensus_matrix: bool = True, linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) → pyckmeans.core.wecr.WECRResult

Predict from WECR.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

must_linkOptional[Iterable], optional

Must-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

must_not_linkOptional[Iterable], optional

Must-not-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

gammafloat, optional

Weight parameter for the constraints. Must be between 0.0 and 1.0, by default 0.5. Higher values increase the weight of the constraints on the final result.

scale_consensus_matrixbool

If true, the consensus matrix will be scaled in such a way that the diagonal entries are all 1.

linkage_typestr

Linkage type of the hierarchical clustering that is used for final consensus cluster calculation.

One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable], optional

Optional callback function for progress reporting.

Returns

WECRResult: WECRResult object.

class pyckmeans.core.wecr.WECRResult(consensus_matrix: numpy.ndarray, cluster_membership: numpy.ndarray, k: numpy.ndarray, bic: Optional[numpy.ndarray] = None, sil: Optional[numpy.ndarray] = None, db: Optional[numpy.ndarray] = None, ch: Optional[numpy.ndarray] = None, names: Optional[Iterable[str]] = None, km_cls: Optional[numpy.ndarray] = None)

Bases: object

Result of WECR.predict.

Parameters

consensus_matrixnumpy.ndarray: n * n weighted consensus (co-association) matrix, where n is the number of samples (observations, data points)
cluster_membershipnumpy.ndarray: m * n matrix cluster memberships, where m in the number of different k values and n is the number of samples (observations, data points)
kIterable[int]: Vector of cluster numbers.
bicOptional[numpy.ndarray]: m-length vector of BIC scores of the consensus clustering for each k.
silOptional[numpy.ndarray]: m-length vector of Silhouette scores of the consensus clustering for each k.
dbOptional[numpy.ndarray]: m-length vector of Davies-Bouldin score of the consensus clustering for each k.
chOptional[numpy.ndarray]: m-length vector of Calinski-Harabasz score of the consensus clustering for each k.
namesOptional[Iterable(str)]: Sample names.
km_clsOptional[numpy.ndarray]: m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Attributes

cmatrixnumpy.ndarray: Consensus matrix.
clnumpy.ndarray: Cluster memberships for each k.
namesOptional[numpy.ndarray]: Sample names.
knumpy.ndarray: Number of clusters.
bicOptional[numpy.ndarray]: Bayesian Information Criterion score of the clustering.
silOptional[numpy.ndarray]: Silhouette scor of the clustering.
dbOptional[numpy.ndarray]: Davies-Bouldin score of the clustering.
chOptional[numpy.ndarray]: Calinski-Harabasz score of the clustering.
km_clsOptional[numpy.ndarray]: m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.

Methods

`copy`()	Get a deep copied WECRResult.
`from_dict`(wecr_res_dict)	Construct WECRResult from dictionary.
`from_dir`(directory)	Construct WECRResult from a directory contraining the three files 'cmatrix.csv', 'clusters.csv', 'metrics.csv', and optionally 'km_clusters.csv'.
`from_json`(file, **kwargs)	Construct WECRResult from JSON file.
`from_json_str`(json_str, **kwargs)	Construct WECRResult from JSON string.
`get_cl`(k[, with_names])	Return cluster memberships from hierarchical clustering at a specified k.
`get_cl_affinity_propagation`([with_names])	Get cluster membership according to Affinity Propagation clustering.
`order`([method, linkage_type])	Get optimal sample order according to hierarchical clustering.
`plot`(k[, names, order, cmap_cm, cmap_clbar, ...])	Plot wecr result consensus matrix with consensus clusters.
`plot_affinity_propagation`([names, order, ...])	plot
`plot_metrics`([figsize])	Plot WECRResult metrics.
`recalculate_cluster_memberships`(x, linkage_type)	ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.
`reorder`(order[, in_place])	Reorder samples according to provided order.
`save_km_cls`(out_file[, one_hot, row_names, ...])	Save predicted cluster membership for the single K-Means runs to a file.
`sort`([method, linkage_type, in_place])	Sort WECRResult using hierarchical clustering.
`to_dict`()	Convert WECRResult to dictionary.
`to_dir`(out_dir[, force])	Save WECRResult to directory.
`to_json`([file])	Convert WECRResult to JSON string or file.

copy() → pyckmeans.core.wecr.WECRResult

Get a deep copied WECRResult.

Returns

WECRResult: A deep copy of self.

classmethod from_dict(wecr_res_dict: Dict) → pyckmeans.core.wecr.WECRResult

Construct WECRResult from dictionary.

Parameters

wecr_res_dictDict: WECRResult as dictionary.

Returns

WECRResult: WECRResult

classmethod from_dir(directory: str) → pyckmeans.core.wecr.WECRResult

Construct WECRResult from a directory contraining the three files ‘cmatrix.csv’, ‘clusters.csv’, ‘metrics.csv’, and optionally ‘km_clusters.csv’. See <pyckmeans.core.wecr.WECRResult.to_dir>().

Parameters

directorystr: WECRResult directory.

Returns

WECRResult: WECRResult

Raises

Exception: Raised if there is a problem with directory.

classmethod from_json(file: str, **kwargs: Dict[str, Any]) → pyckmeans.core.wecr.WECRResult

Construct WECRResult from JSON file.

Parameters

filestr: JSON file
kwargsDict[str, Any]: Additional keyword arguments passed to json.loads.
Returns
——-
WECRResult: WECRResult

classmethod from_json_str(json_str: str, **kwargs: Dict[str, Any]) → pyckmeans.core.wecr.WECRResult

Construct WECRResult from JSON string.

Parameters

json_str: str: JSON string.
kwargsDict[str, Any]: Additional keyword arguments passed to json.loads.

Returns

WECRResult: WECRResult

get_cl(k: int, with_names: bool = False) → Union[numpy.ndarray, pandas.core.series.Series]

Return cluster memberships from hierarchical clustering at a specified k.

Parameters

kint: Number of clusters to return the cluster memberships for.
with_namesbool, optional: Return cluster memberships including sample names. If True, a pandas.Series will be returned.

Returns

Union[numpy.ndarray, pandas.Series]: Cluster memberships.

Raises

wecr.InvalidKError: Raised if an invalid k argument is provided.

get_cl_affinity_propagation(with_names: bool = False, **kwargs: Dict[str, Any]) → Union[numpy.ndarray, pandas.core.series.Series]

Get cluster membership according to Affinity Propagation clustering.

Parameters

with_namesbool, optional: Return cluster memberships including sample names. If True, a pandas.Series will be returned.
kwargsDict[str, Any]: Additional keywords passed to sklearn.cluster.AffinityPropagation

Returns

Union[numpy.ndarray, pandas.Series]: Cluster memberships.

order(method: str = 'GW', linkage_type: str = 'average') → numpy.ndarray

Get optimal sample order according to hierarchical clustering.

Parameters

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) [1] or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

linkage_typestr

Linkage type for the hierarchical clustering. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

Returns

numpy.ndarray: Optimal sample order.

References

1: Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

plot(k: int, names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7)) → matplotlib.figure.Figure

Plot wecr result consensus matrix with consensus clusters.

Parameters

k: int: The number of clusters k to use for plotting.
namesOptional[Iterable[str]]: Sample names to be plotted. If None, self.names will be used.
orderOptional[Union[str, numpy.ndarray]]: Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.
cmap_cmUnion[str, matplotlib.colors.Colormap], optional: Colormap for the consensus matrix, by default ‘Blues’.
cmap_clbarUnion[str, matplotlib.colors.Colormap], optional: Colormap for the cluster bar, by default ‘tab20’.
figsizeTuple[float, float], optional: Figure size for the matplotlib figure, by default (7, 7).

Returns

matplotlib.figure.Figure: Matplotlib figure.

plot_affinity_propagation(names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7), **kwargs: Dict[str, Any]) → matplotlib.figure.Figure

plot

Plot wecr result consensus matrix with consensus clusters calculated using Affinity Propagation.

Parameters

namesOptional[Iterable[str]]: Sample names to be plotted. If None, self.names will be used.
orderOptional[Union[str, numpy.ndarray]]: Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.
cmap_cmUnion[str, matplotlib.colors.Colormap], optional: Colormap for the consensus matrix, by default ‘Blues’.
cmap_clbarUnion[str, matplotlib.colors.Colormap], optional: Colormap for the cluster bar, by default ‘tab20’.
figsizeTuple[float, float], optional: Figure size for the matplotlib figure, by default (7, 7).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.AffinityPropagation.

Returns

matplotlib.figure.Figure: Matplotlib figure.

plot_metrics(figsize: Tuple[float, float] = (7, 7)) → matplotlib.figure.Figure

Plot WECRResult metrics.

Parameters

figsizeTuple[float, float], optional: Figure size for the matplotlib figure, by default (7, 7).

Returns

matplotlib.figure.Figure: Matplotlib Figure of the metrics plot.

recalculate_cluster_memberships(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str, in_place: bool = False) → pyckmeans.core.wecr.WECRResult

ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.

Recalculate cluster memberships using hierarchical clustering based on the given linkage type.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

The data that was used to predict the present WECRResult. A n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, CKmeansResult object will be returned. If True, the object will modified in place and self will be returned.

Returns

WECRResult: WECRResult with recalculated cluster memberships.

reorder(order: numpy.ndarray, in_place: bool = False) → pyckmeans.core.wecr.WECRResult

Reorder samples according to provided order.

Parameters

ordernumpy.ndarray: New sample order.
in_placebool: If False, a new, sorted WECRResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns

WECRResult: Reordered WECRResult

save_km_cls(out_file: str, one_hot: bool = False, row_names: bool = False, col_names: bool = False)

Save predicted cluster membership for the single K-Means runs to a file. The file format depends on the one_hot parameter.

Parameters

out_filestr

Output file path.

one_hotbool

If False, a tab-delimited text file will be written containing a n*m cluster membership matrix, where n is the number of K-Means runs and m is the number of samples.

If True, a file comprising n one-hot encoded m*k cluster membership matrices in tab-delimited text format, separated by an empty line, will be written, where k is the number of clusters.

row_namesbool

If True, row names will be written.

col_namesbool

If True, column names will be written.

sort(method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) → pyckmeans.core.wecr.WECRResult

Sort WECRResult using hierarchical clustering.

Parameters

methodstr

Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) [1] or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.

linkage_typestr

Linkage type for the hierarchical clustering. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

in_placebool

If False, a new, sorted WECRResult object will be returned. If True, the object will be sorted in place and self will be returned.

Returns

WECRResult: Sorted WECRResult

References

1: Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.

to_dict() → Dict

Convert WECRResult to dictionary.

Returns

Dict: WECRResult as dictionary.

to_dir(out_dir: str, force: bool = False)

Save WECRResult to directory. The directory will contain the three files ‘cmatrix.csv’, comprising the consensus matrix, ‘clusters.csv’, comprising the consensus cluster memberships, and ‘metrics.csv’, comprising the clustering metrics. If the WECRResult contains clustering information considering the single K-Means runs, those will be written to ‘km_clusters.csv’.

Parameters

out_dirstr: Output directory. Will be created if it does not exist.
forcebool, optional: Write into out_dir even if it does already exist, by default False.

Raises

Exception: Raised if there is a problem with out_dir.

to_json(file: Optional[str] = None, **kwargs: Dict[str, Any]) → Optional[str]

Convert WECRResult to JSON string or file.

Parameters

fileOptional[str], optional: File path to write the WECRResult to or None. If None, the JSON string will be returned.
kwargsDict[str, Any]: Additional keyword arguments passed to json.dump or json.dumps.

Returns

Optional[str]: None or JSON string.

Module contents

pyckmeans core module