pyckmeans.core package
Submodules
pyckmeans.core.ckmeans module
ckmeans module
- class pyckmeans.core.ckmeans.CKmeans(k: int, n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])
Bases:
objectConsensus K-Means.
- Parameters
- kint
Number of clusters.
- n_repint, optional
Number of K-Means to fit, by default 100
- p_sampfloat, optional
Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
- p_featfloat, optional
Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
- metricsIterable[str]
Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.KMeans.
Methods
fit(x[, progress_callback])Fit CKmeans.
predict(x[, linkage_type, return_cls, ...])Predict cluster membership of new data from fitted CKmeans.
- AVAILABLE_METRICS = ('sil', 'bic', 'db', 'ch')
- fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)
Fit CKmeans.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.ckmeans.CKmeansResult
Predict cluster membership of new data from fitted CKmeans.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- return_clsbool
If True, the cluster memberships of the single K-Means runs will be present in the output.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- Returns
- CKmeansResult
Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.
- class pyckmeans.core.ckmeans.CKmeansResult(consensus_matrix: numpy.ndarray, cluster_membership: numpy.ndarray, k: int, bic: Optional[float] = None, sil: Optional[float] = None, db: Optional[float] = None, ch: Optional[float] = None, names: Optional[Iterable[str]] = None, km_cls: Optional[numpy.ndarray] = None)
Bases:
objectResult of CKmeans.predict.
- Parameters
- consensus_matrixnumpy.ndarray
n * n consensus matrix.
- cluster_membershipnumpy.ndarray
n-length vector of cluster memberships.
- kint
number of clusters.
- bicOptional[float]
BIC score of the consensus clustering.
- silOptional[float]
Silhouette score of the consensus clustering.
- dbOptional[float]
Davies-Bouldin score of the consensus clustering.
- chOptional[float]
Calinski-Harabasz score of the consensus clustering.
- namesOptional[Iterable(str)]
Sample names.
- km_clsOptional[numpy.ndarray]
m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.
- Attributes
- cmatrixnumpy.ndarray
Consensus matrix.
- clnumpy.ndarray
Cluster membership.
- namesOptional[numpy.ndarray]
Sample names.
- kint
Number of clusters.
- bicOptional[float]
Bayesian Information Criterion score of the clustering.
- silOptional[float]
Silhouette scor of the clustering.
- dbOptional[float]
Davies-Bouldin score of the clustering.
- chOptional[float]
Calinski-Harabasz score of the clustering.
- km_clsOptional[numpy.ndarray]
m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.
Methods
copy()Get a deep copied CKmeansResult.
from_dict(ckm_res_dict)Construct CKmeansResult from dictionary.
from_dir(directory)Construct CKmeansResult from a directory contraining the three files 'cmatrix.csv', 'clusters.csv', 'metrics.csv', and optionally 'km_clusters.csv'.
from_json(file, **kwargs)Construct CKmeansResult from JSON file.
from_json_str(json_str, **kwargs)Construct CKmeansResult from JSON string.
order([method, linkage_type])Get optimal order according to hierarchical clustering.
plot([names, order, cmap_cm, cmap_clbar, ...])Plot pyckmeans result consensus matrix with consensus clusters.
recalculate_cluster_memberships(x, linkage_type)ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.
reorder(order[, in_place])Reorder samples according to provided order.
save_km_cls(out_file[, one_hot, row_names, ...])Save predicted cluster membership for the single K-Means runs to a file.
sort([method, linkage_type, in_place])Sort CKmeansResult using hierarchical clustering.
to_dict()Convert CKmeansResult to dictionary.
to_dir(out_dir[, force])Save CKmeansResult to directory.
to_json([file])Convert CKmeansResult to JSON string or file.
- copy() pyckmeans.core.ckmeans.CKmeansResult
Get a deep copied CKmeansResult.
- Returns
- CKmeansResult
A deep copy of self.
- classmethod from_dict(ckm_res_dict: Dict) pyckmeans.core.ckmeans.CKmeansResult
Construct CKmeansResult from dictionary.
- Parameters
- ckm_res_dictDict
CKmeansResult as dictionary.
- Returns
- CKmeansResult
CKmeansResult
- classmethod from_dir(directory: str) pyckmeans.core.ckmeans.CKmeansResult
Construct CKmeansResult from a directory contraining the three files ‘cmatrix.csv’, ‘clusters.csv’, ‘metrics.csv’, and optionally ‘km_clusters.csv’. See
<pyckmeans.core.ckmeans.CKmeansResult.to_dir>().- Parameters
- directorystr
CKmeansResult directory.
- Returns
- CKmeansResult
CKmeansResult
- Raises
- Exception
Raised if there is a problem with directory.
- classmethod from_json(file: str, **kwargs: Dict[str, Any]) pyckmeans.core.ckmeans.CKmeansResult
Construct CKmeansResult from JSON file.
- Parameters
- filestr
JSON file
- kwargsDict[str, Any]
Additional keyword arguments passed to json.loads.
- Returns
- ——-
- CKmeansResult
CKmeansResult
- classmethod from_json_str(json_str: str, **kwargs: Dict[str, Any]) pyckmeans.core.ckmeans.CKmeansResult
Construct CKmeansResult from JSON string.
- Parameters
- json_str: str
JSON string.
- kwargsDict[str, Any]
Additional keyword arguments passed to json.loads.
- Returns
- CKmeansResult
CKmeansResult
- order(method: str = 'GW', linkage_type: str = 'average') numpy.ndarray
Get optimal order according to hierarchical clustering.
- Parameters
- methodstr
Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.
Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.
- linkage_typestr
Linkage type for the hierarchical clustering. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- Returns
- numpy.ndarray
Optimal sample order.
- plot(names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure
Plot pyckmeans result consensus matrix with consensus clusters.
- Parameters
- namesOptional[Iterable[str]]
Sample names to be plotted.
- orderOptional[Union[str, numpy.ndarray]]
Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.
- cmap_cmUnion[str, matplotlib.colors.Colormap], optional
Colormap for the consensus matrix, by default ‘Blues’
- cmap_clbarUnion[str, matplotlib.colors.Colormap], optional
Colormap for the cluster bar, by default ‘tab20’
- figsizeTuple[float, float], optional
Figure size for the matplotlib figure, by default (7, 7).
- Returns
- matplotlib.figure.Figure
Matplotlib figure.
- recalculate_cluster_memberships(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str, in_place: bool = False) pyckmeans.core.ckmeans.CKmeansResult
ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.
Recalculate cluster memberships using hierarchical clustering based on the given linkage type.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
The data that was used to predict the present CKmeansResult. A n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- in_placebool
If False, a new, CKmeansResult object will be returned. If True, the object will modified in place and self will be returned.
- Returns
- CKmeansResult
CKmeansResult with recalculated cluster memberships.
- reorder(order: numpy.ndarray, in_place: bool = False) pyckmeans.core.ckmeans.CKmeansResult
Reorder samples according to provided order.
- Parameters
- ordernumpy.ndarray
New sample order.
- in_placebool
If False, a new, sorted CKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.
- Returns
- CKmeansResult
Reordered CKmeansResult
- save_km_cls(out_file: str, one_hot: bool = False, row_names: bool = False, col_names: bool = False)
Save predicted cluster membership for the single K-Means runs to a file. The file format depends on the one_hot parameter.
- Parameters
- out_filestr
Output file path.
- one_hotbool
If False, a tab-delimited text file will be written containing a n*m cluster membership matrix, where n is the number of K-Means runs and m is the number of samples.
If True, a file comprising n one-hot encoded m*k cluster membership matrices in tab-delimited text format, separated by an empty line, will be written, where k is the number of clusters.
- row_namesbool
If True, row names will be written.
- col_namesbool
If True, column names will be written.
- sort(method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) pyckmeans.core.ckmeans.CKmeansResult
Sort CKmeansResult using hierarchical clustering.
- Parameters
- methodstr
Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.
Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.
- linkage_typestr
Linkage type for the hierarchical clustering. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- in_placebool
If False, a new, sorted CKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.
- Returns
- CKmeansResult
Sorted CKmeansResult
- to_dict() Dict
Convert CKmeansResult to dictionary.
- Returns
- Dict
CKmeansResult as dictionary.
- to_dir(out_dir: str, force: bool = False)
Save CKmeansResult to directory. The directory will contain the three files ‘cmatrix.csv’, comprising the consensus matrix, ‘clusters.csv’, comprising the consensus cluster membership, and ‘metrics.csv’, comprising the clustering metrics. If the CKmeansResult contains clustering information considering the single K-Means runs, those will be written to ‘km_clusters.csv’.
- Parameters
- out_dirstr
Output directory. Will be created if it does not exist.
- forcebool, optional
Write into out_dir even if it does already exist, by default False.
- Raises
- Exception
Raised if there is a problem with out_dir.
- to_json(file: Optional[str] = None, **kwargs: Dict[str, Any]) Optional[str]
Convert CKmeansResult to JSON string or file.
- Parameters
- fileOptional[str], optional
File path to write the CKmeansResult to or None. If None, the JSON string will be returned.
- kwargsDict[str, Any]
Additional keyword arguments passed to json.dump or json.dumps.
- Returns
- Optional[str]
None or JSON string.
- exception pyckmeans.core.ckmeans.InvalidClusteringMetric
Bases:
ExceptionError signalling that an invalid clustering metric was provided.
- pyckmeans.core.ckmeans.bic_kmeans(x: numpy.ndarray, cl: numpy.ndarray, centers: Optional[numpy.ndarray] = None) float
Calculate the Bayesian Information Criterion (BIC) for a KMeans result. The formula is using the BIC calculation for the Gaussian special case.
- Parameters
- xnumpy.ndarray
n * m matrix, where n is the number of samples (observations) and m is the number of features (predictors).
- clIterable[int]
Iterable of length n, containing cluster membership coded as integer.
- centersOptional[numpy.ndarray]
k * m matrix of cluster centers (centroids), where k is the number of clusters and m is the number of features (predictors). If None, centers will be calculated from cl and x.
- Returns
- float
BIC
- pyckmeans.core.ckmeans.wss(x: numpy.ndarray, centers: numpy.ndarray, cl: Iterable[int]) float
Calculate within cluster sum of squares.
- Parameters
- xnumpy.ndarray
n * m matrix, where n is the number of samples (observations) and m is the number of features (predictors).
- centersnumpy.ndarray
k * m matrix of cluster centers (centroids), where k is the number of clusters and m is the number of features (predictors).
- clIterable[int]
Iterable of length n, containing cluster membership as coded as integer.
- Returns
- float
Within cluster sum of squares.
pyckmeans.core.multickmeans module
multickmeans module
- class pyckmeans.core.multickmeans.MultiCKMeans(k: Iterable[int], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])
Bases:
objectConvenience class wrapping Consensus K-Means runs for multiple different numbers of clusters.
- Parameters
- kIterable[int]
List of cluster counts for CKmeans.
- n_repint, optional
Number of K-Means to fit for each single CKmeans, by default 100
- p_sampfloat, optional
Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
- p_featfloat, optional
Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
- metricsIterable[str]
Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.KMeans.
Methods
fit(x[, progress_callback])Fit MultiCKmeans.
predict(x[, linkage_type, return_cls, ...])Predict cluster membership of new data from all fitted CKmeans.
- fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)
Fit MultiCKmeans.
- Parameters
- xUnion[numpy.ndarray, PCOAResult]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.multickmeans.MultiCKmeansResult
Predict cluster membership of new data from all fitted CKmeans.
- Parameters
- xUnion[numpy.ndarray, PCOAResult]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- return_clsbool
If True, the cluster memberships of the single K-Means runs will be present in the output.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- Returns
- CKmeansResult
Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.
- class pyckmeans.core.multickmeans.MultiCKmeansResult(ckmeans_results: List[pyckmeans.core.ckmeans.CKmeansResult], names: Optional[Iterable[str]] = None)
Bases:
objectResult of MultiCKmeansResult.predict.
- Parameters
- ckmeans_results: List[CKmeansResult]
List of CKmeansResults.
- names: Optional[Iterable(str)]
Sample names.
Methods
order(by[, method, linkage_type])Get optimal sample order according to hierarchical clustering of the CKmeansResult at index "by".
plot_metrics([figsize])Plot MultiCKMeansResult metrics.
reorder(order[, in_place])Reorder samples in all CKmeansResults according to provided order.
sort(by[, method, linkage_type, in_place])Sort samples according to hierarchical clustering of the CKmeansResult at index "by".
- order(by: int, method: str = 'GW', linkage_type: str = 'average') numpy.ndarray
Get optimal sample order according to hierarchical clustering of the CKmeansResult at index “by”.
- Parameters
- byint
Index of the CKMeansResult to order by.
- methodstr
Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.
Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.
- linkage_typestr
Linkage type for the hierarchical clustering. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- Returns
- numpy.ndarray
Optimal sample order.
- plot_metrics(figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure
Plot MultiCKMeansResult metrics.
- Parameters
- figsizeTuple[float, float], optional
Figure size for the matplotlib figure, by default (7, 7).
- Returns
- matplotlib.figure.Figure
Matplotlib Figure of the metrics plot.
- reorder(order: numpy.ndarray, in_place: bool = False) pyckmeans.core.multickmeans.MultiCKmeansResult
Reorder samples in all CKmeansResults according to provided order.
- Parameters
- ordernumpy.ndarray
New sample order.
- in_placebool
If False, a new, sorted MultiCKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.
- Returns
- MultiCKmeansResult
Reordered MultiCKmeansResult
- sort(by: int, method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) pyckmeans.core.multickmeans.MultiCKmeansResult
Sort samples according to hierarchical clustering of the CKmeansResult at index “by”.
- Parameters
- byint
Index of the CKMeansResult to sort by.
- methodstr
Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.
Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.
- linkage_typestr
Linkage type for the hierarchical clustering. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- in_placebool
If False, a new, sorted MultiCKmeansResult object will be returned. If True, the object will be sorted in place and self will be returned.
- Returns
- MultiCKmeansResult
Sorted MultiCKmeansResult
pyckmeans.core.utils module
core utilities
- class pyckmeans.core.utils.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)
Bases:
json.encoder.JSONEncoderSpecial json encoder for numpy types
Methods
default(obj)Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).encode(o)Return a JSON string representation of a Python data structure.
iterencode(o[, _one_shot])Encode the given object and yield each string representation as available.
- default(obj)
Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
pyckmeans.core.wecr module
Weighted Ensemble Consensus of Random K-Means (WECR K-Means)
- exception pyckmeans.core.wecr.InvalidConstraintsError
Bases:
Exception
- exception pyckmeans.core.wecr.InvalidKError
Bases:
Exception
- class pyckmeans.core.wecr.WECR(k: Union[int, Iterable[int]], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, **kwargs: Dict[str, Any])
Bases:
objectWECR K-Means
A class representing a Weighted Ensemble Consensus of Random K-Means [1].
- Parameters
- kUnion[int, Iterable[int]]
Number of clusters to drawn from for each K-Means run.
- n_repint, optional
Number of K-Means to fit, by default 100
- p_sampfloat, optional
Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
- p_featfloat, optional
Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.KMeans.
References
- 1
Lai, Y., S., He, Z., Lin, F., Yang, Q., Zhou, X., Zhou. 2019. “An Adaptive Robust Semi-Supervised Clustering Framework Using Weighted Consensus of Random K-Means Ensemble”. IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1877-1890. doi: 10.1109/TKDE.2019.2952596.
Methods
fit(x[, progress_callback])Fit the WECR K-Means.
predict(x[, must_link, must_not_link, ...])Predict from WECR.
- fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)
Fit the WECR K-Means.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- predict(x: Union[numpy.ndarray, pandas.core.frame.DataFrame, pyckmeans.ordination.PCOAResult], must_link: Optional[Iterable] = None, must_not_link: Optional[Iterable] = None, gamma: float = 0.5, scale_consensus_matrix: bool = True, linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.wecr.WECRResult
Predict from WECR.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- must_linkOptional[Iterable], optional
Must-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.
- must_not_linkOptional[Iterable], optional
Must-not-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.
- gammafloat, optional
Weight parameter for the constraints. Must be between 0.0 and 1.0, by default 0.5. Higher values increase the weight of the constraints on the final result.
- scale_consensus_matrixbool
If true, the consensus matrix will be scaled in such a way that the diagonal entries are all 1.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for final consensus cluster calculation.
One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- return_clsbool
If True, the cluster memberships of the single K-Means runs will be present in the output.
- progress_callbackOptional[Callable], optional
Optional callback function for progress reporting.
- Returns
- WECRResult
WECRResult object.
- class pyckmeans.core.wecr.WECRResult(consensus_matrix: numpy.ndarray, cluster_membership: numpy.ndarray, k: numpy.ndarray, bic: Optional[numpy.ndarray] = None, sil: Optional[numpy.ndarray] = None, db: Optional[numpy.ndarray] = None, ch: Optional[numpy.ndarray] = None, names: Optional[Iterable[str]] = None, km_cls: Optional[numpy.ndarray] = None)
Bases:
objectResult of WECR.predict.
- Parameters
- consensus_matrixnumpy.ndarray
n * n weighted consensus (co-association) matrix, where n is the number of samples (observations, data points)
- cluster_membershipnumpy.ndarray
m * n matrix cluster memberships, where m in the number of different k values and n is the number of samples (observations, data points)
- kIterable[int]
Vector of cluster numbers.
- bicOptional[numpy.ndarray]
m-length vector of BIC scores of the consensus clustering for each k.
- silOptional[numpy.ndarray]
m-length vector of Silhouette scores of the consensus clustering for each k.
- dbOptional[numpy.ndarray]
m-length vector of Davies-Bouldin score of the consensus clustering for each k.
- chOptional[numpy.ndarray]
m-length vector of Calinski-Harabasz score of the consensus clustering for each k.
- namesOptional[Iterable(str)]
Sample names.
- km_clsOptional[numpy.ndarray]
m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.
- Attributes
- cmatrixnumpy.ndarray
Consensus matrix.
- clnumpy.ndarray
Cluster memberships for each k.
- namesOptional[numpy.ndarray]
Sample names.
- knumpy.ndarray
Number of clusters.
- bicOptional[numpy.ndarray]
Bayesian Information Criterion score of the clustering.
- silOptional[numpy.ndarray]
Silhouette scor of the clustering.
- dbOptional[numpy.ndarray]
Davies-Bouldin score of the clustering.
- chOptional[numpy.ndarray]
Calinski-Harabasz score of the clustering.
- km_clsOptional[numpy.ndarray]
m*n matrix of predicted cluster memberships for each single K-Means run, where m is the number of single K-Means runs and n is the number samples.
Methods
copy()Get a deep copied WECRResult.
from_dict(wecr_res_dict)Construct WECRResult from dictionary.
from_dir(directory)Construct WECRResult from a directory contraining the three files 'cmatrix.csv', 'clusters.csv', 'metrics.csv', and optionally 'km_clusters.csv'.
from_json(file, **kwargs)Construct WECRResult from JSON file.
from_json_str(json_str, **kwargs)Construct WECRResult from JSON string.
get_cl(k[, with_names])Return cluster memberships from hierarchical clustering at a specified k.
get_cl_affinity_propagation([with_names])Get cluster membership according to Affinity Propagation clustering.
order([method, linkage_type])Get optimal sample order according to hierarchical clustering.
plot(k[, names, order, cmap_cm, cmap_clbar, ...])Plot wecr result consensus matrix with consensus clusters.
plot_affinity_propagation([names, order, ...])plot
plot_metrics([figsize])Plot WECRResult metrics.
recalculate_cluster_memberships(x, linkage_type)ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.
reorder(order[, in_place])Reorder samples according to provided order.
save_km_cls(out_file[, one_hot, row_names, ...])Save predicted cluster membership for the single K-Means runs to a file.
sort([method, linkage_type, in_place])Sort WECRResult using hierarchical clustering.
to_dict()Convert WECRResult to dictionary.
to_dir(out_dir[, force])Save WECRResult to directory.
to_json([file])Convert WECRResult to JSON string or file.
- copy() pyckmeans.core.wecr.WECRResult
Get a deep copied WECRResult.
- Returns
- WECRResult
A deep copy of self.
- classmethod from_dict(wecr_res_dict: Dict) pyckmeans.core.wecr.WECRResult
Construct WECRResult from dictionary.
- Parameters
- wecr_res_dictDict
WECRResult as dictionary.
- Returns
- WECRResult
WECRResult
- classmethod from_dir(directory: str) pyckmeans.core.wecr.WECRResult
Construct WECRResult from a directory contraining the three files ‘cmatrix.csv’, ‘clusters.csv’, ‘metrics.csv’, and optionally ‘km_clusters.csv’. See
<pyckmeans.core.wecr.WECRResult.to_dir>().- Parameters
- directorystr
WECRResult directory.
- Returns
- WECRResult
WECRResult
- Raises
- Exception
Raised if there is a problem with directory.
- classmethod from_json(file: str, **kwargs: Dict[str, Any]) pyckmeans.core.wecr.WECRResult
Construct WECRResult from JSON file.
- Parameters
- filestr
JSON file
- kwargsDict[str, Any]
Additional keyword arguments passed to json.loads.
- Returns
- ——-
- WECRResult
WECRResult
- classmethod from_json_str(json_str: str, **kwargs: Dict[str, Any]) pyckmeans.core.wecr.WECRResult
Construct WECRResult from JSON string.
- Parameters
- json_str: str
JSON string.
- kwargsDict[str, Any]
Additional keyword arguments passed to json.loads.
- Returns
- WECRResult
WECRResult
- get_cl(k: int, with_names: bool = False) Union[numpy.ndarray, pandas.core.series.Series]
Return cluster memberships from hierarchical clustering at a specified k.
- Parameters
- kint
Number of clusters to return the cluster memberships for.
- with_namesbool, optional
Return cluster memberships including sample names. If True, a pandas.Series will be returned.
- Returns
- Union[numpy.ndarray, pandas.Series]
Cluster memberships.
- Raises
- wecr.InvalidKError
Raised if an invalid k argument is provided.
- get_cl_affinity_propagation(with_names: bool = False, **kwargs: Dict[str, Any]) Union[numpy.ndarray, pandas.core.series.Series]
Get cluster membership according to Affinity Propagation clustering.
- Parameters
- with_namesbool, optional
Return cluster memberships including sample names. If True, a pandas.Series will be returned.
- kwargsDict[str, Any]
Additional keywords passed to sklearn.cluster.AffinityPropagation
- Returns
- Union[numpy.ndarray, pandas.Series]
Cluster memberships.
- order(method: str = 'GW', linkage_type: str = 'average') numpy.ndarray
Get optimal sample order according to hierarchical clustering.
- Parameters
- methodstr
Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) [1] or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.
- linkage_typestr
Linkage type for the hierarchical clustering. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- Returns
- numpy.ndarray
Optimal sample order.
References
- 1
Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.
- plot(k: int, names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure
Plot wecr result consensus matrix with consensus clusters.
- Parameters
- k: int
The number of clusters k to use for plotting.
- namesOptional[Iterable[str]]
Sample names to be plotted. If None, self.names will be used.
- orderOptional[Union[str, numpy.ndarray]]
Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.
- cmap_cmUnion[str, matplotlib.colors.Colormap], optional
Colormap for the consensus matrix, by default ‘Blues’.
- cmap_clbarUnion[str, matplotlib.colors.Colormap], optional
Colormap for the cluster bar, by default ‘tab20’.
- figsizeTuple[float, float], optional
Figure size for the matplotlib figure, by default (7, 7).
- Returns
- matplotlib.figure.Figure
Matplotlib figure.
- plot_affinity_propagation(names: Optional[Iterable[str]] = None, order: Optional[Union[str, numpy.ndarray]] = 'GW', cmap_cm: Union[str, matplotlib.colors.Colormap] = 'Blues', cmap_clbar: Union[str, matplotlib.colors.Colormap] = 'tab20', figsize: Tuple[float, float] = (7, 7), **kwargs: Dict[str, Any]) matplotlib.figure.Figure
plot
Plot wecr result consensus matrix with consensus clusters calculated using Affinity Propagation.
- Parameters
- namesOptional[Iterable[str]]
Sample names to be plotted. If None, self.names will be used.
- orderOptional[Union[str, numpy.ndarray]]
Sample Plotting order. Either a string, determining the oder method to use (see CKmeansResult.order), or a numpy.ndarray giving the sample order, or None to apply no reordering.
- cmap_cmUnion[str, matplotlib.colors.Colormap], optional
Colormap for the consensus matrix, by default ‘Blues’.
- cmap_clbarUnion[str, matplotlib.colors.Colormap], optional
Colormap for the cluster bar, by default ‘tab20’.
- figsizeTuple[float, float], optional
Figure size for the matplotlib figure, by default (7, 7).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.AffinityPropagation.
- Returns
- matplotlib.figure.Figure
Matplotlib figure.
- plot_metrics(figsize: Tuple[float, float] = (7, 7)) matplotlib.figure.Figure
Plot WECRResult metrics.
- Parameters
- figsizeTuple[float, float], optional
Figure size for the matplotlib figure, by default (7, 7).
- Returns
- matplotlib.figure.Figure
Matplotlib Figure of the metrics plot.
- recalculate_cluster_memberships(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str, in_place: bool = False) pyckmeans.core.wecr.WECRResult
ATTENTION: This method may only be used if the WECRResult was not reordered, or if x was reordered the same way as the WECRResult.
Recalculate cluster memberships using hierarchical clustering based on the given linkage type.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
The data that was used to predict the present WECRResult. A n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- in_placebool
If False, a new, CKmeansResult object will be returned. If True, the object will modified in place and self will be returned.
- Returns
- WECRResult
WECRResult with recalculated cluster memberships.
- reorder(order: numpy.ndarray, in_place: bool = False) pyckmeans.core.wecr.WECRResult
Reorder samples according to provided order.
- Parameters
- ordernumpy.ndarray
New sample order.
- in_placebool
If False, a new, sorted WECRResult object will be returned. If True, the object will be sorted in place and self will be returned.
- Returns
- WECRResult
Reordered WECRResult
- save_km_cls(out_file: str, one_hot: bool = False, row_names: bool = False, col_names: bool = False)
Save predicted cluster membership for the single K-Means runs to a file. The file format depends on the one_hot parameter.
- Parameters
- out_filestr
Output file path.
- one_hotbool
If False, a tab-delimited text file will be written containing a n*m cluster membership matrix, where n is the number of K-Means runs and m is the number of samples.
If True, a file comprising n one-hot encoded m*k cluster membership matrices in tab-delimited text format, separated by an empty line, will be written, where k is the number of clusters.
- row_namesbool
If True, row names will be written.
- col_namesbool
If True, column names will be written.
- sort(method: str = 'GW', linkage_type: str = 'average', in_place: bool = False) pyckmeans.core.wecr.WECRResult
Sort WECRResult using hierarchical clustering.
- Parameters
- methodstr
Reordering method. Either ‘GW’ (Gruvaeus & Wainer, 1972) [1] or ‘OLO’ for scipy.hierarchy.optimal_leaf_ordering.
- linkage_typestr
Linkage type for the hierarchical clustering. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- in_placebool
If False, a new, sorted WECRResult object will be returned. If True, the object will be sorted in place and self will be returned.
- Returns
- WECRResult
Sorted WECRResult
References
- 1
Gruvaeus, G., H., Wainer. 1972. Two Additions to Hierarchical Cluster Analysis. The British Psychological Society 25.
- to_dict() Dict
Convert WECRResult to dictionary.
- Returns
- Dict
WECRResult as dictionary.
- to_dir(out_dir: str, force: bool = False)
Save WECRResult to directory. The directory will contain the three files ‘cmatrix.csv’, comprising the consensus matrix, ‘clusters.csv’, comprising the consensus cluster memberships, and ‘metrics.csv’, comprising the clustering metrics. If the WECRResult contains clustering information considering the single K-Means runs, those will be written to ‘km_clusters.csv’.
- Parameters
- out_dirstr
Output directory. Will be created if it does not exist.
- forcebool, optional
Write into out_dir even if it does already exist, by default False.
- Raises
- Exception
Raised if there is a problem with out_dir.
- to_json(file: Optional[str] = None, **kwargs: Dict[str, Any]) Optional[str]
Convert WECRResult to JSON string or file.
- Parameters
- fileOptional[str], optional
File path to write the WECRResult to or None. If None, the JSON string will be returned.
- kwargsDict[str, Any]
Additional keyword arguments passed to json.dump or json.dumps.
- Returns
- Optional[str]
None or JSON string.
Module contents
pyckmeans core module