pyckmeans package
Subpackages
Module contents
pyckmeans
pyckmeans, a Python package for Consensus K-Means clustering.
- class pyckmeans.CKmeans(k: int, n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])
Bases:
objectConsensus K-Means.
- Parameters
- kint
Number of clusters.
- n_repint, optional
Number of K-Means to fit, by default 100
- p_sampfloat, optional
Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
- p_featfloat, optional
Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
- metricsIterable[str]
Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.KMeans.
Methods
fit(x[, progress_callback])Fit CKmeans.
predict(x[, linkage_type, return_cls, ...])Predict cluster membership of new data from fitted CKmeans.
- AVAILABLE_METRICS = ('sil', 'bic', 'db', 'ch')
- fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)
Fit CKmeans.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.ckmeans.CKmeansResult
Predict cluster membership of new data from fitted CKmeans.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- return_clsbool
If True, the cluster memberships of the single K-Means runs will be present in the output.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- Returns
- CKmeansResult
Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.
- class pyckmeans.DistanceMatrix(dist_mat: numpy.ndarray, names: Optional[Iterable[str]] = None)
Bases:
objectDistance Matrix, optionally named.
- Parameters
- dist_matnumpy.ndarray
n*n distance matrix.
- namesOptional[Iterable[str]]
Names, by default None.
- Raises
- IncompatibleNamesError
Raised if dimension of names and dist_mat are incompatible.
- Attributes
shapeshape
Methods
from_csv(file_path[, header, index_col, sep])read_csv_distmat
from_phylip(file_path)Read PHYLIP distance matrix.
to_csv(file_path[, force])Write DistanceMatrix object to CSV.
to_phylip(file_path[, force])Write distance matrix to file in PHYLIP matrix format.
- static from_csv(file_path: str, header: Optional[int] = 0, index_col: Optional[int] = 0, sep: str = ',', **kwargs) pyckmeans.distance.DistanceMatrix
read_csv_distmat
Read distance matrix from CSV file.
- Parameters
- file_pathstr
Path to CSV file.
- headerOptional[int]
Determines the row in the CSV file containing sample names. Is passed to pandas.read_csv(). By default 0, meaning the first row.
- index_colOptional[int]
Determines the index column. By default, the first column is expected to contain sample names. Passed to pandas.read_csv().
- sepstr
Column separator, be default ‘,’. Passed to Passed to pandas.read_csv().
- **kwargs
Additional keyword arguments passed to pandas.read_csv().
- Returns
- ——-
- pyckmeans.distance.DistanceMatrix
DistanceMatrix object.
- static from_phylip(file_path: str) pyckmeans.distance.DistanceMatrix
Read PHYLIP distance matrix.
- Returns
- DistanceMatrix
DistanceMatrix object.
- property shape: Tuple[int]
Get matrix shape.
- Returns
- Tuple[int]
Matrix shape.
- to_csv(file_path: str, force: bool = False)
Write DistanceMatrix object to CSV.
- Parameters
- file_pathstr
CSV file path.
- forcebool, optional
Force overwrite if file_path already exists, by default False
- to_phylip(file_path: str, force: bool = False)
Write distance matrix to file in PHYLIP matrix format.
- Parameters
- file_pathstr
Output file path.
- forcebool, optional
Force overwrite if file exists, by default False
- class pyckmeans.MultiCKMeans(k: Iterable[int], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])
Bases:
objectConvenience class wrapping Consensus K-Means runs for multiple different numbers of clusters.
- Parameters
- kIterable[int]
List of cluster counts for CKmeans.
- n_repint, optional
Number of K-Means to fit for each single CKmeans, by default 100
- p_sampfloat, optional
Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
- p_featfloat, optional
Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
- metricsIterable[str]
Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.KMeans.
Methods
fit(x[, progress_callback])Fit MultiCKmeans.
predict(x[, linkage_type, return_cls, ...])Predict cluster membership of new data from all fitted CKmeans.
- fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)
Fit MultiCKmeans.
- Parameters
- xUnion[numpy.ndarray, PCOAResult]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.multickmeans.MultiCKmeansResult
Predict cluster membership of new data from all fitted CKmeans.
- Parameters
- xUnion[numpy.ndarray, PCOAResult]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- return_clsbool
If True, the cluster memberships of the single K-Means runs will be present in the output.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- Returns
- CKmeansResult
Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.
- class pyckmeans.NucleotideAlignment(names: Iterable[str], sequences: numpy.ndarray, copy: bool = False, fast_encoding: bool = False)
Bases:
objectClass for nucleotide alignments.
- Parameters
- namesList[str]
Sequence identifiers/names.
- sequencesnumpy.ndarray
n*m alignment matrix, where n is the number of entries and m is the number of sites.
- copybool
If True, sequences will be copied. If false, the NucleotideAlignment will use the original sequences, potentially modifying them.
- fast_encodingbool
If true, a fast nucleotide encoding method without error checking will be used. ATTENTION: This will modify sequences in place.
- Attributes
shapeshape
Methods
copy()Return a copy of the NucleotideAligment object.
distance([distance_type, pairwise_deletion])Calculate genetic distance.
drop_invariant_sites([in_place])Remove invariant sites from alignment.
from_bp_seqio_records(records[, fast_encoding])Build NucleotideAlignment from iterable of Bio.SeqRecord.SeqRecord.
from_file(file_path[, file_format, ...])Read nucleotide alignment from file.
- copy() pyckmeans.io.nucleotide_alignment.NucleotideAlignment
Return a copy of the NucleotideAligment object.
- Returns
- NucleotideAlignment
Copy of self.
- distance(distance_type: str = 'p', pairwise_deletion: bool = True) pyckmeans.distance.DistanceMatrix
Calculate genetic distance.
- Parameters
- distance_typestr, optional
Type of genetic distance to calculate, by default ‘p’. Available distance types are p-distances (‘p’), Jukes-Cantor distances (‘jc’), and Kimura 2-paramater distances (‘k2p’).
- pairwise_deletionbool
Use pairwise deletion as action to deal with missing data. If False, complete deletion is applied. Gaps (“-”, “~”, ” “), “?”, and ambiguous bases are treated as missing data.
- Returns
- ——-
- pyckmeans.distance.DistanceMatrix
n*n distance matrix.
- drop_invariant_sites(in_place: bool = False) pyckmeans.io.nucleotide_alignment.NucleotideAlignment
Remove invariant sites from alignment. Invariant sites are sites, where each entry has the same symbol.
- Parameters
- in_placebool, optional
Modify self in place, by default False
- Returns
- NucleotideAlignment
NucleotideAlignment without invariant sites. If in_place is set to True, self is returned.
- classmethod from_bp_seqio_records(records: Iterable[Bio.SeqRecord.SeqRecord], fast_encoding: bool = False) NucleotideAlignment
Build NucleotideAlignment from iterable of Bio.SeqRecord.SeqRecord. Such an iterable is, for example, returned by Bio.SeqIO.parse() or can be constructed using Bio.Align.MultipleSequenceAlignment().
- Parameters
- records: Iterable[‘Bio.SeqRecord.SeqRecord’]
Iterable of Bio.SeqRecord.SeqRecord. Such an iterable is, for example, returned by Bio.SeqIO.parse() or can be constructed using Bio.Align.MultipleSequenceAlignment().
- fast_encodingbool
If true, a fast nucleotide encoding method without error checking will be used.
- Returns
- NucleotideAlignment
NucleotideAlignment object.
- Raises
- InvalidSeqIORecordsError
Raised of sequences have different lengths.
- classmethod from_file(file_path: str, file_format='auto', fast_encoding=False) pyckmeans.io.nucleotide_alignment.NucleotideAlignment
Read nucleotide alignment from file.
- Parameters
- file_path: str
Path to alignment file.
- file_format: str
Alignment file format. Either “auto”, “fasta” or “phylip”. When “auto” the file format will be inferred based on the file extension.
- fast_encodingbool
If true, a fast nucleotide encoding method without error checking will be used.
- Returns
- Tuple[numpy.ndarray, numpy.ndarray]
Tuple of sequences and names, each as numpy array.
- Raises
- InvalidAlignmentFileExtensionError
Raised if file_format is “auto” and the file extension is not understood.
- InvalidAlignmentFileFormatError
Raised if an invalid file_format is passed.
- property shape: Tuple[int, int]
Get alignment dimensions/shapes.
- Returns
- Tuple[int, int]
Number of samples n, number of sites m
- class pyckmeans.WECR(k: Union[int, Iterable[int]], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, **kwargs: Dict[str, Any])
Bases:
objectWECR K-Means
A class representing a Weighted Ensemble Consensus of Random K-Means [1].
- Parameters
- kUnion[int, Iterable[int]]
Number of clusters to drawn from for each K-Means run.
- n_repint, optional
Number of K-Means to fit, by default 100
- p_sampfloat, optional
Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
- p_featfloat, optional
Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
- kwargsDict[str, Any]
Additional keyword arguments passed to sklearn.cluster.KMeans.
References
- 1
Lai, Y., S., He, Z., Lin, F., Yang, Q., Zhou, X., Zhou. 2019. “An Adaptive Robust Semi-Supervised Clustering Framework Using Weighted Consensus of Random K-Means Ensemble”. IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1877-1890. doi: 10.1109/TKDE.2019.2952596.
Methods
fit(x[, progress_callback])Fit the WECR K-Means.
predict(x[, must_link, must_not_link, ...])Predict from WECR.
- fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)
Fit the WECR K-Means.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- progress_callbackOptional[Callable]
Optional callback function for progress reporting.
- predict(x: Union[numpy.ndarray, pandas.core.frame.DataFrame, pyckmeans.ordination.PCOAResult], must_link: Optional[Iterable] = None, must_not_link: Optional[Iterable] = None, gamma: float = 0.5, scale_consensus_matrix: bool = True, linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.wecr.WECRResult
Predict from WECR.
- Parameters
- xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]
a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
- must_linkOptional[Iterable], optional
Must-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.
- must_not_linkOptional[Iterable], optional
Must-not-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.
- gammafloat, optional
Weight parameter for the constraints. Must be between 0.0 and 1.0, by default 0.5. Higher values increase the weight of the constraints on the final result.
- scale_consensus_matrixbool
If true, the consensus matrix will be scaled in such a way that the diagonal entries are all 1.
- linkage_typestr
Linkage type of the hierarchical clustering that is used for final consensus cluster calculation.
One of
‘average’
‘complete’
‘single’
‘weighted’
‘centroid’
See scipy.cluster.hierarchy.linkage for details.
- return_clsbool
If True, the cluster memberships of the single K-Means runs will be present in the output.
- progress_callbackOptional[Callable], optional
Optional callback function for progress reporting.
- Returns
- WECRResult
WECRResult object.
- pyckmeans.pcoa(dist: Union[numpy.ndarray, pyckmeans.distance.DistanceMatrix], correction: Optional[str] = None, eps: float = 1e-08) pyckmeans.ordination.PCOAResult
Principle Coordinate Analysis.
- Parameters
- distUnion[numpy.ndarray, pyckmeans.distance.DistanceMatrix]
n*n distance matrix either as numpy ndarray or as pyckmeans DistanceMatrix.
- correction: Optional[str]
Correction for negative eigenvalues, by default None. Available corrections are:
None: negative eigenvalues are set to 0
lingoes: Lingoes correction
cailliez: Cailliet correction
- epsfloat, optional
Eigenvalues smaller than eps will be dropped. By default 0.0001
- Returns
- PCOAResult
PCoA result object.
- Raises
- InvalidCorrectionTypeError
Raised if an unknown correction type is passed.
- NegativeEigenvaluesCorrectionError
Raised if correction parameter is set and correction of negative eigenvalues is not successful.