pyckmeans package

Module contents

pyckmeans

pyckmeans, a Python package for Consensus K-Means clustering.

class pyckmeans.CKmeans(k: int, n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Consensus K-Means.

Parameters

kint: Number of clusters.
n_repint, optional: Number of K-Means to fit, by default 100
p_sampfloat, optional: Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
p_featfloat, optional: Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
metricsIterable[str]: Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

`fit`(x[, progress_callback])	Fit CKmeans.
`predict`(x[, linkage_type, return_cls, ...])	Predict cluster membership of new data from fitted CKmeans.

AVAILABLE_METRICS = ('sil', 'bic', 'db', 'ch')

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit CKmeans.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]: a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
progress_callbackOptional[Callable]: Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) → pyckmeans.core.ckmeans.CKmeansResult

Predict cluster membership of new data from fitted CKmeans.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns

CKmeansResult: Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.DistanceMatrix(dist_mat: numpy.ndarray, names: Optional[Iterable[str]] = None)

Bases: object

Distance Matrix, optionally named.

Parameters

dist_matnumpy.ndarray: n*n distance matrix.
namesOptional[Iterable[str]]: Names, by default None.

Raises

IncompatibleNamesError: Raised if dimension of names and dist_mat are incompatible.

Attributes

shape: shape

Methods

`from_csv`(file_path[, header, index_col, sep])	read_csv_distmat
`from_phylip`(file_path)	Read PHYLIP distance matrix.
`to_csv`(file_path[, force])	Write DistanceMatrix object to CSV.
`to_phylip`(file_path[, force])	Write distance matrix to file in PHYLIP matrix format.

static from_csv(file_path: str, header: Optional[int] = 0, index_col: Optional[int] = 0, sep: str = ',', **kwargs) → pyckmeans.distance.DistanceMatrix

read_csv_distmat

Read distance matrix from CSV file.

Parameters

file_pathstr: Path to CSV file.
headerOptional[int]: Determines the row in the CSV file containing sample names. Is passed to pandas.read_csv(). By default 0, meaning the first row.
index_colOptional[int]: Determines the index column. By default, the first column is expected to contain sample names. Passed to pandas.read_csv().
sepstr: Column separator, be default ‘,’. Passed to Passed to pandas.read_csv().
**kwargs: Additional keyword arguments passed to pandas.read_csv().
Returns
——-
pyckmeans.distance.DistanceMatrix: DistanceMatrix object.

static from_phylip(file_path: str) → pyckmeans.distance.DistanceMatrix

Read PHYLIP distance matrix.

Returns

DistanceMatrix: DistanceMatrix object.

property shape: Tuple[int]

Get matrix shape.

Returns

Tuple[int]: Matrix shape.

to_csv(file_path: str, force: bool = False)

Write DistanceMatrix object to CSV.

Parameters

file_pathstr: CSV file path.
forcebool, optional: Force overwrite if file_path already exists, by default False

to_phylip(file_path: str, force: bool = False)

Write distance matrix to file in PHYLIP matrix format.

Parameters

file_pathstr: Output file path.
forcebool, optional: Force overwrite if file exists, by default False

class pyckmeans.MultiCKMeans(k: Iterable[int], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Convenience class wrapping Consensus K-Means runs for multiple different numbers of clusters.

Parameters

kIterable[int]: List of cluster counts for CKmeans.
n_repint, optional: Number of K-Means to fit for each single CKmeans, by default 100
p_sampfloat, optional: Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
p_featfloat, optional: Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
metricsIterable[str]: Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

`fit`(x[, progress_callback])	Fit MultiCKmeans.
`predict`(x[, linkage_type, return_cls, ...])	Predict cluster membership of new data from all fitted CKmeans.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit MultiCKmeans.

Parameters

xUnion[numpy.ndarray, PCOAResult]: a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
progress_callbackOptional[Callable]: Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) → pyckmeans.core.multickmeans.MultiCKmeansResult

Predict cluster membership of new data from all fitted CKmeans.

Parameters

xUnion[numpy.ndarray, PCOAResult]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns

CKmeansResult: Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.NucleotideAlignment(names: Iterable[str], sequences: numpy.ndarray, copy: bool = False, fast_encoding: bool = False)

Bases: object

Class for nucleotide alignments.

Parameters

namesList[str]: Sequence identifiers/names.
sequencesnumpy.ndarray: n*m alignment matrix, where n is the number of entries and m is the number of sites.
copybool: If True, sequences will be copied. If false, the NucleotideAlignment will use the original sequences, potentially modifying them.
fast_encodingbool: If true, a fast nucleotide encoding method without error checking will be used. ATTENTION: This will modify sequences in place.

Attributes

shape: shape

Methods

`copy`()	Return a copy of the NucleotideAligment object.
`distance`([distance_type, pairwise_deletion])	Calculate genetic distance.
`drop_invariant_sites`([in_place])	Remove invariant sites from alignment.
`from_bp_seqio_records`(records[, fast_encoding])	Build NucleotideAlignment from iterable of Bio.SeqRecord.SeqRecord.
`from_file`(file_path[, file_format, ...])	Read nucleotide alignment from file.

copy() → pyckmeans.io.nucleotide_alignment.NucleotideAlignment

Return a copy of the NucleotideAligment object.

Returns

NucleotideAlignment: Copy of self.

distance(distance_type: str = 'p', pairwise_deletion: bool = True) → pyckmeans.distance.DistanceMatrix

Calculate genetic distance.

Parameters

distance_typestr, optional: Type of genetic distance to calculate, by default ‘p’. Available distance types are p-distances (‘p’), Jukes-Cantor distances (‘jc’), and Kimura 2-paramater distances (‘k2p’).
pairwise_deletionbool: Use pairwise deletion as action to deal with missing data. If False, complete deletion is applied. Gaps (“-”, “~”, ” “), “?”, and ambiguous bases are treated as missing data.
Returns
——-
pyckmeans.distance.DistanceMatrix: n*n distance matrix.

drop_invariant_sites(in_place: bool = False) → pyckmeans.io.nucleotide_alignment.NucleotideAlignment

Remove invariant sites from alignment. Invariant sites are sites, where each entry has the same symbol.

Parameters

in_placebool, optional: Modify self in place, by default False

Returns

NucleotideAlignment: NucleotideAlignment without invariant sites. If in_place is set to True, self is returned.

classmethod from_bp_seqio_records(records: Iterable[Bio.SeqRecord.SeqRecord], fast_encoding: bool = False) → NucleotideAlignment

Build NucleotideAlignment from iterable of Bio.SeqRecord.SeqRecord. Such an iterable is, for example, returned by Bio.SeqIO.parse() or can be constructed using Bio.Align.MultipleSequenceAlignment().

Parameters

records: Iterable[‘Bio.SeqRecord.SeqRecord’]: Iterable of Bio.SeqRecord.SeqRecord. Such an iterable is, for example, returned by Bio.SeqIO.parse() or can be constructed using Bio.Align.MultipleSequenceAlignment().
fast_encodingbool: If true, a fast nucleotide encoding method without error checking will be used.

Returns

NucleotideAlignment: NucleotideAlignment object.

Raises

InvalidSeqIORecordsError: Raised of sequences have different lengths.

classmethod from_file(file_path: str, file_format='auto', fast_encoding=False) → pyckmeans.io.nucleotide_alignment.NucleotideAlignment

Read nucleotide alignment from file.

Parameters

file_path: str: Path to alignment file.
file_format: str: Alignment file format. Either “auto”, “fasta” or “phylip”. When “auto” the file format will be inferred based on the file extension.
fast_encodingbool: If true, a fast nucleotide encoding method without error checking will be used.

Returns

Tuple[numpy.ndarray, numpy.ndarray]: Tuple of sequences and names, each as numpy array.

Raises

InvalidAlignmentFileExtensionError: Raised if file_format is “auto” and the file extension is not understood.
InvalidAlignmentFileFormatError: Raised if an invalid file_format is passed.

property shape: Tuple[int, int]

Get alignment dimensions/shapes.

Returns

Tuple[int, int]: Number of samples n, number of sites m

class pyckmeans.WECR(k: Union[int, Iterable[int]], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, **kwargs: Dict[str, Any])

Bases: object

WECR K-Means

A class representing a Weighted Ensemble Consensus of Random K-Means [1].

Parameters

kUnion[int, Iterable[int]]: Number of clusters to drawn from for each K-Means run.
n_repint, optional: Number of K-Means to fit, by default 100
p_sampfloat, optional: Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).
p_featfloat, optional: Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).
kwargsDict[str, Any]: Additional keyword arguments passed to sklearn.cluster.KMeans.

References

1: Lai, Y., S., He, Z., Lin, F., Yang, Q., Zhou, X., Zhou. 2019. “An Adaptive Robust Semi-Supervised Clustering Framework Using Weighted Consensus of Random K-Means Ensemble”. IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1877-1890. doi: 10.1109/TKDE.2019.2952596.

Methods

`fit`(x[, progress_callback])	Fit the WECR K-Means.
`predict`(x[, must_link, must_not_link, ...])	Predict from WECR.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit the WECR K-Means.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]: a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.
progress_callbackOptional[Callable]: Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pandas.core.frame.DataFrame, pyckmeans.ordination.PCOAResult], must_link: Optional[Iterable] = None, must_not_link: Optional[Iterable] = None, gamma: float = 0.5, scale_consensus_matrix: bool = True, linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) → pyckmeans.core.wecr.WECRResult

Predict from WECR.

Parameters

xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

must_linkOptional[Iterable], optional

Must-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

must_not_linkOptional[Iterable], optional

Must-not-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

gammafloat, optional

Weight parameter for the constraints. Must be between 0.0 and 1.0, by default 0.5. Higher values increase the weight of the constraints on the final result.

scale_consensus_matrixbool

If true, the consensus matrix will be scaled in such a way that the diagonal entries are all 1.

linkage_typestr

Linkage type of the hierarchical clustering that is used for final consensus cluster calculation.

One of

‘average’
‘complete’
‘single’
‘weighted’
‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable], optional

Optional callback function for progress reporting.

Returns

WECRResult: WECRResult object.

pyckmeans.pcoa(dist: Union[numpy.ndarray, pyckmeans.distance.DistanceMatrix], correction: Optional[str] = None, eps: float = 1e-08) → pyckmeans.ordination.PCOAResult

Principle Coordinate Analysis.

Parameters

distUnion[numpy.ndarray, pyckmeans.distance.DistanceMatrix]

n*n distance matrix either as numpy ndarray or as pyckmeans DistanceMatrix.

correction: Optional[str]

Correction for negative eigenvalues, by default None. Available corrections are:

None: negative eigenvalues are set to 0

lingoes: Lingoes correction

cailliez: Cailliet correction

epsfloat, optional

Eigenvalues smaller than eps will be dropped. By default 0.0001

Returns

PCOAResult: PCoA result object.

Raises

InvalidCorrectionTypeError: Raised if an unknown correction type is passed.
NegativeEigenvaluesCorrectionError: Raised if correction parameter is set and correction of negative eigenvalues is not successful.

pyckmeans package

Subpackages

Module contents