pyckmeans package

Subpackages

Module contents

pyckmeans

pyckmeans, a Python package for Consensus K-Means clustering.

class pyckmeans.CKmeans(k: int, n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Consensus K-Means.

Parameters
kint

Number of clusters.

n_repint, optional

Number of K-Means to fit, by default 100

p_sampfloat, optional

Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).

p_featfloat, optional

Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).

metricsIterable[str]

Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

fit(x[, progress_callback])

Fit CKmeans.

predict(x[, linkage_type, return_cls, ...])

Predict cluster membership of new data from fitted CKmeans.

AVAILABLE_METRICS = ('sil', 'bic', 'db', 'ch')
fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit CKmeans.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.ckmeans.CKmeansResult

Predict cluster membership of new data from fitted CKmeans.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns
CKmeansResult

Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.DistanceMatrix(dist_mat: numpy.ndarray, names: Optional[Iterable[str]] = None)

Bases: object

Distance Matrix, optionally named.

Parameters
dist_matnumpy.ndarray

n*n distance matrix.

namesOptional[Iterable[str]]

Names, by default None.

Raises
IncompatibleNamesError

Raised if dimension of names and dist_mat are incompatible.

Attributes
shape

shape

Methods

from_csv(file_path[, header, index_col, sep])

read_csv_distmat

from_phylip(file_path)

Read PHYLIP distance matrix.

to_csv(file_path[, force])

Write DistanceMatrix object to CSV.

to_phylip(file_path[, force])

Write distance matrix to file in PHYLIP matrix format.

static from_csv(file_path: str, header: Optional[int] = 0, index_col: Optional[int] = 0, sep: str = ',', **kwargs) pyckmeans.distance.DistanceMatrix

read_csv_distmat

Read distance matrix from CSV file.

Parameters
file_pathstr

Path to CSV file.

headerOptional[int]

Determines the row in the CSV file containing sample names. Is passed to pandas.read_csv(). By default 0, meaning the first row.

index_colOptional[int]

Determines the index column. By default, the first column is expected to contain sample names. Passed to pandas.read_csv().

sepstr

Column separator, be default ‘,’. Passed to Passed to pandas.read_csv().

**kwargs

Additional keyword arguments passed to pandas.read_csv().

Returns
——-
pyckmeans.distance.DistanceMatrix

DistanceMatrix object.

static from_phylip(file_path: str) pyckmeans.distance.DistanceMatrix

Read PHYLIP distance matrix.

Returns
DistanceMatrix

DistanceMatrix object.

property shape: Tuple[int]

Get matrix shape.

Returns
Tuple[int]

Matrix shape.

to_csv(file_path: str, force: bool = False)

Write DistanceMatrix object to CSV.

Parameters
file_pathstr

CSV file path.

forcebool, optional

Force overwrite if file_path already exists, by default False

to_phylip(file_path: str, force: bool = False)

Write distance matrix to file in PHYLIP matrix format.

Parameters
file_pathstr

Output file path.

forcebool, optional

Force overwrite if file exists, by default False

class pyckmeans.MultiCKMeans(k: Iterable[int], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, metrics: Iterable[str] = ('sil', 'bic'), **kwargs: Dict[str, Any])

Bases: object

Convenience class wrapping Consensus K-Means runs for multiple different numbers of clusters.

Parameters
kIterable[int]

List of cluster counts for CKmeans.

n_repint, optional

Number of K-Means to fit for each single CKmeans, by default 100

p_sampfloat, optional

Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).

p_featfloat, optional

Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).

metricsIterable[str]

Clustering quality metrics to calculate while training. Available metrics are * “sil” (Silhouette Index) * “bic” (Bayesian Information Criterion) * “db” (Davies-Bouldin Index) * “ch” (Calinski-Harabasz).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.KMeans.

Methods

fit(x[, progress_callback])

Fit MultiCKmeans.

predict(x[, linkage_type, return_cls, ...])

Predict cluster membership of new data from all fitted CKmeans.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit MultiCKmeans.

Parameters
xUnion[numpy.ndarray, PCOAResult]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.multickmeans.MultiCKmeansResult

Predict cluster membership of new data from all fitted CKmeans.

Parameters
xUnion[numpy.ndarray, PCOAResult]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

linkage_typestr

Linkage type of the hierarchical clustering that is used for consensus cluster calculation. One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

Returns
CKmeansResult

Object comprising a n * n consensus matrix, and a n-length vector of precited cluster memberships.

class pyckmeans.NucleotideAlignment(names: Iterable[str], sequences: numpy.ndarray, copy: bool = False, fast_encoding: bool = False)

Bases: object

Class for nucleotide alignments.

Parameters
namesList[str]

Sequence identifiers/names.

sequencesnumpy.ndarray

n*m alignment matrix, where n is the number of entries and m is the number of sites.

copybool

If True, sequences will be copied. If false, the NucleotideAlignment will use the original sequences, potentially modifying them.

fast_encodingbool

If true, a fast nucleotide encoding method without error checking will be used. ATTENTION: This will modify sequences in place.

Attributes
shape

shape

Methods

copy()

Return a copy of the NucleotideAligment object.

distance([distance_type, pairwise_deletion])

Calculate genetic distance.

drop_invariant_sites([in_place])

Remove invariant sites from alignment.

from_bp_seqio_records(records[, fast_encoding])

Build NucleotideAlignment from iterable of Bio.SeqRecord.SeqRecord.

from_file(file_path[, file_format, ...])

Read nucleotide alignment from file.

copy() pyckmeans.io.nucleotide_alignment.NucleotideAlignment

Return a copy of the NucleotideAligment object.

Returns
NucleotideAlignment

Copy of self.

distance(distance_type: str = 'p', pairwise_deletion: bool = True) pyckmeans.distance.DistanceMatrix

Calculate genetic distance.

Parameters
distance_typestr, optional

Type of genetic distance to calculate, by default ‘p’. Available distance types are p-distances (‘p’), Jukes-Cantor distances (‘jc’), and Kimura 2-paramater distances (‘k2p’).

pairwise_deletionbool

Use pairwise deletion as action to deal with missing data. If False, complete deletion is applied. Gaps (“-”, “~”, ” “), “?”, and ambiguous bases are treated as missing data.

Returns
——-
pyckmeans.distance.DistanceMatrix

n*n distance matrix.

drop_invariant_sites(in_place: bool = False) pyckmeans.io.nucleotide_alignment.NucleotideAlignment

Remove invariant sites from alignment. Invariant sites are sites, where each entry has the same symbol.

Parameters
in_placebool, optional

Modify self in place, by default False

Returns
NucleotideAlignment

NucleotideAlignment without invariant sites. If in_place is set to True, self is returned.

classmethod from_bp_seqio_records(records: Iterable[Bio.SeqRecord.SeqRecord], fast_encoding: bool = False) NucleotideAlignment

Build NucleotideAlignment from iterable of Bio.SeqRecord.SeqRecord. Such an iterable is, for example, returned by Bio.SeqIO.parse() or can be constructed using Bio.Align.MultipleSequenceAlignment().

Parameters
records: Iterable[‘Bio.SeqRecord.SeqRecord’]

Iterable of Bio.SeqRecord.SeqRecord. Such an iterable is, for example, returned by Bio.SeqIO.parse() or can be constructed using Bio.Align.MultipleSequenceAlignment().

fast_encodingbool

If true, a fast nucleotide encoding method without error checking will be used.

Returns
NucleotideAlignment

NucleotideAlignment object.

Raises
InvalidSeqIORecordsError

Raised of sequences have different lengths.

classmethod from_file(file_path: str, file_format='auto', fast_encoding=False) pyckmeans.io.nucleotide_alignment.NucleotideAlignment

Read nucleotide alignment from file.

Parameters
file_path: str

Path to alignment file.

file_format: str

Alignment file format. Either “auto”, “fasta” or “phylip”. When “auto” the file format will be inferred based on the file extension.

fast_encodingbool

If true, a fast nucleotide encoding method without error checking will be used.

Returns
Tuple[numpy.ndarray, numpy.ndarray]

Tuple of sequences and names, each as numpy array.

Raises
InvalidAlignmentFileExtensionError

Raised if file_format is “auto” and the file extension is not understood.

InvalidAlignmentFileFormatError

Raised if an invalid file_format is passed.

property shape: Tuple[int, int]

Get alignment dimensions/shapes.

Returns
Tuple[int, int]

Number of samples n, number of sites m

class pyckmeans.WECR(k: Union[int, Iterable[int]], n_rep: int = 100, p_samp: float = 0.8, p_feat: float = 0.8, **kwargs: Dict[str, Any])

Bases: object

WECR K-Means

A class representing a Weighted Ensemble Consensus of Random K-Means [1].

Parameters
kUnion[int, Iterable[int]]

Number of clusters to drawn from for each K-Means run.

n_repint, optional

Number of K-Means to fit, by default 100

p_sampfloat, optional

Proportion of samples (observations) to randomly draw per K-Means run, by default 0.8. The resulting number of samples will be rounded up. I.e. if number of samples is 10 and p_samp is 0.75, each K-Means will use 8 randomly drawn samples (0.72 * 10 = 7.2, 7.2 -> 8).

p_featfloat, optional

Proportion of features (predictors) to randomly draw per K-Means run, by default 0.8. The resulting number of features will be rounded up. I.e. if number of features is 10 and p_feat is 0.72, each K-Means will use 8 randomly drawn features (0.72 * 10 = 7.5, 7.2 -> 8).

kwargsDict[str, Any]

Additional keyword arguments passed to sklearn.cluster.KMeans.

References

1

Lai, Y., S., He, Z., Lin, F., Yang, Q., Zhou, X., Zhou. 2019. “An Adaptive Robust Semi-Supervised Clustering Framework Using Weighted Consensus of Random K-Means Ensemble”. IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1877-1890. doi: 10.1109/TKDE.2019.2952596.

Methods

fit(x[, progress_callback])

Fit the WECR K-Means.

predict(x[, must_link, must_not_link, ...])

Predict from WECR.

fit(x: Union[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.core.frame.DataFrame], progress_callback: Optional[Callable] = None)

Fit the WECR K-Means.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

progress_callbackOptional[Callable]

Optional callback function for progress reporting.

predict(x: Union[numpy.ndarray, pandas.core.frame.DataFrame, pyckmeans.ordination.PCOAResult], must_link: Optional[Iterable] = None, must_not_link: Optional[Iterable] = None, gamma: float = 0.5, scale_consensus_matrix: bool = True, linkage_type: str = 'average', return_cls: bool = False, progress_callback: Optional[Callable] = None) pyckmeans.core.wecr.WECRResult

Predict from WECR.

Parameters
xUnion[numpy.ndarray, pyckmeans.ordination.PCOAResult, pandas.DataFrame]

a n * m matrix (numpy.ndarray) or dataframe (pandas.DataFrame), where n is the number of samples (observations) and m is the number of features (predictors). If x is a dataframe, the index will be used a sample names. Alternatively a pyckmeans.ordination.PCOAResult as returned from pyckmeans.pcoa.

must_linkOptional[Iterable], optional

Must-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

must_not_linkOptional[Iterable], optional

Must-not-link constraints. Any 2-dimensional iterable object with constraints as first dimension and sample indices (or names) as second dimension. For example: [[1, 2], [3, 4]], np.array([[‘A’, ‘B’], [‘A’, ‘D’]]) Can be None for no constraints.

gammafloat, optional

Weight parameter for the constraints. Must be between 0.0 and 1.0, by default 0.5. Higher values increase the weight of the constraints on the final result.

scale_consensus_matrixbool

If true, the consensus matrix will be scaled in such a way that the diagonal entries are all 1.

linkage_typestr

Linkage type of the hierarchical clustering that is used for final consensus cluster calculation.

One of

  • ‘average’

  • ‘complete’

  • ‘single’

  • ‘weighted’

  • ‘centroid’

See scipy.cluster.hierarchy.linkage for details.

return_clsbool

If True, the cluster memberships of the single K-Means runs will be present in the output.

progress_callbackOptional[Callable], optional

Optional callback function for progress reporting.

Returns
WECRResult

WECRResult object.

pyckmeans.pcoa(dist: Union[numpy.ndarray, pyckmeans.distance.DistanceMatrix], correction: Optional[str] = None, eps: float = 1e-08) pyckmeans.ordination.PCOAResult

Principle Coordinate Analysis.

Parameters
distUnion[numpy.ndarray, pyckmeans.distance.DistanceMatrix]

n*n distance matrix either as numpy ndarray or as pyckmeans DistanceMatrix.

correction: Optional[str]

Correction for negative eigenvalues, by default None. Available corrections are:

  • None: negative eigenvalues are set to 0

  • lingoes: Lingoes correction

  • cailliez: Cailliet correction

epsfloat, optional

Eigenvalues smaller than eps will be dropped. By default 0.0001

Returns
PCOAResult

PCoA result object.

Raises
InvalidCorrectionTypeError

Raised if an unknown correction type is passed.

NegativeEigenvaluesCorrectionError

Raised if correction parameter is set and correction of negative eigenvalues is not successful.