Parameters and defaults

For methods and functions in sistem which utilize parameters, you can first create an initial Parameters object and pass it to the methods/functions using the params keyword. This is more convenient than repetitively inputting each parameter individually, and the Parameters class additionally contains some useful functionality to check types and more. Note that some or all of the individual parameters can be set manually even if a custom Parameters object is provided.

class sistem.Parameters(N0: int = 10, growth_rate: float = 0.0051, max_growth_rate_multiplier: int | float = 5, capacities: int | ~typing.List[int] = <factory>, nsites: int = 1, epsilon: float = 1e-08, lifespan_mean: int | float = 1, ref: str | None = None, alt_ref: str | None = None, chrom_names: ~typing.List[str] | None = None, chrom_lens: int | ~typing.Dict | None = None, num_chroms: int = 22, region_len: int = 5000000, arm_ratios: float | ~typing.Dict = 0.5, focal_driver_rate: float = 0.0005, SNV_driver_rate: float = 0.0005, arm_rate: float = 0.0001, chromosomal_rate: float = 5e-05, WGD_rate: float = 1e-08, focal_gain_rate: float = 0.5, chrom_dup_rate: float = 0.5, length_mean: int | float = 1.5, mag_mean: int | float = 1.2, focal_pass_rate: float = 0.01, SNV_pass_rate: float = 0.01, CN_coeff: float = 0.25, SNV_coeff: float = 0.1, OG_r: float = 0.05, TSG_r: float = 0.05, EG_r: float = 0.05, alter_prop: float = 0.3, max_region_CN: int = 10, max_region_SNV: int = 10, max_ploidy: int | float = 8, min_ploidy: int | float = 1.5, max_distinct_driv_ratio: float = 0.8, t_max: int = 6000, min_detectable: int = 500000.0, ncells_prim: int = 100, ncells_meta: int = 100, ncells_normal: int = 1, min_mut_fraction: float = 0.05, bin_size: int | None = None, coverage: int | float = 0.1, read_len: int = 150, seq_error: float = 0.02, lorenz_y: float = 0.5, out_dir: str = 'out', log_path: str | None = None, num_processors: int = 1)

Global parameter values for a simulation experiment.

N0

The initial number of cells at time t=0 and upon site seeding. Default: 10.

Type:

int

growth_rate

The primary site exponential growth rate. Default: 0.0051.

Type:

float

max_growth_rate_multiplier

This parameter is used to increase the exponential growth rates in metastatic sites. Metastatic growth rates are at minimum growth_rate but are increased based on the fitness of the cell which initiates seeding, up to a maximum of growth_rate * max_growth_rate_multiplier. Default: 5.

Type:

int, float

capacities

The carrying capacity (max number of cells) of each anatomical site. If an integer is provided, all sites have that same carrying capacity. Also accepts a list of carrying capacities, one for each site. Default: 1e7.

Type:

int, list

epsilon

A per-cell per-generation baseline migration probability to a site. Default: 1e-8.

Type:

float

lifespan_mean

The mean cell lifespan in number of generations drawn from an exponential distribution. Corresponds to 1/r, where r is the rate parameter of the exponential. A value of 1 will fix the lifespan to 1 generation for all cells. Default: 1.

Type:

int, float

ref

Path to an input reference genome in fasta format. Can be used to initialize chromosome sizes, but is required for generating synthetic sequencing reads. Default: None.

Type:

str, optional

alt_ref

Path to an optionally alternate reference genome in fasta format. Allele 0 utilizes ref, while Allele 1 utilizes alt_ref. Use if the goal is to generate allele-specific synthetic sequencing reads. Default: None.

Type:

str, optional

chrom_names

A list of chromosome names to use in the reference sequence(s). Only necessary to specify if the reference genome contains superfluous chromosomes which should not be included in the genome. Default: None.

Type:

list, optional

chrom_lens

A dictionary describing the size of the genome where keys are chromosome names and values are chromosome lengths in number of base pairs. If kept as None, will automatically populate with chr1-chr22 and lengths derived from the hg38 human reference genome. If ref is specified, will automatically populate based on the provided sequences. Additionally, if an integer is passed, then all chromosomes 1-num_chrom (see below) will have the same given length. Default: None.

Type:

dict, int, optional

num_chrom

The number of chromosomes in the genome. Use only if an int is passed to chrom_lens, or if you want to use the first num_chrom human reference chromosomes if chrom_lens is set to None. Otherwise, will update accordingly. Default: 22.

Type:

int

region_len

SISTEM utilizes a simplified genome representation whereby chromosome sequences are partitioned into non-overlapping regions of uniform size region_len (in base pairs). Higher values reduce memory burden, while lower values increase simulation resolution. Default: 5e6.

Type:

int

arm_ratios

The ratio of the small chromosome arm length to the total chrosome length. If chrom_lens is None, will utilize arm ratios derived from the hg38 human reference genome. Can pass a dictionary where keys are chromosome names and values are ratios, or a single ratio used by all chromosomes. Default: 0.5.

Type:

float, dict

focal_driver_rate

The probability of acquiring a driver focal (segmental) CNA at each generation. Default: 1e-4.

Type:

float

focal_pass_rate

The probability of acquiring a passenger focal (segmental) CNA at each generation. Default: = 0.01.

Type:

float

length_mean

The mean number of regions a focal CNA spans. Length is drawn from an exponential distribution. Default = 1.5.

Type:

int, float

focal_gain_rate

The probability that a focal CNA is amplification (gain) versus a deletion (loss) Default = 0.5.

Type:

float

mag_mean

The mean number of additional copies gained during a focal amplification CNA. Amplification magnitude is drawn from a geometric distribution. Default = 1.2.**

Type:

int, float

SNV_driver_rate

The probability of acquiring a driver SNV at each generation. Default: 1e-4

Type:

float

SNV_pass_rate

The probability of acquiring a passenger SNV at each generation. Default: 0.01.

Type:

float

arm_rate

The probability of acquiring a chromosome-arm CNA at each generation. Default: 1e-5.

Type:

float

chromosomal_rate

The probability of acquiring a whole-chromosomal CNA at each generation. Default: 1e-6.

Type:

float

chrom_dup_rate

The probability that a chromosome-arm CNA or a whole-chromosomal CNA is a duplication versus a deletion. Default: 1e-5.

Type:

float

WGD_rate

The probability of acquiring a WGD at each generation. Default: 1e-8.

Type:

float

CN_coeff

The maximum CN selection coefficient magnitude. Used only for random initialization. Default: 0.25.

Type:

float

SNV_coeff

The maximum region SNV selection coefficient magnitude. Used only for random initialization. Default: 0.1.

Type:

float

OG_r

The ratio of regions which are OGs. Used only for random initialization in the Region/Hybrid Selection Model. Default: 0.05.

Type:

float

TSG_r

The ratio of regions which are TSGs. Used only for random initialization in the Region/Hybrid Selection Model.. Default: 0.05.

Type:

float

EG_r

The ratio of regions which are EGs (essential genes). Used only for random initialization in the Hybrid Selection Model. Default: 0.05.

Type:

float

alter_prop

Parameter for creating site-specific metastatic libraries. Represents the fraction of driver selection coefficients to alter in each site if method is ‘random’, or of the farthest site if method is ‘distance’, with the number of altered coefficients scaled accordingly for the rest. Default: 0.3.

Type:

float

max_region_CN

Viability checkpoint parameter. Represents the maximum CN of a driver region. Default: 10.

Type:

int

max_region_SNV

Viability checkpoint parameter. Represents the maximum number of driver SNVs in a single region. Default: 10.

Type:

int

max_ploidy

Viability checkpoint parameter. Represents the maximum ploidy of a cell. Default: 8.

Type:

int, float

min_ploidy

Viability checkpoint parameter. Represents the minimum ploidy of a cell. Default: 1.5.

Type:

int, float

max_distinct_driv_ratio

Viability checkpoint parameter. Represents the maximum number of distinct drivers which can be mutated. Default: 0.8.

Type:

float

t_max

The maximum number of generations to run. Default: 6000.

Type:

int

min_detectable

Terminates the growth simulator when the number of cells present in each anatomical site is atleast min_detectable. Default: 5e-5.

Type:

int

ncells_prim

The number of cells to sample from the primary site. Default: 100.

Type:

int

ncells_meta

The number of cells to sample from the metastatic sites. Default: 100.

Type:

int

ncells_normal

The number of normal cells to diluate the primary site with. If generating clonal lineages, a relative number of normal cells will be sampled from the metastatic sites as well. Passing 1 will add a convenient normal cell outgroup to the simulated lineage tree. Default: 1.

Type:

int

min_mut_fraction

In SISTEM, clones are defined by cells with a unique sequence of driver mutations, but this loose definition means that distinct ‘clones’ appearing in the clonal lineage tree may differ by a just a few small mutations. The min_mut_fraction parameter can help make the clones present in the tree more distinct. It describes the minimum frequency a clone’s genotype must occur in the sampled cells to remain in the tree. If possible, multiple clones in a subtree will merge together with a common genotype to remain above min_mut_fraction, otherwise they are pruned. Default: 0.05.

Type:

float

bin_size

The size in base pairs of the copy number windows/segments in the final output profiles. Essentially groups consecutive regions together into larger bins and computes the mean. If kept as None, the bin size will be set equal to the region_len. This will be desirable for the majority of cases. Only specify if simulating with a region length that is smaller than practical (e.g. if region_len is less than 100kbp for single-cell data). Default: None.

Type:

int, optional

coverage

The average number of reads which cover any given base pair in the genome. When used to generate single-cell read counts or DNA-seq reads, it is recommended to use a low value (<0.2), whereas if used to generate bulk read counts, it is recommended to use a high value (>50). Default: 0.1.

Type:

int, float

read_len

The length of the paired-end reads. Together with coverage, used to compute expected read counts. Default: 150.

Type:

int

seq_error

Per-base pair sequencing error rate. Only used for generating scDNA-seq reads Default: 0.02.

Type:

float

lorenz_y

Used to introduce coverage non-uniformity when generating single-cell read counts and DNA-seq. In a nutshell, coverage uniformity is parameterized by a point on the lorenz curve (0.5, lorenz_y). Default value of 0.5 means maximally uniform, and decreasing lorenz_y down to 0 will decrease uniformity. Only specify if evaluating conditions under non-uniform coverage. Be aware that this parameter is extremely sensitive, and values <=0.4 will lead to highly non-uniform distributions. Default: 0.5.

Type:

float

out_dir

The path to the output directly. Default: ‘out’.

Type:

str

log_path

The path to the log file. By default, will write to out_dir/sim.log. Default: None.

Type:

str, optional

num_processors

Number of processors to use when generating synthetic scDNA-seq reads. Will not speed up the other steps of the simulator. Default: 1.

Type:

int