Configuration Guide¶
SynthBioData provides a flexible configuration system that allows you to customize data generation parameters for different types of synthetic biological data. This doc page covers the available configuration options and how to use them effectively.
Overview¶
The configuration system is built on Pydantic models with full type validation and automatic parameter checking. All configurations inherit from BaseConfig
and extend it with data-type-specific parameters.
General Configuration¶
All data types share common configuration parameters defined in BaseConfig
:
Basic Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
n_samples |
int | 10000 | Number of samples to generate |
positive_ratio |
float | 0.03 | Ratio of positive samples in the dataset (between 0 and 1) |
test_size |
float | 0.2 | Fraction of data to use for testing (between 0 and 1) |
val_size |
float | 0.2 | Fraction of data to use for validation (between 0 and 1) |
random_state |
int | 42 | Random seed for reproducible data generation |
imbalanced |
bool | False | Whether to generate an imbalanced dataset |
Example¶
Click to expand example code: Base configuration
How to use the basic parameters to create different types of datasets:from synthbiodata import create_config, generate_sample_data
# Same parameters work for both data types
# Molecular descriptors configuration
config_molecular = create_config(
data_type="molecular-descriptors",
n_samples=1000, # Dataset with 1000 samples
positive_ratio=0.5, # 50% positive samples (balanced)
test_size=0.2, # 20% for testing
val_size=0.2, # 20% for validation
random_state=123 # Set seed for reproducibility
)
# ADME configuration with identical basic parameters
config_adme = create_config(
data_type="adme",
n_samples=1000, # Same sample size
positive_ratio=0.5, # Same positive ratio
test_size=0.2, # Same test split
val_size=0.2, # Same validation split
random_state=123 # Same random seed
)
# Generate data with each configuration
df_molecular = generate_sample_data(config=config_molecular)
df_adme = generate_sample_data(config=config_adme)
# Check the dataset - same parameters produce consistent splits
print(f"Molecular dataset: {len(df_molecular)} samples, {df_molecular['binds_target'].mean():.1%} positive")
print(f"ADME dataset: {len(df_adme)} samples, {df_adme['bioavailable'].mean():.1%} positive")
Validation¶
The configuration system automatically validates all parameters to ensure they are within realistic biological ranges and mathematically valid. For detailed validation rules and error handling, see the Validation Rules page.
Molecular Descriptors Configuration¶
The MolecularConfig
class extends BaseConfig
with parameters specific to molecular descriptor generation.
Molecular Weight Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
mw_mean |
float | 350.0 | Mean molecular weight in Daltons |
mw_std |
float | 100.0 | Standard deviation of molecular weight |
mw_min |
float | 150.0 | Minimum molecular weight |
mw_max |
float | 600.0 | Maximum molecular weight |
LogP Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
logp_mean |
float | 2.5 | Mean LogP (octanol-water partition coefficient) |
logp_std |
float | 1.5 | Standard deviation of LogP |
logp_min |
float | -2.0 | Minimum LogP value |
logp_max |
float | 6.0 | Maximum LogP value |
TPSA Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
tpsa_mean |
float | 80.0 | Mean topological polar surface area |
tpsa_std |
float | 40.0 | Standard deviation of TPSA |
tpsa_min |
float | 0.0 | Minimum TPSA value |
tpsa_max |
float | 200.0 | Maximum TPSA value |
Target Family Parameters¶
Parameter | Type | Default Value | Description |
---|---|---|---|
target_families |
list[str] | ['GPCR', 'Kinase', 'Protease', 'Nuclear Receptor', 'Ion Channel'] | List of target protein families to sample from |
target_family_probs |
list[float] | [0.3, 0.25, 0.2, 0.15, 0.1] | Probability distribution for selecting each target family (must sum to 1.0) |
Example¶
Click to expand example code: Molecular descriptors configuration
from synthbiodata import create_config, generate_sample_data
# Create a custom molecular descriptors configuration
config = create_config(
data_type="molecular-descriptors",
n_samples=2000, # Generate 2000 samples
positive_ratio=0.15, # 15% positive samples
mw_mean=400.0, # Mean molecular weight
mw_std=80.0, # Standard deviation
mw_min=200.0, # Minimum weight
mw_max=500.0, # Maximum weight
logp_mean=3.0, # Mean LogP
logp_std=1.2, # LogP standard deviation
logp_min=0.0, # Minimum LogP
logp_max=5.0, # Maximum LogP
tpsa_mean=90.0, # Mean TPSA
tpsa_std=35.0, # TPSA standard deviation
target_families=['GPCR', 'Kinase'], # Target families
target_family_probs=[0.6, 0.4], # Family probabilities
random_state=123 # Reproducible results
)
# Generate the data
df = generate_sample_data(config=config)
# Inspect the results
print(f"Generated {len(df)} samples")
print(f"Positive ratio: {df['binds_target'].mean():.1%}")
print(f"Target families: {df['target_family'].value_counts().to_dict()}")
print(f"Molecular weight range: {df['molecular_weight'].min():.1f} - {df['molecular_weight'].max():.1f} Da")
ADME Data Configuration¶
The ADMEConfig
class extends BaseConfig
with parameters for generating ADME (Absorption, Distribution, Metabolism, Excretion) data.
Absorption Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
absorption_mean |
float | 70.0 | Mean absorption percentage (0-100) |
absorption_std |
float | 20.0 | Standard deviation of absorption |
plasma_protein_binding_mean |
float | 85.0 | Mean plasma protein binding percentage (0-100) |
plasma_protein_binding_std |
float | 15.0 | Standard deviation of plasma protein binding |
Metabolism Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
clearance_mean |
float | 5.0 | Mean clearance rate in L/h |
clearance_std |
float | 2.0 | Standard deviation of clearance |
half_life_mean |
float | 12.0 | Mean half-life in hours |
half_life_std |
float | 6.0 | Standard deviation of half-life |
Excretion Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
renal_clearance_ratio |
float | 0.3 | Ratio of renal to total clearance (0-1) |
Example: Custom ADME Configuration¶
from synthbiodata.config.schema.v1.adme import ADMEConfig
# Create a configuration for high-absorption compounds
config = ADMEConfig(
n_samples=3000,
positive_ratio=0.05,
absorption_mean=90.0,
absorption_std=10.0,
plasma_protein_binding_mean=95.0,
plasma_protein_binding_std=5.0,
clearance_mean=3.0,
clearance_std=1.0,
half_life_mean=18.0,
half_life_std=8.0,
renal_clearance_ratio=0.2,
random_state=456
)
Using Configurations¶
Factory Function¶
The easiest way to create configurations is using the factory function:
from synthbiodata import create_config
# Create molecular configuration with custom parameters
config = create_config(
data_type="molecular-descriptors",
n_samples=2000,
positive_ratio=0.15,
mw_mean=300.0,
random_state=789
)
# Create ADME configuration
config = create_config(
data_type="adme",
n_samples=1500,
absorption_mean=80.0,
random_state=789
)
Direct Class Instantiation¶
For full control, instantiate configuration classes directly:
from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.config.schema.v1.adme import ADMEConfig
# Molecular configuration
mol_config = MolecularConfig(
n_samples=1000,
mw_mean=400.0,
target_families=['GPCR', 'Kinase', 'Protease'],
target_family_probs=[0.5, 0.3, 0.2]
)
# ADME configuration
adme_config = ADMEConfig(
n_samples=1000,
absorption_mean=75.0,
half_life_mean=15.0
)