Data Generators¶
The core module of synthbiodata
provides a clean, modular architecture for generating synthetic biological data. The module is organized around a base class with specialized implementations for different types of biological data.
Generates synthetic molecular descriptor data including molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, and chemical fingerprints. Creates realistic binding probabilities based on molecular properties.
synthbiodata.core.molecular.MolecularGenerator
¶
Bases: BaseGenerator
Generator for synthetic molecular descriptor data.
The MolecularGenerator class creates synthetic datasets of molecular descriptors and related features for use in cheminformatics, drug discovery, and machine learning applications. It generates realistic distributions of molecular properties such as molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, rotatable bonds, aromatic rings, formal charge, and chemical fingerprints. The generator can also simulate target protein features and compute binding probabilities based on molecular properties.
⚠️ Note that this generator inherits from ~synthbiodata.core.base.BaseGenerator
.
Attributes:
Name | Type | Description |
---|---|---|
config |
MolecularConfig
|
Configuration object specifying the statistical parameters and options for molecular data generation. |
Methods:
Name | Description |
---|---|
__init__ |
Initialize the molecular generator with the provided configuration. |
_generate_molecular_descriptors |
Generate arrays of molecular descriptor values for the specified number of samples. |
_generate_target_features |
Generate arrays of target protein features for the specified number of samples. |
_generate_chemical_fingerprints |
Generate binary chemical fingerprint features. |
_calculate_binding_probabilities |
Compute binding probabilities based on generated molecular descriptors and target features. |
generate_data |
Generate a complete synthetic molecular dataset as a polars DataFrame. |
Code Example
synthbiodata.core.adme.ADMEGenerator
¶
Bases: BaseGenerator
Generator for synthetic ADME (Absorption, Distribution, Metabolism, Excretion) data. Creates binary bioavailability labels based on realistic pharmaceutical criteria.
The ADMEGenerator class creates synthetic datasets simulating pharmacokinetic properties relevant to drug discovery and pharmaceutical research. It generates realistic distributions for features such as absorption percentage, plasma protein binding, clearance rate, and half-life. The generator can also simulate imbalanced datasets for classification tasks by controlling the proportion of positive (bioavailable) samples.
This generator is useful for benchmarking machine learning models, simulating clinical trial data, and educational purposes where realistic ADME data is required without using sensitive patient information.
Attributes:
Name | Type | Description |
---|---|---|
config |
ADMEConfig
|
Configuration object specifying statistical parameters and options for ADME data generation. |
Methods:
Name | Description |
---|---|
__init__ |
Initialize the ADME generator with the provided configuration. |
generate_data |
Generate a complete synthetic ADME dataset as a polars DataFrame, including binary bioavailability labels. |
Code Example
synthbiodata.core.base.BaseGenerator
¶
Bases: ABC
Abstract base class for all synthetic data generators.
This class provides common functionality for all data generators, including:
- Storing the configuration object.
- Initializing a NumPy random number generator with a fixed random state for reproducibility.
- Providing a seeded Faker instance for generating fake data.
- Defining the abstract interface for data generation.
Attributes:
Name | Type | Description |
---|---|---|
config |
BaseConfig
|
Configuration object containing generator parameters and random state. |
rng |
Generator
|
NumPy random number generator initialized with the provided random state. |
fake |
Faker
|
Faker instance seeded for reproducible fake data generation. |
Methods:
Name | Description |
---|---|
generate_data |
Abstract method to generate synthetic data. Must be implemented by subclasses. |