Data Generators¶

The core module of synthbiodata provides a clean, modular architecture for generating synthetic biological data. The module is organized around a base class with specialized implementations for different types of biological data.

Generates synthetic molecular descriptor data including molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, and chemical fingerprints. Creates realistic binding probabilities based on molecular properties.

`synthbiodata.core.molecular.MolecularGenerator` ¶

Bases: BaseGenerator

Generator for synthetic molecular descriptor data.

The MolecularGenerator class creates synthetic datasets of molecular descriptors and related features for use in cheminformatics, drug discovery, and machine learning applications. It generates realistic distributions of molecular properties such as molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, rotatable bonds, aromatic rings, formal charge, and chemical fingerprints. The generator can also simulate target protein features and compute binding probabilities based on molecular properties.

⚠️ Note that this generator inherits from ~synthbiodata.core.base.BaseGenerator.

Attributes:

Name	Type	Description
`config`	`MolecularConfig`	Configuration object specifying the statistical parameters and options for molecular data generation.

Methods:

Name	Description
`__init__`	Initialize the molecular generator with the provided configuration.
`_generate_molecular_descriptors`	Generate arrays of molecular descriptor values for the specified number of samples.
`_generate_target_features`	Generate arrays of target protein features for the specified number of samples.
`_generate_chemical_fingerprints`	Generate binary chemical fingerprint features.
`_calculate_binding_probabilities`	Compute binding probabilities based on generated molecular descriptors and target features.
`generate_data`	Generate a complete synthetic molecular dataset as a polars DataFrame.

Code Examplefrom synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.core.molecular import MolecularGenerator

config = MolecularConfig(n_samples=100, random_state=123)
gen = MolecularGenerator(config)
df = gen.generate_data()
print(df.head())

Functions¶

`generate_data() -> pl.DataFrame` ¶

Generate synthetic molecular descriptor data.

`synthbiodata.core.adme.ADMEGenerator` ¶

Bases: BaseGenerator

Generator for synthetic ADME (Absorption, Distribution, Metabolism, Excretion) data. Creates binary bioavailability labels based on realistic pharmaceutical criteria.

The ADMEGenerator class creates synthetic datasets simulating pharmacokinetic properties relevant to drug discovery and pharmaceutical research. It generates realistic distributions for features such as absorption percentage, plasma protein binding, clearance rate, and half-life. The generator can also simulate imbalanced datasets for classification tasks by controlling the proportion of positive (bioavailable) samples.

This generator is useful for benchmarking machine learning models, simulating clinical trial data, and educational purposes where realistic ADME data is required without using sensitive patient information.

Attributes:

Name	Type	Description
`config`	`ADMEConfig`	Configuration object specifying statistical parameters and options for ADME data generation.

Methods:

Name	Description
`__init__`	Initialize the ADME generator with the provided configuration.
`generate_data`	Generate a complete synthetic ADME dataset as a polars DataFrame, including binary bioavailability labels.

Code Examplefrom synthbiodata.config.schema.v1.adme import ADMEConfig
from synthbiodata.core.adme import ADMEGenerator

config = ADMEConfig(n_samples=100, random_state=123)
gen = ADMEGenerator(config)
df = gen.generate_data()
print(df.head())

Functions¶

`generate_data() -> pl.DataFrame` ¶

Generate synthetic ADME data.

`synthbiodata.core.base.BaseGenerator` ¶

Bases: ABC

Abstract base class for all synthetic data generators.

This class provides common functionality for all data generators, including:

Storing the configuration object.
Initializing a NumPy random number generator with a fixed random state for reproducibility.
Providing a seeded Faker instance for generating fake data.
Defining the abstract interface for data generation.

Attributes:

Name	Type	Description
`config`	`BaseConfig`	Configuration object containing generator parameters and random state.
`rng`	`Generator`	NumPy random number generator initialized with the provided random state.
`fake`	`Faker`	Faker instance seeded for reproducible fake data generation.

Methods:

Name	Description
`generate_data`	Abstract method to generate synthetic data. Must be implemented by subclasses.

Functions¶

`generate_data() -> pl.DataFrame` `abstractmethod` ¶

Generate synthetic data.

Data Generators¶

synthbiodata.core.molecular.MolecularGenerator ¶

Functions¶

generate_data() -> pl.DataFrame ¶

synthbiodata.core.adme.ADMEGenerator ¶

Functions¶

generate_data() -> pl.DataFrame ¶

synthbiodata.core.base.BaseGenerator ¶

Functions¶

generate_data() -> pl.DataFrame abstractmethod ¶

`synthbiodata.core.molecular.MolecularGenerator` ¶

`generate_data() -> pl.DataFrame` ¶

`synthbiodata.core.adme.ADMEGenerator` ¶

`generate_data() -> pl.DataFrame` ¶

`synthbiodata.core.base.BaseGenerator` ¶

`generate_data() -> pl.DataFrame` `abstractmethod` ¶