Skip to content

Data Generators

The core module of synthbiodata provides a clean, modular architecture for generating synthetic biological data. The module is organized around a base class with specialized implementations for different types of biological data.

Generates synthetic molecular descriptor data including molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, and chemical fingerprints. Creates realistic binding probabilities based on molecular properties.

synthbiodata.core.molecular.MolecularGenerator

Bases: BaseGenerator

Generator for synthetic molecular descriptor data.

The MolecularGenerator class creates synthetic datasets of molecular descriptors and related features for use in cheminformatics, drug discovery, and machine learning applications. It generates realistic distributions of molecular properties such as molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, rotatable bonds, aromatic rings, formal charge, and chemical fingerprints. The generator can also simulate target protein features and compute binding probabilities based on molecular properties.

⚠️ Note that this generator inherits from ~synthbiodata.core.base.BaseGenerator.

Attributes:

Name Type Description
config MolecularConfig

Configuration object specifying the statistical parameters and options for molecular data generation.

Methods:

Name Description
__init__

Initialize the molecular generator with the provided configuration.

_generate_molecular_descriptors

Generate arrays of molecular descriptor values for the specified number of samples.

_generate_target_features

Generate arrays of target protein features for the specified number of samples.

_generate_chemical_fingerprints

Generate binary chemical fingerprint features.

_calculate_binding_probabilities

Compute binding probabilities based on generated molecular descriptors and target features.

generate_data

Generate a complete synthetic molecular dataset as a polars DataFrame.

Code Example
from synthbiodata.config.schema.v1.molecular import MolecularConfig
from synthbiodata.core.molecular import MolecularGenerator

config = MolecularConfig(n_samples=100, random_state=123)
gen = MolecularGenerator(config)
df = gen.generate_data()
print(df.head())

Functions

generate_data() -> pl.DataFrame

Generate synthetic molecular descriptor data.

synthbiodata.core.adme.ADMEGenerator

Bases: BaseGenerator

Generator for synthetic ADME (Absorption, Distribution, Metabolism, Excretion) data. Creates binary bioavailability labels based on realistic pharmaceutical criteria.

The ADMEGenerator class creates synthetic datasets simulating pharmacokinetic properties relevant to drug discovery and pharmaceutical research. It generates realistic distributions for features such as absorption percentage, plasma protein binding, clearance rate, and half-life. The generator can also simulate imbalanced datasets for classification tasks by controlling the proportion of positive (bioavailable) samples.

This generator is useful for benchmarking machine learning models, simulating clinical trial data, and educational purposes where realistic ADME data is required without using sensitive patient information.

Attributes:

Name Type Description
config ADMEConfig

Configuration object specifying statistical parameters and options for ADME data generation.

Methods:

Name Description
__init__

Initialize the ADME generator with the provided configuration.

generate_data

Generate a complete synthetic ADME dataset as a polars DataFrame, including binary bioavailability labels.

Code Example
from synthbiodata.config.schema.v1.adme import ADMEConfig
from synthbiodata.core.adme import ADMEGenerator

config = ADMEConfig(n_samples=100, random_state=123)
gen = ADMEGenerator(config)
df = gen.generate_data()
print(df.head())

Functions

generate_data() -> pl.DataFrame

Generate synthetic ADME data.

synthbiodata.core.base.BaseGenerator

Bases: ABC

Abstract base class for all synthetic data generators.

This class provides common functionality for all data generators, including:

  • Storing the configuration object.
  • Initializing a NumPy random number generator with a fixed random state for reproducibility.
  • Providing a seeded Faker instance for generating fake data.
  • Defining the abstract interface for data generation.

Attributes:

Name Type Description
config BaseConfig

Configuration object containing generator parameters and random state.

rng Generator

NumPy random number generator initialized with the provided random state.

fake Faker

Faker instance seeded for reproducible fake data generation.

Methods:

Name Description
generate_data

Abstract method to generate synthetic data. Must be implemented by subclasses.

Functions

generate_data() -> pl.DataFrame abstractmethod

Generate synthetic data.