Getting Started with SynthBioData¶

Welcome to SynthBioData! This guide will help you get up and running with synthetic biological data generation for drug discovery and machine learning applications.

Important Notice

This package generates synthetic data for testing and educational purposes only.

The data produced does not represent real biological or chemical measurements and should not be used for clinical, regulatory, or production applications.

What is SynthBioData?¶

SynthBioData is a Python package that generates realistic synthetic drug discovery data, including:

Molecular Descriptors: Molecular weight, LogP, TPSA, hydrogen bond donors/acceptors, and more
ADME Data: Absorption, Distribution, Metabolism, and Excretion properties
Target Families: Support for GPCR, Kinase, Protease, and other protein families
Chemical Fingerprints: Binary chemical fingerprints as features

Quick Installation¶

Install SynthBioData using your preferred package manager:

uv (Recommended)pipconda

uv pip install synthbiodata

pip install synthbiodata

conda install -c conda-forge synthbiodata

How to use `synthbiodata`¶

1. Basic Usage¶

Start with the simplest approach by generating data with the default settings:

from synthbiodata import generate_sample_data

# Generate molecular descriptor data
df = generate_sample_data(data_type="molecular-descriptors")
print(f"Generated {len(df)} samples with {len(df.columns)} features")

# Generate ADME data
df_adme = generate_sample_data(data_type="adme")
print(f"Generated {len(df_adme)} samples with {len(df_adme.columns)} features")

2. Custom Configuration¶

For more control over data generation, provide a custom configuration. In the example below, custom_config has parameters like n_samples, postive_ratio and imbalanced to generate an imabalance dataset with 1000 samples:

from synthbiodata import create_config, generate_sample_data

# Create a custom configuration
custom_config = create_config(
    data_type="molecular-descriptors",
    n_samples=1000,
    positive_ratio=0.1,
    imbalanced=True,
)

# Generate data with your custom configuration
df = generate_sample_data(config=custom_config)
print(f"Total samples: {len(df)}")
print(f"Features: {len(df.columns) - 1}")  # Exclude target column
print(f"Positive ratio: {df['binds_target'].mean():.1%}")

3. Reproducible datasets¶

Ensure reproducible data generation by setting the paramater random_state:

# Generate reproducible data
df1 = generate_sample_data(
    data_type="molecular-descriptors",
    random_state=321
)

# Same seed = identical results
df2 = generate_sample_data(
    data_type="molecular-descriptors", 
    random_state=321
)

assert (df1 == df2).all().all()  # True

Data Types Overview¶

Molecular Descriptors¶

Generate synthetic molecular data with features like:

Physical Properties: Molecular weight, LogP, TPSA
Structural Features: Hydrogen bond donors/acceptors, rotatable bonds
Chemical Properties: Aromatic rings, chemical fingerprints
Target Information: Protein families (GPCR, Kinase, Protease, etc.)

ADME Data¶

Generate ADME (Absorption, Distribution, Metabolism, Excretion) data with:

Absorption: Bioavailability percentages and absorption rates
Distribution: Plasma protein binding and volume of distribution
Metabolism: Clearance rates and half-life predictions
Excretion: Renal clearance and elimination parameters

Next Steps¶

Now that you have the basics, explore these detailed guides:

Quick Start Tutorial

Step-by-step tutorial with practical examples
Configuration Guide

Learn about all configuration options and customization
User Guide

Comprehensive usage examples and advanced features
API Reference

Complete API documentation and class references

Key Features¶

Realistic Data

Generate data that mimics real-world molecular properties and distributions
Multiple Target Types

Support for various protein families and target types
Configurable Parameters

Customize data generation to match your specific needs
High Performance

Built on Polars for fast data manipulation and processing
Type Safe

Full type hints and Pydantic validation for robust configuration
:material-reproducible: Reproducible

Deterministic data generation with random seed support

Common Use Cases¶

Machine Learning Research: Generate training data for drug discovery ML models
Algorithm Testing: Test and validate ML algorithms with controlled synthetic data
Educational Purposes: Learn about molecular properties and drug discovery concepts
Benchmarking: Create standardized datasets for comparing different approaches
Prototype Development: Quickly generate data for proof-of-concept applications

Need Help?¶

📖 Check out the User Guide for detailed examples
🔧 Visit the API Reference for complete documentation
🐛 Report issues on GitHub
💬 Join discussions in the GitHub Discussions

Ready to dive deeper? Start with the Quick Start Tutorial for a hands-on walkthrough!