Skip to content

Examples

Best Practices

Test Configurations

Test configurations with small datasets first:

# Test with minimal configuration
test_config = create_config(
    data_type="molecular-descriptors",
    n_samples=10,  # Very small for testing; only 10 data points
    random_state=42
)

# Generate and inspect data
df = generate_sample_data(config=test_config)
print(f"Generated {len(df)} samples successfully")

Reproducibility

Always set random_state for reproducible results:

config = create_config(
    data_type="molecular-descriptors",
    random_state=123  # Ensures reproducible data
)

Validate Before Use

Check configuration validity before expensive operations:

try:
    config = create_config(
        data_type="adme",
        absorption_mean=75.0,
        clearance_mean=5.0
    )
    print("Configuration is valid")
except (RangeError, DistributionError, ValidationError) as e:
    print(f"Configuration error: {e}")

Use Realistic Ranges

Choose parameters that reflect realistic biological ranges:

  • Molecular Weight: 150-600 Da for drug-like molecules
  • LogP: -2 to 6 for good drug-like properties
  • TPSA: 0-200 Ų for membrane permeability
  • Absorption: 20-100% for oral bioavailability
  • Protein Binding: 50-99% for plasma distribution

Performance Considerations

  • Larger n_samples values require more memory and computation time
  • Complex target family distributions may slow down generation
  • Consider using imbalanced=True for more realistic class distributions