DeepSTARR Local Analysis Tutorial
This tutorial demonstrates how to use SEAM to analyze local regulatory mechanisms in the DeepSTARR model, reproducing Figure 2 from our paper.
Note
Expected runtime: ~3.2 minutes using Google Colab A100 GPU
Setup
First, let’s import the required packages:
import time
import random
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import squid
import seam
from seam import MetaExplainer, Compiler, Attributer, Clusterer
from seam import suppress_warnings, get_device
# Optional: suppress warnings
suppress_warnings()
# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
Loading Data and Model
We’ll use the DeepSTARR model and a local sequence library. The model predicts developmental and housekeeping enhancer activity in Drosophila S2 cells.
# Load model
model = tf.keras.models.load_model('deepstarr.h5')
# Load sequence data
sequences = np.load('sequences.npy')
predictions = model.predict(sequences)
print(f"Sequences shape: {sequences.shape}")
print(f"Predictions shape: {predictions.shape}")
Data Preprocessing
We’ll use the Compiler class to organize our data:
# Initialize compiler
compiler = Compiler(x=sequences, y=predictions)
# Compile data into MAVE format
mave = compiler.compile()
print("MAVE dataframe head:")
print(mave.head())
Attribution Map Generation
Next, we’ll generate attribution maps using the saliency method:
# Initialize attributer
attributer = Attributer(
model,
method='saliency',
gpu=True, # Use GPU if available
batch_size=32
)
# Generate maps
t1 = time.time()
maps = attributer.generate(sequences)
print(f"Attribution time: {time.time() - t1:.2f} seconds")
print(f"Maps shape: {maps.shape}")
Clustering Analysis
We’ll use hierarchical clustering to group similar regulatory mechanisms:
# Initialize clusterer
clusterer = Clusterer(maps)
# Generate UMAP embedding
embedding = clusterer.embed(
method='umap',
n_components=2,
n_neighbors=15,
min_dist=0.1
)
# Perform hierarchical clustering
labels = clusterer.cluster(
method='hierarchical',
n_clusters=10,
metric='euclidean',
linkage='ward'
)
# Plot results
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
# Plot dendrogram
clusterer.plot_dendrogram(ax=ax[0])
# Plot embedding
clusterer.plot_embedding(
embedding=embedding,
labels=labels,
ax=ax[1]
)
plt.tight_layout()
plt.show()
Mechanism Analysis
Now we’ll use MetaExplainer to analyze the identified mechanisms:
# Initialize meta-explainer
meta = MetaExplainer(
maps,
alphabet=['A', 'C', 'G', 'T'],
window_size=20
)
# Generate Mechanism Summary Matrix (MSM)
msm = meta.generate_msm(
gpu=True # Use GPU if available
)
# Plot MSM
meta.plot_msm(
column='Entropy',
square_cells=True,
view_window=[50,170],
cmap='rocket_r'
)
# Generate sequence logos for each cluster
logos = meta.generate_logos(
center_values=True,
figsize=(20, 2.5)
)
Interpreting Results
The results show:
Clustering: The dendrogram reveals distinct groups of regulatory mechanisms
UMAP: The embedding shows how mechanisms are related in 2D space
MSM: The entropy matrix highlights regions of mechanistic importance
Logos: Sequence logos reveal the specific patterns in each cluster
Advanced Visualization
For more detailed analysis, we can customize the visualizations:
# Plot MSM with different options
meta.plot_msm(
column='Frequency', # Use frequency instead of entropy
square_cells=True,
view_window=[50,170],
cmap='viridis'
)
# Generate logos with different settings
meta.generate_logos(
indices=[0,1,2], # Only show first 3 clusters
center_values=True,
figsize=(15, 2)
)
Saving Results
Finally, we can save our results:
# Save MSM data
np.save('msm_data.npy', msm)
# Save cluster labels
np.save('cluster_labels.npy', labels)
# Save embedding
np.save('umap_embedding.npy', embedding)
Note
For more examples and advanced usage, please refer to our GitHub repository and the API documentation.