DeepSTARR Local Analysis Tutorial =============================== This tutorial demonstrates how to use SEAM to analyze local regulatory mechanisms in the DeepSTARR model, reproducing Figure 2 from our paper. .. note:: Expected runtime: ~3.2 minutes using Google Colab A100 GPU Setup ----- First, let's import the required packages: .. code-block:: python import time import random import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import squid import seam from seam import MetaExplainer, Compiler, Attributer, Clusterer from seam import suppress_warnings, get_device # Optional: suppress warnings suppress_warnings() # Set random seeds for reproducibility random.seed(42) np.random.seed(42) tf.random.set_seed(42) Loading Data and Model -------------------- We'll use the DeepSTARR model and a local sequence library. The model predicts developmental and housekeeping enhancer activity in Drosophila S2 cells. .. code-block:: python # Load model model = tf.keras.models.load_model('deepstarr.h5') # Load sequence data sequences = np.load('sequences.npy') predictions = model.predict(sequences) print(f"Sequences shape: {sequences.shape}") print(f"Predictions shape: {predictions.shape}") Data Preprocessing --------------- We'll use the Compiler class to organize our data: .. code-block:: python # Initialize compiler compiler = Compiler(x=sequences, y=predictions) # Compile data into MAVE format mave = compiler.compile() print("MAVE dataframe head:") print(mave.head()) Attribution Map Generation ----------------------- Next, we'll generate attribution maps using the saliency method: .. code-block:: python # Initialize attributer attributer = Attributer( model, method='saliency', gpu=True, # Use GPU if available batch_size=32 ) # Generate maps t1 = time.time() maps = attributer.generate(sequences) print(f"Attribution time: {time.time() - t1:.2f} seconds") print(f"Maps shape: {maps.shape}") Clustering Analysis ---------------- We'll use hierarchical clustering to group similar regulatory mechanisms: .. code-block:: python # Initialize clusterer clusterer = Clusterer(maps) # Generate UMAP embedding embedding = clusterer.embed( method='umap', n_components=2, n_neighbors=15, min_dist=0.1 ) # Perform hierarchical clustering labels = clusterer.cluster( method='hierarchical', n_clusters=10, metric='euclidean', linkage='ward' ) # Plot results fig, ax = plt.subplots(1, 2, figsize=(12, 5)) # Plot dendrogram clusterer.plot_dendrogram(ax=ax[0]) # Plot embedding clusterer.plot_embedding( embedding=embedding, labels=labels, ax=ax[1] ) plt.tight_layout() plt.show() Mechanism Analysis --------------- Now we'll use MetaExplainer to analyze the identified mechanisms: .. code-block:: python # Initialize meta-explainer meta = MetaExplainer( maps, alphabet=['A', 'C', 'G', 'T'], window_size=20 ) # Generate Mechanism Summary Matrix (MSM) msm = meta.generate_msm( gpu=True # Use GPU if available ) # Plot MSM meta.plot_msm( column='Entropy', square_cells=True, view_window=[50,170], cmap='rocket_r' ) # Generate sequence logos for each cluster logos = meta.generate_logos( center_values=True, figsize=(20, 2.5) ) Interpreting Results ----------------- The results show: 1. **Clustering**: The dendrogram reveals distinct groups of regulatory mechanisms 2. **UMAP**: The embedding shows how mechanisms are related in 2D space 3. **MSM**: The entropy matrix highlights regions of mechanistic importance 4. **Logos**: Sequence logos reveal the specific patterns in each cluster Advanced Visualization ------------------- For more detailed analysis, we can customize the visualizations: .. code-block:: python # Plot MSM with different options meta.plot_msm( column='Frequency', # Use frequency instead of entropy square_cells=True, view_window=[50,170], cmap='viridis' ) # Generate logos with different settings meta.generate_logos( indices=[0,1,2], # Only show first 3 clusters center_values=True, figsize=(15, 2) ) Saving Results ------------ Finally, we can save our results: .. code-block:: python # Save MSM data np.save('msm_data.npy', msm) # Save cluster labels np.save('cluster_labels.npy', labels) # Save embedding np.save('umap_embedding.npy', embedding) .. note:: For more examples and advanced usage, please refer to our GitHub repository and the API documentation.