API Reference
Compiler
- class seam.compiler.Compiler(x, y, x_ref=None, y_bg=None, alphabet=None, gpu=False)[source]
Bases:
objectCompiler: A utility for compiling sequence analysis data into a standardized format
This implementation processes sequence data and associated metrics into a pandas DataFrame containing:
DNN predictions
Hamming distances (if reference sequence provided in x_ref)
Global Importance Analysis (GIA) scores (if background predictions provided in y_bg)
Sequence strings
- Requirements:
numpy
pandas
scipy
- __init__(x, y, x_ref=None, y_bg=None, alphabet=None, gpu=False)[source]
Initialize the Compiler.
- Parameters:
x – One-hot sequences of shape (N, L, A)
y – DNN predictions of shape (N, 1)
x_ref – Optional reference sequence of shape (1, L, A)
y_bg – Optional background predictions of shape (N, 1)
alphabet – List of characters for sequence conversion (e.g., [‘A’, ‘C’, ‘G’, ‘T’])
gpu – Whether to use GPU-accelerated sequence conversion (default: False)
Attributer
- class seam.attributer.Attributer(model, method='saliency', task_index=None, batch_size=None, num_shuffles=20, compress_fun=<function reduce_mean>, pred_fun=None, gpu=True)[source]
Bases:
objectAttributer: A unified interface for computing attribution maps in TensorFlow 2.x
This implementation is optimized for TensorFlow 2.x and provides GPU-accelerated implementations of common attribution methods: - Saliency Maps - SmoothGrad - Integrated Gradients - DeepSHAP (via SHAP package, requires TensorFlow setup before initialization - see below) - ISM (In-Silico Mutagenesis)
Requirements: - tensorflow - numpy - tqdm - shap (for DeepSHAP only)
Key Features: - Batch processing for saliency, smoothgrad, integrated gradients, and ISM - DeepSHAP processes sequences one at a time (no batch mode) - GPU-optimized implementations for saliency, smoothgrad, and integrated gradients - Consistent interface across methods - Support for multi-head models - Memory-efficient processing of large datasets - Flexible sequence windowing for long sequences
- Example usage:
# Basic usage with output reduction function attributer = Attributer(
model, method=’saliency’, task_index=0, # Select first output head compress_fun=tf.math.reduce_mean, # Reduce selected output to scalar pred_fun=None # Not used for gradient-based methods
)
# Example with ChromBPNet compression functions attributer = Attributer(
model, method=’deepshap’, task_index=0, # Select first output head compress_fun=Attributer.bpnet_profile_deepshap, # ChromBPNet profile compression with stop_gradient pred_fun=None
)
# Example with ISM (forward-pass method) attributer = Attributer(
model, method=’ism’, task_index=0, # Select first output head compress_fun=tf.math.reduce_mean, # Reduce selected output to scalar pred_fun=model.predict_on_batch # Optional: use predict_on_batch for ISM
)
# Computing attributions for a specific window while maintaining full context attributions = attributer.compute(
x=input_sequences, # Shape: (N, window_size, A) x_ref=reference_sequence, # Shape: (1, full_length, A) save_window=[100, 200], # Compute attributions for positions 100-200 batch_size=128
)
# Method-specific parameters attributions = attributer.compute(
x=input_sequences, num_steps=20, # for intgrad num_samples=20, # for smoothgrad multiply_by_inputs=False # for intgrad log2fc=False # for ism
)
Note: For optimal performance, ensure TensorFlow is configured to use GPU acceleration.
DeepSHAP Requirements: DeepSHAP requires specific TensorFlow setup that must be done BEFORE creating the Attributer (because DeepSHAP was designed for earlier TensorFlow versions): 1. Disable TensorFlow eager execution: tf.compat.v1.disable_eager_execution() 2. Disable TensorFlow v2 behavior: tf.compat.v1.disable_v2_behavior() 3. Load/reload the model from file after disabling eager execution 4. Rebuild the model graph by passing a dummy input through it 5. Configure SHAP op handlers for TensorFlow compatibility
- Example setup sequence:
tf.compat.v1.disable_eager_execution() tf.compat.v1.disable_v2_behavior() import shap shap.explainers.deep.deep_tf.op_handlers[“AddV2”] = shap.explainers.deep.deep_tf.passthrough model = tf.keras.models.load_model(model_path, custom_objects=custom_objects) _ = model(tf.keras.Input(shape=model.input_shape[1:])) # Now create Attributer with the prepared model
- SUPPORTED_METHODS = {'deepshap', 'intgrad', 'ism', 'saliency', 'smoothgrad'}
- DEFAULT_BATCH_SIZES = {'intgrad': 128, 'ism': 32, 'saliency': 128, 'smoothgrad': 64}
- __init__(model, method='saliency', task_index=None, batch_size=None, num_shuffles=20, compress_fun=<function reduce_mean>, pred_fun=None, gpu=True)[source]
Initialize the Attributer.
- Parameters:
model – TensorFlow model to explain
method – Attribution method (default: ‘saliency’)
task_index – Index of output head to explain (optional) - For single-output models: leave as None (default) - For multi-output models: specify index (e.g., 0 for first output) - Setting task_index=0 with single-output models will cause errors
batch_size – Batch size for computing attributions (optional, defaults to method-specific size)
num_shuffles – Number of shuffles for DeepSHAP background (default: 20, matches ChromBPNet)
compress_fun – Function to compress model output to scalar (default: tf.math.reduce_mean)
pred_fun – Function to use for model predictions in forward-pass methods like ISM. Not used for gradient-based methods (saliency, smoothgrad, intgrad). Default: model.__call__
gpu – Whether to use GPU-optimized implementation (default: True)
- smoothgrad(X, num_samples=20, mean=0.0, stddev=0.1, gpu=True, **kwargs)[source]
Compute SmoothGrad attribution maps.
- Parameters:
X – Input tensor of shape (batch_size, L, A)
num_samples – Number of noisy samples
mean – Mean of noise
stddev – Standard deviation of noise
gpu – Whether to use GPU-optimized implementation
**kwargs – Additional arguments (ignored)
- Returns:
Attribution maps of shape (batch_size, L, A)
- Return type:
numpy.ndarray
- intgrad(X, baseline_type='zeros', num_steps=20, gpu=True, multiply_by_inputs=False, seed=None)[source]
Compute Integrated Gradients attribution maps.
- Parameters:
X (array-like) – Input sequences
baseline_type (str) – Type of baseline to use: - ‘zeros’: Zero baseline - ‘random_shuffle’: Random shuffle of input sequence - ‘dinuc_shuffle’: Dinucleotide-preserved shuffle of input sequence (default)
num_steps (int) – Number of steps for integration
gpu (bool) – Whether to use GPU-optimized implementation
multiply_by_inputs (bool) – Whether to multiply gradients by inputs
seed (int, optional) – Random seed for reproducibility in shuffling methods
- Returns:
Attribution maps
- Return type:
array-like
- ism(X, log2fc=False, gpu=True, snv_window=None)[source]
Compute In-Silico Mutagenesis attribution maps.
- Parameters:
X – Input tensor of shape (batch_size, L, A)
log2fc – Whether to compute log2 fold change instead of difference
gpu – Whether to attempt GPU-optimized implementation
snv_window – Optional [start, end] positions to compute variants for. If None, compute for all positions.
- Returns:
Attribution maps of shape (batch_size, L, A)
- Return type:
numpy.ndarray
- compute(x, x_ref=None, batch_size=128, save_window=None, **kwargs)[source]
Compute attribution maps.
- Parameters:
x – One-hot sequences (shape: (N, L, A))
x_ref – One-hot reference sequence (shape: (1, L, A)) for windowed analysis. Not used for DeepSHAP background data, which is handled during initialization.
batch_size – Number of attribution maps per batch (ignored for DeepSHAP)
save_window – Window [start, stop] for computing attributions. If provided along with x_ref, the input sequences will be padded with the reference sequence outside this window. This allows computing attributions for a subset of positions while maintaining the full sequence context.
**kwargs – Additional arguments for specific attribution methods - gpu: Whether to use GPU implementation (default: True) - log2FC (bool): Whether to compute log2 fold change (for ISM) - num_steps: Steps for integrated gradients (default: 50) - num_samples: Samples for smoothgrad (default: 50) - mean, stddev: Parameters for smoothgrad noise - multiply_by_inputs: Whether to multiply gradients by inputs (default: False) - baseline_type: Background type for intgrad and deepshap (‘zeros’, ‘random_shuffle’, ‘dinuc_shuffle’) - background: Background sequences for DeepSHAP (shape: (N, L, A)) - overrides baseline_type - snv_window: Window [start, end] for ISM to compute variants (default: None)
- Returns:
Attribution maps (shape: (N, L, A))
- Return type:
numpy.ndarray
- show_params(method=None)[source]
Show available parameters for attribution methods.
- Parameters:
method – Specific method to show params for. If None, shows all methods.
- static bpnet_profile(x)[source]
ChromBPNet profile compression function.
This function implements the ChromBPNet profile task compression. For DeepSHAP, x should be the model. For other methods, x should be the output tensor.
- Parameters:
x – Model output tensor (profile logits) or model (for DeepSHAP)
- Returns:
Weighted sum of mean-normalized logits
- Return type:
tf.Tensor
- static bpnet_counts(x)[source]
ChromBPNet counts compression function.
This function implements the ChromBPNet counts task compression. For DeepSHAP, x should be the model. For other methods, x should be the output tensor.
- Parameters:
x – Model output tensor (counts logits) or model (for DeepSHAP)
- Returns:
- For DeepSHAP: sum of counts across output dimension
For other methods: tensor as-is (no reduction)
- Return type:
tf.Tensor
Clusterer
- class seam.clusterer.Clusterer(attribution_maps, method='umap', gpu=True)[source]
Bases:
objectClusterer: A unified interface for embedding and clustering attribution maps
This implementation provides implementations of common embedding and clustering methods for attribution maps:
Embedding Methods: - UMAP (requires umap-learn) - PHATE (requires phate) - t-SNE (requires openTSNE) - PCA (GPU-accelerated with cuML, CPU fallback with scikit-learn) - Diffusion Maps (not yet implemented)
Clustering Methods: - Hierarchical (GPU-optimized available) - K-means (GPU-accelerated with kmeanstf, CPU fallback with scikit-learn) - DBSCAN (requires scikit-learn)
Requirements: - numpy - scipy - scikit-learn (for PCA, K-means, DBSCAN)
Optional Requirements: - tensorflow (for GPU-accelerated hierarchical clustering) - cuml (for GPU-accelerated PCA) - kmeanstf (for GPU-accelerated K-means clustering) - umap-learn (for UMAP) - phate (for PHATE) - openTSNE (for t-SNE)
Additional Requirements: - scikit-learn (for clustering) - matplotlib (for visualization)
- Example usage:
# Initialize clusterer with attribution maps clusterer = Clusterer(
maps, method=’umap’, n_components=2
)
# Compute embedding embedding = clusterer.embed()
# For K-means or DBSCAN: clusters = clusterer.cluster(embedding, method=’kmeans’, n_clusters=10)
# For hierarchical clustering: linkage = clusterer.cluster(method=’hierarchical’) # Then get cluster labels using different criteria: labels = clusterer.get_cluster_labels(linkage, criterion=’distance’, max_distance=8) # or labels, cut_level = clusterer.get_cluster_labels(linkage, criterion=’maxclust’, n_clusters=100)
- SUPPORTED_METHODS = {'diffmap', 'pca', 'phate', 'tsne', 'umap'}
- SUPPORTED_CLUSTERERS = {'dbscan', 'hierarchical', 'kmeans'}
- __init__(attribution_maps, method='umap', gpu=True)[source]
Initialize the Clusterer.
- Parameters:
attribution_maps – numpy array of shape (N, L, A) containing attribution maps
method – Embedding method (default: ‘umap’)
gpu – Whether to use GPU acceleration when available (default: True)
- embed(**kwargs)[source]
Compute embedding using specified method.
- Parameters:
**kwargs – Method-specific parameters. Can be passed directly or as a ‘kwargs’ dictionary.
- Returns:
Embedded coordinates
- Return type:
numpy.ndarray
- cluster(embedding=None, method='kmeans', n_clusters=10, **kwargs)[source]
Cluster the embedded data.
- Parameters:
embedding – Optional pre-computed embedding. If None, uses stored embedding
method – Clustering method (‘kmeans’, ‘dbscan’, or ‘hierarchical’)
n_clusters – Number of clusters for kmeans
**kwargs –
Additional clustering parameters For DBSCAN:
eps: Maximum distance between samples (default: 0.01) min_samples: Minimum samples per cluster (default: 10)
- For KMeans:
random_state: Random seed (default: 0) n_init: Number of initializations (default: 10) max_iter: Maximum iterations (default: 300 for GPU, sklearn default for CPU)
- For Hierarchical:
batch_size: Batch size for GPU computation (default: 10000) link_method: Linkage method (default: ‘ward’) dist_fname: Temporary file for distance matrix store_distances: Whether to return distances (default: False)
- Returns:
numpy.ndarray: Cluster labels for each sample For hierarchical:
scipy.cluster.hierarchy.linkage: Linkage matrix for hierarchical clustering (use get_cluster_labels() to obtain cluster assignments)
- If store_distances=True with hierarchical:
tuple: (linkage_matrix, distance_matrix)
- Return type:
For kmeans/dbscan
- normalize(embedding, to_sum=False, copy=True)[source]
Normalize embedding vectors to [0,1] range.
- Parameters:
embedding – Array of shape (n_samples, n_dimensions)
to_sum – If True, normalize to sum=1. If False, normalize to range [0,1]
copy – If True, operate on a copy of the data
- Returns:
Normalized embedding
- Return type:
numpy.ndarray
- plot_embedding(embedding, labels=None, dims=[0, 1], normalize=False, cmap='jet', s=2.5, alpha=1.0, linewidth=0.1, colorbar_label=None, sort_order=None, ref_index=None, legend_loc='upper left', figsize=None, save_path=None, dpi=200, file_format='png')[source]
Plot embedding and optionally color by labels/values.
- Parameters:
embedding – Array of shape (n_samples, n_dimensions)
labels – Values for coloring points. Can be: - numpy array of shape (N,) or (N,1) - pandas Series/DataFrame column (e.g., mave[‘DNN’]) - None (points will be single color)
dims – Which dimensions to plot [dim1, dim2]
normalize – Whether to normalize embedding to [0,1] range
cmap – Colormap for points (e.g., ‘viridis’, ‘jet’, ‘tab10’) - Use ‘viridis’/’jet’ for continuous values - Use ‘tab10’/’Set3’ for discrete clusters
s – Point size (default: 2.5)
alpha – Point transparency (default: 1.0)
linewidth – Width of point edges (default: 0.1)
colorbar_label – Label for colorbar (if None, no colorbar shown)
sort_order – Order to plot points (‘ascending’, ‘descending’, or None) - Useful for ensuring important points are plotted on top - Points are sorted based on their label values - Works with both numpy arrays and pandas Series/DataFrames
ref_index – Index of reference/wild-type sequence to highlight (default: None) - Will be shown as a black star on the plot
legend_loc – Location of legend for reference sequence (‘best’, ‘top left’, ‘upper right’, etc.)
figsize – Figure size (width, height) in inches (default: None, uses matplotlib default)
save_path – Path to save figure (if None, displays plot)
dpi – DPI for saved figure (default: 200)
file_format – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
- Example usage:
# Basic plot with reference sequence clusterer.plot_embedding(embedding, ref_index=0)
# Color by DNN predictions with colorbar and reference clusterer.plot_embedding(
embedding, labels=mave[‘DNN’], # or y_mut numpy array colorbar_label=’DNN prediction’, sort_order=’descending’, # high predictions on top ref_index=ref_idx
)
- plot_histogram(embedding, dims=[0, 1], bins=101, cmap='viridis', colorbar_label='Count', figsize=None, save_path=None, dpi=200, file_format='png')[source]
Plot 2D histogram of embedding points.
- Parameters:
embedding – Array of shape (n_samples, n_dimensions)
dims – Which dimensions to plot [dim1, dim2]
bins – Number of bins for histogram (default: 101)
cmap – Colormap for histogram (default: ‘viridis’)
colorbar_label – Label for colorbar (if None, shows ‘Count’)
figsize – Figure size (width, height) in inches (default: None, uses matplotlib default)
save_path – Path to save figure (if None, displays plot)
dpi – DPI for saved figure (default: 200)
file_format – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
- plot_dendrogram(linkage, figsize=(15, 10), leaf_rotation=90, leaf_font_size=8, cut_level=None, save_path=None, dpi=200, file_format='png', ax=None, truncate=True, cut_level_truncate=None, criterion=None, n_clusters=None, gui=False)[source]
Plot dendrogram from hierarchical clustering linkage matrix.
- Parameters:
linkage – Hierarchical clustering linkage matrix
figsize – Figure size (width, height)
leaf_rotation – Rotation of leaf labels
leaf_font_size – Font size for leaf labels
cut_level – Optional height at which to draw horizontal cut line
save_path – Path to save figure (if None, displays plot)
dpi – DPI for saved figure (default: 200)
file_format – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
ax – Matplotlib axis to plot on (for GUI use). If provided, plots on existing axis instead of creating new figure
truncate – Whether to truncate dendrogram for large datasets (for GUI use). Only used when ax is provided
cut_level_truncate – Height at which to truncate dendrogram (for GUI use). Used with truncate=True
criterion – Clustering criterion (‘maxclust’ or ‘distance’) for truncation calculation. Used with truncate=True
n_clusters – Number of clusters (for maxclust criterion) for truncation calculation. Used with truncate=True and criterion=’maxclust’
gui – Whether to apply GUI-specific styling (smaller fonts, removed spines, etc.) (default: False)
- get_cluster_labels(linkage, criterion='maxclust', max_distance=10, n_clusters=200)[source]
Get cluster labels from a linkage matrix.
- Parameters:
linkage – Linkage matrix from scipy.cluster.hierarchy.linkage
criterion – How to form flat clusters (‘distance’ or ‘maxclust’) ‘distance’: Cut tree at specified height ‘maxclust’: Produce specified number of clusters
max_distance – Maximum cophenetic distance within clusters (only used if criterion=’distance’)
n_clusters – Desired number of clusters to produce (only used if criterion=’maxclust’)
- Returns:
Cluster labels (zero-indexed) float: Cut level (max_distance if criterion=’distance’, or computed level if criterion=’maxclust’)
- Return type:
numpy.ndarray
MetaExplainer
- class seam.meta_explainer.MetaExplainer(clusterer, mave_df, attributions, ref_idx=0, background_separation=False, mut_rate=0.1, sort_method='median', alphabet=None)[source]
Bases:
objectA class for analyzing and visualizing attribution map clusters.
This class builds on the Clusterer class to provide detailed analysis and visualization of attribution map clusters.
Features
- Analysis
Mechanism Summary Matrix (MSM) generation
Sequence logos and attribution logos
Cluster membership tracking
Background separation and noise reduction of attribution maps
- Visualization
DNN score distributions per cluster
Sequence logos (PWM and enrichment)
Attribution logos (fixed and adaptive scaling)
Mechanism Summary Matrices
Cluster profile plots
Requirements
All requirements from Clusterer class
Biopython
Logomaker
Seaborn
SQUID-NN
- __init__(clusterer, mave_df, attributions, ref_idx=0, background_separation=False, mut_rate=0.1, sort_method='median', alphabet=None)[source]
Initialize MetaExplainer with clusterer and data.
- Parameters:
clusterer (Clusterer) – Initialized Clusterer object with clustering results.
mave_df (pandas.DataFrame) – DataFrame containing sequences and their scores. Must have columns: - ‘Sequence’: DNA/RNA sequences - ‘Score’ or ‘DNN’: Model predictions - ‘Cluster’: Cluster assignments
attributions (numpy.ndarray) – Attribution maps for sequences. Shape should be (n_sequences, seq_length, n_characters).
ref_idx (int, default=0) – Index of reference sequence in mave_df.
background_separation (bool, default=False) – Whether to separate background signal from logos.
mut_rate (float, default=0.10) – Mutation rate used for background sequence generation.
sort_method ({'median', 'visual', None}, default='median') – How to sort clusters in all visualizations and analyses. - ‘median’: Sort by median DNN score - ‘visual’: Sort based on hierarchical clustering of the MSM pattern - None: Use original cluster indices
alphabet (list of str, optional) – List of characters to use in sequence logos. Default is [‘A’, ‘C’, ‘G’, ‘T’].
- get_cluster_order(sort_method='median', sort_indices=None)[source]
Get cluster ordering based on specified method.
- plot_cluster_stats(plot_type='box', metric='prediction', save_path=None, show_ref=True, show_fliers=False, compact=False, fontsize=8, dpi=200, figsize=None, file_format='png')[source]
Plot cluster statistics with various visualization options.
- Parameters:
plot_type ({'box', 'bar'}) – Type of visualization: - ‘box’: Show distribution as box plots (predictions only) - ‘bar’: Show bar plot of predictions or counts
metric ({'prediction', 'counts'}) – What to visualize (only used for bar plots): - ‘prediction’: DNN prediction scores - ‘counts’: cluster occupancy/size
save_path (str, optional) – Path to save figure. If None, display instead
show_ref (bool) – If True and reference sequence exists, highlight its cluster
show_fliers (bool) – If True and plot_type=’box’, show outlier points
compact (bool) – If False, shows full boxplots. (default: False) If True, uses a compact representation for boxplots with dots and IQR lines.
fontsize (int) – Font size for tick labels
dpi (int) – DPI for saved figure
figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)
file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
- generate_msm(n_seqs=1000, batch_size=50, gpu=False)[source]
Generate a Mechanism Summary Matrix (MSM) from cluster attribution maps.
- Parameters:
n_seqs (int, default=1000) – Number of sequences to generate per cluster.
batch_size (int, default=50) – Number of sequences to process in each batch.
gpu (bool, default=False) – Whether to use GPU acceleration if available.
- Returns:
The Mechanism Summary Matrix with shape (n_clusters, n_clusters). Each entry [i,j] represents the average DNN score when applying cluster i’s mechanism to sequences from cluster j.
- Return type:
numpy.ndarray
- plot_msm(column='Entropy', delta_entropy=False, square_cells=False, view_window=None, show_tfbs_clusters=False, tfbs_clusters=None, entropy_multiplier=0.5, cov_matrix=None, row_order=None, revels=None, save_path=None, dpi=200, figsize=None, file_format='png', gui=False, gui_figure=None)[source]
Visualize the Mechanism Summary Matrix (MSM) as a heatmap.
- Parameters:
column (str) – Which MSM metric to visualize: - ‘Entropy’: Shannon entropy of characters at each position per cluster - ‘Reference’: Percentage of mismatches to reference sequence - ‘Consensus’: Percentage of matches to cluster consensus sequence
delta_entropy (bool) – If True and column=’Entropy’, show change in entropy from background expectation (based on mutation rate)
square_cells (bool) – If True, set cells in MSM to be perfectly square
view_window (list of [start, end], optional) – If provided, crop the x-axis to this window of positions
show_tfbs_clusters (bool) – Whether to show TFBS cluster rectangles (default: False)
tfbs_clusters (dict, optional) – Dictionary mapping cluster IDs to lists of positions. Required if show_tfbs_clusters is True.
entropy_multiplier (float, optional) – Multiplier for entropy threshold when identifying background (default: 0.5)
cov_matrix (numpy.ndarray, optional) – Covariance matrix for TFBS cluster plotting. Required if show_tfbs_clusters is True.
row_order (list of int, optional) – Order of rows in cov_matrix. Required if show_tfbs_clusters is True.
revels (pandas.DataFrame, optional) – Revels matrix for entropy calculations. Required if show_tfbs_clusters is True.
save_path (str, optional) – Path to save figure. If None, display instead
dpi (int) – DPI for saved figure
figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)
file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
gui (bool) – If True, return data for GUI processing without plotting
gui_figure (matplotlib.figure.Figure, optional) – Existing figure to plot on when gui=True. If None, creates a new figure.
- generate_logos(logo_type='average', background_separation=False, mut_rate=0.01, entropy_multiplier=0.5, adaptive_background_scaling=False, figsize=(20, 2.5), batch_size=50, font_name='sans', stack_order='big_on_top', center_values=True, color_scheme='classic', font_weight=None, fade_below=0.5, shade_below=0.5, width=0.9)[source]
Generate sequence or attribution logos for each cluster.
This method creates visualization logos that represent either the average attribution patterns or sequence patterns within each cluster. It can optionally remove background signal to highlight cluster-specific patterns.
- Parameters:
logo_type ({'average', 'pwm', 'enrichment'}, default='average') – Type of logo to generate: - ‘average’: Shows average attribution values (based on attribution maps) - ‘pwm’: Shows position weight matrix of nucleotide frequencies (based on sequence statistics) - ‘enrichment’: Shows nucleotide enrichment relative to background (based on sequence statistics)
background_separation (bool, default=False) – Whether to remove background signal from logos. Only applies to ‘average’ logos. When True, subtracts the background pattern computed by compute_background(), forcused on highly variable positions.
mut_rate (float, default=0.01) – Mutation rate for background entropy calculation. Only used if background_separation=True.
entropy_multiplier (float, default=0.5) – Controls stringency of background position identification via a multiplier on the background entropy. Only used if background_separation=True.
adaptive_background_scaling (bool, default=False) – If True and background_separation=True, uniformly scales the background pattern differently for each cluster based on the magnitude of its background signal. This is useful when clusters have similar background patterns but at different scales.
figsize (tuple, default=(20, 2.5)) – Figure size in inches (width, height).
batch_size (int, default=50) – Number of logos to process in each batch.
font_name (str, default='sans') – Font name for logo text.
stack_order ({'big_on_top', 'small_on_top', 'fixed'}, default='big_on_top') – How to order nucleotides in each stack: - ‘big_on_top’: Largest values on top - ‘small_on_top’: Smallest values on top - ‘fixed’: Fixed order (A, C, G, T)
center_values (bool, default=True) – Whether to center values in each position. Only applies to ‘average’ logos.
color_scheme (str or dict, default='classic') – Color scheme for logo characters.
font_weight (str or int, optional) – Font weight for logo text. Can be string (‘normal’, ‘bold’, etc.) or numeric (0-1000).
fade_below (float, default=0.5) – Controls alpha transparency for negative values. Higher values make negative values more transparent.
shade_below (float, default=0.5) – Controls color darkening for negative values. Higher values make negative values darker.
width (float, default=0.9) – Controls the horizontal width of each character.
- show_sequences(cluster_idx)[source]
Show sequences belonging to a specific cluster.
- Parameters:
cluster_idx (int) – Index of cluster to show sequences for. If sorting was specified during initialization, this index refers to the sorted order (e.g., 0 is the first cluster after sorting).
- Returns:
DataFrame containing sequences and scores for the specified cluster.
- Return type:
pandas.DataFrame
- plot_cluster_profiles(profiles, save_dir=None, dpi=200, figsize=None, file_format='png')[source]
Plot overlay of profiles associated with each cluster.
- Parameters:
profiles (np.ndarray) – Array of profile data corresponding to sequences in mave_df
save_dir (str, optional) – Directory to save profile plots. If None, displays instead.
dpi (int) – DPI for saved figures
figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)
file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
- compute_background(mut_rate=0.01, entropy_multiplier=0.5, adaptive_background_scaling=False, process_logos=True)[source]
Compute background signal based on entropic positions.
This method identifies and computes background signal patterns for each cluster based on positions with high entropy (high variability). The background can be computed either uniformly across all clusters or with cluster-specific scaling.
- Parameters:
mut_rate (float, default=0.01) – Mutation rate used to calculate background entropy threshold. Higher values will identify more positions as entropic.
entropy_multiplier (float, default=0.5) – Factor to multiply background entropy by for threshold. Lower values make the threshold more stringent (fewer positions identified as entropic).
adaptive_background_scaling (bool, default=False) – If True, computes a scaling factor for each cluster that best matches the magnitude of that cluster’s background signal. This is useful when different clusters have similar background patterns but at different scales. If False, uses the same background scale for all clusters.
process_logos (bool, default=True) – If True, creates and processes BatchLogo instances for background visualization. If False, skips logo processing to save time and memory.
Notes
The background computation process: 1. Identifies entropic (highly variable) positions in each cluster 2. Computes the average attribution pattern at these positions 3. If adaptive_background_scaling is True, computes a scaling factor for each
cluster based on positions that are entropic in both that cluster and the global background
- get_cluster_maps(cluster_idx)[source]
Get attribution maps belonging to a specific cluster.
- Parameters:
cluster_idx (int) – Index of cluster to get maps for. If sorting was specified during initialization, this index refers to the sorted order (e.g., 0 is the first cluster after sorting).
- Returns:
Attribution maps for the specified cluster.
- Return type:
numpy.ndarray
- plot_attribution_variation(scope='all', metric='std', save_path=None, view_window=None, figsize=None, dpi=600, colors=None, xtick_spacing=5, file_format='png')[source]
Visualize the variation in attribution values across attribution maps for each nucleotide position.
- Parameters:
scope ({'all', 'clusters'}, default='all') – Scope of variation calculation: - ‘all’: Use all individual attribution maps - ‘clusters’: Use cluster-averaged attribution maps
metric ({'std', 'var'}, default='std') – Metric to use for variation calculation: - ‘std’: Standard deviation - ‘var’: Variance
save_path (str, optional) – Path to save figure. If None, display instead.
view_window (list of [start, end], optional) – If provided, crop the x-axis to this window of positions.
figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)
dpi (int, default=600) – DPI for saved figure.
colors (dict, optional) – Dictionary mapping nucleotide indices to RGB colors. Default: {0: [0, .5, 0], 1: [0, 0, 1], 2: [1, .65, 0], 3: [1, 0, 0]} for A, C, G, T respectively.
xtick_spacing (int, default=5) – Show x-axis labels every nth position. Set to 1 to show all positions.
file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’
- Returns:
Array of variation values (std or var) for each position and nucleotide
- Return type:
numpy.ndarray
Identifier
- class seam.identifier.Identifier(msm_df, meta_explainer, column='Entropy')[source]
Bases:
objectClass for identifying and analyzing transcription factor binding sites (TFBSs) from attribution maps.
The Identifier class takes attribution maps from a MetaExplainer and identifies distinct TFBSs by analyzing patterns of activity across clusters. It uses a multi-step process:
Covariance Analysis: - Analyzes the covariance between positions in the attribution maps - Identifies regions that show coordinated activity across clusters - Uses hierarchical clustering to group positions into potential TFBSs
TFBS Identification: - Defines TFBS regions based on clustered covariance patterns - Determines which clusters are active for each TFBS using entropy-based thresholds - Creates a binary or continuous binding configuration matrix showing TFBS activity levels in each cluster
Binding Configuration Assignment: - Assigns clusters to specific TFBS binding configurations (e.g., A only, A+B, background) - Uses a distance-based scoring system to find the best cluster for each configuration - For background configuration, finds clusters with minimal TFBS activity across all TFBSs
Key Concepts: - TFBS Activity: Measured as 1 - (normalized entropy), where higher values indicate
stronger TFBS activity in a cluster
Binding Configuration Matrix: Shows binary or continuous activity levels (0-1) for each TFBS in each cluster
Binding Configuration Assignments: Maps each possible TFBS combination to its optimal cluster
- Parameters:
msm_df (pandas.DataFrame) – Mechanism Summary Matrix (MSM) data from MetaExplainer, containing entropy or other activity measures for each position in each cluster
meta_explainer (MetaExplainer) – Instance of MetaExplainer class that generated the attribution maps
column (str, optional) – Column from MSM to use for analysis (default: ‘Entropy’)
- revels
Pivoted MSM data with clusters as rows and positions as columns
- Type:
pandas.DataFrame
- cov_matrix
Covariance matrix between positions, used for TFBS identification
- Type:
pandas.DataFrame
- tfbs_clusters
Dictionary mapping TFBS labels to their constituent positions
- Type:
dict
- entropy_multiplier
Threshold multiplier for determining active clusters
- Type:
float
- active_clusters_by_tfbs
Dictionary mapping TFBS labels to their active clusters
- Type:
dict
Notes
The class uses entropy-based measures to identify TFBS activity, where: - Lower entropy indicates more specific, TFBS-like activity - Higher entropy indicates more background-like activity - Activity is normalized relative to background entropy to account for
mutation rate and sequence composition
- __init__(msm_df, meta_explainer, column='Entropy')[source]
Initialize Identifier with MSM data and MetaExplainer instance.
- Parameters:
msm_df (pandas.DataFrame) – MSM data from MetaExplainer
meta_explainer (MetaExplainer) – Instance of MetaExplainer class
column (str, optional) – Column to use for analysis (default: ‘Entropy’)
- cluster_msm_covariance(method='average', n_clusters=None, cut_height=None)[source]
Cluster the covariance matrix using hierarchical clustering.
- Parameters:
method (str, optional) – Linkage method for hierarchical clustering (default: ‘average’)
n_clusters (int, optional) – Number of clusters to form. If None, will use cut_height or automatic detection. Note: This is the number of clusters BEFORE removing the largest cluster.
cut_height (float, optional) – Height at which to cut the dendrogram. If None and n_clusters is None, will use automatic gap detection.
- Returns:
Dictionary mapping cluster labels to positions
- Return type:
dict
- plot_pairwise_matrix(theta_lclc, view_window=None, threshold=None, cbar_title='Pairwise', gridlines=True, xtick_spacing=1, figsize=None, save_path=None, dpi=200, file_format='png')[source]
Plot pairwise matrix visualization. Adapted from https://github.com/jbkinney/mavenn/blob/master/mavenn/src/visualization.py Original authors: Tareen, A. and Kinney, J.
- Parameters:
theta_lclc (np.ndarray) – Pairwise matrix parameters (shape: (L,C,L,C))
view_window (tuple, optional) – (start, end) positions to view
threshold (float, optional) – Threshold for matrix values
cbar_title (str, optional) – Title for colorbar
gridlines (bool, optional) – Whether to show gridlines
xtick_spacing (int, optional) – Show every nth x-tick label (default: 1)
figsize (tuple, optional) – Figure size (width, height) in inches
save_path (str, optional) – Path to save the figure
dpi (int, optional) – DPI for saved figure (default: 200)
file_format (str, optional) – Format for saved figure (default: ‘png’)
- plot_msm_covariance_triangular(view_window=None, xtick_spacing=5, show_clusters=False, figsize=None, save_path=None, dpi=200, file_format='png')[source]
Plot the covariance matrix.
- Parameters:
view_window (tuple, optional) – (start, end) positions to view
xtick_spacing (int, optional) – Show every nth x-tick label (default: 5)
show_clusters (bool, optional) – Whether to show TFBS cluster rectangles (default: False)
figsize (tuple, optional) – Figure size (width, height) in inches
save_path (str, optional) – Directory to save the plot
dpi (int, optional) – DPI for saved figure (default: 200)
file_format (str, optional) – Format for saved figure (default: ‘png’)
- plot_msm_covariance_dendrogram(figsize=(15, 10), leaf_rotation=90, leaf_font_size=8, save_path=None, dpi=200, file_format='png')[source]
Plot the dendrogram from hierarchical clustering.
- Parameters:
figsize (tuple, optional) – Figure size (width, height) in inches
leaf_rotation (float, optional) – Rotation angle for leaf labels (default: 90)
leaf_font_size (int, optional) – Font size for leaf labels (default: 8)
save_path (str, optional) – Path to save figure (if None, displays plot)
dpi (int, optional) – DPI for saved figure (default: 200)
file_format (str, optional) – Format for saved figure (default: ‘png’)
- plot_msm_covariance_square(view_window=None, show_clusters=True, view_linkage_space=False, figsize=None, save_path=None, dpi=200, file_format='png')[source]
Plot covariance matrix in square format using seaborn heatmap.
- Parameters:
view_window (tuple, optional) – (start, end) positions to view in nucleotide position space. Note: Disabled when view_linkage_space is True.
show_clusters (bool, optional) – Whether to show TFBS cluster rectangles. Only available in nucleotide position space.
view_linkage_space (bool, optional) – If True, shows matrix reordered by hierarchical clustering linkage. If False (default), shows matrix in original nucleotide position space. Note: cluster visualization and view_window are disabled in linkage space.
figsize (tuple, optional) – Figure size (width, height) in inches
save_path (str, optional) – Path to save figure
dpi (int, optional) – DPI for saved figure (default: 200)
file_format (str, optional) – Format for saved figure (default: ‘png’)
- set_entropy_multiplier(entropy_multiplier)[source]
Set the entropy multiplier for TFBS activity detection.
This value is used to determine which clusters are considered active for each TFBS region based on their entropy values.
- Parameters:
entropy_multiplier (float) – Multiplier for background entropy threshold. Lower values result in more clusters being considered active.
- get_tfbs_positions(active_clusters)[source]
Get the start and stop positions for each TFBS cluster.
- Parameters:
active_clusters (dict) – Dictionary mapping TFBS labels to active clusters
- Returns:
DataFrame containing start, stop, length, positions, and active clusters for each TFBS, sorted by start position and labeled alphabetically (A, B, C, etc.)
- Return type:
pd.DataFrame
- get_binding_config_matrix(active_clusters, mode='binary')[source]
Create a binding configuration matrix showing TFBS activity in each cluster.
- Parameters:
active_clusters (dict) – Dictionary mapping TFBS labels to active clusters
mode (str) –
‘binary’: 0/1 for inactive/active ‘continuous’: normalized activity values (1 - normalized entropy),
where higher values indicate more activity
- Returns:
Binding configuration matrix with clusters as rows and TFBSs as columns
- Return type:
pd.DataFrame
- plot_binding_config_matrix(active_clusters, mode='binary', orientation='vertical', figsize=None, save_path=None, dpi=200, file_format='png')[source]
Plot binding configuration matrix showing TFBS activity in each cluster.
- Parameters:
active_clusters (dict) – Dictionary mapping TFBS labels to active clusters
mode (str) – ‘binary’: dark gray/white for active/inactive ‘continuous’: grayscale for activity level
orientation (str) – ‘vertical’: Clusters on y-axis, TFBS on x-axis (default) ‘horizontal’: TFBS on y-axis, Clusters on x-axis
figsize (tuple, optional) – Figure size (width, height) in inches
save_path (str, optional) – Path to save the figure. If None, displays plot.
dpi (int, optional) – DPI for saved figure (default: 200)
file_format (str, optional) – Format for saved figure (default: ‘png’)
- get_binding_config_assignments(tfbs_positions, mode='auto', print_template=False)[source]
Assign clusters to specific TFBS binding configurations based on their activity patterns.
This function analyzes the continuous activity levels of TFBSs across clusters to find the optimal cluster for each possible TFBS binding configuration. For example, it will find: - Which cluster best represents TFBS A alone - Which cluster best represents TFBS B alone - Which cluster best represents the combined presence of TFBSs A and B - Which cluster best represents the background configuration (no TFBSs active)
The scoring system works by: 1. For each binding configuration, defining an “ideal” activity pattern where:
Desired TFBS(s) have maximum observed activity
Other TFBSs have minimum observed activity
Calculating how far each cluster’s activity pattern is from this ideal
Selecting the cluster that minimizes this distance
For example, when finding a cluster for TFBS A: - The ideal configuration would have maximum activity for A and minimum for others - Each cluster’s score is based on how close it comes to this ideal - The cluster with the highest score (smallest distance from ideal) is selected
- Parameters:
tfbs_positions (pd.DataFrame) – DataFrame from get_tfbs_positions containing TFBS information. Must have columns: ‘TFBS’, ‘Start’, ‘Stop’, ‘Positions’, ‘Active_Clusters’
mode (str, optional) – ‘auto’ : Automatically assign clusters based on activity patterns ‘template’ : Print a template for manual assignment
print_template (bool, optional) – If True and mode=’template’, prints a formatted template showing all possible TFBS combinations and their current cluster assignments
- Returns:
If mode=’auto’: Dictionary mapping TFBS binding configurations to cluster indices. For example: {
(): 5, # Background configuration (no TFBSs) (‘A’,): 1, # TFBS A alone (‘B’,): 3, # TFBS B alone (‘A’, ‘B’): 7, # Interaction of TFBSs A and B …
} If mode=’template’: None, but prints template for manual assignment
- Return type:
dict or None
Notes
The function internally uses the continuous binding configuration matrix (normalized entropy-based activity levels) to make assignments, ensuring consistent scoring across all binding configurations. This means: - Activity levels are normalized relative to background entropy - Higher values (closer to 1) indicate stronger TFBS activity - Lower values (closer to 0) indicate weaker or no TFBS activity
The scoring system prioritizes finding clusters that: 1. Have high activity for the desired TFBS(s) 2. Have low activity for other TFBSs 3. Show balanced activity when multiple TFBSs are desired
- get_additive_params(tfbs_positions, specific_clusters=None, zero_out_inactive=False, separate_background=True)[source]
Extract additive parameters for each TFBS by cropping from meta-attribution maps.
- Parameters:
tfbs_positions (pd.DataFrame) – DataFrame containing TFBS information (from get_tfbs_positions)
specific_clusters (list of int, optional) – List of one cluster per TFBS to use for cropping (e.g., [5, 17, 20, 23] for TFBSs A, B, C, D). If None, uses the average of all active clusters for each TFBS.
zero_out_inactive (bool, optional) – Controls how to handle positions within the cropped region: - False (default): Return the full cropped region (start to stop) with all positions - True: Return the full cropped region (start to stop), with inactive positions set to zero
separate_background (bool, optional) – Whether to use background-separated cluster maps (default: True). If True, uses meta_explainer.cluster_maps_no_bg if available. If False or if background-separated maps aren’t available, uses regular cluster maps.
- Returns:
Dictionary mapping TFBS IDs (A, B, C, etc.) to their 4xL parameter matrices. For each TFBS, the matrix is cropped from either: - The cluster-averaged attribution map for the specified cluster, or - The average of cluster-averaged attribution maps from all active clusters The matrix always spans the full region (start to stop), with L = stop - start + 1. If zero_out_inactive=True, positions not in the TFBS’s Positions list are set to zero.
- Return type:
dict
- get_epistatic_params(tfbs_positions, binding_config_assignments=None)[source]
Calculate epistatic interactions between TFBSs using Möbius inversion.
For each combination of TFBSs, calculates the interaction using the inclusion-exclusion principle. For example, for a 3-way interaction ABC: I_ABC = y_ABC - y_AB - y_AC - y_BC + y_A + y_B + y_C - y_bg
- Parameters:
tfbs_positions (pd.DataFrame) – DataFrame containing TFBS information (from get_tfbs_positions)
binding_config_assignments (dict, optional) – Dictionary mapping TFBS binding configurations to cluster indices. If None, will use get_binding_config_assignments() with mode=’auto’ to get assignments.
- Returns:
Dictionary mapping TFBS combinations to their epistatic interaction values. Keys are tuples of TFBS IDs (e.g., (‘A’, ‘B’) for 2-way, (‘A’, ‘B’, ‘C’) for 3-way). Values are the calculated interaction terms using Möbius inversion.
- Return type:
dict
Notes
The epistatic interactions are calculated using Möbius inversion, where each term’s coefficient is (-1)^k for a subset of size k. This ensures that:
The interaction term captures the deviation from additivity
Higher-order interactions are properly decomposed into their constituent terms
The background configuration (empty set) is properly accounted for
For example: - 2-way: I_AB = y_AB - y_A - y_B + y₀ - 3-way: I_ABC = y_ABC - y_AB - y_AC - y_BC + y_A + y_B + y_C - y₀ - 4-way: I_ABCD = y_ABCD - y_ABC - y_ABD - y_ACD - y_BCD +
y_AB + y_AC + y_AD + y_BC + y_BD + y_CD - y_A - y_B - y_C - y_D + y₀
A positive interaction indicates synergy (combined effect > sum of individual effects), while a negative interaction indicates antagonism (combined effect < sum of individual effects).
Saving and Loading
If the epistatic parameters are saved as a NumPy file (.npy), these parameters can be loaded as follows:
# Load the saved parameters epistatic_params = np.load(‘path/to/identified_parameters/epistatic_params.npy’,
allow_pickle=True).item()
# The loaded data will be a dictionary where: # - Keys are tuples of TFBS IDs (e.g., (‘A’, ‘B’) for pairwise interactions) # - Values are the interaction terms (float values)
# Example usage: # Get a pairwise interaction ab_interaction = epistatic_params[(‘A’, ‘B’)]
# Get a higher-order interaction abc_interaction = epistatic_params[(‘A’, ‘B’, ‘C’)] ```
Note: The allow_pickle=True parameter is required because the data is stored as a dictionary, and .item() is needed to convert the NumPy array back into a dictionary format.
- plot_epistatic_interactions(epistatic_params, tfbs_positions=None, pairwise_only=False, annotate=True, cmap='RdBu_r', figsize=(10, 8), save_path=None, dpi=200, file_format='png')[source]
Plot epistatic interactions between TFBSs.
Creates two visualizations: 1. A lower triangular heatmap showing pairwise interactions (excluding diagonal) 2. A bar plot showing higher-order interactions (if any exist)
- Parameters:
epistatic_params (dict) – Dictionary mapping TFBS combinations to their interaction values
tfbs_positions (pandas.DataFrame, optional) – DataFrame containing TFBS positions, used for consistent ordering
pairwise_only (bool, default=False) – If True, only plot pairwise interactions
annotate (bool, default=True) – Whether to show interaction values on the heatmap
cmap (str, default='RdBu_r') – Colormap for the heatmap
figsize (tuple, default=(10, 8)) – Figure size for the heatmap
save_path (str, optional) – Directory to save the plots
dpi (int, default=200) – DPI for saved figures
file_format (str, default='png') – Format for saved figures
- Returns:
(fig_heatmap, ax_heatmap) if pairwise_only=True ((fig_heatmap, ax_heatmap), (fig_bar, ax_bar)) if pairwise_only=False
- Return type:
tuple
Utils
Utility functions for SEAM-NN package. Core functionality for sequence processing, data handling, and computation.
- seam.utils.suppress_warnings()[source]
Suppress common warnings for cleaner output.
- Return type:
None
- seam.utils.get_device(gpu=False)[source]
Get appropriate compute device.
- Return type:
Optional[str]
- seam.utils.arr2pd(x, alphabet=['A', 'C', 'G', 'T'])[source]
Convert array to pandas DataFrame with proper column headings.
- Return type:
DataFrame
- seam.utils.oh2seq(one_hot, alphabet=['A', 'C', 'G', 'T'], encoding=1)[source]
Convert one-hot encoding to sequence.
- Return type:
str
- seam.utils.seq2oh(seq, alphabet=['A', 'C', 'G', 'T'], encoding=1)[source]
Convert sequence to one-hot encoding.
- Return type:
ndarray