API Reference

Compiler

class seam.compiler.Compiler(x, y, x_ref=None, y_bg=None, alphabet=None, gpu=False)[source]

Bases: object

Compiler: A utility for compiling sequence analysis data into a standardized format

This implementation processes sequence data and associated metrics into a pandas DataFrame containing:

  • DNN predictions

  • Hamming distances (if reference sequence provided in x_ref)

  • Global Importance Analysis (GIA) scores (if background predictions provided in y_bg)

  • Sequence strings

Requirements:
  • numpy

  • pandas

  • scipy

__init__(x, y, x_ref=None, y_bg=None, alphabet=None, gpu=False)[source]

Initialize the Compiler.

Parameters:
  • x – One-hot sequences of shape (N, L, A)

  • y – DNN predictions of shape (N, 1)

  • x_ref – Optional reference sequence of shape (1, L, A)

  • y_bg – Optional background predictions of shape (N, 1)

  • alphabet – List of characters for sequence conversion (e.g., [‘A’, ‘C’, ‘G’, ‘T’])

  • gpu – Whether to use GPU-accelerated sequence conversion (default: False)

compile()[source]

Compile data into pandas DataFrame.

Attributer

class seam.attributer.Attributer(model, method='saliency', task_index=None, batch_size=None, num_shuffles=20, compress_fun=<function reduce_mean>, pred_fun=None, gpu=True)[source]

Bases: object

Attributer: A unified interface for computing attribution maps in TensorFlow 2.x

This implementation is optimized for TensorFlow 2.x and provides GPU-accelerated implementations of common attribution methods: - Saliency Maps - SmoothGrad - Integrated Gradients - DeepSHAP (via SHAP package, requires TensorFlow setup before initialization - see below) - ISM (In-Silico Mutagenesis)

Requirements: - tensorflow - numpy - tqdm - shap (for DeepSHAP only)

Key Features: - Batch processing for saliency, smoothgrad, integrated gradients, and ISM - DeepSHAP processes sequences one at a time (no batch mode) - GPU-optimized implementations for saliency, smoothgrad, and integrated gradients - Consistent interface across methods - Support for multi-head models - Memory-efficient processing of large datasets - Flexible sequence windowing for long sequences

Example usage:

# Basic usage with output reduction function attributer = Attributer(

model, method=’saliency’, task_index=0, # Select first output head compress_fun=tf.math.reduce_mean, # Reduce selected output to scalar pred_fun=None # Not used for gradient-based methods

)

# Example with ChromBPNet compression functions attributer = Attributer(

model, method=’deepshap’, task_index=0, # Select first output head compress_fun=Attributer.bpnet_profile_deepshap, # ChromBPNet profile compression with stop_gradient pred_fun=None

)

# Example with ISM (forward-pass method) attributer = Attributer(

model, method=’ism’, task_index=0, # Select first output head compress_fun=tf.math.reduce_mean, # Reduce selected output to scalar pred_fun=model.predict_on_batch # Optional: use predict_on_batch for ISM

)

# Computing attributions for a specific window while maintaining full context attributions = attributer.compute(

x=input_sequences, # Shape: (N, window_size, A) x_ref=reference_sequence, # Shape: (1, full_length, A) save_window=[100, 200], # Compute attributions for positions 100-200 batch_size=128

)

# Method-specific parameters attributions = attributer.compute(

x=input_sequences, num_steps=20, # for intgrad num_samples=20, # for smoothgrad multiply_by_inputs=False # for intgrad log2fc=False # for ism

)

Note: For optimal performance, ensure TensorFlow is configured to use GPU acceleration.

DeepSHAP Requirements: DeepSHAP requires specific TensorFlow setup that must be done BEFORE creating the Attributer (because DeepSHAP was designed for earlier TensorFlow versions): 1. Disable TensorFlow eager execution: tf.compat.v1.disable_eager_execution() 2. Disable TensorFlow v2 behavior: tf.compat.v1.disable_v2_behavior() 3. Load/reload the model from file after disabling eager execution 4. Rebuild the model graph by passing a dummy input through it 5. Configure SHAP op handlers for TensorFlow compatibility

Example setup sequence:

tf.compat.v1.disable_eager_execution() tf.compat.v1.disable_v2_behavior() import shap shap.explainers.deep.deep_tf.op_handlers[“AddV2”] = shap.explainers.deep.deep_tf.passthrough model = tf.keras.models.load_model(model_path, custom_objects=custom_objects) _ = model(tf.keras.Input(shape=model.input_shape[1:])) # Now create Attributer with the prepared model

SUPPORTED_METHODS = {'deepshap', 'intgrad', 'ism', 'saliency', 'smoothgrad'}
DEFAULT_BATCH_SIZES = {'intgrad': 128, 'ism': 32, 'saliency': 128, 'smoothgrad': 64}
__init__(model, method='saliency', task_index=None, batch_size=None, num_shuffles=20, compress_fun=<function reduce_mean>, pred_fun=None, gpu=True)[source]

Initialize the Attributer.

Parameters:
  • model – TensorFlow model to explain

  • method – Attribution method (default: ‘saliency’)

  • task_index – Index of output head to explain (optional) - For single-output models: leave as None (default) - For multi-output models: specify index (e.g., 0 for first output) - Setting task_index=0 with single-output models will cause errors

  • batch_size – Batch size for computing attributions (optional, defaults to method-specific size)

  • num_shuffles – Number of shuffles for DeepSHAP background (default: 20, matches ChromBPNet)

  • compress_fun – Function to compress model output to scalar (default: tf.math.reduce_mean)

  • pred_fun – Function to use for model predictions in forward-pass methods like ISM. Not used for gradient-based methods (saliency, smoothgrad, intgrad). Default: model.__call__

  • gpu – Whether to use GPU-optimized implementation (default: True)

saliency(X, batch_size=None)[source]

Compute saliency maps in batches.

smoothgrad(X, num_samples=20, mean=0.0, stddev=0.1, gpu=True, **kwargs)[source]

Compute SmoothGrad attribution maps.

Parameters:
  • X – Input tensor of shape (batch_size, L, A)

  • num_samples – Number of noisy samples

  • mean – Mean of noise

  • stddev – Standard deviation of noise

  • gpu – Whether to use GPU-optimized implementation

  • **kwargs – Additional arguments (ignored)

Returns:

Attribution maps of shape (batch_size, L, A)

Return type:

numpy.ndarray

intgrad(X, baseline_type='zeros', num_steps=20, gpu=True, multiply_by_inputs=False, seed=None)[source]

Compute Integrated Gradients attribution maps.

Parameters:
  • X (array-like) – Input sequences

  • baseline_type (str) – Type of baseline to use: - ‘zeros’: Zero baseline - ‘random_shuffle’: Random shuffle of input sequence - ‘dinuc_shuffle’: Dinucleotide-preserved shuffle of input sequence (default)

  • num_steps (int) – Number of steps for integration

  • gpu (bool) – Whether to use GPU-optimized implementation

  • multiply_by_inputs (bool) – Whether to multiply gradients by inputs

  • seed (int, optional) – Random seed for reproducibility in shuffling methods

Returns:

Attribution maps

Return type:

array-like

ism(X, log2fc=False, gpu=True, snv_window=None)[source]

Compute In-Silico Mutagenesis attribution maps.

Parameters:
  • X – Input tensor of shape (batch_size, L, A)

  • log2fc – Whether to compute log2 fold change instead of difference

  • gpu – Whether to attempt GPU-optimized implementation

  • snv_window – Optional [start, end] positions to compute variants for. If None, compute for all positions.

Returns:

Attribution maps of shape (batch_size, L, A)

Return type:

numpy.ndarray

compute(x, x_ref=None, batch_size=128, save_window=None, **kwargs)[source]

Compute attribution maps.

Parameters:
  • x – One-hot sequences (shape: (N, L, A))

  • x_ref – One-hot reference sequence (shape: (1, L, A)) for windowed analysis. Not used for DeepSHAP background data, which is handled during initialization.

  • batch_size – Number of attribution maps per batch (ignored for DeepSHAP)

  • save_window – Window [start, stop] for computing attributions. If provided along with x_ref, the input sequences will be padded with the reference sequence outside this window. This allows computing attributions for a subset of positions while maintaining the full sequence context.

  • **kwargs – Additional arguments for specific attribution methods - gpu: Whether to use GPU implementation (default: True) - log2FC (bool): Whether to compute log2 fold change (for ISM) - num_steps: Steps for integrated gradients (default: 50) - num_samples: Samples for smoothgrad (default: 50) - mean, stddev: Parameters for smoothgrad noise - multiply_by_inputs: Whether to multiply gradients by inputs (default: False) - baseline_type: Background type for intgrad and deepshap (‘zeros’, ‘random_shuffle’, ‘dinuc_shuffle’) - background: Background sequences for DeepSHAP (shape: (N, L, A)) - overrides baseline_type - snv_window: Window [start, end] for ISM to compute variants (default: None)

Returns:

Attribution maps (shape: (N, L, A))

Return type:

numpy.ndarray

show_params(method=None)[source]

Show available parameters for attribution methods.

Parameters:

method – Specific method to show params for. If None, shows all methods.

static bpnet_profile(x)[source]

ChromBPNet profile compression function.

This function implements the ChromBPNet profile task compression. For DeepSHAP, x should be the model. For other methods, x should be the output tensor.

Parameters:

x – Model output tensor (profile logits) or model (for DeepSHAP)

Returns:

Weighted sum of mean-normalized logits

Return type:

tf.Tensor

static bpnet_counts(x)[source]

ChromBPNet counts compression function.

This function implements the ChromBPNet counts task compression. For DeepSHAP, x should be the model. For other methods, x should be the output tensor.

Parameters:

x – Model output tensor (counts logits) or model (for DeepSHAP)

Returns:

For DeepSHAP: sum of counts across output dimension

For other methods: tensor as-is (no reduction)

Return type:

tf.Tensor

Clusterer

class seam.clusterer.Clusterer(attribution_maps, method='umap', gpu=True)[source]

Bases: object

Clusterer: A unified interface for embedding and clustering attribution maps

This implementation provides implementations of common embedding and clustering methods for attribution maps:

Embedding Methods: - UMAP (requires umap-learn) - PHATE (requires phate) - t-SNE (requires openTSNE) - PCA (GPU-accelerated with cuML, CPU fallback with scikit-learn) - Diffusion Maps (not yet implemented)

Clustering Methods: - Hierarchical (GPU-optimized available) - K-means (GPU-accelerated with kmeanstf, CPU fallback with scikit-learn) - DBSCAN (requires scikit-learn)

Requirements: - numpy - scipy - scikit-learn (for PCA, K-means, DBSCAN)

Optional Requirements: - tensorflow (for GPU-accelerated hierarchical clustering) - cuml (for GPU-accelerated PCA) - kmeanstf (for GPU-accelerated K-means clustering) - umap-learn (for UMAP) - phate (for PHATE) - openTSNE (for t-SNE)

Additional Requirements: - scikit-learn (for clustering) - matplotlib (for visualization)

Example usage:

# Initialize clusterer with attribution maps clusterer = Clusterer(

maps, method=’umap’, n_components=2

)

# Compute embedding embedding = clusterer.embed()

# For K-means or DBSCAN: clusters = clusterer.cluster(embedding, method=’kmeans’, n_clusters=10)

# For hierarchical clustering: linkage = clusterer.cluster(method=’hierarchical’) # Then get cluster labels using different criteria: labels = clusterer.get_cluster_labels(linkage, criterion=’distance’, max_distance=8) # or labels, cut_level = clusterer.get_cluster_labels(linkage, criterion=’maxclust’, n_clusters=100)

SUPPORTED_METHODS = {'diffmap', 'pca', 'phate', 'tsne', 'umap'}
SUPPORTED_CLUSTERERS = {'dbscan', 'hierarchical', 'kmeans'}
__init__(attribution_maps, method='umap', gpu=True)[source]

Initialize the Clusterer.

Parameters:
  • attribution_maps – numpy array of shape (N, L, A) containing attribution maps

  • method – Embedding method (default: ‘umap’)

  • gpu – Whether to use GPU acceleration when available (default: True)

embed(**kwargs)[source]

Compute embedding using specified method.

Parameters:

**kwargs – Method-specific parameters. Can be passed directly or as a ‘kwargs’ dictionary.

Returns:

Embedded coordinates

Return type:

numpy.ndarray

cluster(embedding=None, method='kmeans', n_clusters=10, **kwargs)[source]

Cluster the embedded data.

Parameters:
  • embedding – Optional pre-computed embedding. If None, uses stored embedding

  • method – Clustering method (‘kmeans’, ‘dbscan’, or ‘hierarchical’)

  • n_clusters – Number of clusters for kmeans

  • **kwargs

    Additional clustering parameters For DBSCAN:

    eps: Maximum distance between samples (default: 0.01) min_samples: Minimum samples per cluster (default: 10)

    For KMeans:

    random_state: Random seed (default: 0) n_init: Number of initializations (default: 10) max_iter: Maximum iterations (default: 300 for GPU, sklearn default for CPU)

    For Hierarchical:

    batch_size: Batch size for GPU computation (default: 10000) link_method: Linkage method (default: ‘ward’) dist_fname: Temporary file for distance matrix store_distances: Whether to return distances (default: False)

Returns:

numpy.ndarray: Cluster labels for each sample For hierarchical:

scipy.cluster.hierarchy.linkage: Linkage matrix for hierarchical clustering (use get_cluster_labels() to obtain cluster assignments)

If store_distances=True with hierarchical:

tuple: (linkage_matrix, distance_matrix)

Return type:

For kmeans/dbscan

normalize(embedding, to_sum=False, copy=True)[source]

Normalize embedding vectors to [0,1] range.

Parameters:
  • embedding – Array of shape (n_samples, n_dimensions)

  • to_sum – If True, normalize to sum=1. If False, normalize to range [0,1]

  • copy – If True, operate on a copy of the data

Returns:

Normalized embedding

Return type:

numpy.ndarray

plot_embedding(embedding, labels=None, dims=[0, 1], normalize=False, cmap='jet', s=2.5, alpha=1.0, linewidth=0.1, colorbar_label=None, sort_order=None, ref_index=None, legend_loc='upper left', figsize=None, save_path=None, dpi=200, file_format='png')[source]

Plot embedding and optionally color by labels/values.

Parameters:
  • embedding – Array of shape (n_samples, n_dimensions)

  • labels – Values for coloring points. Can be: - numpy array of shape (N,) or (N,1) - pandas Series/DataFrame column (e.g., mave[‘DNN’]) - None (points will be single color)

  • dims – Which dimensions to plot [dim1, dim2]

  • normalize – Whether to normalize embedding to [0,1] range

  • cmap – Colormap for points (e.g., ‘viridis’, ‘jet’, ‘tab10’) - Use ‘viridis’/’jet’ for continuous values - Use ‘tab10’/’Set3’ for discrete clusters

  • s – Point size (default: 2.5)

  • alpha – Point transparency (default: 1.0)

  • linewidth – Width of point edges (default: 0.1)

  • colorbar_label – Label for colorbar (if None, no colorbar shown)

  • sort_order – Order to plot points (‘ascending’, ‘descending’, or None) - Useful for ensuring important points are plotted on top - Points are sorted based on their label values - Works with both numpy arrays and pandas Series/DataFrames

  • ref_index – Index of reference/wild-type sequence to highlight (default: None) - Will be shown as a black star on the plot

  • legend_loc – Location of legend for reference sequence (‘best’, ‘top left’, ‘upper right’, etc.)

  • figsize – Figure size (width, height) in inches (default: None, uses matplotlib default)

  • save_path – Path to save figure (if None, displays plot)

  • dpi – DPI for saved figure (default: 200)

  • file_format – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

Example usage:

# Basic plot with reference sequence clusterer.plot_embedding(embedding, ref_index=0)

# Color by DNN predictions with colorbar and reference clusterer.plot_embedding(

embedding, labels=mave[‘DNN’], # or y_mut numpy array colorbar_label=’DNN prediction’, sort_order=’descending’, # high predictions on top ref_index=ref_idx

)

plot_histogram(embedding, dims=[0, 1], bins=101, cmap='viridis', colorbar_label='Count', figsize=None, save_path=None, dpi=200, file_format='png')[source]

Plot 2D histogram of embedding points.

Parameters:
  • embedding – Array of shape (n_samples, n_dimensions)

  • dims – Which dimensions to plot [dim1, dim2]

  • bins – Number of bins for histogram (default: 101)

  • cmap – Colormap for histogram (default: ‘viridis’)

  • colorbar_label – Label for colorbar (if None, shows ‘Count’)

  • figsize – Figure size (width, height) in inches (default: None, uses matplotlib default)

  • save_path – Path to save figure (if None, displays plot)

  • dpi – DPI for saved figure (default: 200)

  • file_format – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

plot_dendrogram(linkage, figsize=(15, 10), leaf_rotation=90, leaf_font_size=8, cut_level=None, save_path=None, dpi=200, file_format='png', ax=None, truncate=True, cut_level_truncate=None, criterion=None, n_clusters=None, gui=False)[source]

Plot dendrogram from hierarchical clustering linkage matrix.

Parameters:
  • linkage – Hierarchical clustering linkage matrix

  • figsize – Figure size (width, height)

  • leaf_rotation – Rotation of leaf labels

  • leaf_font_size – Font size for leaf labels

  • cut_level – Optional height at which to draw horizontal cut line

  • save_path – Path to save figure (if None, displays plot)

  • dpi – DPI for saved figure (default: 200)

  • file_format – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

  • ax – Matplotlib axis to plot on (for GUI use). If provided, plots on existing axis instead of creating new figure

  • truncate – Whether to truncate dendrogram for large datasets (for GUI use). Only used when ax is provided

  • cut_level_truncate – Height at which to truncate dendrogram (for GUI use). Used with truncate=True

  • criterion – Clustering criterion (‘maxclust’ or ‘distance’) for truncation calculation. Used with truncate=True

  • n_clusters – Number of clusters (for maxclust criterion) for truncation calculation. Used with truncate=True and criterion=’maxclust’

  • gui – Whether to apply GUI-specific styling (smaller fonts, removed spines, etc.) (default: False)

get_cluster_labels(linkage, criterion='maxclust', max_distance=10, n_clusters=200)[source]

Get cluster labels from a linkage matrix.

Parameters:
  • linkage – Linkage matrix from scipy.cluster.hierarchy.linkage

  • criterion – How to form flat clusters (‘distance’ or ‘maxclust’) ‘distance’: Cut tree at specified height ‘maxclust’: Produce specified number of clusters

  • max_distance – Maximum cophenetic distance within clusters (only used if criterion=’distance’)

  • n_clusters – Desired number of clusters to produce (only used if criterion=’maxclust’)

Returns:

Cluster labels (zero-indexed) float: Cut level (max_distance if criterion=’distance’, or computed level if criterion=’maxclust’)

Return type:

numpy.ndarray

MetaExplainer

class seam.meta_explainer.MetaExplainer(clusterer, mave_df, attributions, ref_idx=0, background_separation=False, mut_rate=0.1, sort_method='median', alphabet=None)[source]

Bases: object

A class for analyzing and visualizing attribution map clusters.

This class builds on the Clusterer class to provide detailed analysis and visualization of attribution map clusters.

Features

Analysis
  • Mechanism Summary Matrix (MSM) generation

  • Sequence logos and attribution logos

  • Cluster membership tracking

  • Background separation and noise reduction of attribution maps

Visualization
  • DNN score distributions per cluster

  • Sequence logos (PWM and enrichment)

  • Attribution logos (fixed and adaptive scaling)

  • Mechanism Summary Matrices

  • Cluster profile plots

Requirements

  • All requirements from Clusterer class

  • Biopython

  • Logomaker

  • Seaborn

  • SQUID-NN

__init__(clusterer, mave_df, attributions, ref_idx=0, background_separation=False, mut_rate=0.1, sort_method='median', alphabet=None)[source]

Initialize MetaExplainer with clusterer and data.

Parameters:
  • clusterer (Clusterer) – Initialized Clusterer object with clustering results.

  • mave_df (pandas.DataFrame) – DataFrame containing sequences and their scores. Must have columns: - ‘Sequence’: DNA/RNA sequences - ‘Score’ or ‘DNN’: Model predictions - ‘Cluster’: Cluster assignments

  • attributions (numpy.ndarray) – Attribution maps for sequences. Shape should be (n_sequences, seq_length, n_characters).

  • ref_idx (int, default=0) – Index of reference sequence in mave_df.

  • background_separation (bool, default=False) – Whether to separate background signal from logos.

  • mut_rate (float, default=0.10) – Mutation rate used for background sequence generation.

  • sort_method ({'median', 'visual', None}, default='median') – How to sort clusters in all visualizations and analyses. - ‘median’: Sort by median DNN score - ‘visual’: Sort based on hierarchical clustering of the MSM pattern - None: Use original cluster indices

  • alphabet (list of str, optional) – List of characters to use in sequence logos. Default is [‘A’, ‘C’, ‘G’, ‘T’].

get_cluster_order(sort_method='median', sort_indices=None)[source]

Get cluster ordering based on specified method.

plot_cluster_stats(plot_type='box', metric='prediction', save_path=None, show_ref=True, show_fliers=False, compact=False, fontsize=8, dpi=200, figsize=None, file_format='png')[source]

Plot cluster statistics with various visualization options.

Parameters:
  • plot_type ({'box', 'bar'}) – Type of visualization: - ‘box’: Show distribution as box plots (predictions only) - ‘bar’: Show bar plot of predictions or counts

  • metric ({'prediction', 'counts'}) – What to visualize (only used for bar plots): - ‘prediction’: DNN prediction scores - ‘counts’: cluster occupancy/size

  • save_path (str, optional) – Path to save figure. If None, display instead

  • show_ref (bool) – If True and reference sequence exists, highlight its cluster

  • show_fliers (bool) – If True and plot_type=’box’, show outlier points

  • compact (bool) – If False, shows full boxplots. (default: False) If True, uses a compact representation for boxplots with dots and IQR lines.

  • fontsize (int) – Font size for tick labels

  • dpi (int) – DPI for saved figure

  • figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)

  • file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

generate_msm(n_seqs=1000, batch_size=50, gpu=False)[source]

Generate a Mechanism Summary Matrix (MSM) from cluster attribution maps.

Parameters:
  • n_seqs (int, default=1000) – Number of sequences to generate per cluster.

  • batch_size (int, default=50) – Number of sequences to process in each batch.

  • gpu (bool, default=False) – Whether to use GPU acceleration if available.

Returns:

The Mechanism Summary Matrix with shape (n_clusters, n_clusters). Each entry [i,j] represents the average DNN score when applying cluster i’s mechanism to sequences from cluster j.

Return type:

numpy.ndarray

plot_msm(column='Entropy', delta_entropy=False, square_cells=False, view_window=None, show_tfbs_clusters=False, tfbs_clusters=None, entropy_multiplier=0.5, cov_matrix=None, row_order=None, revels=None, save_path=None, dpi=200, figsize=None, file_format='png', gui=False, gui_figure=None)[source]

Visualize the Mechanism Summary Matrix (MSM) as a heatmap.

Parameters:
  • column (str) – Which MSM metric to visualize: - ‘Entropy’: Shannon entropy of characters at each position per cluster - ‘Reference’: Percentage of mismatches to reference sequence - ‘Consensus’: Percentage of matches to cluster consensus sequence

  • delta_entropy (bool) – If True and column=’Entropy’, show change in entropy from background expectation (based on mutation rate)

  • square_cells (bool) – If True, set cells in MSM to be perfectly square

  • view_window (list of [start, end], optional) – If provided, crop the x-axis to this window of positions

  • show_tfbs_clusters (bool) – Whether to show TFBS cluster rectangles (default: False)

  • tfbs_clusters (dict, optional) – Dictionary mapping cluster IDs to lists of positions. Required if show_tfbs_clusters is True.

  • entropy_multiplier (float, optional) – Multiplier for entropy threshold when identifying background (default: 0.5)

  • cov_matrix (numpy.ndarray, optional) – Covariance matrix for TFBS cluster plotting. Required if show_tfbs_clusters is True.

  • row_order (list of int, optional) – Order of rows in cov_matrix. Required if show_tfbs_clusters is True.

  • revels (pandas.DataFrame, optional) – Revels matrix for entropy calculations. Required if show_tfbs_clusters is True.

  • save_path (str, optional) – Path to save figure. If None, display instead

  • dpi (int) – DPI for saved figure

  • figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)

  • file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

  • gui (bool) – If True, return data for GUI processing without plotting

  • gui_figure (matplotlib.figure.Figure, optional) – Existing figure to plot on when gui=True. If None, creates a new figure.

generate_logos(logo_type='average', background_separation=False, mut_rate=0.01, entropy_multiplier=0.5, adaptive_background_scaling=False, figsize=(20, 2.5), batch_size=50, font_name='sans', stack_order='big_on_top', center_values=True, color_scheme='classic', font_weight=None, fade_below=0.5, shade_below=0.5, width=0.9)[source]

Generate sequence or attribution logos for each cluster.

This method creates visualization logos that represent either the average attribution patterns or sequence patterns within each cluster. It can optionally remove background signal to highlight cluster-specific patterns.

Parameters:
  • logo_type ({'average', 'pwm', 'enrichment'}, default='average') – Type of logo to generate: - ‘average’: Shows average attribution values (based on attribution maps) - ‘pwm’: Shows position weight matrix of nucleotide frequencies (based on sequence statistics) - ‘enrichment’: Shows nucleotide enrichment relative to background (based on sequence statistics)

  • background_separation (bool, default=False) – Whether to remove background signal from logos. Only applies to ‘average’ logos. When True, subtracts the background pattern computed by compute_background(), forcused on highly variable positions.

  • mut_rate (float, default=0.01) – Mutation rate for background entropy calculation. Only used if background_separation=True.

  • entropy_multiplier (float, default=0.5) – Controls stringency of background position identification via a multiplier on the background entropy. Only used if background_separation=True.

  • adaptive_background_scaling (bool, default=False) – If True and background_separation=True, uniformly scales the background pattern differently for each cluster based on the magnitude of its background signal. This is useful when clusters have similar background patterns but at different scales.

  • figsize (tuple, default=(20, 2.5)) – Figure size in inches (width, height).

  • batch_size (int, default=50) – Number of logos to process in each batch.

  • font_name (str, default='sans') – Font name for logo text.

  • stack_order ({'big_on_top', 'small_on_top', 'fixed'}, default='big_on_top') – How to order nucleotides in each stack: - ‘big_on_top’: Largest values on top - ‘small_on_top’: Smallest values on top - ‘fixed’: Fixed order (A, C, G, T)

  • center_values (bool, default=True) – Whether to center values in each position. Only applies to ‘average’ logos.

  • color_scheme (str or dict, default='classic') – Color scheme for logo characters.

  • font_weight (str or int, optional) – Font weight for logo text. Can be string (‘normal’, ‘bold’, etc.) or numeric (0-1000).

  • fade_below (float, default=0.5) – Controls alpha transparency for negative values. Higher values make negative values more transparent.

  • shade_below (float, default=0.5) – Controls color darkening for negative values. Higher values make negative values darker.

  • width (float, default=0.9) – Controls the horizontal width of each character.

show_sequences(cluster_idx)[source]

Show sequences belonging to a specific cluster.

Parameters:

cluster_idx (int) – Index of cluster to show sequences for. If sorting was specified during initialization, this index refers to the sorted order (e.g., 0 is the first cluster after sorting).

Returns:

DataFrame containing sequences and scores for the specified cluster.

Return type:

pandas.DataFrame

plot_cluster_profiles(profiles, save_dir=None, dpi=200, figsize=None, file_format='png')[source]

Plot overlay of profiles associated with each cluster.

Parameters:
  • profiles (np.ndarray) – Array of profile data corresponding to sequences in mave_df

  • save_dir (str, optional) – Directory to save profile plots. If None, displays instead.

  • dpi (int) – DPI for saved figures

  • figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)

  • file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

compute_background(mut_rate=0.01, entropy_multiplier=0.5, adaptive_background_scaling=False, process_logos=True)[source]

Compute background signal based on entropic positions.

This method identifies and computes background signal patterns for each cluster based on positions with high entropy (high variability). The background can be computed either uniformly across all clusters or with cluster-specific scaling.

Parameters:
  • mut_rate (float, default=0.01) – Mutation rate used to calculate background entropy threshold. Higher values will identify more positions as entropic.

  • entropy_multiplier (float, default=0.5) – Factor to multiply background entropy by for threshold. Lower values make the threshold more stringent (fewer positions identified as entropic).

  • adaptive_background_scaling (bool, default=False) – If True, computes a scaling factor for each cluster that best matches the magnitude of that cluster’s background signal. This is useful when different clusters have similar background patterns but at different scales. If False, uses the same background scale for all clusters.

  • process_logos (bool, default=True) – If True, creates and processes BatchLogo instances for background visualization. If False, skips logo processing to save time and memory.

Notes

The background computation process: 1. Identifies entropic (highly variable) positions in each cluster 2. Computes the average attribution pattern at these positions 3. If adaptive_background_scaling is True, computes a scaling factor for each

cluster based on positions that are entropic in both that cluster and the global background

get_cluster_maps(cluster_idx)[source]

Get attribution maps belonging to a specific cluster.

Parameters:

cluster_idx (int) – Index of cluster to get maps for. If sorting was specified during initialization, this index refers to the sorted order (e.g., 0 is the first cluster after sorting).

Returns:

Attribution maps for the specified cluster.

Return type:

numpy.ndarray

plot_attribution_variation(scope='all', metric='std', save_path=None, view_window=None, figsize=None, dpi=600, colors=None, xtick_spacing=5, file_format='png')[source]

Visualize the variation in attribution values across attribution maps for each nucleotide position.

Parameters:
  • scope ({'all', 'clusters'}, default='all') – Scope of variation calculation: - ‘all’: Use all individual attribution maps - ‘clusters’: Use cluster-averaged attribution maps

  • metric ({'std', 'var'}, default='std') – Metric to use for variation calculation: - ‘std’: Standard deviation - ‘var’: Variance

  • save_path (str, optional) – Path to save figure. If None, display instead.

  • view_window (list of [start, end], optional) – If provided, crop the x-axis to this window of positions.

  • figsize (tuple, optional) – Figure size (width, height) in inches (default: None, uses matplotlib default)

  • dpi (int, default=600) – DPI for saved figure.

  • colors (dict, optional) – Dictionary mapping nucleotide indices to RGB colors. Default: {0: [0, .5, 0], 1: [0, 0, 1], 2: [1, .65, 0], 3: [1, 0, 0]} for A, C, G, T respectively.

  • xtick_spacing (int, default=5) – Show x-axis labels every nth position. Set to 1 to show all positions.

  • file_format (str, optional) – Format for saved figure (default: ‘png’). Common formats: ‘png’, ‘pdf’, ‘svg’, ‘eps’

Returns:

Array of variation values (std or var) for each position and nucleotide

Return type:

numpy.ndarray

Identifier

class seam.identifier.Identifier(msm_df, meta_explainer, column='Entropy')[source]

Bases: object

Class for identifying and analyzing transcription factor binding sites (TFBSs) from attribution maps.

The Identifier class takes attribution maps from a MetaExplainer and identifies distinct TFBSs by analyzing patterns of activity across clusters. It uses a multi-step process:

  1. Covariance Analysis: - Analyzes the covariance between positions in the attribution maps - Identifies regions that show coordinated activity across clusters - Uses hierarchical clustering to group positions into potential TFBSs

  2. TFBS Identification: - Defines TFBS regions based on clustered covariance patterns - Determines which clusters are active for each TFBS using entropy-based thresholds - Creates a binary or continuous binding configuration matrix showing TFBS activity levels in each cluster

  3. Binding Configuration Assignment: - Assigns clusters to specific TFBS binding configurations (e.g., A only, A+B, background) - Uses a distance-based scoring system to find the best cluster for each configuration - For background configuration, finds clusters with minimal TFBS activity across all TFBSs

Key Concepts: - TFBS Activity: Measured as 1 - (normalized entropy), where higher values indicate

stronger TFBS activity in a cluster

  • Binding Configuration Matrix: Shows binary or continuous activity levels (0-1) for each TFBS in each cluster

  • Binding Configuration Assignments: Maps each possible TFBS combination to its optimal cluster

Parameters:
  • msm_df (pandas.DataFrame) – Mechanism Summary Matrix (MSM) data from MetaExplainer, containing entropy or other activity measures for each position in each cluster

  • meta_explainer (MetaExplainer) – Instance of MetaExplainer class that generated the attribution maps

  • column (str, optional) – Column from MSM to use for analysis (default: ‘Entropy’)

revels

Pivoted MSM data with clusters as rows and positions as columns

Type:

pandas.DataFrame

cov_matrix

Covariance matrix between positions, used for TFBS identification

Type:

pandas.DataFrame

tfbs_clusters

Dictionary mapping TFBS labels to their constituent positions

Type:

dict

entropy_multiplier

Threshold multiplier for determining active clusters

Type:

float

active_clusters_by_tfbs

Dictionary mapping TFBS labels to their active clusters

Type:

dict

Notes

The class uses entropy-based measures to identify TFBS activity, where: - Lower entropy indicates more specific, TFBS-like activity - Higher entropy indicates more background-like activity - Activity is normalized relative to background entropy to account for

mutation rate and sequence composition

__init__(msm_df, meta_explainer, column='Entropy')[source]

Initialize Identifier with MSM data and MetaExplainer instance.

Parameters:
  • msm_df (pandas.DataFrame) – MSM data from MetaExplainer

  • meta_explainer (MetaExplainer) – Instance of MetaExplainer class

  • column (str, optional) – Column to use for analysis (default: ‘Entropy’)

cluster_msm_covariance(method='average', n_clusters=None, cut_height=None)[source]

Cluster the covariance matrix using hierarchical clustering.

Parameters:
  • method (str, optional) – Linkage method for hierarchical clustering (default: ‘average’)

  • n_clusters (int, optional) – Number of clusters to form. If None, will use cut_height or automatic detection. Note: This is the number of clusters BEFORE removing the largest cluster.

  • cut_height (float, optional) – Height at which to cut the dendrogram. If None and n_clusters is None, will use automatic gap detection.

Returns:

Dictionary mapping cluster labels to positions

Return type:

dict

plot_pairwise_matrix(theta_lclc, view_window=None, threshold=None, cbar_title='Pairwise', gridlines=True, xtick_spacing=1, figsize=None, save_path=None, dpi=200, file_format='png')[source]

Plot pairwise matrix visualization. Adapted from https://github.com/jbkinney/mavenn/blob/master/mavenn/src/visualization.py Original authors: Tareen, A. and Kinney, J.

Parameters:
  • theta_lclc (np.ndarray) – Pairwise matrix parameters (shape: (L,C,L,C))

  • view_window (tuple, optional) – (start, end) positions to view

  • threshold (float, optional) – Threshold for matrix values

  • cbar_title (str, optional) – Title for colorbar

  • gridlines (bool, optional) – Whether to show gridlines

  • xtick_spacing (int, optional) – Show every nth x-tick label (default: 1)

  • figsize (tuple, optional) – Figure size (width, height) in inches

  • save_path (str, optional) – Path to save the figure

  • dpi (int, optional) – DPI for saved figure (default: 200)

  • file_format (str, optional) – Format for saved figure (default: ‘png’)

plot_msm_covariance_triangular(view_window=None, xtick_spacing=5, show_clusters=False, figsize=None, save_path=None, dpi=200, file_format='png')[source]

Plot the covariance matrix.

Parameters:
  • view_window (tuple, optional) – (start, end) positions to view

  • xtick_spacing (int, optional) – Show every nth x-tick label (default: 5)

  • show_clusters (bool, optional) – Whether to show TFBS cluster rectangles (default: False)

  • figsize (tuple, optional) – Figure size (width, height) in inches

  • save_path (str, optional) – Directory to save the plot

  • dpi (int, optional) – DPI for saved figure (default: 200)

  • file_format (str, optional) – Format for saved figure (default: ‘png’)

plot_msm_covariance_dendrogram(figsize=(15, 10), leaf_rotation=90, leaf_font_size=8, save_path=None, dpi=200, file_format='png')[source]

Plot the dendrogram from hierarchical clustering.

Parameters:
  • figsize (tuple, optional) – Figure size (width, height) in inches

  • leaf_rotation (float, optional) – Rotation angle for leaf labels (default: 90)

  • leaf_font_size (int, optional) – Font size for leaf labels (default: 8)

  • save_path (str, optional) – Path to save figure (if None, displays plot)

  • dpi (int, optional) – DPI for saved figure (default: 200)

  • file_format (str, optional) – Format for saved figure (default: ‘png’)

plot_msm_covariance_square(view_window=None, show_clusters=True, view_linkage_space=False, figsize=None, save_path=None, dpi=200, file_format='png')[source]

Plot covariance matrix in square format using seaborn heatmap.

Parameters:
  • view_window (tuple, optional) – (start, end) positions to view in nucleotide position space. Note: Disabled when view_linkage_space is True.

  • show_clusters (bool, optional) – Whether to show TFBS cluster rectangles. Only available in nucleotide position space.

  • view_linkage_space (bool, optional) – If True, shows matrix reordered by hierarchical clustering linkage. If False (default), shows matrix in original nucleotide position space. Note: cluster visualization and view_window are disabled in linkage space.

  • figsize (tuple, optional) – Figure size (width, height) in inches

  • save_path (str, optional) – Path to save figure

  • dpi (int, optional) – DPI for saved figure (default: 200)

  • file_format (str, optional) – Format for saved figure (default: ‘png’)

set_entropy_multiplier(entropy_multiplier)[source]

Set the entropy multiplier for TFBS activity detection.

This value is used to determine which clusters are considered active for each TFBS region based on their entropy values.

Parameters:

entropy_multiplier (float) – Multiplier for background entropy threshold. Lower values result in more clusters being considered active.

get_tfbs_positions(active_clusters)[source]

Get the start and stop positions for each TFBS cluster.

Parameters:

active_clusters (dict) – Dictionary mapping TFBS labels to active clusters

Returns:

DataFrame containing start, stop, length, positions, and active clusters for each TFBS, sorted by start position and labeled alphabetically (A, B, C, etc.)

Return type:

pd.DataFrame

get_binding_config_matrix(active_clusters, mode='binary')[source]

Create a binding configuration matrix showing TFBS activity in each cluster.

Parameters:
  • active_clusters (dict) – Dictionary mapping TFBS labels to active clusters

  • mode (str) –

    ‘binary’: 0/1 for inactive/active ‘continuous’: normalized activity values (1 - normalized entropy),

    where higher values indicate more activity

Returns:

Binding configuration matrix with clusters as rows and TFBSs as columns

Return type:

pd.DataFrame

plot_binding_config_matrix(active_clusters, mode='binary', orientation='vertical', figsize=None, save_path=None, dpi=200, file_format='png')[source]

Plot binding configuration matrix showing TFBS activity in each cluster.

Parameters:
  • active_clusters (dict) – Dictionary mapping TFBS labels to active clusters

  • mode (str) – ‘binary’: dark gray/white for active/inactive ‘continuous’: grayscale for activity level

  • orientation (str) – ‘vertical’: Clusters on y-axis, TFBS on x-axis (default) ‘horizontal’: TFBS on y-axis, Clusters on x-axis

  • figsize (tuple, optional) – Figure size (width, height) in inches

  • save_path (str, optional) – Path to save the figure. If None, displays plot.

  • dpi (int, optional) – DPI for saved figure (default: 200)

  • file_format (str, optional) – Format for saved figure (default: ‘png’)

get_binding_config_assignments(tfbs_positions, mode='auto', print_template=False)[source]

Assign clusters to specific TFBS binding configurations based on their activity patterns.

This function analyzes the continuous activity levels of TFBSs across clusters to find the optimal cluster for each possible TFBS binding configuration. For example, it will find: - Which cluster best represents TFBS A alone - Which cluster best represents TFBS B alone - Which cluster best represents the combined presence of TFBSs A and B - Which cluster best represents the background configuration (no TFBSs active)

The scoring system works by: 1. For each binding configuration, defining an “ideal” activity pattern where:

  • Desired TFBS(s) have maximum observed activity

  • Other TFBSs have minimum observed activity

  1. Calculating how far each cluster’s activity pattern is from this ideal

  2. Selecting the cluster that minimizes this distance

For example, when finding a cluster for TFBS A: - The ideal configuration would have maximum activity for A and minimum for others - Each cluster’s score is based on how close it comes to this ideal - The cluster with the highest score (smallest distance from ideal) is selected

Parameters:
  • tfbs_positions (pd.DataFrame) – DataFrame from get_tfbs_positions containing TFBS information. Must have columns: ‘TFBS’, ‘Start’, ‘Stop’, ‘Positions’, ‘Active_Clusters’

  • mode (str, optional) – ‘auto’ : Automatically assign clusters based on activity patterns ‘template’ : Print a template for manual assignment

  • print_template (bool, optional) – If True and mode=’template’, prints a formatted template showing all possible TFBS combinations and their current cluster assignments

Returns:

If mode=’auto’: Dictionary mapping TFBS binding configurations to cluster indices. For example: {

(): 5, # Background configuration (no TFBSs) (‘A’,): 1, # TFBS A alone (‘B’,): 3, # TFBS B alone (‘A’, ‘B’): 7, # Interaction of TFBSs A and B …

} If mode=’template’: None, but prints template for manual assignment

Return type:

dict or None

Notes

The function internally uses the continuous binding configuration matrix (normalized entropy-based activity levels) to make assignments, ensuring consistent scoring across all binding configurations. This means: - Activity levels are normalized relative to background entropy - Higher values (closer to 1) indicate stronger TFBS activity - Lower values (closer to 0) indicate weaker or no TFBS activity

The scoring system prioritizes finding clusters that: 1. Have high activity for the desired TFBS(s) 2. Have low activity for other TFBSs 3. Show balanced activity when multiple TFBSs are desired

get_additive_params(tfbs_positions, specific_clusters=None, zero_out_inactive=False, separate_background=True)[source]

Extract additive parameters for each TFBS by cropping from meta-attribution maps.

Parameters:
  • tfbs_positions (pd.DataFrame) – DataFrame containing TFBS information (from get_tfbs_positions)

  • specific_clusters (list of int, optional) – List of one cluster per TFBS to use for cropping (e.g., [5, 17, 20, 23] for TFBSs A, B, C, D). If None, uses the average of all active clusters for each TFBS.

  • zero_out_inactive (bool, optional) – Controls how to handle positions within the cropped region: - False (default): Return the full cropped region (start to stop) with all positions - True: Return the full cropped region (start to stop), with inactive positions set to zero

  • separate_background (bool, optional) – Whether to use background-separated cluster maps (default: True). If True, uses meta_explainer.cluster_maps_no_bg if available. If False or if background-separated maps aren’t available, uses regular cluster maps.

Returns:

Dictionary mapping TFBS IDs (A, B, C, etc.) to their 4xL parameter matrices. For each TFBS, the matrix is cropped from either: - The cluster-averaged attribution map for the specified cluster, or - The average of cluster-averaged attribution maps from all active clusters The matrix always spans the full region (start to stop), with L = stop - start + 1. If zero_out_inactive=True, positions not in the TFBS’s Positions list are set to zero.

Return type:

dict

get_epistatic_params(tfbs_positions, binding_config_assignments=None)[source]

Calculate epistatic interactions between TFBSs using Möbius inversion.

For each combination of TFBSs, calculates the interaction using the inclusion-exclusion principle. For example, for a 3-way interaction ABC: I_ABC = y_ABC - y_AB - y_AC - y_BC + y_A + y_B + y_C - y_bg

Parameters:
  • tfbs_positions (pd.DataFrame) – DataFrame containing TFBS information (from get_tfbs_positions)

  • binding_config_assignments (dict, optional) – Dictionary mapping TFBS binding configurations to cluster indices. If None, will use get_binding_config_assignments() with mode=’auto’ to get assignments.

Returns:

Dictionary mapping TFBS combinations to their epistatic interaction values. Keys are tuples of TFBS IDs (e.g., (‘A’, ‘B’) for 2-way, (‘A’, ‘B’, ‘C’) for 3-way). Values are the calculated interaction terms using Möbius inversion.

Return type:

dict

Notes

The epistatic interactions are calculated using Möbius inversion, where each term’s coefficient is (-1)^k for a subset of size k. This ensures that:

  1. The interaction term captures the deviation from additivity

  2. Higher-order interactions are properly decomposed into their constituent terms

  3. The background configuration (empty set) is properly accounted for

For example: - 2-way: I_AB = y_AB - y_A - y_B + y₀ - 3-way: I_ABC = y_ABC - y_AB - y_AC - y_BC + y_A + y_B + y_C - y₀ - 4-way: I_ABCD = y_ABCD - y_ABC - y_ABD - y_ACD - y_BCD +

y_AB + y_AC + y_AD + y_BC + y_BD + y_CD - y_A - y_B - y_C - y_D + y₀

A positive interaction indicates synergy (combined effect > sum of individual effects), while a negative interaction indicates antagonism (combined effect < sum of individual effects).

Saving and Loading

If the epistatic parameters are saved as a NumPy file (.npy), these parameters can be loaded as follows:

```python import numpy as np

# Load the saved parameters epistatic_params = np.load(‘path/to/identified_parameters/epistatic_params.npy’,

allow_pickle=True).item()

# The loaded data will be a dictionary where: # - Keys are tuples of TFBS IDs (e.g., (‘A’, ‘B’) for pairwise interactions) # - Values are the interaction terms (float values)

# Example usage: # Get a pairwise interaction ab_interaction = epistatic_params[(‘A’, ‘B’)]

# Get a higher-order interaction abc_interaction = epistatic_params[(‘A’, ‘B’, ‘C’)] ```

Note: The allow_pickle=True parameter is required because the data is stored as a dictionary, and .item() is needed to convert the NumPy array back into a dictionary format.

plot_epistatic_interactions(epistatic_params, tfbs_positions=None, pairwise_only=False, annotate=True, cmap='RdBu_r', figsize=(10, 8), save_path=None, dpi=200, file_format='png')[source]

Plot epistatic interactions between TFBSs.

Creates two visualizations: 1. A lower triangular heatmap showing pairwise interactions (excluding diagonal) 2. A bar plot showing higher-order interactions (if any exist)

Parameters:
  • epistatic_params (dict) – Dictionary mapping TFBS combinations to their interaction values

  • tfbs_positions (pandas.DataFrame, optional) – DataFrame containing TFBS positions, used for consistent ordering

  • pairwise_only (bool, default=False) – If True, only plot pairwise interactions

  • annotate (bool, default=True) – Whether to show interaction values on the heatmap

  • cmap (str, default='RdBu_r') – Colormap for the heatmap

  • figsize (tuple, default=(10, 8)) – Figure size for the heatmap

  • save_path (str, optional) – Directory to save the plots

  • dpi (int, default=200) – DPI for saved figures

  • file_format (str, default='png') – Format for saved figures

Returns:

(fig_heatmap, ax_heatmap) if pairwise_only=True ((fig_heatmap, ax_heatmap), (fig_bar, ax_bar)) if pairwise_only=False

Return type:

tuple

Utils

Utility functions for SEAM-NN package. Core functionality for sequence processing, data handling, and computation.

seam.utils.suppress_warnings()[source]

Suppress common warnings for cleaner output.

Return type:

None

seam.utils.get_device(gpu=False)[source]

Get appropriate compute device.

Return type:

Optional[str]

seam.utils.arr2pd(x, alphabet=['A', 'C', 'G', 'T'])[source]

Convert array to pandas DataFrame with proper column headings.

Return type:

DataFrame

seam.utils.oh2seq(one_hot, alphabet=['A', 'C', 'G', 'T'], encoding=1)[source]

Convert one-hot encoding to sequence.

Return type:

str

seam.utils.seq2oh(seq, alphabet=['A', 'C', 'G', 'T'], encoding=1)[source]

Convert sequence to one-hot encoding.

Return type:

ndarray

seam.utils.calculate_background_entropy(mut_rate, alphabet_size)[source]

Calculate background entropy given mutation rate.

Return type:

float

seam.utils.safe_file_path(directory, filename, extension)[source]

Generate safe file path, creating directories if needed.

Return type:

str