API reference

Public surface re-exported from sparse_nmf.

Standalone NMF

class sparse_nmf.SparseNMF(n_components=256, max_iter=500, device='cuda', batch_size=None, verbose=True, random_state=None, tol=0.0001, mse_weight=1.0, r2_weight=0.0, learning_rate=0.01, nonzero_mse_weight=0.0, nonzero_r2_weight=0.0, patience=None)[source]

Bases: object

GPU-accelerated Non-negative Matrix Factorization for sparse matrices.

This class implements NMF using multiplicative update rules while working directly with sparse matrices, avoiding memory-intensive dense conversions.

When R² loss is enabled (r2_weight > 0), the optimization switches to gradient-based Adam optimizer to handle the normalized R² loss function.

Parameters:

n_components (int, default 256) – Number of components (latent dimensions) to extract.
max_iter (int, default 500) – Maximum number of iterations for optimization.
device (str, default 'cuda') – Device to use for computation (‘cuda’, ‘cuda:0’, ‘cpu’, etc.). If ‘cuda’ is specified but CUDA is not available, falls back to CPU.
batch_size (int, optional) – Batch size for processing rows. If None, auto-determines based on GPU memory and matrix size.
verbose (bool, default True) – Whether to print progress information.
random_state (int, optional) – Random seed for initialization. If None, uses random initialization.
tol (float, default 1e-4) – Tolerance for convergence checking. Training stops if change in reconstruction error is below this threshold.
patience (int, optional) – Number of iterations to wait without improvement before early stopping. If None, only uses tolerance-based convergence. If specified, stops training if the error doesn’t improve for patience consecutive iterations.
mse_weight (float, default 1.0) – Weight for MSE (Mean Squared Error) loss component. total_loss = mse_weight * MSE + r2_weight * (1 - R²)
r2_weight (float, default 0.0) – Weight for R² (coefficient of determination) loss component. When > 0, switches to gradient-based optimization. R² loss is computed as (1 - R²) so that minimizing loss maximizes R².
learning_rate (float, default 0.01) – Learning rate for gradient-based optimizer (used when r2_weight > 0).
nonzero_mse_weight (float, default 0.0) – Controls whether MSE loss includes zeros or only non-zero values. When > 0: MSE computed only on non-zero positions (ignores zeros). When 0: MSE computed on all positions including zeros (learns sparsity patterns). When > 0, forces gradient descent (multiplicative updates always include zeros).
nonzero_r2_weight (float, default 0.0) – Controls whether R² loss includes zeros or only non-zero values. When > 0: R² computed only on non-zero positions (ignores zeros). When 0: R² computed on all positions including zeros (learns sparsity patterns). Only affects training when r2_weight > 0. For final reporting, both R² (all values) and R² (non-zero only) are always computed.

W

Basis matrix of shape (n_samples, n_components).

Type:: torch.Tensor

H

Coefficient matrix of shape (n_components, n_features).

Type:: torch.Tensor

reconstruction_error_

Final reconstruction error (Frobenius norm).

Type:: float

r2_score_

R² (coefficient of determination) score on all values (including zeros), computed on a sample of the data. Values closer to 1.0 indicate better reconstruction quality.

Type:: float

r2_score_nonzero_

R² (coefficient of determination) score on non-zero values only, computed on a sample of the data. More meaningful for highly sparse data.

Type:: float

n_iter_

Number of iterations completed.

Type:: int

Examples

>>> from AoU.phenome.sparseNMF import SparseNMF
>>> from scipy.sparse import random
>>>
>>> # Create a sparse matrix
>>> X = random(1000, 5000, density=0.01, format='csr')
>>>
>>> # Fit NMF with MSE only (default, uses multiplicative updates)
>>> nmf = SparseNMF(n_components=128, max_iter=200, device='cuda:0')
>>> X_reduced = nmf.fit_transform(X)
>>>
>>> # Fit NMF with weighted MSE + R² loss (uses gradient descent)
>>> nmf = SparseNMF(
...     n_components=128, max_iter=200, device='cuda:0',
...     mse_weight=0.5, r2_weight=0.5  # Equal weighting
... )
>>> X_reduced = nmf.fit_transform(X)
>>> print(f"R² score: {nmf.r2_score_:.4f}")

fit_transform(X_sparse)[source]

Fit NMF model and return transformed matrix.

Parameters:: X_sparse (scipy.sparse matrix) – Sparse input matrix of shape (n_samples, n_features). Must be non-negative (values >= 0).
Returns:: Transformed matrix W of shape (n_samples, n_components). This is the reduced representation ready for further processing (e.g., autoencoder).
Return type:: np.ndarray

transform(X_sparse)[source]

Transform new data using the fitted model.

Note: This is a simplified version that fits W for new data while keeping H fixed. For true out-of-sample transformation, you would need to solve: W_new = argmin ||X_new - W_new @ H||^2

Parameters:: X_sparse (scipy.sparse matrix) – New sparse matrix to transform.
Returns:: Transformed matrix of shape (n_samples, n_components).
Return type:: np.ndarray

sparse_nmf.sparse_nmf(X_sparse, n_components=256, max_iter=500, device='cuda', batch_size=None, verbose=True, random_state=None, mse_weight=1.0, r2_weight=0.0, learning_rate=0.01, nonzero_mse_weight=0.0, nonzero_r2_weight=0.0)[source]

Convenience function for sparse NMF (without saving).

Parameters:

X_sparse (scipy.sparse matrix) – Sparse input matrix of shape (n_samples, n_features).
n_components (int, default 256) – Number of components to extract.
max_iter (int, default 500) – Maximum number of iterations.
device (str, default 'cuda') – Device to use (‘cuda’, ‘cpu’, etc.).
batch_size (int, optional) – Batch size for processing. If None, auto-determines.
verbose (bool, default True) – Whether to print progress.
random_state (int, optional) – Random seed for reproducibility.
mse_weight (float, default 1.0) – Weight for MSE loss component.
r2_weight (float, default 0.0) – Weight for R² loss component. When > 0, uses gradient descent.
learning_rate (float, default 0.01) – Learning rate for gradient descent (used when r2_weight > 0 or nonzero_mse_weight > 0).
nonzero_mse_weight (float, default 0.0) – Controls whether MSE loss includes zeros or only non-zero values. When > 0: MSE computed only on non-zero positions (ignores zeros). When 0: MSE computed on all positions including zeros (learns sparsity patterns).
nonzero_r2_weight (float, default 0.0) – Controls whether R² loss includes zeros or only non-zero values. When > 0: R² computed only on non-zero positions (ignores zeros). When 0: R² computed on all positions including zeros (learns sparsity patterns). Only affects training when r2_weight > 0.

Returns:

Transformed matrix of shape (n_samples, n_components).

Return type:

np.ndarray

Examples

>>> from AoU.phenome.sparseNMF import sparse_nmf
>>> # MSE only (default, fast multiplicative updates)
>>> X_reduced = sparse_nmf(X, n_components=256, device='cuda:0')
>>>
>>> # Weighted MSE + R² loss
>>> X_reduced = sparse_nmf(X, n_components=256, mse_weight=0.5, r2_weight=0.5)

sparse_nmf.train_sparse_nmf(X_sparse=None, n_components=256, max_iter=500, device='cuda', batch_size=None, verbose=True, random_state=None, mse_weight=1.0, r2_weight=0.0, learning_rate=0.01, nonzero_mse_weight=0.0, nonzero_r2_weight=0.0, normalize_inputs=False, normalize_outputs=False, patience=None, embeddings_save_path=None, model_save_path=None, force=False)[source]

Train sparse NMF model with automatic saving of embeddings and model.

This is a convenience wrapper that handles model creation, training, and saving. If save paths are provided and files exist, loads from disk unless force=True. When loading from disk, X_sparse is not required.

Parameters:

X_sparse (scipy.sparse matrix, optional) – Sparse input matrix of shape (n_samples, n_features). Required when training a new model (force=True or files don’t exist). Optional when loading from disk (files exist and force=False).
n_components (int, default 256) – Number of components to extract.
max_iter (int, default 500) – Maximum number of iterations.
device (str, default 'cuda') – Device to use (‘cuda’, ‘cpu’, etc.).
batch_size (int, optional) – Batch size for processing. If None, auto-determines.
verbose (bool, default True) – Whether to print progress.
random_state (int, optional) – Random seed for reproducibility.
mse_weight (float, default 1.0) – Weight for MSE loss component.
r2_weight (float, default 0.0) – Weight for R² loss component. When > 0, uses gradient descent.
learning_rate (float, default 0.01) – Learning rate for gradient descent (used when r2_weight > 0 or nonzero_mse_weight > 0).
nonzero_mse_weight (float, default 0.0) – Controls whether MSE loss includes zeros or only non-zero values. When > 0: MSE computed only on non-zero positions (ignores zeros). When 0: MSE computed on all positions including zeros (learns sparsity patterns).
nonzero_r2_weight (float, default 0.0) – Controls whether R² loss includes zeros or only non-zero values. When > 0: R² computed only on non-zero positions (ignores zeros). When 0: R² computed on all positions including zeros (learns sparsity patterns). Only affects training when r2_weight > 0.
normalize_inputs (bool, default False) – Whether to L2-normalize the input X matrix before training. When True, each row of X will be normalized to unit length before NMF. This helps balance dense vs sparse rows and prevents dense datasets from dominating the NMF factors. Useful when datasets have varying sparsity levels.
normalize_outputs (bool, default False) – Whether to L2-normalize the output W matrix (embeddings) before returning. When True, each row of the output will have unit length. This is useful if you plan to pass the embeddings to an autoencoder with normalize_input=True or use_cosine_loss=True. When False, preserves the original magnitude information.
patience (int, optional) – Number of iterations to wait without improvement before early stopping. If None, only uses tolerance-based convergence. If specified, stops training if the error doesn’t improve for patience consecutive iterations.
embeddings_save_path (str, optional) – Path to save the transformed embeddings (W matrix) as a .npy file. If provided and file exists and force=False, loads embeddings from disk.
model_save_path (str, optional) – Path to save the model (W and H matrices) as a .pkl file. If provided and file exists and force=False, loads model from disk.
force (bool, default False) – If True, force retraining even if save files exist. If False and both save files exist, loads from disk instead of retraining.

Return type:

Tuple[ndarray, SparseNMF]

Returns:

X_reduced (np.ndarray) – Transformed matrix of shape (n_samples, n_components).
model (SparseNMF) – Trained SparseNMF model.

Examples

>>> from AoU.phenome.sparseNMF import train_sparse_nmf
>>>
>>> # Train and save model
>>> X_reduced, model = train_sparse_nmf(
...     X_sparse,
...     n_components=256,
...     device='cuda:0',
...     embeddings_save_path='embeddings.npy',
...     model_save_path='model.pkl'
... )
>>>
>>> # Load from disk (if files exist and force=False)
>>> # X_sparse is not required when loading from disk
>>> X_reduced, model = train_sparse_nmf(
...     embeddings_save_path='embeddings.npy',
...     model_save_path='model.pkl'
... )

Joint model

class sparse_nmf.SparseNMF_Autoencoder(n_samples, n_features, nmf_components=256, latent_dim=2, hidden_dims=(256, 128, 64, 16), activation='relu', dropout=0.0, use_vae=False, use_feature_attention=False, feature_attention_weight=1.0, feature_attention_temperature=1.0, normalize_nmf_components=False, device='cuda', random_state=None)[source]

Bases: Module

Joint SparseNMF + Autoencoder model for end-to-end training.

This model combines sparse NMF with an autoencoder in a single trainable architecture. The key advantage is that sparse operations are used throughout NMF, and only the reduced W matrix (n_samples × n_components) is converted to dense for the autoencoder.

Architecture:

Sparse X (n_samples, n_features): → NMF: X ≈ W @ H (sparse operations) → Dense W (n_samples, n_components) → Autoencoder: W → z (n_samples, latent_dim)

Parameters:

n_samples (int) – Number of samples (rows) in the input matrix. Can be inferred from X_sparse.shape[0] in train_joint_model.
n_features (int) – Number of features (columns) in the input matrix. Can be inferred from X_sparse.shape[1] in train_joint_model.
nmf_components (int, default 256) – Number of NMF components (intermediate dimension).
latent_dim (int, default 2) – Final latent dimension (output of autoencoder).
hidden_dims (tuple of int, default (256, 128, 64, 16)) – Hidden layer dimensions for autoencoder. Matches two-step approach default.
activation (str, default "relu") – Activation function for autoencoder.
dropout (float, default 0.0) – Dropout rate for autoencoder.
use_vae (bool, default False) – Whether to use Variational Autoencoder.
use_feature_attention (bool, default False) – If True, learn attention weights for each NMF component based on reconstruction importance. Disabled by default for joint training - use two-step approach for feature attention.
feature_attention_weight (float, default 1.0) – Weight for mixing original input with attended input (0 = no attention, 1 = full attention).
feature_attention_temperature (float, default 1.0) – Temperature for attention weights. Use higher values (1.0+) for joint training stability.
normalize_nmf_components (bool, default False) – Whether to L2-normalize NMF components (W) before passing to autoencoder. When True, matches normalize_input=True in two-step approach. This is crucial for proper clustering and prevents radial patterns. When False, preserves the original magnitude information in W.
device (str, default 'cuda') – Device to use for computation.
random_state (int | None)

Examples

>>> from AoU.phenome.sparseNMF import SparseNMF_Autoencoder
>>> from scipy.sparse import csr_matrix
>>>
>>> # Create and train model (dimensions inferred automatically)
>>> z, model = train_joint_model(
...     X_sparse,  # n_samples and n_features inferred from X_sparse.shape
...     nmf_components=256,
...     latent_dim=2,
...     device='cuda:0',
...     n_epochs=100
... )

encode(X_sparse_torch=None)[source]

Encode to latent space.

Parameters:: X_sparse_torch (torch.sparse.FloatTensor, optional) – If provided, uses current W. Otherwise uses stored W.
Returns:: z – Latent embeddings (n_samples, latent_dim)
Return type:: torch.Tensor

forward(X_sparse_torch)[source]

Forward pass through the joint model.

Parameters:

X_sparse_torch (torch.sparse.FloatTensor) – Sparse input matrix (n_samples, n_features)

Returns:

z (torch.Tensor) – Latent embeddings (n_samples, latent_dim)
W_recon (torch.Tensor) – Reconstructed W from autoencoder (n_samples, nmf_components)
X_recon (torch.Tensor) – NMF reconstruction (n_samples, n_features) - for loss computation
W (torch.Tensor) – Current W matrix (n_samples, nmf_components)

reparameterize(mu, logvar)[source]: Reparameterization trick for VAE.

sparse_nmf.train_joint_model(X_sparse, model=None, n_samples=None, n_features=None, nmf_components=256, latent_dim=2, hidden_dims=(256, 128, 64, 16), activation='relu', dropout=0.0, use_vae=True, use_feature_attention=False, feature_attention_weight=1.0, feature_attention_temperature=1.0, normalize_nmf_components=False, device='cuda', n_epochs=200, learning_rate=0.0005, nmf_weight=1.0, ae_weight=1.0, kl_weight=0.01, use_contrastive=False, contrastive_weight=0.25, contrastive_temperature=0.5, use_cosine_loss=True, dimension_reg_weight=0.0, weight_decay=0.0001, batch_size=None, verbose=True, random_state=None, save_path=None, force=False)[source]

Train the joint SparseNMF + Autoencoder model.

Parameters:

X_sparse (scipy.sparse matrix) – Sparse input matrix (n_samples, n_features).
model (SparseNMF_Autoencoder, optional) – Pre-initialized model. If None, creates a new model.
n_samples (int, optional) – Number of samples. If None, automatically inferred from X_sparse.shape[0]. Only specify if you want to validate the shape matches.
n_features (int, optional) – Number of features. If None, automatically inferred from X_sparse.shape[1]. Only specify if you want to validate the shape matches.
nmf_components (int, default 256) – Number of NMF components.
latent_dim (int, default 2) – Final latent dimension.
hidden_dims (tuple, default (256, 128, 64, 16)) – Autoencoder hidden dimensions. Matches two-step approach default.
activation (str, default "relu") – Activation function.
dropout (float, default 0.0) – Dropout rate.
use_vae (bool, default True) – Whether to use VAE.
use_feature_attention (bool, default False) – If True, learn attention weights for each NMF component. Disabled by default for joint training because randomly initialized attention can destabilize training. Use two-step approach if you need feature attention.
feature_attention_weight (float, default 1.0) – Weight for mixing original input with attended input (0 = no attention, 1 = full).
feature_attention_temperature (float, default 1.0) – Temperature for attention weights. Higher values for joint training stability.
normalize_nmf_components (bool, default False) – Whether to L2-normalize NMF components (W) before passing to autoencoder. When True, matches normalize_input=True in two-step approach. This is crucial for proper clustering and prevents radial patterns. When False, preserves the original magnitude information in W.
device (str, default 'cuda') – Device to use.
n_epochs (int, default 200) – Number of training epochs.
learning_rate (float, default 0.0005) – Learning rate.
nmf_weight (float, default 1.0) – Weight for NMF loss.
ae_weight (float, default 1.0) – Weight for AE loss.
kl_weight (float, default 0.01) – Weight for KL loss (VAE only).
use_contrastive (bool, default False) – Whether to use contrastive loss. Default False matches working two-step config.
contrastive_weight (float, default 0.25) – Weight for contrastive loss term.
contrastive_temperature (float, default 0.5) – Temperature for contrastive loss (lower = sharper distinctions).
use_cosine_loss (bool, default True) – Whether to use cosine loss for autoencoder reconstruction. Works better with normalized inputs (matches two-step approach).
dimension_reg_weight (float, default 0.1) – Weight for dimension regularization (prevents collapse to 1D).
weight_decay (float, default 1e-4) – L2 weight regularization for optimizer.
batch_size (int, optional) – Batch size for autoencoder training. Default 256.
verbose (bool, default True) – Whether to print progress.
random_state (int, optional) – Random seed.
save_path (str, optional) – Path to save model and embeddings.
force (bool, default False) – If True, retrain even if save_path exists.

Return type:

tuple

Returns:

z (np.ndarray) – Final latent embeddings (n_samples, latent_dim).
model (SparseNMF_Autoencoder) – Trained model.

sparse_nmf.compute_joint_loss(model, X_sparse_torch, z, W_recon, X_recon, W, H, mu=None, logvar=None, nmf_weight=1.0, ae_weight=1.0, kl_weight=0.01, use_sparse_loss=True, use_contrastive=True, contrastive_weight=0.25, contrastive_temperature=0.5, use_cosine_loss=True, dimension_reg_weight=0.1)[source]

Compute combined loss for joint model.

Parameters:

model (SparseNMF_Autoencoder) – The model instance.
X_sparse_torch (torch.sparse.FloatTensor) – Sparse input matrix.
z (torch.Tensor) – Latent embeddings.
W_recon (torch.Tensor) – Reconstructed W from autoencoder.
X_recon (torch.Tensor) – NMF reconstruction.
W (torch.Tensor) – Current W matrix.
mu (torch.Tensor, optional) – VAE mean (if using VAE).
logvar (torch.Tensor, optional) – VAE log variance (if using VAE).
nmf_weight (float, default 1.0) – Weight for NMF reconstruction loss.
ae_weight (float, default 1.0) – Weight for autoencoder reconstruction loss.
kl_weight (float, default 0.01) – Weight for KL divergence (VAE only).
use_sparse_loss (bool, default True) – If True, compute NMF loss only on non-zero elements (more efficient).
H (Tensor)
use_contrastive (bool)
contrastive_weight (float)
contrastive_temperature (float)
use_cosine_loss (bool)
dimension_reg_weight (float)

Return type:

tuple

Returns:

total_loss (torch.Tensor) – Total combined loss.
loss_dict (dict) – Dictionary of individual loss components.

Attention analysis

sparse_nmf.extract_attention_weights(model, X_nmf, batch_size=256, device=None, verbose=False)[source]

Extract per-sample feature attention weights from trained autoencoder.

Supports both feature attention and transformer attention modes. Optimized for very large datasets (1M+ samples) with memory-efficient processing.

Parameters:

model (Autoencoder) – Trained autoencoder model with either: - use_feature_attention=True: Feature-wise attention (MLP-based) - use_transformer=True: Transformer attention (self-attention between features)
X_nmf (np.ndarray or torch.Tensor) – NMF-transformed embeddings, shape (n_samples, n_nmf_components). These are the inputs to the autoencoder. For very large datasets, keep as numpy array on CPU to avoid OOM.
batch_size (int, default 256) – Batch size for processing. Auto-increased for large datasets. For 1M+ samples, uses 4096-8192 for efficiency.
device (str, optional) – Device to use. If None, uses model’s device.
verbose (bool, default False) – If True, print progress information.

Returns:

attention_weights – Attention weights for each sample and NMF component, shape (n_samples, n_nmf_components). For feature attention: values in [0, 1] range (from sigmoid). For transformer: values in [0, 1] range (normalized attention scores).

Return type:

np.ndarray

Examples

>>> from AoU.phenome.sparseNMF import extract_attention_weights
>>> # Works with feature attention
>>> attention_weights = extract_attention_weights(model, X_nmf, batch_size=1024)
>>> # Also works with transformer attention
>>> attention_weights = extract_attention_weights(model, X_nmf, batch_size=4096)
>>> print(f"Attention shape: {attention_weights.shape}")  # (n_samples, n_components)

sparse_nmf.extract_and_aggregate_attention(model, X_nmf, nmf_H, batch_size=256, device=None, normalize=True, gene_feature_names=None, nmf_feature_names=None, sample_names=None, metadata=None, verbose=True, nonzero_threshold=None, save_dir=None, force=False, return_attention_matrices=False)[source]

Extract attention weights for all samples, trace to genes, and aggregate statistics.

This is a high-level wrapper that: 1. Extracts attention weights for all samples on NMF components 2. Traces attention weights back to original gene features 3. Aggregates statistics across samples for both gene and NMF features

Supports both feature attention and transformer attention modes. Optimized for very large datasets (1M+ samples): uses GPU operations, larger batches, memory-efficient processing, and vectorized aggregations.

Parameters:

model (Autoencoder) – Trained autoencoder model with either feature attention or transformer attention enabled.
X_nmf (np.ndarray or torch.Tensor) – NMF-transformed embeddings, shape (n_samples, n_nmf_components).
nmf_H (np.ndarray or torch.Tensor) – NMF H matrix mapping components to genes, shape (n_nmf_components, n_genes).
batch_size (int, default 256) – Batch size for extracting attention weights. Auto-increased for large datasets.
device (str, optional) – Device to use. If None, uses model’s device.
normalize (bool, default True) – If True, normalize attention weights per sample before aggregation.
gene_feature_names (array-like, optional) – Names for gene features. If None, uses integer indices. If metadata is provided, this will be inferred from metadata[‘var’].index.
nmf_feature_names (array-like, optional) – Names for NMF features. If None, uses integer indices.
sample_names (array-like, optional) – Names/IDs for samples. If None, uses integer indices. Should match the order of samples in X_nmf. Can be obtained from metadata[‘obs’].index or metadata[‘obs’][‘sourceId’]. If metadata is provided, this will be inferred from metadata[‘obs’].
metadata (dict or AnnData-like object, optional) – Metadata object with ‘var’ and ‘obs’ keys (e.g., AnnData object). If provided, gene_feature_names and sample_names will be inferred from: - metadata[‘var’].index for gene names - metadata[‘obs’][‘obs_id’] if available, otherwise metadata[‘obs’].index for sample names This parameter takes precedence over gene_feature_names and sample_names if provided.
verbose (bool, default True) – If True, print progress and summary information.
nonzero_threshold (float, optional) – Threshold for counting “nonzero” attention. If None, uses percentile-based threshold: the 1st percentile of all attention values (identifies bottom 1% as noise). If normalize=True, all values are > 0 after normalization, so a threshold is needed to identify meaningful attention vs. uniform noise distribution.
save_dir (str, optional) – Directory path to save the aggregated dataframes as parquet files. If provided, saves: - gene_aggregated_df as ‘gene_attention_aggregated.parquet’ - nmf_aggregated_df as ‘nmf_attention_aggregated.parquet’ If None, dataframes are not saved.
force (bool, default False) – If True, overwrite existing parquet files. If False and files exist, raises an error.
return_attention_matrices (bool, default False) – If True, also return the pre-aggregated attention matrices: - gene_attention_matrix: shape (n_samples, n_genes) - attention scores for each sample-gene pair - nmf_attention_matrix: shape (n_samples, n_nmf_components) - attention scores for each sample-NMF component pair Note: When True, this requires storing the full matrices in memory, which may be memory-intensive for very large datasets. If loading from existing files, matrices cannot be returned unless they were previously saved.

Returns:

gene_aggregated_df (pd.DataFrame) – Aggregated statistics per gene feature with columns: - feature_index: Gene feature index - feature_name: Gene feature name (if provided) - mean_attention: Mean attention across all samples - min_attention: Minimum attention across all samples - max_attention: Maximum attention across all samples - n_samples_nonzero: Number of samples with nonzero attention for this gene - pct_samples_nonzero: Percentage of samples with nonzero attention - max_attention_sample: Sample name/ID with highest attention for this gene (if sample_names provided)
nmf_aggregated_df (pd.DataFrame) – Aggregated statistics per NMF feature with columns: - feature_index: NMF feature index - feature_name: NMF feature name (if provided) - mean_attention: Mean attention across all samples - min_attention: Minimum attention across all samples - max_attention: Maximum attention across all samples - n_samples_nonzero: Number of samples with nonzero attention for this NMF feature - pct_samples_nonzero: Percentage of samples with nonzero attention - max_attention_sample: Sample name/ID with highest attention for this NMF feature (if sample_names provided)
gene_attention_matrix (np.ndarray, optional) – Pre-aggregated attention matrix, shape (n_samples, n_genes). Only returned if return_attention_matrices=True. Contains continuous attention scores for each sample-gene pair.
nmf_attention_matrix (np.ndarray, optional) – Pre-aggregated attention matrix, shape (n_samples, n_nmf_components). Only returned if return_attention_matrices=True. Contains continuous attention scores for each sample-NMF component pair.

Examples

>>> from AoU.phenome.sparseNMF import extract_and_aggregate_attention
>>>
>>> gene_df, nmf_df = extract_and_aggregate_attention(
...     model, X_nmf, nmf.H,
...     gene_feature_names=gene_names,
...     sample_names=sample_names
... )
>>>
>>> # Get top genes by mean attention
>>> top_genes = gene_df.nlargest(10, 'mean_attention')

sparse_nmf.trace_attention_to_genes(attention_weights_nmf, nmf_H, normalize=True)[source]

Trace attention weights from NMF components back to original gene features.

This uses the principled approach of matrix multiplication through the linear transformation (NMF H matrix). Since NMF decomposition is:

X ≈ W @ H

where W is (n_samples, n_components) and H is (n_components, n_genes), the attention on NMF components can be propagated to genes via:

gene_attention = attention_weights_nmf @ H

This is mathematically sound because: 1. Attention weights represent importance of each NMF component 2. H matrix maps each NMF component to original gene features 3. Matrix multiplication properly aggregates attention across components

This approach is equivalent to computing gradients through linear layers and is the standard method for propagating importance/attention through linear transformations in neural networks.

Parameters:

attention_weights_nmf (np.ndarray) – Attention weights on NMF components, shape (n_samples, n_nmf_components). Typically obtained from extract_attention_weights().
nmf_H (np.ndarray or torch.Tensor) – NMF H matrix mapping components to genes, shape (n_nmf_components, n_genes). This is the coefficient matrix from NMF decomposition (X ≈ W @ H). Can be obtained from SparseNMF.H or SparseNMF_Autoencoder.H.
normalize (bool, default True) – If True, normalize attention weights per sample so they sum to 1. This makes the weights interpretable as a probability distribution.

Returns:

gene_attention_weights – Attention weights for each sample and gene, shape (n_samples, n_genes). Higher values indicate genes that are more important for that sample.

Return type:

np.ndarray

Examples

>>> from AoU.phenome.sparseNMF import extract_attention_weights, trace_attention_to_genes
>>>
>>> # Extract attention on NMF components
>>> attention_nmf = extract_attention_weights(model, X_nmf)
>>>
>>> # Trace back to genes
>>> gene_attention = trace_attention_to_genes(attention_nmf, nmf.H)
>>>
>>> # Get top genes for a sample
>>> sample_idx = 0
>>> top_genes = np.argsort(gene_attention[sample_idx])[-10:][::-1]

sparse_nmf.compute_attention_correlation(gene_attention_matrix, X, obs_mask=None, stratify_by_unique_values=True, verbose=True)[source]

Compute correlations between gene attention matrix and original data matrix.

Computes Pearson and Spearman correlations between gene attention weights and the original gene expression/association matrix. Can stratify samples by the number of unique values per sample (useful for binary vs. continuous data).

Parameters:

gene_attention_matrix (np.ndarray) – Gene attention matrix, shape (n_samples, n_genes). Contains continuous attention scores for each sample-gene pair.
X (np.ndarray or scipy.sparse matrix) – Original data matrix, shape (n_samples, n_genes). Can be sparse (scipy.sparse) or dense (numpy array).
obs_mask (np.ndarray, optional) – Boolean mask to select subset of samples. If provided, X[obs_mask] is used. Should match the samples in gene_attention_matrix.
stratify_by_unique_values (bool, default True) – If True, stratify samples by number of unique values per sample: - “2_unique”: Binary data (2 unique values) - “3_unique”: Ternary data (3 unique values) - “4+_unique”: Continuous data (4+ unique values) If False, only computes global correlations.
verbose (bool, default True) – If True, print summary information.

Returns:

correlation_results_df – DataFrame with columns: - stratum: “2_unique”, “3_unique”, “4+_unique”, or “all” - subset: “all” (all values) or “nonzero” (only nonzero values in X) - n_samples: Number of samples (for strata) or values (for global) - pearson: Pearson correlation coefficient - spearman: Spearman rank correlation coefficient - spearman_p: Spearman correlation p-value

Return type:

pd.DataFrame

Examples

>>> from AoU.phenome.sparseNMF import compute_attention_correlation
>>> import numpy as np
>>>
>>> # Compute correlations
>>> results_df = compute_attention_correlation(
...     gene_attention_matrix=gene_attention_matrix,
...     X=X,
...     obs_mask=obs_mask,
...     stratify_by_unique_values=True
... )
>>>
>>> # Display results
>>> print(results_df)

Visualization

sparse_nmf.plot_nmf_factor_distributions(W, n_factors_to_plot=None, figsize=None, bins=50, kde=True, title=None, factor_names=None, max_cols=4, max_samples=50000, sharex=True, sharey=False, log_x=False, log_y=False, filter_zeros=False, zero_threshold=1e-10, return_fig=False)[source]

Visualize the distribution of each NMF factor across all samples using a faceted plot.

Creates a grid of subplots where each subplot shows the distribution (histogram + KDE) of one NMF factor across all samples. This helps identify factors with different distributions, sparsity patterns, or outliers.

Optimized for large datasets by sampling when n_samples > max_samples.

Parameters:

W (np.ndarray or torch.Tensor) – NMF factor matrix of shape (n_samples, n_components). Each column represents one NMF factor, each row represents one sample.
n_factors_to_plot (int, optional) – Number of factors to plot. If None, plots all factors. If specified, plots the first n_factors_to_plot factors.
figsize (tuple, optional) – Figure size in inches (width, height). If None, auto-calculated based on grid size.
bins (int, default 50) – Number of bins for histogram.
kde (bool, default True) – If True, overlay a kernel density estimate (KDE) curve on the histogram. Automatically disabled for very large datasets (>100k samples) for performance.
title (str, optional) – Overall plot title. If None, auto-generated.
factor_names (list, optional) – Custom names for factors. If None, uses “Factor 1”, “Factor 2”, etc.
max_cols (int, default 4) – Maximum number of columns in the subplot grid.
max_samples (int, default 50000) – Maximum number of samples to use for plotting. If n_samples > max_samples, randomly samples max_samples for faster computation. Statistics (mean, std, etc.) are still computed on full dataset.
sharex (bool, default True) – If True, share x-axis across all subplots. Makes it easier to compare value ranges.
sharey (bool, default False) – If True, share y-axis across all subplots. Makes it easier to compare densities.
log_x (bool, default False) – If True, use logarithmic scale for x-axis. Useful for visualizing sparse distributions with long tails. Automatically adds small epsilon to handle zeros.
log_y (bool, default False) – If True, use logarithmic scale for y-axis (density). Useful when density values span multiple orders of magnitude.
filter_zeros (bool, default False) – If True, exclude zero values from the distribution plot. Useful for focusing on the non-zero tail of sparse distributions. Statistics are still computed on full dataset.
zero_threshold (float, default 1e-10) – Values below this threshold are considered “zero” when filter_zeros=True. Useful for filtering out numerical noise.
return_fig (bool, default False) – If True, return the matplotlib figure for further customization. If False, display the plot and return None.

Returns:

fig – The matplotlib figure if return_fig=True, otherwise None.

Return type:

matplotlib.figure.Figure or None

Examples

>>> from AoU.phenome.sparseNMF import train_sparse_nmf, plot_nmf_factor_distributions
>>>
>>> # Train NMF
>>> X_nmf, nmf_model = train_sparse_nmf(X_sparse, n_components=256)
>>>
>>> # Visualize first 12 factors (fast, even for large datasets)
>>> plot_nmf_factor_distributions(X_nmf, n_factors_to_plot=12)
>>>
>>> # Customize with factor names
>>> factor_names = [f"Component {i+1}" for i in range(12)]
>>> fig = plot_nmf_factor_distributions(
...     X_nmf,
...     n_factors_to_plot=12,
...     factor_names=factor_names,
...     kde=True,
...     return_fig=True
... )
>>> fig.savefig("nmf_distributions.png")

Sample data

sparse_nmf.data.generate_synthetic_sparse(n_samples=500, n_features=1000, n_components=8, density=0.05, noise=0.1, seed=0)[source]

Build a (n_samples, n_features) CSR matrix with rank-n_components structure plus sparse noise.

The matrix is constructed as W @ H + noise, then thresholded to keep only the top density fraction of entries — emulating the sparsity pattern of, e.g., gene-association count data.

Parameters:

n_samples (int) – Output shape.
n_features (int) – Output shape.
n_components (int) – Rank of the underlying low-rank structure. NMF with n_components should recover (close to) this.
density (float) – Fraction of non-zero entries in the output.
noise (float) – Standard deviation of additive Gaussian noise on the dense product before thresholding. Larger noise makes recovery harder.
seed (int) – RNG seed for reproducibility.

Returns:

Shape (n_samples, n_features), dtype float32, non-negative.

Return type:

scipy.sparse.csr_matrix

sparse_nmf.data.load_synthetic_sparse()[source]

Load the bundled synthetic_sparse.npz; generate if missing.

Returns the same shape as generate_synthetic_sparse()’s defaults so callers can switch between the two without changing downstream code.