All functions in this module take in data as boolean design matrices (i.e. observations x features), and return a feature association measure (i.e. features x features).
Note that some of these functions return valid adjacency matrices (e.g. a feature is not associated to itself), while others return covariance or correlations (features are partially or fully correlated to themselves). Many of the more basic association measures are given in terms of conditional probabilities via a contingency table, for which we adopt the following notation, where A and B are individual feature columns of X:
Where appropriate, the methods here allow for additive/laplace smoothing through the psuedocounts parameter, even in cases where this is not traditionally done (e.g. cosine similarity).
We give interpretations of meaning that allow for this, where we can.
associations
Functions:
-
coocur_prob–probability of a co-ocurrence per-observation
-
odds_ratio–Ratio of the odds of a true pos/neg to false pos/neg
-
mutual_information–Mutual Information over binary random variables
-
chow_liu–Chow-Liu Tree on the features of a (binary) design matrix
-
yule_y–a.k.a. Coefficient of Colligation.
-
yule_q–a.k.a. Goodman & Kruskal's gamma for 2x2.
-
ochiai–a.k.a cosine similarity on binary sets
-
binary_cosine_similarity–alias of ochiai(X), provided for user convenience
-
hyperbolic_project– -
resource_project–Bipartite projection due to Zhou et al (2007).
-
high_salience_skeleton–Backboning technique from Grady et al. (2012).
-
doubly_stochastic_filter–From P.B. Slater(2009),
-
forest_pursuit–Structure estimation under spreading-process assumption (Sexton, 2025)
coocur_prob
probability of a co-ocurrence per-observation
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Source code in affinis/associations.py
odds_ratio
Ratio of the odds of a true pos/neg to false pos/neg
For associations, we replace pos/neg and true/false with a=1/0 b=1/0, which gives odds ratio as:
Additive smoothing is applied via the initial calculation of conditional probabilities.
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Source code in affinis/associations.py
mutual_information
Mutual Information over binary random variables
For use in e.g. Chow-Liu Trees
An estimate for the mutual information (i.e., between the sample distributions) can be derived from the marginals, as with OR/Yule's Q/Y/etc., though it is more compactly represented as a pairwise sum over the domains of each distribution being compared:
Additive smoothing is applied via the initial calculation of conditional probabilities.
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Returns:
Source code in affinis/associations.py
chow_liu
Chow-Liu Tree on the features of a (binary) design matrix
computes mutual information over all pairs of features, and returns the maximum spanning tree on them. Assumes a symmetric adjacency is wanted.
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Returns: Adjacency matrix of the Chow Liu MST
Source code in affinis/associations.py
yule_y
a.k.a. Coefficient of Colligation.
Mobius transform of the Odds Ratio to the range [-1,1], scaled so that transformed contingency table of each feature pair has unitary off- diagonals and diagonal (associations) \(sqrt{OR}\).
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Returns: square matrix containing Yule's Y
Source code in affinis/associations.py
yule_q
a.k.a. Goodman & Kruskal's gamma for 2x2.
mobius transform of the Odds Ratio to the range [-1,1]:
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Returns: square matrix containing Yule's Q
Source code in affinis/associations.py
ochiai
a.k.a cosine similarity on binary sets
Effectively an uncentered correlation, but for binary observations the "cosine similarity" is also called the Ochiai Coefficient between two sets \(A,B\), where binary "1" stands for an element belonging to the set. See Janson and Vegelius (1981).
In our use case, we define measured as
This interpretation of cosine similarity as the geometric mean of conditional probabilities is particularly useful when trying to approximate interaction rates, and especially to apply additive smoothing.
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Returns: square cosine similarity matrix (incl. ones in the diagonal)
Source code in affinis/associations.py
binary_cosine_similarity
alias of ochiai(X), provided for user convenience
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
Returns: cosine similarity on binary feature vectors
Source code in affinis/associations.py
hyperbolic_project
Newman (2001) TODO: add to tests and document
Note: passing a pseudocount currently does not have an effect, as there is not a trivial way to interpret this bipartite projection as a probability.
Parameters:
-
X(FeatMat) –feature matrix
Source code in affinis/associations.py
resource_project
Bipartite projection due to Zhou et al (2007).
Goes one step further than hyperbolic projection, by viewing each node as having some “amount” of a resource to spend, which gets re-allocated by observational unit. Can be interpreted as one step of iterative proportional fitting on the bipartite adjacency matrix.
NOTE: For additive smoothing to work, we assume no smoothing is needed for the "forward" projection (features->observations), since we assume all observations have some feature participation, while some features may have no (observed) observations.
By default, we symmetrize with "maximum", meaning that association is considered as the strongest of the directions it could take. This can be overridden with any function of two same-shaped arrays.
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
-
sym_func–function to symmetrize the resulting association measure
Returns: symmetrized "resource projection" similarities
Source code in affinis/associations.py
high_salience_skeleton
high_salience_skeleton(
X: FeatMat,
prior_dists: SimsMat | None = None,
pseudocts: PsdCts = "min-connect",
)
Backboning technique from Grady et al. (2012). Calculates shortest paths from every node, and counts the number of trees each edge ended up being used in.
Parameters:
-
X(FeatMat) –feature matrix
-
prior_dists(SimsMat | None, default:None) –(Default =
-log(ochiai)) prior distances for shortest paths -
pseudocts(PsdCts, default:'min-connect') –additive smoothing parameter
Source code in affinis/associations.py
doubly_stochastic_filter
doubly_stochastic_filter(
X: FeatMat,
reg: float = 0.1,
pseudocts: PsdCts = 0.5,
prior_dists: SimsMat | None = None,
**sink_kws
) -> SimsMat
From P.B. Slater(2009),
“A two-stage algorithm for extracting the multiscale backbone of complex weighted networks”
This implementation is effectively a wrappper around sinkhorn and [min_connected_filter][affinis.filter.min_connected_filter] to accomplish the "two stage" process used.
Parameters:
-
X(FeatMat) –feature matrix
-
pseudocts(PsdCts, default:0.5) –additive smoothing parameter
-
prior_sims–(Default will calculate cosine similarity)
-
sink_kws–kwargs to pass to
sinkhorn()
Source code in affinis/associations.py
forest_pursuit
forest_pursuit(
X: FeatMat,
mode: FPopts = "edge-prob",
pseudocts: PsdCts = "min-connect",
prior_dists: SimsMat | None = None,
**efm_kws
) -> SimsMat
Structure estimation under spreading-process assumption (Sexton, 2025)
This is the original implementation, as demonstrated in Rachael Sextons's dissertation (2025). For more details on how it works, see Approximate Recovery in Near-linear Time by Forest Pursuit, and subsequent chapters for modifications and descriptions of the "modes" presented here.
Which mode you select determines how you will interpret the resulting association scores.
- If "edge-prob", FP will estimate the probability of an edge being activated, given you know the two nodes it connects are known to be active. This is a formalization of what it means for an edge to "exist" in the Desire-Path Density framing. See forest_pursuit_edge.
- If "counts", FP simply reports the estimated number of edge activations found from spanning trees in rows of X. See forest_pursuit_cts.
- If "forest-max", use Expected Forest Maximization, an E-M scheme to estimate the posterior probability of both the edge activations and the underlying graph structure, via alternating minimization. See expected_forest_maximization.
- If "interaction", re-weight the "edge-prob" values to be the likelihood of observing an edge activation, given the observed data (useful e.g. for bayesian inference). See forest_pursuit_interaction.
Parameters:
-
X(FeatMat) –feature matrix
-
mode(FPopts, default:'edge-prob') –inference mode
-
pseudocts(PsdCts, default:'min-connect') –additive smoothing parameter
-
prior_dists(SimsMat | None, default:None) –previously calculated inter-node distances (default:
-log(cos)) -
**efm_kws–kwargs to pass to expected_forest_maximization if that mode is selected.
Source code in affinis/associations.py
Forest Pursuit Modes
For the sake of completeness, we provide the underlying functions for each of the forest_pursuit modes, here.
forest_pursuit_cts
forest_pursuit_cts(
X: FeatMat, prior_dists: SimsMat | None = None
) -> SimsMat
Point estimate for number of actual edge activations, rather than node-node co-occurrences. Uses the Empirical Bayes estimate of the Spanning Forest Density
Parameters:
Returns: counts for approximate steiner tree occurrences.
Source code in affinis/associations.py
forest_pursuit_edge
forest_pursuit_edge(
X: FeatMat,
prior_dists: SimsMat | None = None,
pseudocts: PsdCts = "min-connect",
) -> SimsMat
point estimate for edge-activation probability, conditional on both nodes being a priori activated. Uses the Spanning Forest Density non-parametric estimator
Parameters:
-
(XFeatMat) –feature matrix
-
(prior_distsSimsMat | None, default:None) –default uses
-log(cos) -
(pseudoctsPsdCts, default:'min-connect') –additive smoothing parameter
Returns: probability of edge traversal given a co-occurrence.
Source code in affinis/associations.py
forest_pursuit_interaction
forest_pursuit_interaction(
X: FeatMat,
prior_dists: SimsMat | None = None,
precalc_prob: SimsMat | None = None,
pseudocts: PsdCts = "min-connect",
) -> SimsMat
point estimate for probability of observing an edge traversal, using the Spanning Forest Density non-parametric estimator. Weights conditional edge traversal probability by the base co-occurrence probability.
If you have already calculated forest_pursuit_edge, you can pass it as precalc_prob.
Parameters:
-
(XFeatMat) –feature matrix
-
(prior_distsSimsMat | None, default:None) –default uses
-log(cos) -
(precalc_probSimsMat | None, default:None) –to avoid re-computing edge prob if you already have
-
(pseudoctsPsdCts, default:'min-connect') –additive smoothing parameter
Returns: probability of observing an edge traversal
Source code in affinis/associations.py
expected_forest_maximization
expected_forest_maximization(
X: FeatMat,
prior_struct: SimsMat = None,
beta: float = 0.001,
eps: float = 1e-05,
max_iter: int = 100,
verbose: bool = False,
) -> SimsMat
Expectation Maximization Scheme to recover structure.
Shown to have minor accuracy improvement over vanilla Forest Pursuit, at the cost of significantly decreased scalability, as it uses an alternating-minimization scheme to jointly estimate edge activation probabilities, and network structure.
For more detail, see Expected Forest Maximization.
Parameters:
-
(XFeatMat) –FeatMat:
-
(prior_structSimsMat, default:None) – -
(betafloat, default:0.001) – -
(epsfloat, default:1e-05) – -
(max_iterint, default:100) – -
(verbosebool, default:False) –
Returns: