Interpretability and Shapley Values
Written by James Camp and Jacob Collier
Motivation
Deep Learning/Machine Learning (DL/ML) is increasingly embedded in critical domains like healthcare, finance, law, policing, and more. Models often achieve impressive predictive accuracy, yet for real-world adoption we also need Trust & Transparency. Simply put, stakeholders must understand why an ML model outputs a particular decision.
While some simpler models (like linear regression) are intrinsically interpretable, state-of-the-art models (like deep neural networks) appear, to us, as “black boxes,” meaning we have a hard time understanding why they make the predictions they do. Shapley values provide a formal, axiom-based framework for attributing a model’s prediction to its input features, helping to (hopefully) allow us to get a better understanding of what’s going on within the “black box.”
Shapley Value
We define the Shapley value for feature (i) as:
\[ \phi_i = \frac{1}{n} \sum_{S \subseteq [n] \setminus \{i\}} \frac{v\bigl(S \cup \{i\}\bigr) - v(S)}{\binom{n-1}{|S|}}. \]
The average over all sizes (k) of the average over sets of size (k):
\[ \phi_i = \frac{1}{n} \sum_{k=0}^{n-1} \frac{1}{\binom{n-1}{k}} \sum_{\substack{S \subseteq [n] \setminus \{i\}:|S|=k}} \bigl(v(S \cup \{i\}) - v(S)\bigr). \]

Let’s try computing some Shapley Values:
Case (i=1) and (n=4):

Case (i=1) and (n=5):

Case (i=1) and (n=6):

Observation
Directly computing all these terms is \(O(2^n)\) quickly becomes infeasible for large n.
Key question: How do we efficiently compute Shapley values without enumerating all \(2^n\) subsets?
Regression Formulation
Rather than computing every Shapley value, it is possible to use linear regression to approximate them, however this still has some issues:
Linear regression is unable to effectively approximate non-linear relations. That is to say in the case that the Shapley values are highly non-linear then a regression formulation is unlikely to be able to be accurate.
Linear regression can also struggle to cope with significant data outliers.
In order to train this regression model we are already calculating the Shapley values, resulting in \(O(2^n)\) time complexity.

Kernel Shap
Motivation: The \(O(2^n)\) time complexity leads to poor scaling, and results in infeasibility for most neural networks, however by using linear regression on only a subset of the inputs we are able to reduce the complexity.
Kernel Shap works by approximating the regression using only a subset of the rows and comparing it to the corresponding subsection of the Shapley scores.

The advantage of this approach is that we are able to train a regression model while using significantly less compute since we are only computing a subset of the Shapley values, however there are some disadvantages.
The first such disadvantage is that some points are more impactful than others. Because of this our approximation may suffer if it samples a subset which is largely non-representative of the dataset, or which has an abnormal slope in a given dimension.
Key Question: How can we ensure that we are choosing a good subset of the Shapley values, without significantly increasing our time complexity?
Leverage Scores
Goals: We would like to maintain performance while getting a theoretical guarantee of accuracy.
Kernel Shap attempts to approximate the linear regression, but it suffers from the fact that some points are more important than others to the preservation of the line. One approach to fix this is by assigning each point an amount of leverage based on how important it is to maintaining the slope.
In order to calculate the leverage score of a row S, we calculate that row S has leverage:



Leverage Shap
Idea: Kernel Shap, but rather than sampling a random subset of rows, we instead sample rows proportional to their leverage score. This method results in significantly lower comparative loss than either Kernel Shap or Optimized Kernel Shap.
Leverage Shap comes with the additional bonus of a guarantee that if \(\gamma = \frac{\|A \phi - b\|_2^2}{\|A \phi\|_2^2}\) for all \(\epsilon \geq 0\) with \(O(nlog(n) + \frac{n}{\epsilon})\) samples and with probability \(\frac{99}{100}\).

