{"title": "Towards Robust Interpretability with Self-Explaining Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7775, "page_last": 7784, "abstract": "Most recent work on interpretability of complex machine learning models has focused on estimating a-posteriori explanations for previously trained models around specific predictions. Self-explaining models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general -- explicitness, faithfulness, and stability -- and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.", "full_text": "Towards Robust Interpretability\n\nwith Self-Explaining Neural Networks\n\nDavid Alvarez-Melis\n\nCSAIL, MIT\n\ndalvmel@mit.edu\n\nTommi S. Jaakkola\n\nCSAIL, MIT\n\ntommi@csail.mit.edu\n\nAbstract\n\nMost recent work on interpretability of complex machine learning models has\nfocused on estimating a posteriori explanations for previously trained models\naround speci\ufb01c predictions. Self-explaining models where interpretability plays a\nkey role already during learning have received much less attention. We propose\nthree desiderata for explanations in general \u2013 explicitness, faithfulness, and stability\n\u2013 and show that existing methods do not satisfy them. In response, we design\nself-explaining models in stages, progressively generalizing linear classi\ufb01ers to\ncomplex yet architecturally explicit models. Faithfulness and stability are enforced\nvia regularization speci\ufb01cally tailored to such models. Experimental results across\nvarious benchmark datasets show that our framework offers a promising direction\nfor reconciling model complexity and interpretability.\n\n1\n\nIntroduction\n\nInterpretability or lack thereof can limit the adoption of machine learning methods in decision-critical\n\u2014e.g., medical or legal\u2014 domains. Ensuring interpretability would also contribute to other pertinent\ncriteria such as fairness, privacy, or causality [5]. Our focus in this paper is on complex self-explaining\nmodels where interpretability is built-in architecturally and enforced through regularization. Such\nmodels should satisfy three desiderata for interpretability: explicitness, faithfulness, and stability\nwhere, for example, stability ensures that similar inputs yield similar explanations. Most post-hoc\ninterpretability frameworks are not stable in this sense as shown in detail in Section 5.4.\nHigh modeling capacity is often necessary for competitive performance. For this reason, recent work\non interpretability has focused on producing a posteriori explanations for performance-driven deep\nlearning approaches. The interpretations are derived locally, around each example, on the basis of\nlimited access to the inner workings of the model such as gradients or reverse propagation [4, 18], or\nthrough oracle queries to estimate simpler models that capture the local input-output behavior [16, 2,\n14]. Known challenges include the de\ufb01nition of locality (e.g., for structured data [2]), identi\ufb01ability\n[12] and computational cost (with some of these methods requiring a full-\ufb02edged optimization\nsubroutine [24]). However, point-wise interpretations generally do not compare explanations obtained\nfor nearby inputs, leading to unstable and often contradicting explanations [1].\nA posteriori explanations may be the only option for already-trained models. Otherwise, we would\nideally design the models from the start to provide human-interpretable explanations of their pre-\ndictions. In this work, we build highly complex interpretable models bottom up, maintaining the\ndesirable characteristics of simple linear models in terms of features and coef\ufb01cients, without limiting\nperformance. For example, to ensure stability (and, therefore, interpretability), coef\ufb01cients in our\nmodel vary slowly around each input, keeping it effectively a linear model, albeit locally. In other\nwords, our model operates as a simple interpretable model locally (allowing for point-wise interpreta-\ntion) but not globally (which would entail sacri\ufb01cing capacity). We achieve this with a regularization\nscheme that ensures our model not only looks like a linear model, but (locally) behaves like one.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur main contributions in this work are:\n\u2022 A rich class of interpretable models where the explanations are intrinsic to the model\n\u2022 Three desiderata for explanations together with an optimization procedure that enforces them\n\u2022 Quantitative metrics to empirically evaluate whether models adhere to these three principles, and\n\nshowing the advantage of the proposed self-explaining models under these metrics\n\n2\n\nInterpretability: linear and beyond\n\nTo motivate our approach, we start with a simple linear regression model and successively generalize\nit towards the class of self-explaining models. For input features x1, . . . , xn 2 R, and associated\nparameters \u27130, . . . ,\u2713 n 2 R the linear regression model is given by f (x) =Pn\ni \u2713ixi + \u27130. This model\nis arguably interpretable for three speci\ufb01c reasons: i) input features (xi\u2019s) are clearly anchored with\nthe available observations, e.g., arising from empirical measurements; ii) each parameter \u2713i provides\na quantitative positive/negative contribution of the corresponding feature xi to the predicted value;\nand iii) the aggregation of feature speci\ufb01c terms \u2713ixi is additive without con\ufb02ating feature-by-feature\ninterpretation of impact. We progressively generalize the model in the following subsections and\ndiscuss how this mechanism of interpretation is preserved.\n\n2.1 Generalized coef\ufb01cients\nWe can substantially enrich the linear model while keeping its overall structure if we permit the\ncoef\ufb01cients themselves to depend on the input x. Speci\ufb01cally, we de\ufb01ne (offset function omitted)\nf (x) = \u2713(x)>x, and choose \u2713 from a complex model class \u21e5, realized for example via deep neural\nnetworks. Without further constraints, the model is nearly as powerful as\u2014and surely no more\ninterpretable than\u2014any deep neural network. However, in order to maintain interpretability, at\nleast locally, we must ensure that for close inputs x and x0 in Rn, \u2713(x) and \u2713(x0) should not differ\nsigni\ufb01cantly. More precisely, we can, for example, regularize the model in such a manner that\nrxf (x) \u21e1 \u2713(x0) for all x in a neighborhood of x0. In other words, the model acts locally, around\neach x0, as a linear model with a vector of stable coef\ufb01cients \u2713(x0). The individual values \u2713(x0)i act\nand are interpretable as coef\ufb01cients of a linear model with respect to the \ufb01nal prediction, but adapt\ndynamically to the input, albeit varying slower than x. We will discuss speci\ufb01c regularizers so as to\nkeep this interpretation in Section 3.\n\n2.2 Beyond raw features \u2013 feature basis\nTypical interpretable models tend to consider each variable (one feature or one pixel) as the funda-\nmental unit which explanations consist of. However, pixels are rarely the basic units used in human\nimage understanding; instead, we would rely on strokes and other higher order features. We refer\nto these more general features as interpretable basis concepts and use them in place of raw inputs\nin our models. Formally, we consider functions h(x) : X!Z\u21e2\nRk, where Z is some space of\ninterpretable atoms. Naturally, k should be small so as to keep the explanations easily digestible.\nAlternatives for h(\u00b7) include: (i) subset aggregates of the input (e.g., with h(x) = Ax for a boolean\nmask matrix A), (ii) prede\ufb01ned, pre-grounded feature extractors designed with expert knowledge (e.g.,\n\ufb01lters for image processing), (iii) prototype based concepts, e.g. h(x)i = kx \u21e0ik for some \u21e0i 2X\n[12], or (iv) learnt representations with speci\ufb01c constraints to ensure grounding [19]. Naturally, we\ncan let h(x) = x to recover raw-input explanations if desired. The generalized model is now:\n\nf (x) = \u2713(x)>h(x) =\n\n\u2713(x)ih(x)i\n\n(1)\n\nkXi=1\n\nSince each h(x)i remains a scalar, it can still be interpreted as the degree to which a particular feature\nis present. In turn, with constraints similar to those discussed above \u2713(x)i remains interpretable as a\nlocal coef\ufb01cient. Note that the notion of locality must now take into account how the concepts rather\nthan inputs vary since the model is interpreted as being linear in the concepts rather than x.\n\n2.3 Further generalization\nThe \ufb01nal generalization we propose considers how the elements \u2713(x)ih(x)i are aggregated. We can\nachieve a more \ufb02exible class of functions by replacing the sum in (1) by a more general aggregation\n\n2\n\n\ffunction g(z1, . . . , zk), where zi := \u2713(x)ih(x)i. Naturally, in order for this function to preserve\nthe desired interpretation of \u2713(x) in relation to h(x), it should: i) be permutation invariant, so as to\neliminate higher order uninterpretable effects caused by the relative position of the arguments, (ii)\nisolate the effect of individual h(x)i\u2019s in the output (e.g., avoiding multiplicative interactions between\nthem), and (iii) preserve the sign and relative magnitude of the impact of the relevance values \u2713(x)i.\nWe formalize these intuitive desiderata in the next section.\nNote that we can naturally extend the framework presented in this section to multivariate functions\nwith range in Y\u21e2 Rm, by considering \u2713i : X! Rm, so that \u2713i(x) 2 Rm is a vector corresponding to\nthe relevance of concept i with respect to each of the m output dimensions. For classi\ufb01cation, however,\nwe are mainly interested in the explanation for the predicted class, i.e., \u2713\u02c6y(x) for \u02c6y = argmaxy p(y|x).\n3 Self-explaining models\n\nWe now formalize the class of models obtained through subsequent generalization of the simple\nlinear predictor in the previous section. We begin by discussing the properties we wish to impose\non \u2713 in order for it to act as coef\ufb01cients of a linear model on the basis concepts h(x). The intuitive\nnotion of robustness discussed in Section 2.2 suggests using a condition bounding k\u2713(x) \u2713(y)k\nwith Lkh(x) h(y)k for some constant L. Note that this resembles, but is not exactly equivalent to,\nLipschitz continuity, since it bounds \u2713\u2019s variation with respect to a different\u2014and indirect\u2014measure\nof change, provided by the geometry induced implicitly by h on X . Speci\ufb01cally,\nDe\ufb01nition 3.1. We say that a function f : X\u2713 Rn ! Rm is difference-bounded by h : X\u2713 Rn !\nRk if there exists L 2 R such that kf (x) f (y)k \uf8ff Lkh(x) h(y)k for every x, y 2X .\nImposing such a global condition might be undesirable in practice. The data arising in applications\noften lies on low dimensional manifolds of irregular shape, so a uniform bound might be too restrictive.\nFurthermore, we speci\ufb01cally want \u2713 to be consistent for neighboring inputs. Thus, we seek instead\na local notion of stability. Analogous to the local Lipschitz condition, we propose a pointwise,\nneighborhood-based version of De\ufb01nition 3.1:\nDe\ufb01nition 3.2. f : X\u2713 Rn ! Rm is locally difference-bounded by h : X\u2713 Rn ! Rk if for every\nx0 there exist > 0 and L 2 R such that kx x0k < implies kf (x) f (x0)k \uf8ff Lkh(x) h(x0)k.\nNote that, in contrast to De\ufb01nition 3.1, this second notion of stability allows L (and ) to depend on\nx0, that is, the \u201cLipschitz\u201d constant can vary throughout the space. With this, we are ready to de\ufb01ne\nthe class of functions which form the basis of our approach.\nDe\ufb01nition 3.3. Let x 2X\u2713 Rn and Y\u2713 Rm be the input and output spaces. We say that\nf : X!Y is a self-explaining prediction model if it has the form\nf (x) = g\u2713(x)1h(x)1, . . . ,\u2713 (x)kh(x)k\n\nwhere:\n\n(2)\n\n@zi 0\n\ni=1 of basis concepts and their in\ufb02uence scores.\n\nP1) g is monotone and completely additively separable\nP2) For every zi := \u2713(x)ih(x)i, g satis\ufb01es @g\nP3) \u2713 is locally difference-bounded by h\nP4) h(x) is an interpretable representation of x\nP5) k is small.\nIn that case, for a given input x, we de\ufb01ne the explanation of f (x) to be the set Ef (x) ,\n{(h(x)i,\u2713 (x)i)}k\nBesides the linear predictors that provided a starting point in Section 2, well-known families such as\ngeneralized linear models and nearest-neighbor classi\ufb01ers are contained in this class of functions.\nHowever, the true power of the models described in De\ufb01nition 3.3 comes when \u2713(\u00b7) (and potentially\nh(\u00b7)) are realized by architectures with large modeling capacity, such as deep neural networks. When\n\u2713(\u00b7) is realized with a neural network, we refer to f as a self-explaining neural network (SENN). If g\ndepends on its arguments in a continuous way, f can be trained end-to-end with back-propagation.\nSince our aim is maintaining model richness even in the case where the concepts are chosen to be\nraw inputs (i.e., h is the identity), we rely predominantly on \u2713 for modeling capacity, realizing it with\nlarger, higher-capacity architectures.\n\n3\n\n\fIt remains to discuss how the properties (P1)-(P5) in De\ufb01nition 3.3 are to be enforced. The \ufb01rst\ntwo depend entirely on the choice of aggregating function g. Besides trivial addition, other options\n\ninclude af\ufb01ne functions g(z1, . . . , zk) =Pi Aizi where the Ai are constrained to be positive. On\n\nthe other hand, the last two conditions in De\ufb01nition 3.3 are application-dependent: what and how\nmany basis concepts are adequate should be informed by the problem and goal at hand.\nThe only condition in De\ufb01nition 3.3 that warrants further discussion is (P3): the stability of \u2713 with\nrespect to h. For this, let us consider what f would look like if the \u2713i\u2019s were indeed (constant)\nparameters. Looking at f as a function of h(x), i.e., f (x) = g(h(x)), let z = h(x). Using the chain\nx denotes the Jacobian of h (with respect to x). At a given\nrule we get rxf = rzf \u00b7 J h\npoint x0, we want \u2713(x0) to behave as the derivative of f with respect to the concept vector h(x)\naround x0, i.e., we seek \u2713(x0) \u21e1 rzf. Since this is hard to enforce directly, we can instead plug this\nansatz in rxf = rzf \u00b7 J h\n(3)\nAll three terms in L\u2713(f ) can be computed, and when using differentiable architectures h(\u00b7) and\n\u2713(\u00b7), we obtain gradients with respect to (3) through automatic differentiation and thus use it as a\nregularization term in the optimization objective. With this, we obtain a gradient-regularized objective\nof the form Ly(f (x), y) + L\u2713(f (x)), where the \ufb01rst term is a classi\ufb01cation loss and a parameter\nthat trades off performance against stability \u2014and therefore, interpretability\u2014 of \u2713(x).\n\nx to obtain a proxy condition:\nL\u2713(f (x)) , krxf (x) \u2713(x)>J h\n\nx , where J h\n\nx (x)k \u21e1 0\n\n4 Learning interpretable basis concepts\n\nRaw input features are the natural basis for interpretability when the input is low-dimensional and\nindividual features are meaningful. For high-dimensional inputs, raw features (such as individual\npixels in images) tend to be hard to analyze coherently and often lead to unstable explanations that are\nsensitive to noise or imperceptible artifacts in the data [1], and not robust to simple transformations\nsuch as constant shifts [9]. The results in the next section con\ufb01rm this phenomenon, where we observe\nthat the lack of robustness of methods that rely on raw inputs is ampli\ufb01ed for high-dimensional inputs.\nTo avoid some of these shortcomings, we can instead operate on higher level features. In the context\nof images, we might be interested in the effect of textures or shapes\u2014rather than single pixels\u2014on\npredictions. For example, in medical image processing higher-level visual aspects such as tissue\nruggedness, irregularity or elongation are strong predictors of cancerous tumors, and are among the\n\ufb01rst aspects that doctors look for when diagnosing, so they are natural \u201cunits\u201d of explanation.\nIdeally, these basis concepts would be informed by expert knowledge, such as the doctor-provided\nfeatures mentioned above. However, in cases where such prior knowledge is not available, the basis\nconcepts could be learnt instead. Interpretable concept learning is a challenging task in its own right\n[8], and as other aspects of interpretability, remains ill-de\ufb01ned. We posit that a reasonable minimal\nset of desiderata for interpretable concepts is:\ni) Fidelity: the representation of x in terms of concepts should preserve relevant information,\nii) Diversity: inputs should be representable with few non-overlapping concepts, and\niii) Grounding: concepts should have an immediate human-understandable interpretation.\nHere, we enforce these conditions upon the concepts learnt by SENN by: (i) training h as an\nautoencoder, (ii) enforcing diversity through sparsity and (iii) providing interpretation on the concepts\nby prototyping (e.g., by providing a small set of training examples that maximally activate each\nconcept, as described below). Learning of h is done end-to-end in conjunction with the rest of the\nmodel. If we denote by hdec( \u00b7 ) : Rk ! Rn the decoder associated with h, and \u02c6x := hdec(h(x)) the\nreconstruction of x, we use an additional penalty Lh(x, \u02c6x) on the objective, yielding the loss:\n(4)\nAchieving (iii), i.e., the grounding of h(x), is more subjective. A simple approach consists of\nrepresenting each concept by the elements in a sample of data that maximize their value, that is,\n\nSimilarly, one could construct (by optimizing h) synthetic inputs that maximally activate each concept\n\nwe can represent concept i through the set X i = argmax \u02c6X\u2713X,| \u02c6X|=lPx2 \u02c6X h(x)i where l is small.\n(and do not activate others), i.e., argmaxx2X hi(x) Pj6=i hj(x). Alternatively, when available,\none might want to represent concepts via their learnt weights\u2014e.g., by looking at the \ufb01lters associated\nwith each concept in a CNN-based h( \u00b7 ). In our experiments, we use the \ufb01rst of these approaches\n(i.e., using maximally activated prototypes), leaving exploration of the other two for future work.\n\nLy(f (x), y) + L\u2713(f (x)) + \u21e0Lh(x, \u02c6x)\n\n4\n\n\freconstruction \n loss \n\nc\no\nn\nc\ne\np\nt\ns\n\nconcept encoder\n\nrelevance parametrizer\n\ninput x\n\nr\ne\nl\ne\nv\na\nn\nc\ne\ns\n\nrobustness\n loss\n\nclassification \n loss \n\nclass label \n\naggregator\n\nexplanation\n\nFigure 1: A SENN consists of three components: a concept encoder (green) that transforms the\ninput into a small set of interpretable basis features; an input-dependent parametrizer (orange) that\ngenerates relevance scores; and an aggregation function that combines to produce a prediction. The\nrobustness loss on the parametrizer encourages the full model to behave locally as a linear function\non h(x) with parameters \u2713(x), yielding immediate interpretation of both concepts and relevances.\n\n5 Experiments\n\nThe notion of interpretability is notorious for eluding easy quanti\ufb01cation [5]. Here, however, the\nmotivation in Section 2 produced a set of desiderata according to which we can validate our mod-\nels. Throughout this section, we base the evaluation on four main criteria. First and foremost, for\nall datasets we investigate whether our models perform on par with their non-modular, non inter-\npretable counterparts. After establishing that this is indeed the case, we focus our evaluation on the\ninterpretability of our approach, in terms of three criteria:\n\n(i) Explicitness/Intelligibility: Are the explanations immediate and understandable?\n(ii) Faithfulness: Are relevance scores indicative of \"true\" importance?\n(iii) Stability: How consistent are the explanations for similar/neighboring examples?\n\nBelow, we address these criteria one at a time, proposing qualitative assessment of (i) and quantitative\nmetrics for evaluating (ii) and (iii).\n\n5.1 Dataset and Methods\n\nDatasets We carry out quantitative evaluation on three classi\ufb01cation settings: (i) MNIST digit\nrecognition, (ii) benchmark UCI datasets [13] and (iii) Propublica\u2019s COMPAS Recidivism Risk Score\ndatasets.1 In addition, we provide some qualitative results on CIFAR10 [10] in the supplement (\u00a7A.5).\nThe COMPAS data consists of demographic features labeled with criminal recidivism (\u201crelapse\u201d) risk\nscores produced by a private company\u2019s proprietary algorithm, currently used in the Criminal Justice\nSystem to aid in bail granting decisions. Propublica\u2019s study showing racial-biased scores sparked a\n\ufb02urry of interest in the COMPAS algorithm both in the media and in the fairness in machine learning\ncommunity [25, 7]. Details on data pre-processing for all datasets are provided in the supplement.\n\nComparison methods. We compare our approach against various interpretability frameworks:\nthree popular \u201cblack-box\u201d methods; LIME [16], kernel Shapley values (SHAP, [14]) and perturbation-\nbased occlusion sensitivity (OCCLUSION) [26]; and various gradient and saliency based methods:\ngradient\u21e5input (GRAD*INPUT) as proposed by Shrikumar et al. [20], saliency maps (SALIENCY)\n[21], Integrated Gradients (INT.GRAD) [23] and (\u270f)-Layerwise Relevance Propagation (E-LRP) [4].\n\n1github.com/propublica/compas-analysis/\n\n5\n\n\fFigure 2: A comparison of traditional input-based explanations (positive values depicted in red) and\nSENN\u2019s concept-based ones for the predictions of an image classi\ufb01cation model on MNIST. The\nexplanation for SENN includes a characterization of concepts in terms of de\ufb01ning prototypes.\n\n5.2 Explicitness/Intelligibility: How understandable are SENN\u2019s explanations?\n\nWhen taking h(x) to be the identity, the explanations provided by our method take the same surface\nlevel (i.e, heat maps on inputs) as those of common saliency and gradient-based methods, but differ\nsubstantially when using concepts as a unit of explanations (i.e., h is learnt). In Figure 2 we contrast\nthese approaches in the context of digit classi\ufb01cation interpretability. To highlight the difference, we\nuse only a handful of concepts, forcing the model encode digits into meta-types sharing higher level\ninformation. Naturally, it is necessary to describe each concept to understand what it encodes, as\nwe do here through a grid of the most representative prototypes (as discussed in \u00a74), shown here in\nFig. 2, right. While pixel-based methods provide more granular information, SENN\u2019s explanation\nis (by construction) more parsimonious. For both of these digits, Concept 3 had a strong positive\nin\ufb02uence towards the prediction. Indeed, that concept seems to be associated with diagonal strokes\n(predominantly occurring in 7\u2019s), which both of these inputs share. However, for the second prediction\nthere is another relevant concept, C4, which is characterized largely by stylized 2\u2019s, a concept that in\ncontrast has negative in\ufb02uence towards the top row\u2019s prediction.\n\nFigure 3: Left: Aggregated correlation between feature relevance scores and true importance, as\ndescribed in Section 5.3. Right: Faithfulness evaluation SENN on MNIST with learnt concepts.\n\n5.3 Faithfulness: Are \u201crelevant\u201d features truly relevant?\n\nAssessing the correctness of estimated feature relevances requires a reference \u201ctrue\u201d in\ufb02uence to\ncompare against. Since this is rarely available, a common approach to measuring the faithfulness of\nrelevance scores with respect to the model they are explaining relies on a proxy notion of importance:\nobserving the effect of removing features on the model\u2019s prediction. For example, for a probabilistic\nclassi\ufb01cation model, we can obscure or remove features, measure the drop in probability of the\npredicted class, and compare against the interpreter\u2019s own prediction of relevance [17, 3]. Here,\nwe further compute the correlations of these probability drops and the relevance scores on various\npoints, and show the aggregate statistics in Figure 3 (left) for LIME, SHAP and SENN (without learnt\nconcepts) on various UCI datasets. We note that this evaluation naturally extends to the case where\nthe concepts are learnt (Fig. 3, right). The additive structure of our model allows for removal of\nfeatures h(x)i\u2014regardless of their form, i.e., inputs or concepts\u2014simply by setting their coef\ufb01cients\n\u2713i to zero. Indeed, while feature removal is not always meaningful for other predictions models (i.e.,\none must replace pixels with black or averaged values to simulate removal in a CNN), the de\ufb01nition of\nour model allows for targeted removal of features, rendering an evaluation based on it more reliable.\n\n6\n\nInputSaliencyGrad*InputInt.Grad.e-LRPOcclusionLIMEC5C4C3C2C1SENN1000100C5C4C3C2C1Cpt1Cpt2Cpt3Cpt4Cpt5ionosphereheartdiabetesabaloneDataset0.50.00.51.0FaithfulnessEstimateSHAPLIMESENN\fFigure 4: Explaining a CNN\u2019s prediction on an true MNIST digit (top row) and a perturbed version\nwith added Gaussian noise. Although the model\u2019s prediction is mostly unaffected by this perturbation\n(change in prediction probability \uf8ff 104), the explanations for post-hoc methods vary considerably.\n\n(A) SENN on COMPAS\n\n(B) SENN on BREAST-CANCER\n\n(C) All methods on UCI/MNIST\n\nFigure 5: (A/B): Effect of regularization on SENN\u2019s performance. (C): Robustness comparison.\n\n5.4 Stability: How coherent are explanations for similar inputs?\n\nAs argued throughout this work, a crucial property that interpretability methods should satisfy to\ngenerate meaningful explanations is that of robustness with respect to local perturbations of the input.\nFigure 4 shows that this is not the case for popular interpretability methods; even adding minimal\nwhite noise to the input introduces visible changes in the explanations. But to formally quantify\nthis phenomenon, we appeal again to De\ufb01nition 3.2 as we seek a worst-case (adversarial) notion\nof robustness. Thus, we can quantify the stability of an explanation generation model fexpl(x), by\nestimating, for a given input x and neighborhood size \u270f:\n\n\u02c6L(xi) = argmax\nxj2B\u270f(xi)\n\nkfexpl(xi) fexpl(xj)k2\n\nkh(xi) h(xj)k2\n\n(5)\n\nwhere for SENN we have fexpl(x) := \u2713(x), and for raw-input methods we replace h(x) with x,\nturning (5) into an estimation of the Lipschitz constant (in the usual sense) of fexpl. We can directly\nestimate this quantity for SENN since the explanation generation is end-to-end differentiable with\nrespect to concepts, and thus we can rely on direct automatic differentiation and back-propagation\nto optimize for the maximizing argument xj, as often done for computing adversarial examples\nfor neural networks [6]. Computing (5) for post-hoc explanation frameworks is, however, much\nmore challenging, since they are not end-to-end differentiable. Thus, we need to rely on black-box\noptimization instead of gradient ascent. Furthermore, evaluation of fexpl for methods like LIME and\nSHAP is expensive (as it involves model estimation for each query), so we need to do so with a\nrestricted evaluation budget. In our experiments, we rely on Bayesian Optimization [22].\nThe continuous notion of local stability (5) might not be suitable for discrete inputs or settings\nwhere adversarial perturbations are overly restrictive (e.g., when the true data manifold has regions\nof \ufb02atness in some dimensions). In such cases, we can instead de\ufb01ne a (weaker) sample-based\nnotion of stability. For any x in a \ufb01nite sample X = {xi}n\ni=1, let its \u270f-neighborhood within X be\nN\u270f(x) = {x0 2 X |k x x0k \uf8ff \u270f}. Then, we consider an alternative version of (5) with N\u270f(x) in\nlieu of B\u270f(xi). Unlike the former, its computation is trivial since it involves a \ufb01nite sample.\n\n7\n\nOriginalSaliencyGrad*InputInt.Grad.e-LRPOcclusionLIMESENNP(7)=1.0000e+00\u02c6L=1.45\u02c6L=1.36\u02c6L=0.91\u02c6L=1.35\u02c6L=1.66\u02c6L=6.23\u02c6L=0.011e-041e-031e-021e-011e+00RegularizationStrength0.00.10.20.30.40.5RelativeDiscreteLipschitzEstimate777879808182838485PredictionAccuacy0e+001e-081e-071e-061e-051e-041e-031e-021e-011e+00RegularizationStrength0.00.20.40.60.81.0RelativeCont.LipschitzEstimate80.082.585.087.590.092.595.097.5100.0PredictionAccuacydiabetesheartyeastionosphereabalone104100LipshitzEstimateUCIDatasetsSHAPLIMESENNLIMESaliencyGrad*Inpute-LRPInt.Grad.OcclusionSENN101101LipshitzEstimateMNIST\fWe \ufb01rst use this evaluation metric to validate the usefulness of the proposed gradient regularization\napproach for enforcing explanation robustness. The results on the COMPAS and BREAST-CANCER\ndatasets (Fig. 5 A/B), show that there is a natural tradeoff between stability and prediction accuracy\nthrough the choice of regularization parameter . Somewhat surprisingly, we often observe an boost\nin performance brought by the gradient penalty, likely caused by the additional regularization it\nimposes on the prediction model. We observe a similar pattern on MNIST (Figure 8, in the Appendix).\nNext, we compare all methods in terms of robustness on various datasets (Fig. 5C), where we observe\nSENN to consistently and substantially outperform all other methods in this metric.\nIt is interesting to visualize the inputs and corresponding explanations that maximize criterion\n(5) \u2013or its discrete counterpart, when appropriate\u2013 for different methods and datasets, since these\nsuccinctly exhibit the issue of lack of robustness that our work seeks to address. We provide many\nsuch \u201cadversarial\u201d examples in Appendix A.7. These examples show the drastic effect that minimal\nperturbations can have on most methods, particularly LIME and SHAP. The pattern is clear: most\ncurrent interpretability approaches are not robust, even when the underlying model they are trying to\nexplain is. The class of models proposed here offers a promising avenue to remedy this shortcoming.\n\n6 Related Work\nInterpretability methods for neural networks. Beyond the gradient and perturbation-based meth-\nods mentioned here [21, 26, 4, 20, 23], various other methods of similar spirit exist [15]. These\nmethods have in common that they do not modify existing architectures, instead relying on a-posteriori\ncomputations to reverse-engineer importance values or sensitivities of inputs. Our approach differs\nboth in what it considers the units of explanation\u2014general concepts, not necessarily raw inputs\u2014and\nhow it uses them, intrinsically relying on the relevance scores it produces to make predictions, obviat-\ning the need for additional computation. More related to our approach is the work of Lei et al. [11]\nand Al-Shedivat et al. [19]. The former proposes a neural network architecture for text classi\ufb01cation\nwhich \u201cjusti\ufb01es\u201d its predictions by selecting relevant tokens in the input text. But this interpretable\nrepresentation is then operated on by a complex neural network, so the method is transparent as\nto what aspect of the input it uses for prediction, but not how it uses it. Contextual Explanation\nNetworks [19] are also inspired by the goal of designing a class of models that learns to predict and\nexplain jointly, but differ from our approach in their formulation (through deep graphical models) and\nrealization of the model (through variational autoencoders). Furthermore, our approach departs from\nthat work in that we explicitly enforce robustness with respect to the units of explanation and we\nformulate concepts as part of the explanation, thus requiring them to be grounded and interpretable.\nExplanations through concepts and prototypes. Li et al. [12] propose an interpretable neural\nnetwork architecture whose predictions are based on the similarity of the input to a small set of\nprototypes, which are learnt during training. Our approach can be understood as generalizing this\napproach beyond similarities to prototypes into more general interpretable concepts, while differing\nin how these higher-level representation of the inputs are used. More similar in spirit to our approach\nof explaining by means of learnable interpretable concepts is the work of Kim et al. [8]. They propose\na technique for learning concept activation vectors representing human-friendly concepts of interest,\nby relying on a set of human-annotated examples characterizing these. By computing directional\nderivatives along these vectors, they gauge the sensitivity of predictors with respect to semantic\nchanges in the direction of the concept. Their approach differs from ours in that it explains a (\ufb01xed)\nexternal classi\ufb01er and uses a prede\ufb01ned set of concepts, while we learn both of these intrinsically.\n\n7 Discussion and future work\nInterpretability and performance currently stand in apparent con\ufb02ict in machine learning. Here, we\nmake progress towards showing this to be a false dichotomy by drawing inspiration from classic\nnotions of interpretability to inform the design of modern complex architectures, and by explicitly en-\nforcing basic desiderata for interpretability\u2014explicitness, faithfulness and stability\u2014during training\nof our models. We demonstrate how the fusion of these ideas leads to a class of rich, complex models\nthat are able to produce robust explanations, a key property that we show is missing from various\npopular interpretability frameworks. There are various possible extensions beyond the model choices\ndiscussed here, particularly in terms of interpretable basis concepts. As for applications, the natural\nnext step would be to evaluate interpretable models in more complex domains, such as larger image\ndatasets, speech recognition or natural language processing tasks.\n\n8\n\n\fAcknowledgments\nThe authors would like to thank the anonymous reviewers and Been Kim for helpful comments.\nThe work was partially supported by an MIT-IBM grant on deep rationalization and by Graduate\nFellowships from Hewlett Packard and CONACYT.\n\nReferences\n[1] D. Alvarez-Melis and T. S. Jaakkola. \u201cOn the Robustness of Interpretability Methods\u201d. In:\nProceedings of the 2018 ICML Workshop in Human Interpretability in Machine Learning.\n2018. arXiv: 1806.08049.\n\n[2] D. Alvarez-Melis and T. S. Jaakkola. \u201cA causal framework for explaining the predictions of\nblack-box sequence-to-sequence models\u201d. In: Conference on Empirical Methods in Natural\nLanguage Processing (EMNLP). 2017, pp. 412\u2013421.\n\n[3] L. Arras, F. Horn, G. Montavon, K.-R. M\u00fcller, and W. Samek. \u201c\"What is relevant in a text\ndocument?\": An interpretable machine learning approach\u201d. In: PLos ONE 12.8 (2017), pp. 1\u2013\n23.\n\n[4] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. R. M\u00fcller, and W. Samek. \u201cOn pixel-wise\nexplanations for non-linear classi\ufb01er decisions by layer-wise relevance propagation\u201d. In: PLos\nONE 10.7 (2015).\n\n[5] F. Doshi-Velez and B. Kim. \u201cTowards a Rigorous Science of Interpretable Machine Learning\u201d.\n\nIn: ArXiv e-prints Ml (2017), pp. 1\u201312. arXiv: 1702.08608.\nI. J. Goodfellow, J. Shlens, and C. Szegedy. \u201cExplaining and Harnessing Adversarial Exam-\nples\u201d. In: International Conference on Learning Representations. 2015.\n\n[6]\n\n[7] N. Grgic-Hlaca, M. B. Zafar, K. P. Gummadi, and A. Weller. \u201cBeyond Distributive Fairness in\nAlgorithmic Decision Making: Feature Selection for Procedurally Fair Learning\u201d. In: AAAI\nConference on Arti\ufb01cial Intelligence. 2018.\n\n[8] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres. \u201c Interpretability\nBeyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) \u201d.\nIn: International Conference on Machine Learning (ICML). 2018.\n\n[9] P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. Sch\u00fctt, S. D\u00e4hne, D. Erhan, and B. Kim.\n\u201cThe (Un)reliability of saliency methods\u201d. In: NIPS workshop on Explaining and Visualizing\nDeep Learning (2017).\n\n[10] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech. rep. Citeseer,\n\n2009.\n\n[11] T. Lei, R. Barzilay, and T. Jaakkola. \u201cRationalizing Neural Predictions\u201d. In: Conference on\nEmpirical Methods in Natural Language Processing (EMNLP). 2016, pp. 107\u2013117. arXiv:\n1606.04155.\n\n[12] O. Li, H. Liu, C. Chen, and C. Rudin. \u201cDeep Learning for Case-Based Reasoning through\nPrototypes: A Neural Network that Explains Its Predictions\u201d. In: AAAI Conference on Arti\ufb01cial\nIntelligence. 2018. arXiv: 1710.04806.\n\n[13] M. Lichman and K. Bache. UCI Machine Learning Repository. 2013.\n[14] S. Lundberg and S.-I. Lee. \u201cA uni\ufb01ed approach to interpreting model predictions\u201d. In: Advances\n\nin Neural Information Processing Systems 30. 2017, pp. 4768\u20134777. arXiv: 1705.07874.\n\n[15] G. Montavon, W. Samek, and K.-R. M\u00fcller. \u201cMethods for interpreting and understanding deep\n\nneural networks\u201d. In: Digital Signal Processing (2017).\n\n9\n\n\f[16] M. T. Ribeiro, S. Singh, and C. Guestrin. \u201c\"Why Should I Trust You?\": Explaining the\nPredictions of Any Classi\ufb01er\u201d. In: ACM SIGKDD Conference on Knowledge Discovery and\nData Mining (KDD). New York, NY, USA: ACM, 2016, pp. 1135\u20131144. arXiv: 1602.04938.\n[17] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. R. M\u00fcller. \u201cEvaluating the\nvisualization of what a deep neural network has learned\u201d. In: IEEE Transactions on Neural\nNetworks and Learning Systems 28.11 (2017), pp. 2660\u20132673. arXiv: 1509.06321.\n\n[18] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. \u201cGrad-cam: Why\ndid you say that? visual explanations from deep networks via gradient-based localization\u201d. In:\nICCV. 2017. arXiv: 1610.02391.\n\n[19] M. Al-Shedivat, A. Dubey, and E. P. Xing. \u201cContextual Explanation Networks\u201d. In: arXiv\n\npreprint arXiv:1705.10301 (2017).\n\n[20] A. Shrikumar, P. Greenside, and A. Kundaje. \u201cLearning Important Features Through Propa-\ngating Activation Differences\u201d. In: International Conference on Machine Learning (ICML).\nEd. by D. Precup and Y. W. Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR,\nJune 2017, pp. 3145\u20133153. arXiv: 1704.02685.\n\n[21] K. Simonyan, A. Vedaldi, and A. Zisserman. \u201cDeep inside convolutional networks: Visualising\nimage classi\ufb01cation models and saliency maps\u201d. In: International Conference on Learning\nRepresentations (Workshop Track). 2014.\nJ. Snoek, H. Larochelle, and R. P. Adams. \u201cPractical Bayesian Optimization of Machine\nLearning Algorithms\u201d. In: Advances in Neural Information Processing Systems (NIPS). 2012.\n[23] M. Sundararajan, A. Taly, and Q. Yan. \u201cAxiomatic attribution for deep networks\u201d. In: arXiv\n\n[22]\n\n[24]\n\npreprint arXiv:1703.01365 (2017).\nJ. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. \u201cUnderstanding neural networks\nthrough deep visualization\u201d. In: arXiv preprint arXiv:1506.06579 (2015).\n\n[25] M. B. Zafar, I. Valera, M. Rodriguez, K. Gummadi, and A. Weller. \u201cFrom parity to preference-\nbased notions of fairness in classi\ufb01cation\u201d. In: Advances in Neural Information Processing\nSystems (NIPS). 2017, pp. 228\u2013238.\n\n[26] M. D. Zeiler and R. Fergus. \u201cVisualizing and understanding convolutional networks\u201d. In:\n\nEuropean conference on computer vision. Springer. 2014, pp. 818\u2013833.\n\n10\n\n\f", "award": [], "sourceid": 4848, "authors": [{"given_name": "David", "family_name": "Alvarez Melis", "institution": "MIT"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}