Description

Defining a novel k-Nearest Neighbours approach to assess the applicability domain of a QSAR model for reliable predictions

Categories

Published

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Share

Transcript

RESEARCH ARTICLE Open Access
Defining a novel
k
-nearest neighbours approachto assess the applicability domain of a QSARmodel for reliable predictions
Faizan Sahigara, Davide Ballabio, Roberto Todeschini and Viviana Consonni
*
Abstract
Background:
With the growing popularity of using QSAR predictions towards regulatory purposes, such predictivemodels are now required to be strictly validated, an essential feature of which is to have the model
’
s ApplicabilityDomain (AD) defined clearly. Although in recent years several different approaches have been proposed to addressthis goal, no optimal approach to define the model
’
s AD has yet been recognized.
Results:
This study proposes a novel descriptor-based AD method which accounts for the data distribution andexploits
k
-Nearest Neighbours (kNN) principle to derive a heuristic decision rule. The proposed method is athree-stage procedure to address several key aspects relevant in judging the reliability of QSAR predictions. Inspiredfrom the adaptive kernel method for probability density function estimation, the first stage of the approach definesa pattern of thresholds corresponding to the various training samples and these thresholds are later used to derivethe decision rule. Criterion deciding if a given test sample will be retained within the AD is defined in the secondstage of the approach. Finally, the last stage tries reflecting upon the reliability in derived results taking modelstatistics and prediction error into account.
Conclusions:
The proposed approach addressed a novel strategy that integrated the kNN principle to define the ADof QSAR models. Relevant features that characterize the proposed AD approach include: a) adaptability to local densityof samples, useful when the underlying multivariate distribution is asymmetric, with wide regions of low data density;b) unlike several kernel density estimators (KDE), effectiveness also in high-dimensional spaces; c) low sensitivity to thesmoothing parameter
k
; and d) versatility to implement various distances measures. The results derived on a case studyprovided a clear understanding of how the approach works and defines the model
’
s AD for reliable predictions.
Keywords:
QSAR, Applicability domain, kNN, Nearest neighbour, Model validation
Background
The popularity of QSARs has seen a growth from timeto time and was complemented by the availability of more sophisticated and efficient model development tech-niques. This fact was further supported by the consider-ation of QSAR predictions for regulatory purposes. To dealwith risk assessment of chemicals for their safe use, a new European legislation
–
REACH (Registration, Evaluation,Authorization and restriction of Chemicals) was approvedin the recent years [1]. To reduce animal testing and re-placing them by cost effective methods, this law encouragesthe use of QSARs as a possible alternative when enough ex-perimental data is not available, provided that the modelwas strictly validated for its regulatory consideration [2].There are several aspects that must be taken into ac-count before considering a QSAR model reliable enough.In other words, the validity of a model has to be evalu-ated. Existing literature has often emphasized upon val-idating the QSAR models to reflect their robustness andpredictive ability. In 2004, following five OECD principlesfor model validation were adopted to validate a QSARmodel for its regulatory consideration: a) a defined end-point; b) an unambiguous algorithm; c) a defined domainof applicability d) appropriate measures for goodness-of-fit, robustness and predictivity and e) mechanistic inter-pretation, if possible [3].
* Correspondence: viviana.consonni@unimib.itMilano Chemometrics and QSAR Research Group, Department of Earth andEnvironmental Sciences, University of Milano-Bicocca, P.za della Scienza 1,Milano 20126, Italy
© 2013 Sahigara et al.; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of theCreative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,distribution, and reproduction in any medium, provided the srcinal work is properly cited.
Sahigara
et al. Journal of Cheminformatics
2013,
5
:27http://www.jcheminf.com/content/5/1/27
Applicability domain (AD) of a QSAR model definesthe model
’
s limitation in its structural domain and re-sponse space. In other words, this principle for model validation restricts the applicability of a model to reliably predict those test samples that are structurally similar tothe training samples used to build that model [4-6]. Sev-eral approaches were proposed in the past years to de-fine the AD of QSAR models. These approaches mainly differed in the algorithm used to characterise the ADwithin the descriptor space, where the model can predictreliably [7,8]. For instance, some classical approachessuggested defining the domain of applicability by a) con-sidering the range of descriptors values; b) enclosing thetraining space in a convex hull; c) calculating the dis-tance of a query compound from a defined point withinthe model
’
s descriptor space and d) estimating the Prob-ability Density Function for the given data. All these ap-proaches were associated with their own advantages andlimitations [2,7-10]. From time to time, several ap-proaches were proposed that were aimed to be more ef-ficient or were thought to overcome several limitationsof existing approaches.This article proposes a new heuristic approach towardsdefining the AD of QSAR models. The basis of thisnovel strategy is inspired from the
k
-Nearest Neighbours(kNN) approach and adaptive kernel methods for prob-ability density estimation (kernel density estimators, KDE)[11]. Due to its simplicity and easy implementation, kNNhad been a preferred choice for several proposed QSARstudies [6,12-18].In the classical kNN approach for AD evaluation[6,18], average distances of all the training samples fromtheir
k
nearest neighbours are calculated and used to de-fine a unique threshold to decide if a test sample is insideor outside the model
’
s AD (for example, 95th percentile).Moreover, in the framework of the probability density function estimation, the nearest neighbour method pro- vides density estimates depending on the Euclidean dis-tance to the
k
-th nearest data point [19]. Following thesame concept, the proposed method tries to integrate thekNN principle with the salient features of adaptive kernelmethods [11], which define local bandwidth factors corre-sponding to the training data points and use them to buildthe density estimate at a given point.The novelty of the kNN based AD approach proposedin this article lies in the overall strategy that is properly executed in a three-stage procedure to encapsulate and re-flect upon several significant aspects towards model valid-ation. Moreover, some features common to most of the ADapproaches were dealt differently with this approach; forinstance, rather than defining a general threshold as in allthe distance-based approaches, each training sample in thisapproach was associated with its individual threshold; inorder to find an optimal smoothing parameter
k
, thisapproach performed a
k
-optimization procedure based onMonte Carlo validation; additionally, model
’
s statistical pa-rameters and other relevant aspects were dealt simultan-eously to reflect upon the reliability in the derived results.To better understand the strategy behind this approach,it was implemented on a dataset from the literature. Thedataset was chosen from the CAESAR project to predictthe bioconcentration factor (BCF) [20,21].The derived re-sults were discussed in comparison with those derivedfrom other literature AD approaches.
Methods
k
-Nearest Neighbours principle from AD perspective
The kNN principle basically reflects upon the structuralsimilarity of a test sample to the training samples used tobuild that model. In theory, the distance of a query sampleis considered from its
k
closest data points in the chemicalspace. Lower distance values correspond to a higher simi-larity, while the increasing distances signify higher levelsof structural mismatch. The
k
value plays a significant rolein defining how constraint the approach will be and thus,it can be referred to as the smoothing parameter.A stepwise execution of the following three stagescharacterises the workflow of this approach:
1) defining thresholds for training samples2) evaluating AD for new/test samples3) optimizing the smoothing parameter
k
To allow a better interpretation of the proposed ap-proach, results on a two-dimensional simulated datasetwill be considered throughout the major part of this dis-cussion and wherever applicable. As shown in Figure 1,this dataset has a cluster of 48 training samples and theremaining two training samples (49 and 50) are locatedquite in the extremities of the space with respect tothese clustered samples.
Defining thresholds for training samples
Thresholds have a great influence in characterising theAD for reliable predictions; a test sample that exceedsthe threshold condition is associated with an unreliableprediction.Like the adaptive kernel methods, instead of defining ageneral unique threshold as seen with several classicalAD approaches, the proposed approach allocates a set of thresholds corresponding to the various training samples.For a given value of
k
, threshold allocation process canbe summarised as follows:
a) First of all, the distances of each training samplefrom the remaining
n
–
1 samples are calculated andranked in increasing order,
n
being the total numberof training samples. This will result in a
n
x (
n
−
1)
Sahigara
et al. Journal of Cheminformatics
2013,
5
:27 Page 2 of 9http://www.jcheminf.com/content/5/1/27
neighbour table
D
; an entry
D
ij
of the tablecorresponds to the distance of the
i-
th sample fromits
j-
th nearest neighbour:
D
i
1
≤
D
i
2
≤
…
≤
D
i
;
n
−
1
b) The average distance of each
i
-th sample from its
k
nearest neighbours is calculated considering the first
k
entries in
i
-th row of the neighbour table:
d
i
k
ð Þ ¼
X
k j
¼
1
D
ij
k where
;
1
≤
k
≤
n
−
1 and
d
i
k
ð Þ
≤
d
i
k
þ
1
ð Þð
1
Þ
A vector
d
k
ð Þ
of average distance values is thenderived considering all the samples in the training set.c) Next, a reference value (from now on referred as
Ref Val
),
e
d k
ð Þ
is determined as follows:
e
d k
ð Þ ¼
Q
3
d
k
ð Þð Þþ
1
:
5
Q
3
d
k
ð Þð Þ
−
Q
1
d
k
ð Þð Þ½ ð
2
Þ
where,
Q
1
d
k
ð Þð Þ
and
Q
3
d
k
ð Þð Þ
are the valuescorresponding to the 25th and 75th percentiles inthe vector
d
k
ð Þ
, respectively [22].d) Next, the ordered distances of each
i-
th trainingsample from all other
n
- 1 training samples arecompared with the
Ref Val
. If the distance value of the
i-
th sample from its given
j-
th trainingneighbour (where 1
≤
j
≤
n
–
1) is less than or equalto the
Ref Val
, then that distance value is retained,otherwise is discarded. The number
K
i
of neighbourssatisfying this condition, minimum zero andmaximum being
n
–
1, defines the density of the
i
-thsample neighbourhood:
K
i
:
D
ij
≤
e
d k
ð Þ
∀
j
:
1
;
n
−
1
ð
3
Þ
e) Finally, each
i
-th training sample is associated with athreshold
t
i
which defines the width of itsneighbourhood as:
t
i
¼
X
K
i
j
¼
1
D
ij
K
i
ð
4
Þ
If no distance value was retained for a given
i-
th train-ing sample (
K
i
= 0), then its threshold
t
i
would be theo-retically settled to 0, but a pragmatic solution is to set itequal to the smallest threshold of the training set.The plot in Figure 2 provides with an overview of thethresholds for all the 50 samples in the simulated dataset.As expected, most of the training samples within the clus-ter (for instance, samples 2, 33 and 39) were associatedwith higher
K
i
values. On the other hand, obvious poten-tial outliers (samples 49 and 50) had their thresholds equalto 0 since they couldn
’
t satisfy the threshold criterion evenfor a single training neighbour (i.e.
K
i
= 0), thus no dis-tance values contributed to their threshold calculation.Nevertheless, they were associated with the minimumthreshold equal to 0.42, i.e. the threshold of sample 43.
Figure 1
Scatter plot of the simulated dataset.
Figure 2
Simulated data set.
Thresholds
t
i
vs. number of trainingneighbours
K
i
plot (
k
= 12).
Sahigara
et al. Journal of Cheminformatics
2013,
5
:27 Page 3 of 9http://www.jcheminf.com/content/5/1/27
Evaluating AD for new/test samples
Until this point, each training sample was associated withits individual threshold. The next step will be to character-ise the AD which usually relies upon a set of conditionsthat will decide if a given test sample can be associatedwith a reliable prediction or not.The criterion used by this approach to associate agiven test sample to be within the domain of applicabil-ity can be summarised below.Given a test sample, its distance from all the
n
trainingsamples is calculated and simultaneously, compared tobe less than or equal to the thresholds associated withthose training samples. If this condition holds true withat least one training sample, the test sample will be con-sidered inside the domain of applicability for that model.Otherwise, the prediction for that test sample will berendered unreliable.More formally, given the training set
TR
, for each testsample
j
, the AD decision rule is:
j
∈
AD iff
∃
i
∈
TR
:
D
ij
≤
t
i
ð
5
Þ
where
D
ij
is the distance between the
j
-th test sampleand the
i
-th training sample and
t
i
is the individualthreshold of the latter. In addition, each test/new samplewill be associated with the number
K
j
of nearest trainingneighbours for which the previous condition holds true.This number can be assumed as a measure of predictionreliability; indeed, high values of
K
j
indicate that the new sample falls within a dense training region of the model
’
sspace, while low values of
K
j
denote that the new samplestill belongs to the model
’
s space, but located in sparsetraining regions.
K
j
equal to zero rejects the sample as itbeing outside the model
’
s AD since no training neigh-bours are identified.Figure 3 provides with the contour plot for the simu-lated dataset derived projecting several data points enoughto fill the training space. Thresholds were calculated using12 nearest neighbours and Euclidean distance. This choiceof
k
= 12 nearest neighbours was based on the resultsderived performing an internal
k
-optimization, discussedlater in this article. The space enclosed around the clusterrepresented as black line indicates that all the data pointswithin this enclosed region were inside the AD. Thus, thisregion reflects in a way how the AD was characterised forthis two-dimensional dataset. Area of this enclosed regiontends to expand or shrink depending upon the number of nearest neighbours used for threshold calculation.As explained earlier, the extreme outliers in the train-ing space will be associated with the number
K
i
of neigh-bours equal to zero and the lowest possible threshold inthe training set. Consider the sample 49 from the simu-lated dataset which is an extreme outlier with its thresh-old equal to 0.42. If there is a test sample that seems tobe quite in the vicinity of this potential outlier withinthe descriptor space, the test sample will be associatedwith an unreliable prediction since its distance fromsample 49 will likely exceed the small threshold. Now,consider a case, where the descriptor values for anothertest sample exactly overlap or are very similar to thosefor this potential outlier. In this situation, the distance of that sample from the outlier will be less than the thresh-old and thus it will be considered within the domain of applicability. In theory, this is not wrong because the po-tential outlier is still a part of the training space. Practi-cally, the approach retains all the training samples tocharacterize the AD but minimizing the role of potentialoutliers in doing so. That
’
s the reason why the first testsample was excluded from being reliably predicted whilethe second sample was not. However, for the latter thenumber
K
j
of nearest training neighbours will likely beequal to one indicating that its prediction has some de-gree of uncertainty. In conclusion, there exists a relationbetween the defined AD and the impact of training sam-ples in characterising it based on their threshold values.
Optimizing the smoothing parameter
k
Another important aspect is concerning the choice of anappropriate smoothing parameter
k
, whose theoreticalrange is between 1 and
n
-1.Very low
k
values will restrict the domain of applic-ability in a very strict manner as compared to the ADderived opting for larger
k
values. This is because, anopted
k
value will have a direct impact on the thresholdcalculations which in turn can make it more rigid or easier
Figure 3
Simulated data set.
Contour plot to demonstrate howthe AD was characterised. Metric used: Euclidean distance;
k
= 12.
Sahigara
et al. Journal of Cheminformatics
2013,
5
:27 Page 4 of 9http://www.jcheminf.com/content/5/1/27
for test samples to satisfy the threshold criterion. Thestrategy implemented in this article to select an appropri-ate
k
value was performed by Monte Carlo validation,maximizing the percentage of the test samples consideredwithin the AD, i.e. satisfying AD criterion (Equation 5).Box-and-whisker plots (box plots) were produced toget an overview of all these derived results. For instance,consider the plot in Figure 4 derived for the simulateddataset showing percentage of test samples retained withinthe AD with different
k
values (optimization carried outwith 20% of samples in the test set and 1000 iterations).Median quartile in the middle of the box (marked inred) can be referred for all the
k
values to get a hintabout how many test samples were retained on averageduring the optimization process for a given
k
value. Thetop and bottom edges of each box plot (quartiles
Q
3 and
Q
1) correspond to 75th and 25th percentile, respectively.The whisker can extend further from
Q
1
–
w
(
Q
3
−
Q
1)until
Q
3 +
w
(
Q
3
−
Q
1), of 1.5 [23]. The test samples fall-ing outside this coverage are considered as outliers andare highlighted as
‘
+
’
in red. About their usefulness inthe proposed AD approach, box plots showing limitedspread and allowing majority of test samples to beretained within the AD can be favoured and their corre-sponding range of
k
values can be considered to finally opt for the most appropriate
k
. Additionally, a line plotis integrated in the same figure indicating the mean per-centage of test samples that were considered within theAD for each
k
value
.
A simultaneous interpretation of both these plots can make it easier for a user to decideupon an appropriate
k
value.Figure 4 shows that the spread of the box plots for ini-tial
k
values is quite large. This may have resulted due tothe impact of restricted training thresholds that excludedseveral test samples from the AD. With an increase in
k
values, the spread narrowed, however the outliers werestill present until
k
= 17. After this point, the box plotsremained unchanged throughout the plot with no out-liers. Similar observations were derived from the meanline plot which showed a significant rise initially fol-lowed by a stable curve until the first half of the
k
values. The plot didn
’
t show any major changes for thesecond half of the
k
values. In order to avoid very high
k
values good enough to unnecessarily expand the definedAD, a
k
value of 12 was opted as appropriate
k
for thisdataset. The plots dealt earlier (Figures 2 and 3) for thisdataset were thus derived using this opted
k
value.We also performed an extended analysis on several di- verse data sets (results not reported in this paper), to study the influence of the smoothing parameter
k
on model
’
sAD definition. It was concluded that optimization of
k
canbe a time-demanding procedure especially in the case of ahuge number of samples, but it was also observed that thisapproach is quite insensitive to the smoothing parameter
k
, except for very small
k
values which led to the resultsinfluenced by local noise. Therefore, for many applicationsthe optimization of the smoothing parameter can beavoided and reasonable results can instead be obtained by a fixed
k
value empirically calculated as
n
1/3
.
Reflecting the reliability in derived results
After the AD approach has been applied to the model of interest, several features will be taken into account to re-flect upon the derived results. Moreover, as stated earlierthe response domain will be taken into account to addressthe reliability in the results derived by characterising theAD of a model in its descriptor space.In order to reflect upon a model
’
s predictive ability, thepredictive squared correlation coefficient (
Q
2
) was used.Since the test samples excluded from the model
’
s AD areunreliably predicted, in theory they should not be ac-counted for to calculate the model
’
s statistics (
Q
2
).The following key parameters were evaluated:
a) Number of test samples retained within the AD.b)
Q
2
calculated from the test samples retained withinthe AD [24,25]:
Q
2
¼
1
−
X
n
TS
j
¼
1
^
y
j
−
y
j
2
" #
=
n
TS
X
n
TR
i
¼
1
y
i
−
y
TR
2
" #
=
n
TR
ð
6
Þ
where,
y
j
is the measured response value for the
j
-thsample and
ŷ
j
its predicted value;
n
TR
and
n
TS
represent the total number of training and test
Figure 4
Simulated data set.
Box-and-whisker plot of test samples(%) retained within the AD for different
k
values during
k
-optimization.
Sahigara
et al. Journal of Cheminformatics
2013,
5
:27 Page 5 of 9http://www.jcheminf.com/content/5/1/27

Search

Similar documents

Related Search

Theological Approach to Studying the Old TestTo Know the Comparative Analysis of Filipino How to mitigate the adverse impacts of climatGetting to know the third person of the GodheTo Think the Human Out of the Centerto find the job satisfaction of MBA studentA Phenomenological Model for Psychiatry, PsycA simulation model for chemically amplified rMeasures That Are Used to Assess the Health oGetting to know the third person of the Godhe

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...Sign Now!

We are very appreciated for your Prompt Action!

x