pROC 1.15.0
The latest version of pROC, 1.15.0 has just been released. It features significant speed improvements, many bug fixes, new methods for use in dplyr pipelines, increased verbosity, and prepares the way for some backwards-incompatible changes upcoming in pROC 1.16.0.
Verbosity
Since its initial release, pROC has been detecting the level
s of the positive and negative classes (cases and controls), as well as the direction
of the comparison, that is whether values are higher in case or in control observations. Until now it has been doing so silently, but this has lead to several issues and misunderstandings in the past. In particular, because of the detection of direction
, ROC curves in pROC will nearly always have an AUC higher than 0.5, which can at times hide problems with certain classifiers, or cause bias in resampling operations such as bootstrapping or cross-validation.
In order to increase transparency, pROC 1.15.0 now prints a message on the command line when it auto-detects one of these two arguments.
> roc(aSAH$outcome, aSAH$ndka)
Setting levels: control = Good, case = Poor
Setting direction: controls < cases
Call:
roc.default(response = aSAH$outcome, predictor = aSAH$ndka)
Data: aSAH$ndka in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
Area under the curve: 0.612
If you run pROC repeatedly in loops, you may want to turn off these diagnostic messsages. The recommended way is to explicitly specify them explicitly:
roc(aSAH$outcome, aSAH$ndka, levels = c("Good", "Poor"), direction = "<")
Alternatively you can pass quiet = TRUE
to the ROC function to silenty ignore them.
roc(aSAH$outcome, aSAH$ndka, quiet = TRUE)
As mentioned earlier this last option should be avoided when you are resampling, such as in bootstrap or cross-validation, as this could silently hide some biases due to changing directions.
Speed
Several bottlenecks have been removed, yielding significant speedups in the roc
function with algorithm = 2
(see issue 44), as well as in the coords
function which is now vectorized much more efficiently (see issue 52) and scales much better with the number of coordinates to calculate. With these improvements pROC is now as fast as other ROC R packages such as ROCR.
With Big Data becoming more and more prevalent, every speed up matters and making pROC faster has very high priority. If you think that a particular computation is abnormally slow, for instance with a particular combination of arguments, feel free to submit a bug report.
As a consequence, algorithm = 2
is now used by default for numeric predictors, and is automatically selected by the new algorithm = 6
meta algorithm. algorithm = 3
remains slightly faster with very low numbers of thresholds (below 50) and is still the default with ordered factor predictors.
Pipelines
The roc
function can be used in pipelines, for instance with dplyr or magrittr. This is still a highly experimental feature and will change significantly in future versions (see issue 54 for instance). Here is an example of usage:
library(dplyr) aSAH %>% filter(gender == "Female") %>% roc(outcome, s100b)
The roc.data.frame
method supports both standard and non-standard evaluation (NSE), and the roc_
function supports standard evaluation only. By default it returns the roc
object, which can then be piped to the coords
function to extract coordinates that can be used in further pipelines
aSAH %>% filter(gender == "Female") %>% roc(outcome, s100b) %>% coords(transpose=FALSE) %>% filter(sensitivity > 0.6, specificity > 0.6)
More details and use cases are available in the ?roc
help page.
Transposing coordinates
Since the initial release of pROC, the coords
function has been returning a matrix with thresholds in columns, and the coordinate variables in rows.
data(aSAH) rocobj <- roc(aSAH$outcome, aSAH$s100b) coords(rocobj, c(0.05, 0.2, 0.5)) # 0.05 0.2 0.5 # threshold 0.05000000 0.2000000 0.5000000 # specificity 0.06944444 0.8055556 0.9722222 # sensitivity 0.97560976 0.6341463 0.2926829
This format doesn't conform to the grammar of the tidyverse, outlined by Hadley Wickham in his Tidy Data 2014 paper, which has become prevalent in modern R language. In addition, the dropping of dimensions by default makes it difficult to guess what type of data coords
is going to return.
coords(rocobj, "best") # threshold specificity sensitivity # 0.2050000 0.8055556 0.6341463 # A numeric vector
Although it is possible to pass drop = FALSE
, the fact that it is not the default makes the behaviour unintuitive. In an upcoming version of pROC, this will be changed and coords
will return a data.frame
with the thresholds in rows and measurement in colums by default.
Changes in 1.15
- Addition of the
transpose
argument. - Display a warning if
transpose
is missing. Passtranspose
explicitly to silence the warning. - Deprecation of
as.list
.
With transpose = FALSE
, the output is a tidy data.frame
suitable for use in pipelines:
coords(rocobj, c(0.05, 0.2, 0.5), transpose = FALSE) # threshold specificity sensitivity # 0.05 0.05 0.06944444 0.9756098 # 0.2 0.20 0.80555556 0.6341463 # 0.5 0.50 0.97222222 0.2926829
It is recommended that new developments set transpose = FALSE
explicitly. Currently these changes are neutral to the API and do not affect functionality outside of a warning.
Upcoming backwards incompatible changes in future version (1.16)
The next version of pROC will change the default transpose
to FALSE
. This is a backward incompatible change that will break any script that did not previously set transpose
and will initially come with a warning to make debugging easier. Scripts that set transpose
explicitly will be unaffected.
Recommendations
If you are writing a script calling thecoords
function, set transpose = FALSE
to silence the warning and make sure your script keeps running smoothly once the default transpose
is changed to FALSE
. It is also possible to set transpose = TRUE
to keep the current behavior, however is likely to be deprecated in the long term, and ultimately dropped.
New coords
return values
The coords
function can now return two new values, "youden"
and "closest.topleft"
. They can be returned regardless of whether input = "best"
and of the value of the best.method
argument, although they will not be re-calculated if possible. They follow the best.weights
argument as expected. See issue 48 for more information.
Bug fixes
Several small bugs have been fixed in this version of pROC. Most of them were identified thanks to an increased unit test coverage. 65% of the code is now unit tested, up from 46% a year ago. The main weak points remain the testing of all bootstrapping and resampling operations. If you notice any unexpected or wrong behavior in those, or in any other function, feel free to submit a bug report.
Getting the update
The update his available on CRAN now. You can update your installation by simply typing:
install.packages("pROC")
Here is the full changelog:
roc
now prints messages when autodetectinglevels
anddirection
by default. Turn off withquiet = TRUE
or set these values explicitly.- Speedup with
algorithm = 2
(issue 44) and incoords
(issue 52). - New
algorithm = 6
(used by default) usesalgorithm = 2
for numeric data, andalgorithm = 3
for ordered vectors. - New
roc.data.frame
method androc_
function for use in pipelines. coords
can now returns"youden"
and"closest.topleft"
values (issue 48).- New
transpose
argument forcoords
,TRUE
by default (issue 54). - Use text instead of Tcl/Tk progress bar by default (issue 51).
- Fix
method = "density"
smoothing when called directly fromroc
(issue 49). - Renamed
roc
argumentn
tosmooth.n
. - Fixed 'are.paired' ignoring smoothing arguments of
roc2
withreturn.paired.rocs
. - New
ret
option"all"
incoords
(issue 47) drop
incoords
now drops the dimension ofret
too (issue 43)
Xavier Robin
Publié le samedi 1 juin 2019 à 09:33 CEST
Lien permanent : /blog/2019/06/01/proc-1.15.0
Tags :
pROC
Commentaires : 0
Commentaires
Aucun commentaire