Designing an R interface and dealing with S3 “classes”… any suggestions?

I (Bob) am spending some of this summer working on the esci package for R. I’ve got a rough draft cobbled together, but now I want to go back through to make a beautiful, maintainable codebase. Most importantly, I want to design and name the functions to provide an interface that will work well and as-expected for regular R users.

This post describes the approach I’m considering–if you have any feedback, I’d be very happy to have it.

The goals

I’m writing functions that will provide estimates with CIs for different effect sizes (means, mean differences, R, proportions, etc.)

Goal 1 – I’d like all functions to work with both summary and raw data. For example, I want users to be able to do:

estimate_mean(mtcars, mpg, conf.level = 0.99)

As well as

estimate_mean(m = 10, s = 2, n = 5)

Goal 2 – I’d like to try to make the functions compatible with tidyverse style and tidy evaluation (to the extent I understand either of these). So, for tidy style my goal is to use camel_case for function names. For tidy evaluation, it should be possible to send in column names unquoted, and to arbitrarily expand the number of columns to be processed. As in:

estimate_mean(data = mtcars, y = mpg, cyl, wt, gear)

Goal 3 – I’d like the package to be clean and easy to use. I don’t want different functions for slightly different styles of input (raw data vs. summary data). I do want the auto-fill for a function to be informative.

Goal 4 – I’d like the code to be efficient and easy to maintain. My approach has been to write a basic function that deals with summary data only, and then to use that as the basis for processing raw data, hopefully avoiding code duplication.

Current Implementation

To meet these goals, I’ve been tinkering around with Rs S3 “classes” and the UseMethod dispatcher that enables you to route function calls to different implementations based on the types of objects passed. I’ve found this to be quite a journey. I’ve devised a working approach (code listing is below), but I’m eager for feedback, as it feels a bit icky–I am not confident that I’m dealing with unquoted arguments well/properly.

  • Detected unquoted arguments… is there a better way? To dispatch properly, I’d like to detect if the user has passed in column name as an unquoted argument (rather than, say a vector). However, there doesn’t seem to be a way to dispatch by an unquoted argument (they don’t have a natural class, and checking the class before quoting causes an error). You can quote first, but literally anything can be quoted, so you don’t end up with a substantively different object if the original argument was unquoted vs. not. There does not seem to be a ‘is.unquoted’ or the like to check. So I’ve hit upon testing by using:
qpassed <- try(eval(y), silent = TRUE)

If y is an unquoted column name, this eval throws an error, and qpassed obtains the class “try-error”. This feels a bit icky. And it doesn’t distinguish an unquoted column name from a typo in an otherwise valid expression.

  • Is it ok to dispatch based on a calculated switch rather than what is actually passed? In the set of functions I envision, the real key to how to dispatch is in the format of y. If y is a vector, I want to dispatch to a function that estimates from a vector. If y is an unquoted column, name, I want to dispatch to a function that handles a column from a data frame. If y is the beginning of a list of unquoted column names, I want to dispatch to a function that handles that list. And finally, if y is null it means summary data has been passed in, so I want to dispatch accordingly. The thing is, though, you can’t dispatch on y because it might be an unquoted column name, and these don’t have a class, can’t be evaluated before quoting, and therefore won’t be dispatched properly. I was stuck, and then it dawned on me that I could use the dispatching function to determine what type of y has been passed, and dispatch accordingly. What’s really strange about this, is I realized I could code the class of y in another variable (I called it switch) and ask UseMethod to use this calculated variable even though it wasn’t passed–and it does! And yet it does not actually pass the dispatching variable to the function that gets called! That’s not like any other type of class dispatch method I’ve ever seen… seems to work, but feels kind of wrong.
  • Is there no way to tell a vector from a single number? I was surprised when writing the dispatch function to pass in a vector and have the class show up as numeric. It looks like R doesn’t assign a vector class to vectors. Is there some way to distinguish single numbers from vectors? (length?)
  • as_label rather than quo_name — it looks like quo_name is being retired in favor of as_label… Not really a question, just a note to myself to update my code for this.

Current Implementation:

Here’s the dispatch function I’ve generated for estimating a single mean. It dispatches to one of 4 functions: 1) to one that works with summary data (.numeric), 2) to one that works with a vector (.vector), 3) to one that works with a single column name of data (.character), 4) or to one that works with a list of column names (.list). Although the dispatch function itself keels icky to me, I do like that these 4 functions are built stacked on top of each other (the list calls the single column, which calls the vector, which calls the numeric). I *think* that’s the way to go to avoid code duplication and make maintenance easy–but happy for feedback on this as well.

This is just a skeleton that I will use/elaborate for other functions. I am still planning on adding assertive-based input checking, and providing a rich, well-named, and consistent output object.

estimate_mean <- function(data = NULL,
                         y = NULL,
                         ...,
                         m = NULL,
                         s = NULL,
                         n = NULL,
                         conf.level = 0.95) {


    switch <- 1

    if (!is.null(m)) {
        if(!is.null(data))  stop("You have passed summary statistics, so don't pass the 'data' parameter used for raw data.")
        if(!is.null(y)) stop("You have passed summary statistics, so don't pass the 'y' parameter used for raw data.")
    } else {
        if(!is.null(m))  stop("You have passed raw data, so don't pass the 'm' parameter used for summary data.")
        if(!is.null(s))  stop("You have passed raw data, so don't pass the 's' parameter used for summary data.")
        if(!is.null(n))  stop("You have passed raw data, so don't pass the 'n' parameter used for summary data.")

        qpassed <- try(eval(y), silent = TRUE)

        if(class(qpassed) != "try-error" & class(qpassed) == "numeric") {
            if(!is.null(data))  stop("You have passed y as a vector of data, so don't pass the 'data' parameter used for data frames.")
            class(switch) <- "vector"
        } else {
            dotlist <- rlang::quos(...)
            if(length(dotlist) != 0) {
                switch <- list()
            } else {
                switch <- "character"
            }
        }
    }

    UseMethod("estimate_mean", switch)

}

#' @export
estimate_mean.numeric <- function(m, s, n, conf.level = 0.95) {
    sem <- s / sqrt(n)
    moe <- qt(1 - (1-conf.level)/2, n-1)
    ci.low <- m - moe
    ci.high <- m + moe
    res <- list(m = m,
                sem = sem,
                moe = moe,
                ci.low = ci.low,
                ci.high = ci.high)
    formatted_mean <- stringr::str_interp(
        "mean = $[.2f]{m} ${conf.level*100}% CI [$[.2f]{ci.low}, $[.2f]{ci.high}]"
    )
    class(res) <- "estimate"
    return(res)
}

#' @export
estimate_mean.vector <- function(y, conf.level = 0.95) {
    y <- na.omit(y)
    m <- mean(y, na.rm = TRUE)
    s <- sd(y, na.rm = TRUE)
    n <- length(y)
    res <- estimate_mean.numeric(m, s, n, conf.level = conf.level)
    return(res)
}

#' @export
estimate_mean.character <- function(data, y, conf.level = 0.95) {
    y_enquo <- rlang::enquo(y)
    y_quoname <- rlang::quo_name(y_enquo)

    res <- estimate_mean.vector(data[[y_quoname]], conf.level = conf.level)
    return(res)
}


#' @export
estimate_mean.list <- function(data, ..., conf.level = 0.95) {

    res <- list()
    dotlist <- rlang::quos(...)
    for (y_var in dotlist) {
        y_name <- rlang::quo_name(y_var)
        res[[y_name]] <- estimate_mean.character(data = data, y = !!y_name, conf.level = conf.level)
    }

    class(res) <- "estimate list"

    return(res)

}

So…. if any R package-developers have thoughts or suggestions I’d be glad to have them. Thanks in advanced.

About

I'm a teacher, researcher, and gadfly of neuroscience. My research interests are in the neural basis of learning and memory, the history of neuroscience, computational neuroscience, bibliometrics, and the philosophy of science. I teach courses in neuroscience, statistics, research methods, learning and memory, and happiness. In my spare time I'm usually tinkering with computers, writing programs, or playing ice hockey.

Leave a Reply

Your email address will not be published. Required fields are marked *

*