Loading…
useR!2017 has ended
Talk [clear filter]
Wednesday, July 5
 

11:00am CEST

Analysis of German Fuel Prices with R
Keywords: Analytics, Marketing, tidyverse, purrr, ggplot2, rgdal, sp and more
Webpages: https://creativecommons.tankerkoenig.de (sic), https://www.openstreetmap.org/
We present an R-based analysis to measure the impact of different market drivers on fuel prices in Germany. The analysis is based on the open dataset on German fuel prices, bringing in many additional open data sets along the way.
  • Overview of the dataset
    1. History, Legal framework and data collection
    2. Current uses in “price-finder apps”
    3. Structure of the dataset
    4. Preparation of the data
    5. A first graphical analysis
    • price levels
    • weekly and daily pricing patterns
  • Overview of potential price drivers and corresponding data sources
    1. A Purrr workflow for preparing regional data from Destatis
    • Number of registered cars
    • Number of fuel stations
    • Number of inhabitants
    • Mean income, etc.
    1. Determining geographical market drivers with OSM data using sp, rgdal, geosphere
    • Branded vs independent
    • Location: higwhway, close to highway exit (“Autohof”) etc.
    • Proximity to competitors, etc.
    1. Cost drivers
    • Market prices for crude oil
    • Distance of fuel station to fuel depot
    • Land lease and property-prices
    1. Outlook:
    • Weather
    • Traffic density
Based on this data, we will present different modelling approaches to quantify the impact of the above drivers on average price levels. We will also give an outlook and first results on temporal pricing patterns and indicators for competitive or anti-competitive behaviour.
This talk is a condensed version of an online R-workshop that I am currently preparing and which I expect to be fully available at the time of UseR 2017.

Speakers


Wednesday July 5, 2017 11:00am - 11:18am CEST
PLENARY Wild Gallery

11:00am CEST

How we built a Shiny App for 700 users?
Keywords: Shiny, shiny.semantic, UI, UX, application performance, analytics consulting
Shiny has proved itself a great tool for communicating data science teams’ results. However, developing a Shiny app for a large scope project that will be used commercially by more than dozens of users is not easy. The first challenge is User Interface (UI): the expectations are that the app should not vary from modern web pages. Secondly, performance directly impacts user experience (UX), and it’s difficult to maintain efficiency with growing requirements and user base.
In this talk, we will share our experience from a real-life case study of building an app used daily by 700 users where our data science team tackled all these problems. This, to our knowledge, was one of the biggest production deployments of a Shiny App.
We will show an innovative approach to building a beautiful and flexible Shiny UI using shiny.semantic package (an alternative to standard Bootstrap). Furthermore, we will talk about the non-standard optimization tricks we implemented to gain performance. Then we will discuss challenges regarding complex reactivity and offer solutions. We will go through implementation and deployment process of the app using a load balancer. Finally, we will present the application and give details on how this benefited our client.



Wednesday July 5, 2017 11:00am - 11:18am CEST
4.02 Wild Gallery
  Talk, Shiny I

11:00am CEST

Robets: Forecasting with Robust Exponential Smoothing with Trend and Seasonality
Keywords: Time Series, Forecasting, Robust Statistics, Exponential Smoothing
Webpages: https://CRAN.R-project.org/package=robets, https://rcrevits.wordpress.com/
Simple forecasting methods, such as exponential smoothing, are very popular in business analytics. This is not only due to their simplicity, but also because they perform very well, in particular for shorter time series. Incorporating trend and seasonality into an exponential smoothing method is standard. Many real time series, show seasonal patterns that should be exploited for forecasting purposes. Including a trend or not may be less clear. For instance, weekly sales (in units) may show an increasing trend, but the sales will not grow to infinity. Here, the damped trend model gives an outcome. Damped trend exponential smoothing gives excellent results in forecasting competitions.
In a highly cited paper, Hyndman and Khandakar (2008) developed an automatic forecasting method using exponential smoothing, available as the R package forecast. We propose the package robets, an outlier robust alternative of the function ets in the forecast package. For each method of a class of exponential smoothing variants we made a robust alternative. The class includes methods with a damped trend and/or seasonal components. The robust method is developed by robustifying every aspect of the original exponential smoothing variant. We provide robust forecasting equations, robust initial values, robust smoothing parameter estimation and a robust information criterion. The method is an extension of Gelper, Fried, and Croux (2010) and is described in more detail in Crevits and Croux (2016).
The code of the developed R package is based on the function ets of the forecast package. The usual functions for visualizing the models and forecasts also work for robets objects. Additionally there is a function plotOutliers which highlights outlying values in a time series.
References Crevits, Ruben, and Christophe Croux. 2016. “Forecasting with Robust Exponential Smoothing with Damped Trend and Seasonal Components.” Working Paper.

Gelper, S, R Fried, and C Croux. 2010. “Robust Forecasting with Exponential and Holt-Winters Smoothing.” Journal of Forecasting 29: 285–300.

Hyndman, R J, and Y Khandakar. 2008. “Automatic Time Series Forecasting: The Forecast Package for R.” Journal of Statistical Software 27 (3).




Speakers


Wednesday July 5, 2017 11:00am - 11:18am CEST
2.01 Wild Gallery

11:00am CEST

Social contact data in endemic-epidemic models and probabilistic forecasting with **surveillance**
Keywords: age-structured contact matrix, areal count time series, infectious disease epidemiology, norovirus, spatio-temporal surveillance data
Webpages: https://CRAN.R-project.org/package=surveillance
Routine surveillance of notifiable infectious diseases gives rise to weekly counts of reported cases stratified by region and age group. A well-established approach to the statistical analysis of such surveillance data are endemic-epidemic time-series models (hhh4) as implemented in the R package surveillance (Meyer, Held, and Höhle 2017). Autoregressive model components reflect the temporal dependence inherent to communicable diseases. Spatial dynamics are largely driven by human travel and can be captured by movement network data or a parametric power law based on the adjacency matrix of the regions. Furthermore, the social phenomenon of “like seeks like” produces characteristic contact patterns between subgroups of a population, in particular with respect to age (Mossong et al. 2008). We thus incorporate an age-structured contact matrix in the hhh4 modelling framework to
  1. assess age-specific disease spread while accounting for its spatial pattern (Meyer and Held 2017)
  2. improve probabilistic forecasts of infectious disease spread (Held, Meyer, and Bracher 2017)
We analyze weekly surveillance counts on norovirus gastroenteritis from the 12 city districts of Berlin, in six age groups, from week 2011/27 to week 2015/26. The following year (2015/27 to 2016/26) is used to assess the quality of the predictions.
References Held, Leonhard, Sebastian Meyer, and Johannes Bracher. 2017. “Probabilistic Forecasting in Infectious Disease Epidemiology: The Thirteenth Armitage Lecture.” bioRxiv. doi:10.1101/104000.

Meyer, Sebastian, and Leonhard Held. 2017. “Incorporating Social Contact Data in Spatio-Temporal Models for Infectious Disease Spread.” Biostatistics 18 (2): 338–51. doi:10.1093/biostatistics/kxw051.

Meyer, Sebastian, Leonhard Held, and Michael Höhle. 2017. “Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance.” Journal of Statistical Software. http://arxiv.org/abs/1411.0416.

Mossong, Joël, Niel Hens, Mark Jit, Philippe Beutels, Kari Auranen, Rafael Mikolajczyk, Marco Massari, et al. 2008. “Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases.” PLoS Medicine 5 (3): e74. doi:10.1371/journal.pmed.0050074.




Speakers
avatar for Sebastian Meyer

Sebastian Meyer

Friedrich-Alexander-Universität Erlangen-Nürnberg


slides pdf

Wednesday July 5, 2017 11:00am - 11:18am CEST
3.01 Wild Gallery

11:00am CEST

Transformation Forests
Keywords: random forest, transformation model, quantile regression forest, conditional distribution, conditional quantiles
Webpages: https://R-forge.R-project.org/projects/ctm https://arxiv.org/1701.02110
Regression models for supervised learning problems with a continuous target are commonly understood as models for the conditional mean of the target given predictors. This notion is simple and therefore appealing for interpretation and visualisation. Information about the whole underlying conditional distribution is, however, not available from these models. A more general understanding of regression models as models for conditional distributions allows much broader inference from such models, for example the computation of prediction intervals. Several random forest-type algorithms aim at estimating conditional distributions, most prominently quantile regression forests. We propose a novel approach based on a parametric family of distributions characterised by their transformation function. A dedicated novel transformation tree'' algorithm able to detect distributional changes is developed. Based on these transformation trees, we introducetransformation forests’‘as an adaptive local likelihood estimator of conditional distribution functions. The resulting models are fully parametric yet very general and allow broad inference procedures, such as the model-based bootstrap, to be applied in a straightforward way. The procedures are implemented in the ``trtf’’ R add-on package currently available from R-forge.

Speakers


Wednesday July 5, 2017 11:00am - 11:18am CEST
2.02 Wild Gallery

11:00am CEST

Updates to the Documentation System for R
1. Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City , UT

Funding: This work is supported by funding from the R Consortium and The University of Utah Center for Clinical and Translational Science (NIH 5UL1TR001067-02).

Abstract: Over the last few years while the open source statistical package R has come to prominence it has gained important resources, such as multiple flexible class systems. However, methods for documentation have not kept pace with other advances in the language. I will present the work of the R Documentation Task Force, an R Consortium Working Group, in creating the next generation of documentation system for R.

The new documentation system is based off a S4 formal class system and exists independent of but is complimentary to the packaging system in R. Documentation objects are stored as objects and as such can be manipulated programmatically as with all R objects.

This approach creates a “many in-many out” approach, meaning that developers of software and documentation can create documentation in the format that is easiest for them, such as Rd or Roxygen, and users of the documentation can read or utilize documentation in a convenient format. Since R also makes use of code from other languages such as C++, this creates faculties for including documentation without recreating it.

This work is based on input from the R Documentation Task Force, which is a working group, supported by the R Consortium and the University of Utah Center for Clinical and Translational Science, consisting of R Core developers, representatives from the R Consortium member companies and community developers with relevant interest in documentation.

Good documentation is critical for researchers to disseminate computational research methods, either internally or externally to their organization. This work will facilitate the creation of documentation by making documentation immediately accessible and promote documentation consumption through multiple outputs which can be implemented by developers.

Speakers


Wednesday July 5, 2017 11:00am - 11:18am CEST
3.02 Wild Gallery

11:18am CEST

A Benchmark of Open Source Tools for Machine Learning from R
Keywords: machine learning, predictive modeling, predictive accuracy, scalability, speed
Webpages: https://github.com/szilard/benchm-ml
Binary classification is one of the most widely used machine learning methods in business applications. If the number of features is not very large (sparse), algorithms such as random forests, gradient boosted trees or deep learning neural networks (and ensembles of those) are expected to perform the best in terms of accuracy. There are countless off-the-shelf open source implementations for the previous algorithms (e.g. R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.), but which one to use in practice? Surprisingly, there is a huge variation between even the most commonly used implementations of the same algorithm in terms of scalability, speed, accuracy. In this talk we will see which open source tools work reasonably well on larger datasets commonly encountered in practice. Not surprizingly, all the best tools are available seamlessly from R.

Speakers


Wednesday July 5, 2017 11:18am - 11:36am CEST
2.02 Wild Gallery

11:18am CEST

A restricted composite likelihood approach to modelling Gaussian geostatistical data
Keywords: composite likelihood, effective sample size, REML, spatial dependence
Composite likelihood methods have become popular in spatial statistics. This is mainly due to the fact that large matrices need to be inverted in full maximum likelihood and this becomes computationally expensive when you have a large number of regions under consideration. We introduce restricted pairwise composite likelihood (RECL) methods for estimation of mean and covariance parameters in a spatial Gaussian random field, without resorting back to the full likelihood. A simulation study was carried out to investigate how this method works in settings of increasing domain as well as infill asymptotics, whilst varying the strength of correlation, with similar scenarios as Curriero and Lele (1999). Preliminary results showed that pairwise composite likelihoods tend to underestimate the variance parameters, especially when there is high correlation, while RECL corrects for the underestimation. Therefore, RECL is recommended if interest is in both the mean and the variance parameters. The methods are made available in the spatialRECL package and implemented in R. The methodology will be highlighted in the first part of the presentation, and some analysis will be made on a real data example of TSH levels from Galicia, Spain.
References Curriero, F, and S Lele. 1999. “A Composite Likelihood Approach to Semivariogram Estimation.” J Agric Biol Envir S 4 (1): 9–28.






Wednesday July 5, 2017 11:18am - 11:36am CEST
2.01 Wild Gallery

11:18am CEST

Adaptive Subgroup Selection in Sequential Trials

[`ASSISTant`](https://cran.r-project.org/package=ASSISTant) is an R
package for a novel group-sequential adaptive trial. The design is
motivated by a randomized controlled trial to compare an endovascular
procedure with conventional medical treatment for stroke patients; see
Lai, Lavori and Liao [-@Lai2014191]. The endovascular procedure may
be effective only in a subgroup of patients not known at the design
stage but may be learned statistically from the data collected during
the course of the trial. The group-sequential design implemented in
`ASSISTant` incorporates adaptive choice of the patient subgroup among
several possibilities which includes the entire patient population as
a choice. Appropriate Type I and type II errors of a test can be
defined in this setting and the design maintains a prescribed type I
error by using the closed testing principle in multiple testing.

This adaptive design underlies the
[NIH DIFFUSE-3](https://www.nihstrokenet.org/clinical-trials/acute-interventional-trials/defuse-3)
trial currently underway. The package is on CRAN [-@assistant] and
github.

Speakers
avatar for Balasubramanian Narasimhan

Balasubramanian Narasimhan

Stanford University



Wednesday July 5, 2017 11:18am - 11:36am CEST
3.01 Wild Gallery

11:18am CEST

bradio: Add data music widgets to your business intelligence dashboards
Keywords: music, sonification, Shiny
Webpages: http://src.thomaslevine.com/bradio/, https://thomaslevine.com/!/data-music/
Recent years have brought considerable advances in data sonification (Ligges et al. 2016; Sueur, Aubin, and Simonis 2008; Stone and Garisson 2012; Stone and Garrison 2013; Levine 2015), but data sonification is still a very involved process with many technical limitations. Developing data music in R has historically been a very tedious process because of R’s poor concurrency features and general weakness in audio rendering capabilities (Levine 2016). End-user data music tools can be more straightforward, but they usually constrain users to very particular and rudimentary aesthetic mappings (Siegert and Williams 2017; Levine 2014; Borum Consulting 2014). Finally, existing data music implementations have limited interactivity capabilities, and no integrated solutions are available for embedding in business intelligence dashboards.
I have addressed these various issues by implementing bradio, a Shiny widget for rendering data music. In bradio, a song is encoded as a Javascript function that can take data inputs from R, through Shiny. The Javascript component relies on the webaudio Javascript package (johnnyscript 2014) and is thus compatible with songs written for the webaudio Javascript package, the baudio Javascript package (substack 2014), and Javascript code-music-studio (substack 2015); this compatibility allows for existing songs to be adapted easily as data music. bradio merges the convenience of interactive Javascript music with the data analysis power of R, facilitating the prototyping and presentation of sophisticated interactive data music.
Borum Consulting. 2014. “Readme for Tonesintune Version 2.0.” http://tonesintune.com/Readme.php.

johnnyscript. 2014. webaudio. 2.0.0 ed. https://www.npmjs.com/package/webaudio.

Levine, Thomas. 2014. Sheetmusic. 0.0.4 ed. https://pypi.python.org/pypi/sheetmusic.

———. 2015. “Plotting Data as Music Videos in R.” In UseR! http://user2015.math.aau.dk/contributed_talks#61.

———. 2016. “Approaches to Live Music Synthesis for Multivariate Data Analysis in R.” In SatRday. http://budapest.satrdays.org/.

Ligges, Uwe, Sebastian Krey, Olaf Mersmann, and Sarah Schnackenberg. 2016. tuneR: Analysis of Music. http://r-forge.r-project.org/projects/tuner/.

Siegert, Stefan, and Robin Williams. 2017. Sonify: Data Sonification - Turning Data into Sound. https://CRAN.R-project.org/package=sonify.

Stone, Eric, and Jesse Garisson. 2012. “Give Your Data a Listen.” In UseR! http://biostat.mc.vanderbilt.edu/wiki/pub/Main/UseR-2012/81-Stone.pdf.

Stone, Eric, and Jesse Garrison. 2013. AudiolyzR: Give Your Data a Listen. https://CRAN.R-project.org/package=audiolyzR.

substack. 2014. baudio. 2.1.2 ed. https://www.npmjs.com/package/baudio.

———. 2015. code-music-studio. 1.5.2 ed. https://www.npmjs.com/package/code-music-studio.

———. n.d. “Make Music with Algorithms!” http://studio.substack.net/-/help.

Sueur, J., T. Aubin, and C. Simonis. 2008. “Seewave: A Free Modular Tool for Sound Analysis and Synthesis.” Bioacoustics 18: 213–26. http://isyeb.mnhn.fr/IMG/pdf/sueuretal_bioacoustics_2008.pdf.



Speakers

Wednesday July 5, 2017 11:18am - 11:36am CEST
4.02 Wild Gallery
  Talk, Shiny I

11:18am CEST

Ensemble packages with user friendly interface: an added value for the R community
Keywords: Ensemble package, user-friendly interface, gene expression analysis, biclustering
Webpages: https://r-forge.r-project.org/R/?group_id=589, https://github.com/ewouddt/RcmdrPlugin.BiclustGUI
The increasing amount of R packages makes it difficult to any newcomer to orientate himself/herself in the large amount of option available for topics such as modeling, clustering, variable selection, optimization, sample size estimation etc. The quality of the packages, the associated help files, error reporting system and continuity of support vary significantly and methods may be duplicated across multiple packages if the packages focus on a specific application within a particular field only.
Ensemble packages can be seen as another type of contribution to the R community. Careful revision of packages that approach the same topic from different perspectives may be very useful for increasing the overall quality of the CRAN repository. The revision should not be limited to the technical part, but should also cover methodological aspects. A necessary condition for success of the ensemble package is of course that this revision happens in close collaboration with the authors of the original package.
An additional benefit of ensemble packages lies in leveraging many graphical options of the traditional R framework. Starting from a simple Graphical User Interface, over an R Commander plugins, to Shiny applications, R provides wide range of visualization options. By combining visualization with the content of original packages, the ensemble package can provide different user experience. Such a property extends added value of ensemble beyond a simple review library. Necessarily, the flexibility of the package is reduced by transformation into point and click interface, but the user requiring a fully flexible environment can be referred to the original packages.
We present two case studies of such ensemble packages: IsoGeneGUI and BiclustGUI. IsoGeneGUI is implemented in the Graphical User Interface (GUI) and combines the original IsoGene package for dose-response analysis of high dimensional data with other packages such as orQA, ORIClust, goric and ORCME, that offer methods to analyze different perspectives of gene expression based data sets. IsoGeneGUI thus provides a wide range of methods methods (and the most complete data analysis tool for order restricted analysis) in a user friendly fashion. Hence analyzes can be implemented by users with only limited knowledge of R programming. The RcmdrPlugin.BiclustGUI is a GUI plugin for R Commander that combines various biclustering packages, bringing multiple algorithms, visualizations and diagnostics tools into one unified framework. Additionally, the package allows for simple inclusion of potential future biclustering methods.
The collaboration with the authors of the original packages on implementation of their methods within an ensemble package was extremely important for both case studies. Indeed, in that way, the link with the original packages could be retained. The ensemble package allowed for careful evaluation of the methods, their overlap and differences, and for presenting them as a concise framework in a user friendly environment.

Speakers


Wednesday July 5, 2017 11:18am - 11:36am CEST
3.02 Wild Gallery
  Talk, Packages

11:18am CEST

When is an Outlier an Outlier? The O3 plot
Whether a case might be identified as an outlier depends on the other cases in the dataset and on the variables available. A case can stand out as unusual on one or two variables, while appearing middling on the others. If a case is identified as an outlier, it is useful to find out why. This paper introduces a new display, the O3 plot (Overview Of Outliers), for supporting outlier analyses, and describes its implementation in R.

Figure 1 shows an example of an O3 plot for four German demographic variables recorded for the 299 Bundestag constituencies. There is a row for each variable combination for which outliers were found and two blocks of columns. Each row of the block on the left shows which variable combination defines that row. There are 4 variables, so there are 4 columns, one for each variable, and a cell is coloured grey if that variable is part of the combination. The combinations (the rows) are sorted by numbers of outliers found within numbers of variables in the combination, and blue dotted lines separate the combinations with different numbers of variables. The columns in the left block are sorted by how often the variables occur. A boundary column separates this block from the block on the right that records the outliers found with whichever outlier identification algorithm was used (in this case Wilkinson’s HDoutliers with alpha=0.05). There is one column for each case that is found to be an outlier at least once and these columns are sorted by the numbers of times the cases are outliers.

Given \(n\) cases and \(p\) variables there would be \((p+1+n)\) columns if all cases were an outlier on some combination of variables. And if outliers were identified for all possible combinations there would be \(2^p-1\) rows. An O3 plot has too many rows if there are lots of variables with many combinations having outliers and it has too many columns if there are lots of cases identified as outliers on at least one variable combination. Combinations are only reported if outliers are found for them and cases are only reported which occur at least once as an outlier.

O3 plots show which cases are identified often as outliers, which are identified in single dimensions, and which are only identified in higher dimensions. They highlight which variables and combinations of variables may be affected by possible outliers.

Speakers


Wednesday July 5, 2017 11:18am - 11:36am CEST
PLENARY Wild Gallery

11:36am CEST

Collaborative Development in R: A Case Study with the sparsebn package
With the massive popularity of R in the statistics and data science communities along with the recent movement towards open development and reproducible research with CRAN and GitHub, R has become the de facto go-to for cutting edge statistical software. With this movement, a problem faced by many groups is how individual programmers can work on related codebases in an open, collaborative manner while emphasizing good software practices and reproducible research. The sparsebn package, recently released on CRAN, is an example of this dilemma: sparsebn is a family of packages for learning graphical models, with different algorithms tailored for different types of data. Although each algorithm shares many similarities, different researchers and programmers were in charge of implementing different algorithms. Instead of releasing disparate, unrelated packages, our group developed a shared family of packages in order to streamline the addition of new algorithms so as to minimize programming overhead (the dreaded “data munging” and “plumbing” work). In this talk, I will use sparsebn as a case study in collaborative research and development, illustrating both the development process and the fruits of our labour: A fast, modern package for learning graphical models that leverages cutting-edge trends in high-dimensional statistics and machine learning.

Speakers
BA

Bryon Aragam

Carnegie Mellon University



Wednesday July 5, 2017 11:36am - 11:54am CEST
3.02 Wild Gallery

11:36am CEST

Curve Linear Regression with **clr**
Keywords: Dimension reduction, Correlation dimension, Singular value decomposition, Load forecasting
We present a new R package for curve linear regression: the clr package.
This package implements a new methodology for linear regression with both curve response and curve regressors, which is described in Cho et al. (2013) and Cho et al. (2015).
The key idea behind this methodology is dimension reduction based on a singular value decomposition in a Hilbert Space, which reduces the curve regression problem to several scalar linear regression problems.
We apply curve linear regression with clr to model and forecast daily electricity loads.
References Bathia, N., Q. Yao, and F. Ziegelmann. 2010. “Identifying the Finite Dimensionality of Curve Time Series.” The Annals of Statistics 38: 3352–86.

Cho, H., Y. Goude, X. Brossat, and Q. Yao. 2013. “Modelling and Forecasting Daily Electricity Load Curves: A Hybrid Approach.” Journal of the American Statistical Association 108: 7–21.

———. 2015. “Modelling and Forecasting Daily Electricity Load via Curve Linear Regression.” In Modeling and Stochastic Learning for Forecasting in High Dimension, edited by Anestis Antoniadis and Xavier Brossat, 35–54. Springer.

Fan, J., and Q. Yao. 2003. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer.

Hall, P., and J. L. Horowitz. 2007. “Methodology and Convergence Rates for Functional Linear Regression.” The Annals of Statistics 35: 70–91.




Speakers


Wednesday July 5, 2017 11:36am - 11:54am CEST
2.01 Wild Gallery

11:36am CEST

Distributional Trees and Forests
Keywords: Distributional regression, recursive partitioning, decision trees, random forests
Webpages: https://R-Forge.R-project.org/projects/partykit/
In regression analysis one is interested in the relationship between a dependent variable and one or more explanatory variables. Various methods to fit statistical models to the data set have been developed, starting from ordinary linear models considering only the mean of the response variable and ranging to probabilistic models where all parameters of a distribution are fit to the given data set.
If there is a strong variation within the data it might be advantageous to split the data first into more homogeneous subgroups based on given covariates and then fit a local model in each subgroup rather than fitting one global model to the whole data set. This can be done by applying regression trees and forests.
Both of these two concepts, parametric modeling and algorithmic trees, have been investigated and developed further, however, mostly separated from each other. Therefore, our goal is to embed the progress made in the field of probabilistic modeling in the idea of algorithmic tree and forest models. In particular, more flexible models such as GAMLSS (Rigby and Stasinopoulos 2005) should be fitted in the nodes of a tree in order to capture location, scale, shape as well as censoring, tail behavior etc. while non-additive effects of the explanatory variables can be detected by the splitting algorithm used to build the tree.
The corresponding implementation is provided in an R package disttree which is available on R-Forge and includes the two main functions disttree and distforest. Next to the data set and a formula the user only has to specify a distribution family and receives a tree/forest model with a set of distribution parameters for each final node. One possible way to specify a distribution family is to hand over a gamlss.dist family object (Stasinopoulos, Rigby, and others 2007). In disttree and distforest the fitting function distfit is applied within a tree building algorithm chosen by the user. Either the MOB algorithm, an algorithm for model-based recursive partitioning (Zeileis, Hothorn, and Hornik 2008), or the ctree algorithm (Hothorn, Hornik, and Zeileis 2006) can be used as a framework. These algorithms are both implemented in the partykit package (Hothorn et al. 2015).
References Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3). Taylor & Francis: 651–74.

Hothorn, Torsten, Kurt Hornik, Carolin Strobl, and Achim Zeileis. 2015. “Package ’Party’.” Package Reference Manual for Party Version 0.9–0.998 16: 37.

Rigby, Robert A, and D Mikis Stasinopoulos. 2005. “Generalized Additive Models for Location Scale and Shape (with Discussion).” Applied Statistics 54.3: 507–54.

Stasinopoulos, D Mikis, Robert A Rigby, and others. 2007. “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software 23 (7): 1–46.

Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. 2008. “Model-Based Recursive Partitioning.” Journal of Computational and Graphical Statistics 17 (2). Taylor & Francis: 492–514.




Speakers


Wednesday July 5, 2017 11:36am - 11:54am CEST
2.02 Wild Gallery

11:36am CEST

EpiModel: An R Package for Mathematical Modeling of Infectious Disease over Networks
Keywords: mathematical model, infectious disease, epidemiology, networks, R
Webpages: https://CRAN.R-project.org/package=EpiModel, http://epimodel.org/
The EpiModel package provides tools for building, simulating, and analyzing mathematical models for epidemics using R. Epidemic models are a formal representation of the complex systems that collectively determine the population dynamics of infectious disease transmission: contact between people, inter-host infection transmission, intra-host disease progression, and the underlying demographic processes. Simulating epidemic models serves as a computational laboratory to gain insight into the dynamics of these disease systems, test empirical hypotheses about the determinants of a specific outbreak patterns, and forecast the impact of interventions like vaccines, clinical treatment, or public health education campaigns.
A range of different modeling frameworks has been developed in the field of mathematical epidemiology over the last century. Several of these are included in EpiModel, but the unique contribution of this software package is a general stochastic framework for modeling the spread of epidemics across dynamic contact networks. Network models represent repeated contacts with the same person or persons over time (e.g., sexual partnerships). These repeated contacts give rise to persistent network configurations – pairs, triples, and larger connected components – that in turn may establish the temporally ordered pathways for infectious disease transmission across a population. The timing and sequence of contacts, and the transmission acts within them, is most important when transmission requires intimate contact, that contact is relatively rare, and the probability of infection per contact is relatively low. This is the case for HIV and other sexually transmitted infections.
Both the estimation and simulation of the dynamic networks in EpiModel are implemented using Markov Chain Monte Carlo (MCMC) algorithm functions for exponential-random graph models (ERGMs) from the statnet suite of R packages. These MCMC algorithms exploit a key property of ERGMs: that the maximum likelihood estimates of the model parameters uniquely reproduce the model statistics in expectation. The mathematical simulation of the contact network over time is theoretically guaranteed to vary stochastically around the observed network statistics. Temporal ERGMs provide the only integrated, principled framework for both the estimation of dynamic network models from sampled empirical data and also the simulation of complex dynamic networks with theoretically justified methods for handling changes in population size and composition over time.
In this talk, I will provide an overview of both the modeling tools built into EpiModel, designed to facilitate learning for students new to modeling, and the package’s application programming interface (API) for extending EpiModel, designed to facilitate the exploration of novel research questions for advanced modelers. I will motivate these research-level extensions by discussing our recent applications of these network modeling statistical methods and software tools to investigate the transmission dynamics of HIV and sexually transmitted infections among men who have sex with men in the United States and heterosexual couples in Sub-Saharan Africa.

Speakers
avatar for Samuel Jenness

Samuel Jenness

Assistant Professor, Emory University
Epidemic modeling, network science, HIV/STI prevention



Wednesday July 5, 2017 11:36am - 11:54am CEST
3.01 Wild Gallery

11:36am CEST

Interacting with databases from Shiny
Online presentation: https://github.com/bborgesr/useR2017


Keywords
: databases, shiny, DBI, dplyr, pool
Webpages: http://shiny.rstudio.com/articles/overview.html, https://github.com/rstudio/pool
Connecting to an external database from R can be challenging. This is made worse when you need to interact with a database from a live Shiny application. To demystify this process, I’ll do two things.
First, I’ll talk about best practices when connecting to a database from Shiny. There are three important packages that help you with this and I’ll weave them into this part of the talk. The DBI package does a great job of standardizing how to establish a connection, execute safe queries using SQL (goodbye SQL injections!) and close the connection. The dplyr package builds on top of this to make even easier to connect to databases and extract data, since it allows users to query the database using regular dplyr syntax in R (no SQL knowledge necessary). Yet a third package, pool, exists to help you when using databases in Shiny applications, by taking care of connection management, and often resulting in better performance.
Second, I’ll demo these concepts in practice by showing how we can connect to a database from Shiny to create a CRUD application. I will show the application running and point out specific parts of the code (which will be publicly available).


Wednesday July 5, 2017 11:36am - 11:54am CEST
4.02 Wild Gallery
  Talk, Shiny I

11:36am CEST

Sports Betting and R: How R is changing the sports betting world
Title Sports Betting and R: How R is changing the sports betting world Speaker: Marco Blume Keywords: Sports Betting, Sports Analytics, Vegas, Markets Webpages - https://cran.r-project.org/web/packages/odds.converter/index.html - https://cran.r-project.org/web/packages/pinnacle.API/index.html - http://pinnacle.com/


Sports Betting markets are one of the purest prediction markets that exist and are yet vastly misunderstood by the public. Many assume that the center of the sports betting world is situated in Las Vegas.  However, in the modern era, sports bookmaking is a task that looks a lot like market making in finance with sophisticated algorithmic trading systems running and constantly adjusting prices in real-time as events occur.  But, unlike financial markets, sports are governed by a set of physical rules and can usually be measured and understood.

Since the late 90s, Pinnacle has been one of the largest sportsbooks in the world and one of the only sportsbooks who will take wagers from professional bettors (who win in the long term).  Similar to card counters in Blackjack, most other sportsbook will ban these winners.  At Pinnacle the focus is on modelLing, automation, data science and R is a central piece of the business and a large number of customers use an API to interact with us.  
 
In this talk, we dispel common misconceptions about the sports betting world and show how this is actually a very sexy problem in modelLing and data science and show how we are using R to try to beat Vegas and other sportsbooks every day in a form of data science warfare.
  
Since the rise of in-play betting markets, an operator must make a prediction in real time on the probability of outcomes for the remainder of an event within a very small margin of error. Customers can compete by building their own models or utilising information that might not be accounted for in the market and expressing their belief through wagering. 

Naturally, a customer will generally wager when they believe they have an edge, and then the operator must determine how to change its belief after each piece of new information (wagers, in-game events, etc). This essentially involves predicting how much information is encoded in a wager, which depends partially on the sharpness of each customer, and then determining how to act on that information to maximise profits.   

One way to look at this is that we are aggregating, in a smart way, the world’s models, opinions, and information when we come up with a price. This is a powerful concept and is why, for example, political prediction markets are much more accurate than polls or pundits.   

For this reason, we are releasing another package to CRAN very soon: We will be releasing a package that has all our odds for the entire MLB season and US Election 2016, which can be combined with the very popular Lahman package to build predictive models and to measure the prediction vs real market data to see how your model would have performed in a real market.  

We believe this is a very exciting (and difficult) problem to use for educational purposes.   This package can be used in conjunction with two of our existing packages already on CRAN for a few years: odds.converter (to convert between betting market odds types and probabilities) and Pinnacle.API (used to interact with Pinnacle’s real-time odds API in R).

Even if you have no interest in sports or wagering, we believe this is a fascinating problem and our data and tools are perfect for the R community at large to work with, for academic reasons or for hobby.


Speakers
avatar for Marco Blume

Marco Blume

Trading Director, Pinnacle



Wednesday July 5, 2017 11:36am - 11:54am CEST
PLENARY Wild Gallery

11:54am CEST

ICAOD: An R Package to Find Optimal Designs for Nonlinear Models with Imperialist Competitive Algorithm
Keywords: Optimal design, Nonlinear models, Optimization, Evolutionary algorithm
Webpages: https://cran.r-project.org/web/packages/ICAOD
The ICAOD package applies a novel multi-heuristic algorithm called imperialist competitive algorithm (ICA) to find different types of optimal designs for nonlinear models (Masoudi et al., in press). The setup assumes that we have a general parametric regression model and a design criterion formulated as a convex function of the Fisher information matrix. The package constructs locally D-optimal, minimax D-optimal, standardized maximin D-optimal and optimum-on-the-average designs for a class of nonlinear models, including multiple-objective optimal designs for the 4-parameter Hill model commonly used in dose response studies and other applied fields. Several useful functions are also provided in the package, namely a function to check optimality of the generated design using an equivalence theorem followed by a graphic plot of the sensitivity function for visual appreciation. Another function is to compute the efficiency lower bound of the generated design if the algorithm is terminated prematurely.
References Masoudi E., Holling H., Wong W.K. (in press) Application of imperialist competitive algorithm to find minimax and standardized maximin optimal designs, Computational Statistics & Data Analysis.


Speakers


Wednesday July 5, 2017 11:54am - 12:12pm CEST
2.01 Wild Gallery

11:54am CEST

metawRite: Review, write and update meta-analysis results
Keywords: Living systematic review, meta-analysis, shiny, reproducible research
Webpages: https://github.com/natydasilva/metawRite
Systematic reviews are used to understand how treatments are effective and to design disease control policies, this approach is used by public health agencies such as the World Health Organization. Systematic reviews in the literature often include a meta-analysis that summarizes the findings of multiple studies. It is critical that such reviews are updated quickly as new scientific information becomes available, so the best evidence is used for treatment advice. However, the current peer-reviewed journal based approach to publishing systematic reviews means that reviews can rapidly become out of date and updating is often delayed by the publication model. Living systematic reviews have been proposed as a new approach to dealing with this problem. The main concept of a living review is to enable rapid updating of systematic reviews as new research becomes available, while also ensuring a transparent process and reproducible review. Our approach to a living systematic review will be implemented in an R package named metawRite. The goal is to combine writing and analysis of the review, allowing versioning and updating in an R package . metawRite package will allow an easy and effective way to display a living systematic review available in a web-based display. Three main tasks are needed to have an effective living systematic review: the ability to produce dynamic reports, availability online with an interface that enables end users to understand the data and the ability to efficiently update the review (and any meta-analysis) with new research (Elliott et al. 2014). metawRite package will cover these three task integrated in a friendly web based environment for the final user. This package is not a new meta-analysis package instead will be flexible enough to read different output models from the most used meta-analysis packages in R (metafor (Viechtbauer 2010), meta (Schwarzer 2007) among others), organize the information and display the results in an user driven interactive dashboard. The main function of this package will display a modern web-based application for update a living systematic review. This package combines the power of R, shiny (Chang et al. 2017) and knitr (Xie 2015) to get a dynamic reports and up to date meta-analysis results remaining user friendly. The package has the potential to be used by a large number of groups that conduct and update systematic review such as What Works clearinghouse (https://ies.ed.gov/ncee/WWC/) which reviews education interventions, Campbell Collaboration https://www.campbellcollaboration.org that includes reviews on topics such as social and criminal justice issues and many other social science topics, the Collaboration for Environment Evidence (http://www.environmentalevidence.org) and food production and security (http://www.syreaf.org) among others.
References Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Elliott, Julian H, Tari Turner, Ornella Clavisi, James Thomas, Julian PT Higgins, Chris Mavergames, and Russell L Gruen. 2014. “Living Systematic Reviews: An Emerging Opportunity to Narrow the Evidence-Practice Gap.” PLoS Med 11 (2). Public Library of Science: e1001603.

Schwarzer, Guido. 2007. “Meta: An R Package for Meta-Analysis.” R News 7 (3): 40–45.

Viechtbauer, Wolfgang. 2010. “Conducting Meta-Analyses in R with the metafor Package.” Journal of Statistical Software 36 (3): 1–48. http://www.jstatsoft.org/v36/i03/.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Vol. 29. CRC Press.




Speakers
avatar for Natalia da silva

Natalia da silva

Just got my PhD in Statistics at ISU, Iowa State University
My interest are: supervised learning methods, prediction, exploratory data analysis, statistical graphics, reproducible research and meta-analysis. Co-founder of R-Ladies Ames and now Co-founder of R-LadiesMVD (Montevideo, UY). I'm a conference buddie.



Wednesday July 5, 2017 11:54am - 12:12pm CEST
3.01 Wild Gallery

11:54am CEST

mlrHyperopt: Effortless and collaborative hyperparameter optimization experiments
Keywords: machine learning, hyperparameter optimization, tuning, classification, networked science
Webpages: https://jakob-r.github.io/mlrHyperopt/
Most machine learning tasks demand hyperparameter tuning to achieve a good performance. For example, Support Vector Machines with radial basis functions are very sensitive to the choice of both kernel width and soft margin penalty C. However, for a wide range of machine learning algorithms these “search spaces” are less known. Even worse, experts for the particular methods might have conflicting views. The popular package caret (Jed Wing et al. 2016) approaches this problem by providing two simple optimizers grid search and random search and individual search spaces for all implemented methods. To prevent training on misconfigured methods a grid search is performed by default. Unfortunately it is only documented which parameters will be tuned but the exact bounds have to be obtained from the source code. As a counterpart mlr (Bischl et al. 2016) offers more flexible parameter tuning methods such as an interface to mlrMBO (Bischl et al. 2017) for conducting Bayesian optimization. Unfortunately mlr lacks of default search spaces and thus parameter tuning becomes difficult. Here mlrHyperopt steps in to make hyperparameter optimization as easy as in caret. As a matter of fact, for a developer of a machine learning package, it is unquestionable impossible to be an expert of all implemented methods and provide perfect search spaces. Hence mlrHyperopt aims at:
  • improving the search spaces of caret with simple tricks.
  • letting the users submit and download improved search spaces to a database.
  • providing advanced tuning methods interfacing mlr and mlrMBO.
A study on selected data sets and numerous popular machine learning methods compares the performance of the grid and random search implemented in caret to the performance of mlrHyperopt for different budgets.
References Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “Mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. https://CRAN.R-project.org/package=mlr.

Bischl, Bernd, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. 2017. “mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions.” arXiv:1703.03373 [Stat], March. http://arxiv.org/abs/1703.03373.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2016. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.




Speakers


Wednesday July 5, 2017 11:54am - 12:12pm CEST
2.02 Wild Gallery

11:54am CEST

RL10N: Translating Error Messages & Warnings
Keywords: package development, localization, translation, errors and warnings, R Consortium
Webpages: https://CRAN.R-project.org/package=msgtools, https://github.com/RL10N
R is becoming the global standard language for data analysis, but it requires its user to speak English. RL10N is an R Consortium funded project to make it easier to translate error messages and warnings into different languages. The talk covers how to automatically translate messages using Google Translate and Microsoft Translator, and how to easily integrate these message translations into your R packages using msgtools. Make your code more accessible to users around the world!

Speakers
avatar for Richard Cotton

Richard Cotton

Curriculum Lead, DataCamp


Wednesday July 5, 2017 11:54am - 12:12pm CEST
3.02 Wild Gallery
  Talk, Packages

11:54am CEST

ShinyProxy
Keywords: Shiny, enterprise computing, open source
Webpages: https://shinyproxy.io
Shiny is nice technology to write interactive R-based applications. It has been rapidly adopted and the R community has collaborated on many interesting extensions. Until recently, though, deployments in larger organizations and companies required proprietary solutions. ShinyProxy fills this gap and offers a fully open source alternative to run and manage shiny applications at large.
In this talk we detail the ShinyProxy architecture and demonstrate how it meets the needs of organizations. First of all, by design ShinyProxy scales to thousands of concurrent users. Secondly, it offers authentication and authorization functionality using standard technologies like LDAP, ActiveDirectory, OpenID Connect as well as social authentication (Facebook, Twitter, Google, LinkedIn or Github). Thirdly, the management interface allows to monitor application usage real-time and provides infrastructure to collect usage statistics in event logging databases (e.g. influxdb) or databases for scientific computing (e.g. MonetDB). Finally, the ShinyProxy developers took special care to develop a solution that can be easily embedded in broader applications and (responsive) web sites.
Besides these basic features, the use of Docker technology opens a new world of possibilities that go beyond the current proprietary platforms and in the final section of the talk we will show how academic institutions, governmental organizations and industry roll out Shiny apps with ShinyProxy and, last but not least, how you can do this too.

Speakers

Wednesday July 5, 2017 11:54am - 12:12pm CEST
4.02 Wild Gallery
  Talk, Shiny I

11:54am CEST

Urban green spaces and their biophonic soundscape component
Keywords: soundscape ecology, urbanization, green space, indicators, soundscape
Abstract
Sustainable urban environments with urban green spaces like city parks and urban gardens provide enduring benefits for individuals and society. Providing recreational spaces they encourage physical activity resulting in improved physical and mental health of citizens. As such, the density and the quality of these areas are of high importance in urban area planning.
In order to study urban green spaces as a landscape, the study of their soundscape as the holistic experience of their sounds has recently gained attention in soundscape ecological studies. Using R, the soundecology and seewave packages provide accessible processing tools appropriate to automate the calculation of soundecology indicators of long run sound recordings from permanent outdoor recorders. These indicators give information about the biophonic component in the present soundscape, and as such give a clear indication of the quality of the green space. Since bird vocalizations contribute strongly to the biophonic component, their spring singing activity is clearly reflected in the yearly pattern of these indicators.
A pilot study focussing on the annual variations of the soundscape of a typical urban green space has been conducted.

Speakers


Wednesday July 5, 2017 11:54am - 12:12pm CEST
PLENARY Wild Gallery

12:12pm CEST

**addhaz**: Contribution of chronic diseases to the disability burden in *R*
The increase in life expectancy followed by the growing proportion of old individuals living with chronic diseases contributes to the burden of disability worldwide. The estimation of how much each chronic condition contributes to the disability prevalence can be useful to develop public health strategies to reduce the burden. In this presentation, we will introduce the R package addhaz, which is based on the attribution method (Nusselder and Looman 2004) to partition the total disability prevalence into the additive contributions of chronic diseases using cross-sectional data. The R package includes tools to fit the binomial and multinomial additive hazard models, the core of the attribution method. The models are fitted by maximizing the binomial and multinomial log-likelihood functions using constrained optimization (constrOptim). The 95% Wald and bootstrap percentile confidence intervals can be obtained for the parameter estimates. Also, the absolute and relative contribution of each chronic condition to the disability prevalence and their bootstrap confidence intervals can be estimated. An additional feature of addhaz is the possibility to use parallel computing to obtain the bootstrap confidence intervals, reducing computation time. In this presentation, we will illustrate the use of addhaz with examples for the binomial and multinomial models, using the data from the Brazilian National Health Survey, 2013.
Keywords: Disability, Binomial outcome, Multinomial outcome, Additive hazard model, Cross-sectional data
Webpage: https://cran.r-project.org/web/packages/addhaz/index.html
References Nusselder, Wilma J, and Caspar WN Looman. 2004. “Decomposition of Differences in Health Expectancy by Cause.” Demography 41 (2). Springer: 315–34.




Speakers


Wednesday July 5, 2017 12:12pm - 12:30pm CEST
3.01 Wild Gallery

12:12pm CEST

Developing and deploying large scale Shiny applications for non-life insurance
Keywords: Shiny modules, HTMLWidgets, HTMLTemplates, openCPU, NoSQL, Docker
Webpages: https://www.friss.eu/en/
FRISS is a Dutch, fast growing company with a 100% focus on fraud, risk and compliance for non-life insurance companies and is the European market leader with over 100+ implementations in more than 15 countries worldwide. The FRISS platform offers insurers fully automated access to a vast set of external data sources, which together facilitate many different types of screenings, based on knowledge rules, statistical models, clustering, text mining, image recognition and other machine learning techniques. The information produced by the FRISS platform is bundled into a risk score that provides a quantified risk assessment on a person or case, that enables insurers to make better and faster decisions.
At FRISS, all analytical applications and services are based on R. Interactive applications are based on Shiny, a popular web application platform for R designed by RSTUDIO, while openCPU, an interoperable HTTP API for R, is used to deploy advanced scoring engines at scale, that can be deeply integrated into other services.
In this talk, we show various architectures on how to create high performance, large scale Shiny apps and scoring engines, with a clean code base. Shiny apps are based around the module pattern, HTMLWidgets and HTMLTemplates. Shiny modules allow a developer to compose a complex app via a set of easy to understand modules, each with separate UI and server logic. In these architectures, each module has a set of reactive inputs and outputs and focuses on a single, dedicated task. Subsequently, the modules are combined in a main app that can perform a multitude of complex tasks, yet is still easy to understand and to reason about. In addition, we show how HTMLWidgets allow you to bring the best of JavaScript, the language of the web, into R and show how HTMLTemplates can be used to create R based web applications with a fresh, modern and distinct look.
Finally, in this talk, we show various real-life examples of complex, large scale Shiny applications developed at FRISS. These applications are actively used by insurers worldwide for reporting, dashboarding, anomaly detection, interactive network exploration and fraud detection and allow insurers to combat fraud, risk and compliance. In addition, we show how the aforementioned techniques can be combined with modern NoSQL databases like ElasticSearch, MongoDB and Neo4j, to create high performance apps and how Docker can be used for a smooth deployment process in on-premises scenarios, that is both fast and secure.

Speakers

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
4.02 Wild Gallery
  Talk, Shiny I

12:12pm CEST

Maps are data, so why plot data on a map?
Keywords: data maps, OpenStreetMap, spatial, visualization
Webpages: https://CRAN.R-project.org/package=osmplotr, https://github.com/ropensci/osmplotr, https://github.com/osmdatar/osmdata
R, like any and every other system for analysing and visualising spatial data, has a host of ways to overlay data on maps (or the other way around). Maps nevertheless contain data—nay, maps are data—making this act tantamount to overlaying data upon data. That’s likely not going to end well, and so this talk will present two new packages that enable you to visualise your own data with actual map data such as building polygons or street lines, rather than merely overlaying (or underlaying) them. The osmdata package enables publically accessible data from OpenStreetMap to be read into R, and osmplotr can then use these data as a visual basis for your own data. Both categorical and continuous data can be visualised through colours or through structural properties such as line thicknesses or types. We think this results is more visually striking and beautiful data maps than any alternative approach that necessitates separating your data from map data.

Speakers


Wednesday July 5, 2017 12:12pm - 12:30pm CEST
PLENARY Wild Gallery

12:12pm CEST

Maximum growth rate estimation with **growthrates**
Keywords: population growth, nonlinear models, differential equation
Webpages: https://CRAN.R-project.org/package=growthrates, https://github.com/tpetzoldt/growthrates
The population growth rate is a direct measure of fitness, common in many disciplines of theoretical and applied biology, e.g. physiology, ecology, eco-toxicology or pharmacology. The R package growthrates aims to streamline growth rate estimation from direct or indirect measures of population density (e.g. cell counts, optical density or fluorescence) of batch experiments or field observations. It can be applicable to different species of bacteria, protists, and metazoa, e.g. E. coli, Cyanobacteria, Paramecium, green algae or Daphnia.
The package includes three types of methods:
  1. Fitting of linear models to the period of exponential growth using the “growth rates made easy”-method of Hall and Barlow (2013),
  2. Nonparametric growthrate estimation by using smoothers. The current implementation uses function smooth.spline, similar to method of package grofit (Kahm et al. 2010),
  3. Nonlinear fitting of parametric models like logistic, Gompertz, Baranyi or Huang (Huang 2011) is done with package FME (Flexible Modelling Environment) of Soetaert and Petzoldt (2010). Growth models can be given either in closed form or as numerically integrated system of differential equations, that are numerically solved with package deSolve (Soetaert, Petzoldt, and Setzer 2010) and cOde (Kaschek 2016).
The package contains methods to fit single data sets or complete experimental series. It uses S4 classes and contains functions for extracting results (e.g. coef, summary, residuals, …), and methods for convenient plotting. The fits and the growth models can be visualized with shiny apps.
References Hall, Acar, B. G., and M. Barlow. 2013. “Growth Rates Made Easy.” Mol. Biol. Evol. 31: 232–38. doi:10.1093/molbev/mst197.

Huang, Lihan. 2011. “A New Mechanistic Growth Model for Simultaneous Determination of Lag Phase Duration and Exponential Growth Rate and a New Belehdredek-Type Model for Evaluating the Effect of Temperature on Growth Rate.” Food Microbiology 28 (4): 770–76. doi:10.1016/j.fm.2010.05.019.

Kahm, Matthias, Guido Hasenbrink, Hella Lichtenberg-Frate, Jost Ludwig, and Maik Kschischo. 2010. “Grofit: Fitting Biological Growth Curves with R.” Journal of Statistical Software 33 (7): 1–21. doi:10.18637/jss.v033.i07.

Kaschek, Daniel. 2016. cOde: Automated C Code Generation for Use with the deSolve and bvpSolve Packages. https://CRAN.R-project.org/package=cOde.

Soetaert, Karline, and Thomas Petzoldt. 2010. “Inverse Modelling, Sensitivity and Monte Carlo Analysis in R Using Package FME.” Journal of Statistical Software 33 (3): 1–28. doi:10.18637/jss.v033.i03.

Soetaert, Karline, Thomas Petzoldt, and R. Woodrow Setzer. 2010. “Solving Differential Equations in R: Package deSolve.” Journal of Statistical Software 33 (9): 1–25. doi:10.18637/jss.v033.i09.




Speakers
avatar for Thomas Petzoldt

Thomas Petzoldt

Senior Scientist, TU Dresden (Dresden University of Technology)
dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation



Wednesday July 5, 2017 12:12pm - 12:30pm CEST
2.01 Wild Gallery

12:12pm CEST

The R6 Class System
Keywords: Classes, Object-oriented programming, R6, Reference classes
Webpages: https://CRAN.R-project.org/package=R6, https://github.com/wch/R6
R6 is an implementation of a classical object-oriented programming system for R. In classical OOP, objects have mutable state and they contain methods to modify and access internal state. This stands in contrast with the functional style of object-oriented programming provided by the S3 and S4 class systems, where the objects are (typically) not mutable, and the methods to modify and access their contents are external to the objects themselves.
R6 has some similarities with R’s built-in Reference Class system. Although the implementation of R6 is simpler and lighter weight than that of Reference Classes, it offers some additional features such as private members and robust cross-package inheritance.
In this talk I will discuss when it makes sense to use R6 as opposed to functional OOP, demonstrate how to use the package, and explore some of the internal design of R6.

Speakers

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
3.02 Wild Gallery
  Talk, Packages

12:12pm CEST

The Revised Sequential Parameter Optimization Toolbox
Keywords: optimization, tuning, surrogate model, computer experiments
Webpages: https://CRAN.R-project.org/package=SPOT
Real-world optimization problems often have very high complexity, due to multi-modality, constraints, noise or other crucial problem features. For solving these optimization problems a large collection of methods are available. Most of these methods require to set a number of parameters, which have a significant impact on the optimization performance. Hence, a lot of experience and knowledge about the problem is necessary to give the best possible results. This situation grows worse if the optimization algorithm faces the additional difficulty of strong restrictions on resources, especially time, money or number of experiments.
Sequential parameter optimization (Bartz-Beielstein, Lasarczyk, and Preuss 2005) is a heuristic combining classical and modern statistical techniques for the purpose of efficient optimization. It can be applied in two manners:
  • to efficiently tune and select the parameters of other search algorithms, or
  • to optimize expensive-to-evaluate problems directly, via shifting the load of evaluations to a surrogate model.
SPO is especially useful in scenarios where
  1. no experience of how to choose the parameter setting of an algorithm is available,
  2. a comparison with other algorithms is needed,
  3. an optimization algorithm has to be applied effectively and efficiently to a complex real-world optimization problem, and
  4. the objective function is a black-box and expensive to evaluate.
The Sequential Parameter Optimization Toolbox SPOT provides enhanced statistical techniques such as design and analysis of computer experiments, different methods for surrogate modeling and optimization to effectively use sequential parameter optimization in the above mentioned scenarios.
Version 2 of the SPOT package is a complete redesign and rewrite of the original R package. Most function interfaces were redesigned to give a more streamlined usage experience. At the same time, modular and transparent code structures allow for increased extensibility. In addition, some new developments were added to the SPOT package. A Kriging model implementation, based on earlier Matlab code by Forrester et al. (Forrester, Sobester, and Keane 2008), has been extended to allow for the usage of categorical inputs. Additionally, it is now possible to use stacking for the construction of ensemble learners (Bartz-Beielstein and Zaefferer 2017). This allows for the creation of models with a far higher predictive performance, by combining the strengths of different modeling approaches.
In this presentation we show how the new interface of SPOT can be used to efficiently optimize the geometry of an industrial dust filter (cyclone). Based on a simplified simulation of this real world industry problem, some of the core features of SPOT are demonstrated.
References Bartz-Beielstein, Thomas, and Martin Zaefferer. 2017. “Model-Based Methods for Continuous and Discrete Global Optimization.” Applied Soft Computing 55: 154–67. doi:10.1016/j.asoc.2017.01.039.

Bartz-Beielstein, Thomas, Christian Lasarczyk, and Mike Preuss. 2005. “Sequential Parameter Optimization.” In Proceedings Congress on Evolutionary Computation 2005 (Cec’05), 1553. Edinburgh, Scotland. http://www.spotseven.de/wp-content/papercite-data/pdf/blp05.pdf.

Forrester, Alexander, Andras Sobester, and Andy Keane. 2008. Engineering Design via Surrogate Modelling. Wiley.




Speakers


Wednesday July 5, 2017 12:12pm - 12:30pm CEST
2.02 Wild Gallery

1:30pm CEST

**shadow**: R Package for Geometric Shade Calculations in an Urban Environment
Keywords: shadow, sun position, geometry, solar radiation, building facades
Webpage: https://CRAN.R-project.org/package=shadow
Spatial analysis of the urban environment frequently requires estimating whether a given point is shaded or not, given a representation of spatial obstacles (e.g. buildings) and a time-stamp with its associated solar position. For example, we may be interested in -
  • Calculating the amount of time a given roof or facade is shaded, to determine the utility of installing Photo-Voltaic cells for electricity production.
  • Calculating shade footprint on vegetated areas, to determine the expected microclimatic influence of a new tall building.
These types of calculations are usually applied in either vector-based 3D (e.g. ESRI’s ArcScene) or raster-based 2.5D (i.e. Digital Elevation Model, DEM) settings. However, the former solutions are mostly restricted to proprietary software associated with specific 3D geometric model formats. The latter DEM-based solutions are more common, in both open-source (e.g. GRASS GIS) as well as proprietary (e.g. ArcGIS) software. The insol R package provides such capabilities in R. Though conceptually and technically simpler to work with, DEM-based approaches are less suitable for an urban environment, as opposed to natural terrain, for two reasons -
  • A continuous elevation surface at sufficiently high resolution for the urban context (e.g. LIDAR) may not be available and is expensive to produce.
  • DEMs cannot adequately represent individual vertical urban elements (e.g. building facades), thus limiting the interpretability of results.
The shadow package aims at addressing these limitations. Functions in this package operate on a vector layer of building outlines along with their heights (class SpatialPolygonsDataFrame from package sp), rather than a DEM. Such data are widely available, either from local municipalities or from global datasets such as OpenStreetMap. Currently functions to calculate shadow height, Sky View Factor (SVF) and shade footprint on ground are implemented. Since the inputs are vector-based, the resulting shadow estimates are easily associated with specific urban elements such as buildings, roofs or facades.
We present a case study where package shadow was used to calculate shading on roofs and facades in a large neighborhood (Rishon-Le-Zion city, Israel), on an hourly temporal resolution and a 1-m spatial resolution. The results were combined with Typical Meteorological Year (TMY) direct solar radiation data to derive total annual insolation for each 1-m grid cell. Subsequently the locations where installation of photovoltaic (PV) cells is worthwhile, given a predefined threshold production, were mapped.
The approach is currently applicable to a flat terrain and does not treat obstacles (e.g. trees) other than the buildings. Our case study demonstrates that subject to these limitations package shadow can be used to calculate shade and insolation estimates in an urban environment using widely available polygonal building data. Future development of the package will be aimed at combining vector-associated shadow calculations with raster data representing non-flat terrain.

Speakers


Wednesday July 5, 2017 1:30pm - 1:48pm CEST
3.01 Wild Gallery
  Talk, GIS

1:30pm CEST

Integrated Cluster Analysis with R in Drug Discovery Experiments using Multi-Source Data
  1. Institute for Biostatistics and Statistical Bioinformatics, Hasselt University, Belgium
  2. Independent consultant
Keywords: High dimensional data, Clustering
Webpages: https://cran.r-project.org/web/packages/IntClust/index.html
Discovering the exact activities of a compound is of primary interest in drug development. A single drug can interact with multiple targets and unintended drug-target interactions could lead to severe side effects. Therefore, it is valuable in the early phases of drug discovery to not only demonstrate the desired on-target efficacy of compounds but also to outline its unwanted off-target effects. Further, the earlier unwanted behaviour is documented, the better. Otherwise, the drug could fail in a later stage which means that the invested time, effort and money are lost.
In the early stages of drug development, different types of information on the compounds are collected: the chemical structures of the molecules (fingerprints), the predicted targets (target predictions), on various bioassays, the toxicity and more. An analysis of each data source could reveal interesting yet disjoint information. It only provides a limited point of view and does not give information on how everything is interconnected in the global picture (Shi, De Moor, and Moreau 2009). Therefore, a simultaneous analysis of multiple data sources can provide a more complete insight on the compounds' activity.
An analysis based on multiple data sources is relatively new and growing area in drug discovery and drug development. Multi-source clustering procedures provide us with the opportunity to relate several data sources to each other to gain a better understanding of the mechanism of action of compounds. The use of multiple data sources was investigated in the QSTAR (quantitative structure transcriptional activity relationship) consortium (Ravindranath et al. 2015). The goal was to find associations between chemical, bioassay and transcriptomic data in the analysis of a set of compounds under development.
In the current study, we extend the clustering method presented in(Perualila-Tan et al. 2016) and review the performance of several clustering methods on a real drug discovery project in R. We illustrate how the new clustering approaches provide a valuable insight for the integration of chemical, bioassay and transcriptomic data in the analysis of a specific set of compounds. The proposed methods are implemented and publicly available in the R package IntClust which is a wrapper package for a multitude of ensemble clustering methods.
References Perualila-Tan, N., Z. Shkedy, W. Talloen, H. W. H. Goehlmann, QSTAR Consortium, M. Van Moerbeke, and A. Kasim. 2016. “Weighted-Similarity Based Clustering of Chemical Structure and Bioactivity Data in Early Drug Discobased.” Journal of Bioinformatics and Computational Biology.

Ravindranath, A. C., N. Perualila-Tan, A. Kasim, G. Drakakis, S. Liggi, S. C. Brewerton, D. Mason, et al. 2015. “Connecting Gene Expression Data from Connectivity Map and in Silico Target Predictions for Small Molecule Mechanism-of-Action Analysis.” Mol. BioSyst. 11 (1). The Royal Society of Chemistry: 86–96. doi:10.1039/C4MB00328D.

Shi, Y., B. De Moor, and Y. Moreau. 2009. “Clustering by Heterogeneous Data Fusion: Framework and Applications.” NIPS Workshop.





UseR pdf

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
2.01 Wild Gallery

1:30pm CEST

R-Ladies Global Community
**Keywords**: R-Ladies, Diversity, R community

**Webpages**: https://rladies.org/

The _R_ community suffers from an underrepresentation of women* in every role and area of participation: whether as leaders (no women on the _R_ core team, 5 of 37 female ordinary members of the _R_-Foundation), package developers [around 10% women amongst CRAN maintainers, @forwards; @CRANsurvey], conference speakers and participants [around 28% at useR! 2016, @forwards], educators, or users.

As a diversity initiative alongside the Forwards Task Force, _R_-Ladies’ mission is to achieve proportionate representation by encouraging, inspiring, and empowering the minorities currently underrepresented in the _R_ community.
_R_-Ladies’ primary focus is on supporting the _R_ enthusiasts who identify as an underrepresented gender minority to achieve their programming potential, by building a collaborative global network of _R_ leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide.

Since _R_-Ladies global was created a year ago we have grow exponentially to more than 4000 _R_-Ladies in 15 countries and have established a great brand. We want to share the amazing work _R_-Ladies has achieved, future plans and how the _R_ community can support and champion _R_-Ladies around the world.


Wednesday July 5, 2017 1:30pm - 1:48pm CEST
PLENARY Wild Gallery
  Talk, Community

1:30pm CEST

Stream processing with R in AWS
Keywords: stream processing, big data, ETL, scale
Webpages: https://CRAN.R-project.org/package=AWR, https://CRAN.R-project.org/package=AWR.KMS, https://CRAN.R-project.org/package=AWR.Kinesis
R is rarely mentioned among the big data tools, although it’s fairly well scalable for most data science problems and ETL tasks. This talk presents an open-source R package to interact with Amazon Kinesis via the MultiLangDaemon bundled with the Amazon KCL to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of Kinesis shards.
Besides the technical background and a quick introduction on how Kinesis works, this talk will feature some stream processing use-cases at CARD.com, and will also provide an overview and hands-on demos on the related data infrastructure built on the top of Docker, Amazon ECS, ECR, KMS, Redshift and a bunch of third-party APIs – besides the related open-source R packages, eg AWR, AWR.KMS and AWR.Kinesis, developed at CARD.
References

Speakers


Wednesday July 5, 2017 1:30pm - 1:48pm CEST
3.02 Wild Gallery
  Talk, HPC

1:30pm CEST

Text mining, the tidy way
Keywords: text mining, natural language processing, tidy data, sentiment analysis
Webpages: https://CRAN.R-project.org/package=tidytext, http://tidytextmining.com/
Unstructured, text-heavy data sets are increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. We introduce the tidytext package for approaching text analysis from a tidy data perspective. We can manipulate, summarize, and visualize the characteristics of text using the R tidy tool ecosystem; these tools extend naturally to many text analyses and allow analysts to integrate natural language processing into effective workflows already in wide use. We explore how to implement approaches such as sentiment analysis of texts and measuring tf-idf to quantify what a document is about.

Speakers
avatar for Julia Silge

Julia Silge

data scientist, Stack Overflow



Wednesday July 5, 2017 1:30pm - 1:48pm CEST
4.02 Wild Gallery

1:30pm CEST

The use of R in predictive maintenance: A use case with TRUMPF Laser GmbH
Keywords: data science, predictive maintenance, industry 4.0, business, industry, use case,
The buzz for industry 4.0 continues – digitalizing business processes is one of the main aims of companies in the 21st century. One topic gains particular importance: predictive maintenance. Enterprises use this method in order to cut production and maintenance costs and to increase reliability.
Being able to predict machine failures, performance drops or quality deterioration is a huge benefit for companies. With this knowledge, maintenance and failure costs can be reduced and optimized.
With the help of R and its massive community, analysts can apply the best algorithms and methods for predictive maintenance. When a good analytic model for predictive maintenance has been found, companies are challenged to implement them in their own environments and workflows. Especially regarding the workflow across different departments, it is necessary to find an appropriate solution which is capable of interdisciplinary work, as well.
My talk will show how this challenge was solved for TRUMPF Laser GmbH, a subsidiary of TRUMPF, a world-leading high-technology company which offers production solutions in the machine tool, laser and electronic sectors. I would like to share my experience with R and predictive maintenance in a real-world industry scenario and show the audience how to automate R code and visualize it in a front-end solution for all departments involved.

Speakers
avatar for Andreas Prawitt

Andreas Prawitt

Data Scientist, eoda GmbH
At the useR Conference I am interested to see how Data Scientist use R in larger companies. I am looking forward to show you how TRUMPF Lasertechnik integrated R in their Analytical workflows.


Wednesday July 5, 2017 1:30pm - 1:48pm CEST
2.02 Wild Gallery

1:48pm CEST

**RQGIS** - integrating *R* with QGIS for innovative geocomputing
Keywords: GIS interface, QGIS, Python interface
Webpages: https://cran.r-project.org/web/packages/RQGIS/index.html, https://github.com/jannes-m/RQGIS
RQGIS establishes an interface to QGIS - the most widely used open-source desktop geographical information system (GIS). Since QGIS itself provides access to other GIS (SAGA, GRASS, GDAL, etc.), RQGIS brings more than 1000 geoalgorithms to the R console. Furthermore, R users do not have to touch Python though RQGIS makes use of the QGIS Python API in the background. Also, several convenience functions facilitate the usage of RQGIS. For instance, open_help provides instant access to the online help and get_args_man automatically collects all function arguments and respective default values of a specified geoalgorithm. The workhorse function run_qgis also accepts spatial objects residing in R’s global environment as input, and also loads QGIS output (such as shapefiles and rasters) directly back into R, if desired. Here, we will demonstrate the fruitful combination of R and QGIS by spatially predicting plant species richness of a Peruvian fog-oasis. For this, we will use RQGIS to extract terrain attributes (from a digital elevation model) which subsequently will serve as predictors in a non-linear Poisson regression. Apart from this, there are many more useful applications that combine R with GIS. For instance, GIS technologies include among others algorithms for the computation of stream networks, surface roughness, terrain classification, landform identification as well as routing and spatial neighbor operations. On the other hand, R provides access to advanced modeling techniques, kriging interpolation, and spatial autocorrelation and spatial cross-validation algorithms to name but a few. Naturally, this paves the way for innovative and advanced statistical geocomputing. Compared to other R packages integrating GIS functionalities (rgrass7, RSAGA, RPyGeo), RQGIS accesses a wider range of GIS functions, and is often easier to use. To conclude, anyone working with large spatio-temporal data in R may benefit from the R-QGIS integration.

Speakers


Wednesday July 5, 2017 1:48pm - 2:06pm CEST
3.01 Wild Gallery
  Talk, GIS
  • Company 58

1:48pm CEST

Architectural Elements Enabling R for Big Data
Keywords: Big Data, Machine Learning, Scalability, High Perforance Computing, Graph Analytics
Webpages: https://oracle.com/goto/R
Big Data garners much attention, but how can enterprises extract value from data as found in the growing corporate data lakes or data reservoirs. Extracting value from big data requires high performance and scalable tools – both in hardware and software. Increasingly, enterprises take on massive machine learning and graph analytics projects, where the goal is to build models and analyze graphs involving multi-billion row tables or to partition analyses into thousands or even millions of components.
Data scientists need to address use cases that range from modeling individual customer behavior to understanding aggregate behavior, or exploring centrality of nodes within a graph to monitoring sensors from the Internet of Things for anomalous behavior. While R is cited as the most used statistical language, limitations of scalability and performance often restrict its use for big data. In this talk, we present architectural elements enabling high performance and scalability, highlighting scenarios both on Hadoop/Spark and database platforms using R. We illustrate how Oracle Advanced Analytics’ Oracle R Enterprise component and Oracle R Advanced Analytics for Hadoop enable using R on big data, achieving both scalability and performance.

Speakers
avatar for Mark Hornick

Mark Hornick

Senior Director, Oracle
Mark Hornick is the Senior Director of Product Management for the Oracle Machine Learning (OML) family of products. He leads the OML PM team and works closely with Product Development on product strategy, positioning, and evangelization, Mark has over 20 years of experience with integrating... Read More →


Wednesday July 5, 2017 1:48pm - 2:06pm CEST
3.02 Wild Gallery
  Talk, HPC

1:48pm CEST

Clustering transformed compositional data using *coseq*
NA
NA
Abstract: Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e., data made up of profiles, whose rows belong to the simplex), remains largely unexplored, particularly in cases where the observed value of an observation is equal or close to zero for one or more samples. This work is motivated by the analysis of two sets of compositional data, both focused on the categorization of profiles but arising from considerably different applications: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib’ bike sharing system in Paris, France. For both of these applications, we propose the use of appropriate data transformations in conjunction with either Gaussian mixture models or K-means algorithms and penalized model selection criteria. Using our Bioconductor package coseq, we illustrate the user-friendly implementation and visualization provided by our proposed approach, with a focus on the functional coherence of the gene co-expression clusters and the geographical coherence of the bike station groupings.
Keywords: Clustering, compositional data, K-means, mixture model, transformation, co-expression

Speakers


Wednesday July 5, 2017 1:48pm - 2:06pm CEST
2.01 Wild Gallery

1:48pm CEST

manifestoR - a tool for data journalists, a source for text miners and a prototype for reproducibility software
Keywords: political science, reproducibility, corpus, data journalism, text mining
Webpages: https://CRAN.R-project.org/package=manifestor, https://manifesto-project.wzb.eu/information/documents/manifestoR
The Manifesto Project is a long-term political science research project that has been collecting, archiving and analysing party programs from democratic elections since 1979, and is one of longest standing and most widely used data sources in political science. The project recently released manifestoR as its official R package for accessing and analysing the data collected by the project. The package is aimed at three groups: it is a valuable tool for data journalism and social sciences, a data source for text mining, and a prototype for software that promotes research reproducibility.
The manifestoR package provides access to the Manifesto Corpus (Merz, Regel & Lewandowski 2016) – the project’s text database – which contains more than 3000 digitalised election programmes from 573 parties, together running in elections between 1946 and 2015 in 50 countries, and includes documents in more than 35 different languages. More than 2000 of these documents are available as digitalised, cleaned, UTF-8 encoded full text – the rest as PDF files. As these texts are accessible from directly within R, manifestoR provides a comfortable and valuable data source for text miners interested in political and/or multilingual training data, as well as for data journalists.
The manifesto texts accessible through manifestoR are labelled statement by statement, according to a 56 category scheme which identifies policy issues and positions. On the basis of this labelling scheme, the political science community has developed many aggregate indices on different scales for parties’ ideological positions. Most of these algorithms have been collected and included in manifestoR in order to provide a centralised and easy to use starting point for scientific and journalistic analyses and inquiries.
Replicability and reproducibility of scientific analyses are core values of the R community, and are of growing importance in the social sciences. Hence, manifestoR was designed with the goal of reproducible research in mind and tries to set an example of how a political science research project can publish and maintain an open source package to promote reproducibility when using its data. The Manifesto Project’s text collection is constantly growing and being updated, but any version ever published can easily be used as the basis for scripts written with manifestoR. In addition, the package integrates seamlessly with the widely-used tm package (Feinerer 2008) for text mining in R, and provides a data_frame representation for every data object in order to connect to the tidyverse packages (Wickham 2014), including the text-specific tidytext (Silge & Robinson 2016). For standardising and open-sourcing the implementations of aggregate indices from the community in manifestoR, we sought collaboration with the original authors. Additionally, the package provides infrastructure to easily adapt such indices, or to create new ones. The talk will also discuss the lessons learned and the unmet challenges that have arisen in developing such a package specifically for the political science community.
References
  • Feinerer, Ingo (2008). A text mining framework in R and its applications. Doctoral thesis, WU Vienna University of Economics and Business.
  • Merz, N., Regel, S., & Lewandowski, J. (2016). The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis. Research &Amp; Politics, 3(2), 2053168016643346. doi: 10.1177/2053168016643346
  • Silge, J., & Robinson, D. (2016). Tidytext: Text Mining and Analysis Using Tidy Data Principles in R. JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.
  • Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10




Wednesday July 5, 2017 1:48pm - 2:06pm CEST
4.02 Wild Gallery

1:48pm CEST

Too good for your own good: Shiny prototypes out of control
Keywords: Shiny, Project Management, Product Management, Best Practice, Stakeholder Management
Shiny development is exploding in the R world, especially for enabling analysts to share their results with business users interactively. At Mango Solutions, the number of Shiny apps we are being commissioned to build has increased dramatically with approximately 30% of current projects involving some aspect of Shiny development.
Typically, Shiny has been used as a prototyping tool to quickly show business the value of data driven projects with the aim to productionalise the app once buy-in from stakeholders is gained. Shiny is fantastically quick to get an app up and running and into the hands of users and additional features can be rapidly prototyped for stakeholders.
In this presentation I will share with you our experience from a client project where Shiny prototyping got out of control- the app was so successful for the business the pilot phase quickly evolved into full deployment as more users were involved in “testing” without production best practice implemented yet. I will then tell you how we faced into this challenge which involved client education and the planning and implementation of the required deployment rigour.
I will also share our thoughts on how to approach Shiny prototyping and development (taking on board our lessons learnt) depending on the app’s needs- you can still quickly implement features with Shiny but with a few recommendations you can minimise the largest risks of your app getting away from you.

Speakers


Wednesday July 5, 2017 1:48pm - 2:06pm CEST
2.02 Wild Gallery

1:48pm CEST

We R a Community-making high quality courses for high education accessible for all:The >eR-Biostat initiative
1. Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Center for Statistics, Hasselt University, 3590 Diepenbeek, Belgium
2. Department of Epidemiology and Biostatistics, Gonder University, Ethiopia
3. Human Sciences Research Council (HSRC), PRETORIA, South Africa
4. The University of South Africa (UNISA), PRETORIA, South Africa
5. Wolfson Research Institute for Health and Wellbeing, Durham University, Durham


Keywords: Developing countries, master programs, Biostatistucs, E-learning using R

Webpage: https://github.com/eR-Biostat

One of the main problems in high education at a master level in developing countries is the lack of high quality course materials for courses in master programs. The >eR-Biostat initiative is focused on masters programs in Biostatistics/Statistics and aim to develop new E-learning system for courses at a master level.

The E-learning system, developed as a part of the >eR-Biostat initiative, offers free online course materials for master students in biostatistics/statistics in developing countries. For each course, the materials are publicly available and consist of several type of course materials: (1) notes for the course, (2) slides for the course, (3) R programs, ready to use, which contain all data and R code for the all examples and illustrations discussed in the course and (4) homework assignments and exams.

The >eR-Biostat initiative introduces a new, R based, learning system, the multi-module learning system, in which the students in the local universities in developing countries will be able to follow courses in different learning format, including e-courses taken online and a combination between e-courses and local lectures given by local staff members. R software and packages are used in all courses as data analysis tool for all examples and illustrations. The >eR-Biostat initiative provides a free, accessible and ready to use tool for capacity building in biostatistics/statistics for local universities in developing countries with current low or near zero capacity in these topics. In its nurture, the R community is used for this type of collaboration (for example, CRAN and Bioconductor which offer access to the most up-to-date R packages for data analysis). The >eR-Biostat initiative is aimed to bring the R community members for the development of high education courses in the same way it is currently done in software development.

Speakers
avatar for Ziv Shkedy

Ziv Shkedy

Professor, Hasselt University
I am a professor for biostatistics and bioinformatics in the center for statistics in Hasselt University, Belgium



Wednesday July 5, 2017 1:48pm - 2:06pm CEST
PLENARY Wild Gallery
  Talk, Community

2:06pm CEST

A Tidy Data Model for Natural Language Processing
This talk introduces the R package cleanNLP, which provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford’s CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish (Marneffe et al. 2016, De Marneffe et al. (2014)). Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction (Lee et al. 2011). The functionality provided by the package applies the tidy data philosophy (Wickham 2014) to the processing of raw textual data by offering three distinct contributions:

a data schema representing the output of an NLP annotation pipeline as a collection of normalized tables;
a set of native Java output functions converting a Stanford CoreNLP annotation object directly, without converting into an intermediate XML format, into this collection of normalized tables;
tools for converting from the tidy model into (sparse) data matrices appropriate for exploratory and predictive modeling.
Together, these contributions simplify the process of doing exploratory data analysis over a corpus of text. The output works seamlessly with both tidy data tools as well as other programming and graphing systems. The talk will illustrate the basic usage of the cleanNLP package, explain the rational behind the underlying data model, and show an example from a corpus of the text from every State of the Union address made by a United States President (Peters 2016).

Speakers
avatar for Taylor Arnold

Taylor Arnold

Assistant Professor of Statistics, University of Richmond
Large scale text and image processing


Wednesday July 5, 2017 2:06pm - 2:24pm CEST
4.02 Wild Gallery

2:06pm CEST

Beyond Prototyping: Best practices for R in critical enterprise environments
Keywords: data science, business, industry, best practice, critical enterprise environments, R,
Over the last couple of years, R has become increasingly popular among business users. Today, it is the first choice of many data science departments when it comes to ad-hoc analysis and data visualization, research and prototyping.
But when it comes to critical production environments, IT departments are still reluctant to consider R as part of their software stack. And there are reasons for that: Dynamic typing, the reputation for being slow (still around!), the lack of experience regarding management and administration of R (and its 10,000 packages), to name some of them.
Nevertheless, with the help of some friends, it is feasible and reasonable to use R as a core application even in critical production environments. This talk will share lessons learned from practical experience and point out a best practice landscape of tools, approaches and methods to make that happen.

Speakers


Wednesday July 5, 2017 2:06pm - 2:24pm CEST
2.02 Wild Gallery

2:06pm CEST

Extracting Meaningful Noisy Biclusters from a Binary Big-Data Matrix using the BiBitR R Package
Keywords: R, package, biclustering, binary data
Webpages: https://cran.r-project.org/web/packages/BiBitR/index.hmtl, https://github.com/ewouddt/BiBitR
Biclustering is a data analysis method that can be used to cluster the rows and columns in a (big) data matrix simultaneously in order to identify local submatrices of interest, i.e., local patterns in a big data matrix. For binary data matrices, the local submatrices that biclustering methods can identify consists of rectangles of 1’s. Several methods were developed for biclustering of binary data, such as the Bimax algorithm proposed by Prelić et al. (2006) and the BiBit algorithm by Rodriguez-Baena, Perez-Pulido, and Aguilar-Ruiz (2011). However, these methods are capable to discover only perfect biclusters which means that noise is not allowed (i.e., zeros are not included in the bicluster). We present an extension for the BiBit algorithm (E-BiBit) that allows for noisy biclusters. While this method works very fast, its downside is that it often produces a large number of biclusters (typically >10000) which makes it very difficult to recover any meaningful patterns and to interpret the results. Furthermore many of these biclusters are highly overlapping.
We propose a data analysis workflow to extract meaningful noisy biclusters from binary data using an extended and `pattern-guided’ version of BiBit and combine it with traditional clustering/networking methods. The proposed algorithm and the data analysis workflow are illustrated using the BiBitR R package to extract and visualize these results.
The proposed method/data analysis flow is applied to high dimensional real life health data which contains information of disease symptoms of hundreds thousands of patients. The E-BiBit algorithm is used to identify homogeneous subsets of patients who share the same disease symptom profiles.
The E-BiBit has also been included in the BiclustGUI R package (De Troyer and Otava (2016), De Troyer et al. (2016)), an ensemble GUI package in which multiple biclustering and visualisation methods are implemented.
References De Troyer, E., and M. Otava. 2016. Package ’Rcmdrplugin.BiclustGUI’: ’Rcmdr’ Plug-in Gui for Biclustering. https://ewouddt.github.io/RcmdrPlugin.BiclustGUI/aboutbiclustgui/.

De Troyer, E., M. Otava, J. D. Zhang, S. Pramana, T. Khamiakova, S. Kaiser, M. Sill, et al. 2016. “Applied Biclustering Methods for Big and High-Dimensional Data Using R.” In, edited by A. Kasim, Z. Shkedy, S. Kaiser, S. Hochreiter, and W. Talloen. CRC Press Taylor & Francis Group, Chapman & Hall/CRC Biostatistics Series.

Prelić, A., S. Bleuler, P. Zimmermann, Wille A., P. Bühlmann, W. Gruissem, L. Henning, L. Thiele, and E. Zitzler. 2006. “A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data.” Bioinformatics 22: 1122–9.

Rodriguez-Baena, Domingo S., Antona J. Perez-Pulido, and Jesus S. Aguilar-Ruiz. 2011. “A Biclustering Algorithm for Extracting Bit-Patterns from Binary Dataets.” Bioinformatics 27 (19).




Speakers


Wednesday July 5, 2017 2:06pm - 2:24pm CEST
2.01 Wild Gallery

2:06pm CEST

How the R Consortium is Supporting the R Community
There is a lot happening at R Consortium! We now have 15 members, including the Gordon and Betty Moore Foundation which joined this year as a Platinum member, 21 active projects and a fired- up grant process. This March the ISC awarded grants to 10 new projects totaling nearly $240,000. In this talk we will describe how the R Consortium is evolving to carry out its mission to provide support for the R language, the R Foundation and the R Community. We will summarize the active projects, why they are important and where we think they are going, and describe how individual R users can get involved with R Consortium projects.
Keywords: R Consortium
Webpages: https://www.r-consortium.org/

Speakers
avatar for David Smith

David Smith

Cloud Advocate, Microsoft
Ask me about R at Microsoft, the R Consortium, or the Revolutions blog.



Wednesday July 5, 2017 2:06pm - 2:24pm CEST
PLENARY Wild Gallery
  Talk, Community

2:06pm CEST

Link2GI - Easy linking Open Source GIS with R
Keywords Spatial analysis, Setup, GRASS, SAGA, OTB, QGIS
Despite the well known capabilities of spatial analysis and data handling in the world of R, an enormous gap persists between R and the mature open source Geographic Information System (GIS) and Remote Sensing (RS) software community. Prominent representatives like QGIS, GRASS GIS and SAGA GIS provide comprehensive and continually growing collections of highly sophisticated algorithms that are mostly fast, stable and usually well proofed by the community
Although a number of R wrappers aim to bridge this gap (eg rgrass7 for GRASS GIS 7.x, RSAGA for SAGA GIS) – among which RQGIS is the most recent outcome to realize a simple access to the powerful QGIS command line interface – most of these packages are not that easy to setup. Most of the wrappers are trying to find and/or set an appropriate environment, nevertheless it is in many cases at least cumbersome to get all necessary settings correct, especially if one has to work with restricted rights or parallel installations of the same GIS software.
In order to overcome known limitations, the package link2GI provides a small framework for easy linking of R to major GIS software. Here, linking simply means to provide all necessary environment settings as well as full access to the command line APIs of these software tools, whereby the strategy differs from software to software. As a result an easy entrance door for linking current versions of GRASS7.x GIS, SAGA GIS, QGIS as well as other command line tools like the Orfeo Toolbox (OTB) to R is provided. The package focus on both R users that are not very familiar with the conditions and pitfalls of their preferred operating system and more experienced users that want to have some comfortable shortcuts for a seamless integration of e.g. GRASS. The most simple call link2GI::linkGRASS7(x=anySpatialObject) will search for the OS dependent installations of GRASS 7. Furthermore, it will setup the rsession according to the provided spatial object. All steps can be influenced manually which will significantly speed up the process. Especially if you work with already established GRASS databases it provides a convenient way to link mapsets and locations correctly.
The package is also providing some basic tools beyond simple linking. Since Edzer Pebesma’s new sf package, it is for the first time possible to deal with big vector data sets (> 1.000.000 polygons or 25.000.000 vertices). Nevertheless it is advantagous to process the more sophisticeded spatial analysis with external GIS software. To improve this process link2GI provides a first version of direct reading and writing GRASS and SAGA vector data from and to R to speed up the conversion process. Finally, a first version of a common Orfeo Toolbox wrapper for simplifying OTB calls is introduced.


Wednesday July 5, 2017 2:06pm - 2:24pm CEST
3.01 Wild Gallery
  Talk, GIS

2:06pm CEST

Rc$^2$: an Environment for Running R and Spark in Docker Containers
Keywords: R, Spark, Docker containers, Kubernetes, Cloud computing
Rc\(^2\) (R cloud computing) is a containerized environment for running R, Hadoop, and Spark with various persistent data stores including PostgreSQL, HDFS, HBase, Hive, etc. At this time, the server side of Rc\(^2\) runs on Docker’s Community Edition, which can be: on the same machine as the client, on a server, or in the cloud. Currently, Rc\(^2\) supports a macOS client, but iOS and web clients are in active development.
The clients are designed for small or large screens with a left editor panel and a right console/output panel. The editor panel supports R scripts, R Markdown, and Sweave, but bash, SQL, Python, and additional languages will be added. The right panel allows toggling among the console and graphical objects as well as among generated help, html, and pdf files. A slide-out panel allows toggling among session files, R environments, and R packages. Extensive search capabilities are available in all panels.
The base server configuration has containers for an app server, a database server, and a compute engine. The app server communicates with the client. The compute engine is available with or without Hadoop/Spark. Additional containers can be added or removed from within Rc\(^2\) as it is running, or various prebuilt topologies can be launched from the Welcome window. Multiple sessions can be run concurrently in tabs. For example, a local session could be running along with another session connected to a Spark cluster.
Although the Rc\(^2\) architecture supports physical servers and clusters, the direction of computing is in virtualization. The docker containers in Rc\(^2\) can be orchestrated by kubernetes to build arbitrarily large virtual clusters for the compute engine (e.g., parallel R) and/or for Hadoop/ Spark. The focus initially is on building a virtual cluster from Spark containers using kubernetes built on a persistent data store, e.g., HDFS. The ultimate goal is to built data science workflows, e.g., ingesting streaming data into Kafka, modulating it into a data store, and passing it to Spark Streaming.

Speakers
JH

Jim Harner

Professor Emeritus, West Virginia University


Rc2 pdf

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
3.02 Wild Gallery
  Talk, HPC

2:24pm CEST

Diversity of the R Community
Keywords: R foundation, R community, gender gap, diversity, useR! conferences
Webpages: https://forwards.github.io/
R Forwards is a R Foundation taskforce which aims at leading the R community forwards in widening the participation of women and other under-represented groups. We are organized in sub-teams that work on specific tasks, such as data collection and analysis, social media, gathering teaching materials, organizing targeted workshops, keep track of scholarships and interesting diversity initiatives, etc. In this talk, I will present an overview of our activities and in particular the work of the survey team who analyzed the questionnaire run at useR! 2016. We collected information on the participants socio-demographic, experiences and interest in R to get a better understanding of how to make the R community a more inclusive environment. We regularly post our results with blogs. Based on this analysis, I will present some of our recommendations.

Speakers
avatar for julie josse

julie josse

Polytechnique, Polytechnique
Professor of statistics, my research focuses on handling missing values. Conference buddy, I would be glad to discussing with you



Wednesday July 5, 2017 2:24pm - 2:42pm CEST
PLENARY Wild Gallery
  Talk, Community

2:24pm CEST

Implementing Predictive Analytics projects in corporate environments
Keywords: R in production, business applications
INWT Statistics is a company specialised on services around Predictive Analytics. For our clients we develop customised algorithms and solutions. While R is the de facto standard within our company, we face many challenges in our day to day work when we implement these solutions for our clients. To overcome these challenges we use standardised approaches for integrating predictive models into the infrastructure of our clients.
In this talk I will give an overview of a typical project structure and the role of the R language within our projects. R is used as an Analytics tool, for automatic reporting, building dashboards, and various programming tasks. While developing solutions, we always have to keep in mind how our clients plan to utilise the results. Here we have experience with the full delivery of the outcome in the form of R packages and workshops, as well as giving access to the results by using dashboards or automatically generated reports. Different companies need different models of implementation. Thus in each project we have to decide early on how R can be used to its full potential to meet our clients requirements. In this regard, I give insights into various models of implementation and our experience with each of them.

Speakers


Wednesday July 5, 2017 2:24pm - 2:42pm CEST
2.02 Wild Gallery

2:24pm CEST

Interfacing Google's spherical geometry library (S2) for spatial data in R
Keywords: Spatial statistics, spherical geometry, geospatial index, GIS
Webpages: https://github.com/spatstat/s2, https://cran.r-project.org/package=s2
Google’s S2 geometry library is a somewhat hidden gem which hasn’t received the attention it deserves. It both facilitates geometric operations directly on the sphere such as polygonal unions, intersections, differences etc. without the hassle of projecting data in the common latitude and longitude format, and provides an efficient quadtree type hierarchical geospatial index.
The original C++ source code is available in a Google Code archive and it has been partially ported to e.g. Java, Python, NodeJS, and Go, and it is used in MongoDB’s 2dsphere index.
The geospatial index in the S2 library allows for useful approximations of arbitrary regions on the sphere which can be efficiently manipulated.
We describe how the geospatial index is constructed and some of it properties as well as how to perform some of the geometrical operations supported by the library. This is all done using Rcpp to interface the C++ code from R.

Speakers

talk html

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
3.01 Wild Gallery
  Talk, GIS

2:24pm CEST

kmlShape: clustering longitudinal data according to their shapes

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
2.01 Wild Gallery

2:24pm CEST

R goes Mobile: Efficient Scheduling for Parallel R Programs on Heterogeneous Embedded Systems
Keywords: Parallelization, Resource-Aware Scheduling, Hyperparameter Tuning, Embedded Systems
Webpages: http://sfb876.tu-dortmund.de/SPP/sfb876-a3.html
We present a resource-aware scheduling strategy for parallelizing R applications on heterogeneous architectures, like those commonly found in mobile devices. Such devices typically consist of different processors with different frequencies and memory sizes, and are characterized by tight resource and energy restrictions. Similar to the parallel package that is part of the R distribution, we target problems that can be decomposed into independent tasks that are then processed in parallel. However, as the parallel package is not resource-aware and does not support heterogeneous architectures, it is ill-suited for the kinds of systems we are considering.
The application we are focusing on is parameter tuning of machine learning algorithms. In this scenario, the execution time of an evaluation of a parameter configuration can vary heavily depending on the configuration and the underlying architecture. Key to our approach is a regression model that estimates the execution time of a task for each available processor type based on previous evaluations. In combination with a scheduler allowing to allocate tasks to specific processors, we thus enable efficient resource-aware parallel scheduling to optimize the overall execution time.
We demonstrate the effectiveness of our approach in a series of examples targeting the ARM big.LITTLE architecture, an architecture commonly found in mobile phones.
References ARM. 2017. “big.LITTLE Technology.” https://www.arm.com/products/processors/technologies/biglittleprocessing.php.

Helena Kotthaus, Ingo Korb. 2017. “TraceR: Profiling Tool for the R Language.” Department of Computer Science 12, TU Dortmund University. https://github.com/allr/traceR-installer.

Kotthaus, Helena, Ingo Korb, and Peter Marwedel. 2015. “Performance Analysis for Parallel R Programs: Towards Efficient Resource Utilization.” 01/2015. Department of Computer Science 12, TU Dortmund University.

Richter, Jakob, Helena Kotthaus, Bernd Bischl, Peter Marwedel, Jörg Rahnenführer, and Michel Lang. 2016. “Faster Model-Based Optimization Through Resource-Aware Scheduling Strategies.” In LION10, 267–73. Springer International Publishing.




Speakers
avatar for Helena Kotthaus

Helena Kotthaus

Department of Computer Science 12, TU Dortmund University, Dortmund, Germany


Wednesday July 5, 2017 2:24pm - 2:42pm CEST
3.02 Wild Gallery
  Talk, HPC

2:24pm CEST

Text Analysis and Text Mining Using R
Keywords: text analysis, text mining, machine learning, social media
Summary A useR! Talk about text analysis and text mining using R. I would cover the broad set of tools for text analysis and natural language processing in R, with an emphasis on my R package quanteda but also covering other major tools in the R ecosystem for text analysis (e.g. stringi).
The talk would is tutorial covers how to perform common text analysis and natural language processing tasks using R. Contrary to a belief popular among some data scientists, when used properly, R is a fast and powerful tool for managing even very large text analysis tasks. My talk would present the many option available, demonstrate that these work on large data, and compare the features of R for these tasks versus popular options in Python.
Specifically, I will demonstrate how to format and input source texts, how to structure their metadata, and how to prepare them for analysis. This includes common tasks such as tokenisation, including constructing ngrams and “skip-grams”, removing stopwords, stemming words, and other forms of feature selection. I will also show to how to tag parts of speech and parse structural dependencies in texts. For statistical analysis, I will show how R can be used to get summary statistics from text, search for and analyse keywords and phrases, analyse text for lexical diversity and readability, detect collocations, apply dictionaries, and measure term and document associations using distance measures. Our analysis covers basic text-related data processing in the R base language, but most relies on the quanteda package (https://github.com/kbenoit/quanteda) for the quantitative analysis of textual data. We also cover how to pass the structured objects from quanteda into other text analytic packages for doing topic modelling, latent semantic analysis, regression models, and other forms of machine learning.

About me Kenneth Benoit is Professor of Quantitative Social Research Methods at the London School of Economics and Political Science. His current research focuses on automated, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining. For the past 5 years, he has been developing a major R package for text analysis, quanteda, as part of European Research Council grant ERC-2011-StG 283794-QUANTESS.


Speakers

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
4.02 Wild Gallery

2:42pm CEST

**rTRNG**: Advanced Parallel Random Number Generation in R
Keywords: Random Number Generation, Monte Carlo, Parallel Execution, Reproducibility
Webpages: https://github.com/miraisolutions/rTRNG
Monte Carlo simulations provide a powerful computational approach to address a wide variety of problems in several domains, such as physical sciences, engineering, computational biology and finance. The independent-samples and large-scale nature of Monte Carlo simulations make the corresponding computation suited for parallel execution, at least in theory. In practice, pseudo-random number generators (RNGs) are intrinsically sequential. This often prevents having a parallel Monte Carlo algorithm that is playing fair, meaning that results are independent of the architecture, parallelization techniques and number of parallel processes (Mertens 2009; Bauke 2016).
We will show that parallel-oriented RNGs and techniques in fact exist and can be used in R with the rTRNG package (Porreca, Schmid, and Bauke 2017). The package relies on TRNG (Bauke 2016), a state-of-the-art C++ pseudo-random number generator library for sequential and parallel Monte Carlo simulations.
TRNG provides parallel RNGs that can be manipulated by jumping ahead an arbitrary number of steps or splitting a sequence into any desired subsequence(s), thus supporting techniques such as block-splitting and leapfrogging suitable to parallel algorithms.
The rTRNG package provides access to the functionality of the underlying TRNG C++ library by embedding its sources and headers. Beyond this, it makes use of Rcpp and RcppParallel to offer several ways of creating and manipulating pseudo-random streams, and drawing random variates from them, which we will demonstrate:
  • Base-R-like usage for selecting and manipulating the current engine, as a simple and immediate way for R users to use rTRNG
  • Reference objects wrapping the underlying C++ TRNG random number engines can be created and manipulated in OOP-style, for greater flexibility in using parallel RNGs in R
  • TRNG C++ library and headers can be accessed directly from within R projects that use C++, both via standalone C++ code (via sourceCpp) or through creating an R package that depends on rTRNG
References Bauke, Heiko. 2016. Tina’s Random Number Generator Library. https://numbercrunch.de/trng/trng.pdf.

Mertens, Stephan. 2009. “Random Number Generators: A Survival Guide for Large Scale Simulations.” In Modern Computational Science 09. BIS-Verlag.

Porreca, Riccardo, Roland Schmid, and Heiko Bauke. 2017. rTRNG: R Package Providing Access and Examples to TRNG C++ Library. https://github.com/miraisolutions/rTRNG/.




Speakers
avatar for Riccardo Porreca

Riccardo Porreca

Mirai Solutions GmbH



Wednesday July 5, 2017 2:42pm - 3:00pm CEST
3.02 Wild Gallery
  Talk, HPC

2:42pm CEST

Neural Embeddings and NLP with R and Spark
Keywords: NLP, Spark, Deep Learning, Network Science
Webpages: https://github.com/akzaidi
Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we’ll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient.
Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we’ll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages.
To describe the utility of this framework we’ll provide an example where we’ll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017)), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames.
References Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. 3 (March). JMLR.org: 1137–55. http://dl.acm.org/citation.cfm?id=944919.944966.

Olah, Christopher. 2014. “Deep Learning, NLP, and Representations.” https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. doi:10.18637/jss.v059.i10.

Zaidi, Ali. 2017. “Summarizing Git Commits and Github Pull Requests Using Sequence to Sequence Neural Attention Models.” CS224N: Final Project, Stanford University.




Speakers
avatar for Ali Zaidi

Ali Zaidi

Data Scientist, Microsoft


Wednesday July 5, 2017 2:42pm - 3:00pm CEST
4.02 Wild Gallery

2:42pm CEST

Scalable, Spatiotemporal Tidy Arrays for R (stars)

Scalable, Spatiotemporal Tidy Arrays for R (stars)

Edzer Pebesma, Etienne Racine, Michael Sumner
Spatiotemporal data often comes in the form of dense arrays, with space and time being array dimensions. Examples include socio-economic or demographic data, environmental variables monitored at fixed stations, time series of satellite images with multiple spectral bands, spatial simulations, climate model results. Currently, R does not have infrastructure to handle and analyse such arrays easily. Package raster is probably still the most powerful package for handling this kind of data in memory and on disk, but does not address non-raster time series, rasters time series with multiple attributes, rasters with mixed type attributes, or spatially distributed sets of satellite images. This project will not only deal with these cases, but also extend the “in memory or on disk” model to that where the data are held remotely in cloud storage, which is a more feasible option e.g. for satellite data collected Today. We will implement pipe-based workflows that are developed and tested on samples before they are evaluated for complete datasets, and discuss the challenges of visualiasation and storage in such workflows. This is work in progress, and the talk will discuss the design stage and hopefully show an early prototype.





Speakers
avatar for Edzer Pebesma

Edzer Pebesma

University of Muenster
I lead the spatio-temporal modelling laboratory at the institute for geoinformatics, and am deputy head of institute. I hold a PhD in geosciences, and am interested in spatial statistics, environmental modelling, geoinformatics and GI Science, semantic technology for spatial analysis... Read More →


Wednesday July 5, 2017 2:42pm - 3:00pm CEST
3.01 Wild Gallery
  Talk, GIS

2:42pm CEST

We R What We Ask: The Landscape of R Users on Stack Overflow
Keywords: r, data science, web traffic, visualization
Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 175,000 questions about the R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I’ll examine what ecosystems of R packages are used in combination, what other technologies are used alongside *R**, and what countries and cities have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I’ll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.

Speakers

Wednesday July 5, 2017 2:42pm - 3:00pm CEST
PLENARY Wild Gallery
 
Thursday, July 6
 

11:00am CEST

**rags2ridges**: A One-Stop-Go for Network Modeling of Precision Matrices
Keywords: Data integration, Graphical modeling, High-dimensional precision matrix estimation; Networks
Webpages: https://CRAN.R-project.org/package=rags2ridges, https://github.com/CFWP/rags2ridges
Contact: cf.peeters@vumc.nl
A contemporary use for inverse covariance matrices (aka precision matrices) is found in the data-based reconstruction of networks through graphical modeling. Graphical models merge probability distributions of random vectors with graphs that express the conditional (in)dependencies between the constituent random variables. The rags2ridges package enables L2-penalized (i.e., ridge) estimation of the precision matrix in settings where the number of variables is large relative to the sample size. Hence, it is a package where high-dimensional (HD) data meets networks.
The talk will give an overview of the rags2ridges package. Specifically, it will show that the package is a one-stop-go as it provides functionality for the extraction, visualization, and analysis of networks from HD data. Moreover, it will show that the package provides a basis for the vertical (across data sets) and horizontal (across platforms) integration of HD data stemming from omics experiments. Last but not least, it will explain why many rap musicians are stating that one should ‘get ridge, or die trying’.
References https://arxiv.org/abs/1509.07982
https://arxiv.org/abs/1608.04123
http://dx.doi.org/10.1016/j.csda.2016.05.012


Speakers
avatar for Carel Peeters

Carel Peeters

Assistant Professor, VU University medical center
Biostatistician specializing in multivariate and high-dimensional molecular biostatistics.



Thursday July 6, 2017 11:00am - 11:18am CEST
3.01 Wild Gallery
  Talk, Methods I

11:00am CEST

Bayesian social network analysis with Bergm
Keywords: Bayesian analysis, Exponential random graph models, Monte Carlo methods
Webpages: https://CRAN.R-project.org/package=Bergm
Exponential random graph models (ERGMs) are a very important family of statistical models for analyzing network data. From a computational point of view, ERGMs are extremely difficult to handle since their normalising constant, which depends on model parameters, is intractable. In this talk, we show how parameter inference can be carried out in a Bayesian framework using MCMC strategies which circumvents the need to calculate the normalising constants.
The new version of the Bergm package for R (Caimo and Friel 2014) provides a comprehensive framework for Bayesian analysis for ERGMs useing the approximate exchange algorithm (Caimo and Friel 2011) and calibration of the pseudo-posterior distribution (Bouranis, Friel, and Maire 2015) to sample from the ERGM parameter posterior distribution. The package can also supply graphical Bayesian goodness-of-fit procedures that address the issue of model adequacy.
This talk will have a strong focus on the main practical implementation features of the software that will be described by the analysis of real network data (with various applications in Neuroscience and Organisation Science).
References Bouranis, L., N. Friel, and F. Maire. 2015. “Bayesian Inference for Misspecified Exponential Random Graph Models.” arXiv Preprint arXiv:1510.00934.

Caimo, A., and N. Friel. 2011. “Bayesian Inference for Exponential Random Graph Models.” Social Networks 33 (1): 41–55.

———. 2014. “Bergm: Bayesian Exponential Random Graphs in R.” Journal of Statistical Software 61 (2): 1–25.




Speakers


Thursday July 6, 2017 11:00am - 11:18am CEST
4.01 Wild Gallery

11:00am CEST

Hosting Data Packages via `drat`: A Case Study with Hurricane Exposure Data
R packages offer the chance to distribute large datasets while also providing functions for exploring and working with that data. However, data packages often exceed the suggested size of CRAN packages, which is a challenge for package maintainers who would like to share their code through this central and popular repository. In this talk, we outline an approach in which the maintainer creates a smaller code package with the code to interact with the data, which can be submitted to CRAN, and a separate data package, which can be hosted by the package maintainer through a personal repository. Although repositories are not mainstream, and so cannot be listed with an “Includes” or “Depends” dependency for a package submitted to CRAN, we suggest a way of including the data package as a suggested package and incorporating conditional code in the executable code within vignettes, examples, and tests, as well as conditioning functions in the code package to check for the availability of the data package. We illustrate this approach for a pair of packages , and , that allows users to explore exposure to hurricanes and tropical storms in the United States. This approach may prove useful for a number of R package maintainers, especially with the growing trend to the sharing and use of open data in many of the fields in which R is popular.

Speakers


Thursday July 6, 2017 11:00am - 11:18am CEST
2.02 Wild Gallery

11:00am CEST

moodler: A new R package to easily fetch data from Moodle
Keywords: Moodle, SQL, tidy data
Webpages: https://github.com/jchrom/moodler
Learning management systems (LMS) generate large amounts of data. The LMS Moodle is at the forefront of open source learning platforms, and thanks to its widespread adoption by schools and businesses, it represents a great target for educational data-analytic efforts. In order to facilitate data analysis of Moodle data in R, we introduce a new R package: moodler. It is a collection of useful SQL queries and data-wrangling functions that fetch data from Moodle database and turn it into tidy data frames. This makes it easy to feed data from Moodle to a large number of R packages that focus on specific types of analyses.

Speakers


Thursday July 6, 2017 11:00am - 11:18am CEST
4.02 Wild Gallery
  Talk, Web

11:00am CEST

Show Me Your Model: tools for visualisation of statistical models
Keywords: Model visualisation, model exploration, structure visualisation, grammar of model visualisation
The ggplot2 (Wickham 2009) package changed the way how we approach to data visualisation. Instead of looking for suitable type of a plot out of dozens of predefined templates now we express the relation among variables with a well defined grammar based on the excellent book The Grammar of Graphics (Wilkinson 2006).
Similar revolution is happening with tools for visualisation of statistical models. In the CRAN repository, one may find a lot of great packages that graphically explain a structure or diagnostic for some family of statistical models. Just to mention few known and powerful packages: rms, forestmodel and regtools (regression models), survminer (survival models), ggRandomForests (random forest based models), factoextra (multivariate structure exploration), factorMerger (one-way ANOVA) and many, many others. They are great, but they do not share same logic nor structure.
New packages from the tidyverse, like broom (Robinson 2017), creates an opportunity to build an unified interface for model exploration and visualisation for large collection of statistical models. And there is more and more articles that set theoretical foundations for unified grammar of model visualization (see for example Wickham, Cook, and Hofmann 2015).
In this talk I am going to present various approaches to the model visualisation, give an overview of selected existing packages for visualisation of statistical models and discuss proposition for a unified grammar of model visualisation.
References Robinson, David. 2017. Broom: Convert Statistical Analysis Objects into Tidy Data Frames. https://CRAN.R-project.org/package=broom.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wickham, Hadley, Dianne Cook, and Heike Hofmann. 2015. Visualizing Statistical Models: Removing the Blindfold. Statistical Analysis; Data Mining 8(4).

Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.





Thursday July 6, 2017 11:00am - 11:18am CEST
PLENARY Wild Gallery

11:00am CEST

Using the alphabetr package to determine paired T cell receptor sequences
The immune system has the monumental challenge of being capable of respond- ing to any pathogen or foreign substance invading the body while ignoring self and innocuous molecules. T cells—which play a central role in directing immune responses, regulating other immune cells, and remembering past infections— accomplish this feat by maintaining a diverse repertoire of T cell receptors (TCR). A typical T cell expresses one unique TCR, and the TCR is made up of two chains—the TCRα and TCRβ chains—that both determine the set of molecules that the T cell can respond to. Since T cells play such a central role in many immune responses, identifying the TCR pairs of T cells involved in infec- tious diseases, cancers, and autoimmune diseases can have profound insights for designing vaccines and immunotherapies. I introduce a novel approach to ob- taining paired TCR sequences with the alphabetr package, which implements algorithms that identify TCR pairs in an efficient, high-throughput fashion for antigen-specific T cell populations (Lee et al. 2017).

Speakers


Thursday July 6, 2017 11:00am - 11:18am CEST
2.01 Wild Gallery

11:18am CEST

Can you keep a secret?
Can you keep a secret? Andrie de Vries^1 and Gábor Csárdi^2 1. Senior Programme Manager, Algorithms and Data Science, Microsoft 2. Independent consultant
Keywords: Asymmetric encryption, Public key encryption
When you use R to connect to a database, cloud computing service or other API, you must supply passwords, for example database credentials, authentication keys, etc.
It is easy to inadvertently leak your passwords and other secrets, e.g. accidentally adding your secrets to version control or logs.
A new package, secret solves this problem by allowing you to encrypt and encrypt secrets using public key encryption. The package is available at github [@secret] and soon also on CRAN.
If you attend this session, you will learn:
  • Patterns for inadvertently leak secrets
  • The essentials of public key cryptography: how to create an asymmetric key pair (public and private key)
  • How to create a vault with encrypted secrets using the secret package
  • How to share these secrets with your collaborators by encrypting the secret with their public key
  • How you can do all of this in 5 lines of R code
This session will appeal to all R users who must use passwords to connect to services.

References

Speakers
avatar for Andrie de Vries

Andrie de Vries

Senior Programme Manager, Microsoft
Andrie is a senior programme manager at Microsoft, responsible for community projects and evangelization of Microsoft's contribution in Europe to the open source R language. He is co-author of the very popular title "R for Dummies" and a top contributor to the Q&A website, StackOverflow... Read More →


Thursday July 6, 2017 11:18am - 11:36am CEST
4.02 Wild Gallery
  Talk, Web

11:18am CEST

Clouds, Containers and R, towards a global hub for reproducible and collaborative data science
RosettaHUB aims at establishing a global open data science and open education meta cloud centered on usability, reproducibility, auditability, and shareability. It enables a wide range of social interactions and real-time collaborations.
RosettaHUB leverages public and private clouds and makes them easy to use for everyone. RosettaHUB’s federation platform allows any higher education institution or research laboratory to create a virtual organization within the hub. The institution’s members (researchers, educators, students) receive automatically active AWS accounts which are consolidated under one paying account, supervised in terms of budget and cloud resources usage, protected with safeguarding microservices and monitored/managed centrally by the institution’s administrator. The cloud resources are generally paid for using the coupons provided by Amazon as part of the AWS Educate program. The Organization members’ active AWS accounts are put under the control of a collaboration portal which simplifies dramatically everything related to the interaction with AWS and its collaborative use by communities of researchers, educators and students. The portal allows similar capabilities for Google Compute Engine, Azure, OpenStack-based and OpenNebula-based clouds.
RosettaHUB leverages Docker and allows users to work with containers seamlessly. Those containers are portable. When coupled with RosettaHUB’s open APIs, they break the silos between clouds and avoid vendor lock-in. Simple web interfaces allow users to create those containers, connect them to data storages, snapshot them, share snapshots with collaborators and migrate them from one cloud to another. The RosettaHUB perspectives make it possible to use the containers to serve securely noVNC, RStudio, Jupyter and to enable those tools for real-time collaboration. Zeppelin, Spark-notebook and Shiny Apps are also supported. The RosettaHUB real-time collaborative containerized workbench is a universal IDE for data scientists. It makes it possible to interact in a stateful manner with hybrid kernels gluing together in a single process R, Python, Scala, SQL clients, Java, Matlab, Mathematica, etc. and allowing those different environments to share their workspace and their variables in memory. The RosettaHUB kernels and objects model break the silos between data science environments and make it possible to use them simultaneously in a very effective and flexible manner. A simplified reactive programming framework makes it possible to create reactive data science microservices and interactive web applications based on multi-language macros and visual widgets. A scientific web based spreadsheet makes it possible to interact with R/Python/Scala capabilities from within cells which includes variables import/export and variables mirroring to cells as well as the automatic mapping of any function in those environments to formulas invokable in cells. Spreadsheet cells can also contain code and code execution results making it become a flexible multi-language notebook. Ubiquitous docker containers coupled with the RosettaHUB workbench checkpointing capability and the logging to embedded databases of all the interactions the users have with their environments make everything created within RosettaHUB reproducible and auditable.
The RosettaHUB’s APIs (700+ functions) cover the full spectrum of programmatic interaction between users and clouds, containers and R/Python/Scala kernels. Clients for the APIs are available as an R package, a Pyhton module, a Java library, an Excel add-in and a Word Add-in. Based on those APIs, RosettaHUB provides a CloudFormation- like service which makes it easy to create and manage as templates, collections of related Cloud resources, container images, R/Python/Scala scripts, macros and visual widgets alongside with optional cloud credentials. Those templates are cloud agnostic and they make it possible for anyone to easily create and distribute complex data science applications and services. The user with whom the template is shared can with one-click trigger the reconstruction and wiring on the fly of all the artifacts and dependencies. The RosettaHUB templates constitute a powerful sharing

mechanism for RosettaHUB’s e-Science and e-learning environments snapshots as well as for Jupyter/Zeppelin notebooks, shiny Apps, etc. RosettaHUB’s marketplace transform those templates into products that can be shared or sold.
The presentation will be an overview of RosettaHUB and will discuss the results of the RosettaHUB/AWS Educate initiative which involved 30 higher education institutions and research labs counting over 3000 researchers, educators, and students.

Speakers

Thursday July 6, 2017 11:18am - 11:36am CEST
2.02 Wild Gallery

11:18am CEST

Differentiation of brain tumor tissue using hierarchical non-negative matrix factorization
**Keywords**: non-negative matrix factorization, magnetic resonance imaging, brain tumor

Treatment of brain tumors is complicated by their high degree of heterogeneity. Various stages of the disease can occur throughout the same lesion, and transitions between the pathological tissue regions (i.e. active tumor, necrosis and edema) are diffuse [@price2006improved]. Clinical practice could benefit from an accurate and reproducible method to differentiate brain tumor tissue based on medical imaging data.

We present a hierarchical variant of non-negative matrix factorization (hNMF) for characterizing brain tumors using multi-parametric magnetic resonance imaging (MRI) data [@sauwen2015hierarchical]. Non-negative matrix factorization (NMF) decomposes a non-negative input matrix *X* into 2 factor matrices *W* and *H*, thereby providing a parts-based representation of the input data. In the current context, the columns of *X* correspond to the image voxels and the rows represent the different MRI parameters. The columns of *W* represent tissue-specific signatures and the rows of *H* contain the relative abundances per tissue type over the different voxels.

**hNMF** is available as an *R* package on CRAN and compatible with the **NMF** package. Besides the standard NMF algorithms that come with the **NMF** package, an effcient NMF algorithm called hierarchical alternating least-squares NMF was implemented and used within the hNMF framework. hNMF can be used as a general matrix factorization technique, but in the context of this talk it will be shown that valid tissue signatures are obtained using hNMF. Tissue abundances can be mapped back to the imaging domain, providing tissue differentiation on a voxel-wise basis (see Figure 1).

![Figure 1: hNMF abundance maps of the pathological tissue regions of a glioblastoma patient. Left to right: T~1~-weighted background image with region of interest (green frame); abundance map active tumor; abundance map necrosis; abundance map edema.](./Assembly.png)

# References

Speakers


Thursday July 6, 2017 11:18am - 11:36am CEST
2.01 Wild Gallery

11:18am CEST

difNLR: Detection of potentional gender/minority bias with extensions of logistic regression
difNLR: Detection of potentional gender/minority bias with extensions of logistic regression
Adela Drabinova1,2 and Patricia Martinkova2

1. Faculty of Mathematics and Physics, Charles University, Prague
2. Institute of Computer Science, Czech Academy of Sciences, Prague


Keywords: detection of item bias, differential item functioning, psychometrics, R

Webpages: https://CRAN.R-project.org/package=difNLR, https://CRAN.R-project.org/package=ShinyItemAnalysis, https://shiny.cs.cas.cz/ShinyItemAnalysis/

The R package difNLR has been developed for detection of potentially unfair items in educational and psychological testing, analysis of so called Differential Item Functioning (DIF), based on extensions of logistic regression model. For dichotomous data, six models have been implemented to offer wide range of proxies to Item Response Theory models. Parameters are obtained using non-linear least square estimation and DIF detection procedure is performed by either F or likelihood ratio test of submodel. For unscored data, analysis of Differential Distractor Functioning (DDF) based on multinomial regression model is offered to provide closer look at individual item options (distractors). Features and options are demonstrated on three data sets. The package is designed to correspond to difR package (one of the most used R libraries in DIF detection, see Magis, Béland, Tuerlinckx, & De Boeck (2010)) and currently is exploited by ShinyItemAnalysis (Martinková, Drabinová, Leder, & Houdek, 2017) which provides graphical interface offering detailed analysis of educational and psychological tests.

References
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847

Martinková, P., Drabinová, A., Leder, O., & Houdek, J. (2017). ShinyItemAnalysis: Test and item analysis via shiny. Retrieved from shiny.cs.cas.cz/ShinyItemAnalysis/; https://CRAN.R-project.org/package=ShinyItemAnalysis

Martinková, P., Drabinová, A., Liaw, Y.-L., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual assessments. CBE-Life Sciences Education, 16(2). https://doi.org/10.1187/cbe.16-10-0307

McFarland, J. L., Price, R. M., Wenderoth, M. P., Martinková, P., Cliff, W., Michael, J., … Wright, A. (2017). Development and validation of the homeostasis concept inventory. CBE-Life Sciences Education, 16(2). https://doi.org/10.1187/cbe.16-10-0305



Thursday July 6, 2017 11:18am - 11:36am CEST
4.01 Wild Gallery

11:18am CEST

Quantitative fisheries advice using R and FLR
Keywords: Quantitative Fisheries Science, Common Fisheries Policy, Management Strategy Evaluation, advice, simulation
Webpages: https://flr-project.org, https://github.com/flr
The management of the activities of fishing fleets aims at ensuring the sustainable exploitation of the ocean’s living resources, the provision of important food resources to humankind, and the profitability of an industry that is an important economic and social activity in many areas of Europe and elsewhere. These are the principles of the European Union Common Fisheries Policy (CFP), which has driven the management of Europe’s fisheries resources since 1983.
Quantitative scientific advice is at the heart of fisheries management regulations, providing estimates of the likely current and future status of fish stocks through statistical population models, termed stock assessments, but also probabilistic comparisons of the expected effects of alternative management procedures. Management Strategy Evaluation (MSE) uses stochastic simulation to incorporate both the inherent variability of natural systems, and our limited ability to model their dynamics, into analyses of the expected effects of a given management intervention on the sustainability of both fish stocks and fleets.
The Fishery Library in R (FLR) project has been for the last ten years building an extensible toolset of statistical and simulation methods for quantitative fisheries science (Kell et al. 2007), with the overarching objective of enabling fisheries scientists to carry out analyses of management procedures in a simplified and robust manner through the MSE approach.
FLR has become widely used in many of the scientific bodies providing fisheries management advice, both in Europe and elsewhere. The evaluation of the effects of some elements of the revised CFP, the analysis of the proposed fisheries management plans for the North Sea, or the comparison of management strategies for Atlantic tuna stocks, among others, have used the FLR tools to advice managers of the possible courses of action to favour the sustainable use of many marine fish stocks.
The FLR toolset is currently composed of 20 packages, covering the various steps in the fisheries advice and simulation workflow. They include a large number of S4 classes, and more recently Reference Classes, to model the data structures that represent each of the elements in the fisheries system. Class inheritance and method overloading are essential tools that have allowed the FLR packages to interact, complement and enrich each other, while still limiting the number of functions an user needs to be aware of. Methods also exist that make use of R’s parallelization facilities and of compiled code to deal with complex computations. Statistical models have also been implemented, making use of both R’s capabilities and external libraries for Automatic Differentiation.
We present the current status of FLR, the new developments taking place, and the challenges faced in the development of a collection of packages based on S4 classes and methods.
References Kell, L. T., I. Mosqueira, P. Grosjean, J.-M. Fromentin, D. Garcia, R. Hillary, E. Jardim, et al. 2007. “FLR: An Open-Source Framework for the Evaluation and Development of Management Strategies.” ICES Journal of Marine Science 64 (4). http://dx.doi.org/10.1093/icesjms/fsm012.




Speakers
avatar for Finlay Scott

Finlay Scott

Joint Research Centre, European Commission



Thursday July 6, 2017 11:18am - 11:36am CEST
PLENARY Wild Gallery

11:18am CEST

Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in *R*
Keywords: clustered data, clustered covariance matrix estimators, object-orientation, simulation, R
Webpages: http://R-forge.R-project.org/projects/sandwich/
Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, or other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to “the” clustered standard errors, there is a surprisingly wide variation in clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g. for zero-inflated, censored, or limited responses).
In R, the sandwich package (Zeileis 2004; Zeileis 2006) provides an object-oriented approach to “robust” covariance matrix estimation based on methods for two generic functions (estfun() and bread()). Using this infrastructure, sandwich covariances for cross-section or time series data have been available for models beyond lm() or glm(), e.g., for packages MASS, pscl, countreg, betareg, among many others. However, corresponding functions for clustered or panel data have been somewhat scattered or available only for certain modeling functions. This shortcoming has been corrected in the development version of sandwich on R-Forge. Here, we introduce this new object-oriented implementation of clustered and panel covariances and assess the methods’ performance in a simulation study.
References Zeileis, Achim. 2004. “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software 11 (10): 1–17. http://www.jstatsoft.org/v11/i10/.

———. 2006. “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software 16 (9): 1–16. http://www.jstatsoft.org/v16/i09/.




Speakers


Thursday July 6, 2017 11:18am - 11:36am CEST
3.01 Wild Gallery
  Talk, Methods I

11:36am CEST

**BradleyTerryScalable**: Ranking items scalably with the Bradley-Terry model
Keywords: Citation data, Directed network, Paired comparisons, Quasi-symmetry, Sparse matrices
Webpage: https://github.com/EllaKaye/BradleyTerryScalable
Motivated by the analysis of large-scale citation networks, we implement the familiar Bradley-Terry model (Zermelo 1929; Bradley and Terry 1952) in such a way that it can be applied, with relatively modest memory and execution-time requirements, to pair-comparison data from networks with large numbers of nodes. This provides a statistically principled method of ranking a large number of objects, based only on paired comparisons.
The BradleyTerryScalable package complements the existing CRAN package BradleyTerry2 (Firth and Turner 2012) by permitting a much larger number of objects to be compared. In contrast to BradleyTerry2, the new BradleyTerryScalable package implements only the simplest, ‘unstructured’ version of the Bradley-Terry model. The new package leverages functionality in the additional R packages igraph (Csardi and Nepusz 2006), Matrix (Bates and Maechler 2017) and Rcpp (Eddelbuettel 2013) to provide flexibility in model specification (whole-network versus disconnected cliques) as well as memory efficiency and speed. The Bayesian approach of Caron and Doucet (2012) is provided as an optional alternative to maximum likelihood, in order to allow whole-network ranking even when the network of paired comparisons is not fully connected.
The BradleyTerryScalable package can readily handle data from directed networks with many thousands of nodes. The use of the Bradley-Terry model to produce a ranking from citation data was originally advocated in Stigler (1994), and was studied in detail more recently in Varin, Cattelan, and Firth (2016); here we will illustrate its use with a large-scale network of inter-company patent citations.
References Bates, Douglas, and Martin Maechler. 2017. “Matrix: Sparse and Dense Matrix Classes and Methods.” R Package Version 1.2-8. http://cran.r-project.org/package=Matrix.

Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. the Method of Paired Comparisons.” Biometrika 39: 324–45.

Caron, François, and Arnaud Doucet. 2012. “Efficient Bayesian Inference for Generalized Bradley–Terry Models.” Journal of Computational and Graphical Statistics 21: 174–96.

Csardi, Gabor, and Tamas Nepusz. 2006. “The igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. http://igraph.org.

Eddelbuettel, Dirk. 2013. Seamless R and C++ Integration with Rcpp. New York: Springer.

Firth, David, and Heather L Turner. 2012. “Bradley-Terry Models in R: The BradleyTerry2 Package.” Journal of Statistical Software 48 (9). http://www.jstatsoft.org/v48/i09.

Stigler, Stephen M. 1994. “Citation Patterns in the Journals of Statistics and Probability.” Statistical Science, 94–108.

Varin, Cristiano, Manuela Cattelan, and David Firth. 2016. “Statistical Modelling of Citation Exchange Between Statistics Journals.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 179: 1–63.

Zermelo, Ernst. 1929. “Die Berechnung Der Turnier-Ergebnisse Als Ein Maximumproblem Der Wahrscheinlichkeitsrechnung.” Mathematische Zeitschrift 29: 436–60.




Speakers
avatar for Ella Kaye

Ella Kaye

Ms, University of Warwick
Ella is a Research Software Engineer in the Department of Statistics at the University of Warwick, UK. She works to increase sustainability and EDI (Equality, Diversity and Inclusion) in the R Project. She also runs rainbowR, a community that supports, promotes and connects LGBTQ... Read More →



Thursday July 6, 2017 11:36am - 11:54am CEST
4.01 Wild Gallery

11:36am CEST

*jamovi*: a spreadsheet for R
Keywords: Spreadsheet, User-interface, Learning R
Webpages: https://www.jamovi.org, https://CRAN.R-project.org/package=jmv
In spite of the availability of the powerful and sophisticated R ecosystem, spreadsheets such as Microsoft Excel remain ubiquitous within the business community, and spreadsheet like software, such as SPSS, continue to be popular in the sciences. This likely reflects that for many people the spreadsheet paradigm is familiar and easy to grasp.
The jamovi project aims to make R and its ecosystem of analyses accessible to this large body of users. jamovi provides a familiar, attractive, interactive spreadsheet with the usual spreadsheet features: data-editing, filtering, sorting, and real-time recomputation of results. Significantly, all analyses in jamovi are powered by R, and are available from CRAN. Additionally, jamovi can be placed in ‘syntax mode’, where the underlying R code for each analysis is produced, allowing for a seamless transition to an interactive R session.
We believe that jamovi represents a significant opportunity for the authors of R packages. With some small modifications, an R package can be augmented to run inside of jamovi, allowing R packages to be driven by an attractive user-interface (in addition to the normal R environment). This makes R packages accessible to a much larger audience, and at the same time provides a clear pathway for users to migrate from a spreadsheet to R scripting.
This talk introduces jamovi, introduces its user-interface and feature set, and demonstrates the ease with which R packages can be augmented to additionally support the interactive spreadsheet paradigm.
jamovi is available from www.jamovi.org

Speakers

useR pdf

Thursday July 6, 2017 11:36am - 11:54am CEST
PLENARY Wild Gallery

11:36am CEST

Biosignature-Based Drug Design: from high dimensional data to business impact
Keywords: biosignatures, machine learning, drug design, data fusion, high-throughput screening
Webpages: https://www.openanalytics.eu/
For decades, high throughput screening of chemical compounds has played a central role in drug design. In general, such screens were only affordable if they had a narrow biological scope (e.g., compound activity on an isolated protein target). In recent years, screening techniques have become available that combine a high throughput with a high dimensional readout and a complex biological context (e.g., cell culture). Examples are high content imaging and L1000 transcriptomics. In addition, due to state-of-the-art machine learning methods (Unterthiner et al. 2014) and high performance computing (Harnie et al. 2016) it has become possible to benefit from such high dimensional biological data on an enterprise scale. Together, these advances enable Biosignature-Based Drug Design, a paradigm that will dramatically change pharmaceutical research.
A software pipeline, mainly built in R and C++, allows us to support Biosignature-Based Drug Design in an enterprise setting. It is worth noting that dealing with multiple data sets of this scale and complexity is non-trivial and challenging. With our pipeline, we tailor generic methods to the needs of specific projects in diverse therapeutic areas. This operational application goes hand in hand with an ongoing effort –together with academic partners– to improve and extend our workflow.
We will show use cases in which Biosignature-Based Drug Design has increased the effectiveness and cost-efficiency of high throughput screens by repurposing historic data (Simm et al. 2017). Moreover, integrating multiple data sources allows to takes into account a broader biological context, rather than a single mode of action. This will yield a better understanding of on- and off-target effects. Ultimately, this may reduce failure rates for drug candidates in clinical trials.
Acknowledgements This work was supported by research grants IWT130405 ExaScience Life Pharma and IWT150865 Exaptation from the Flanders Innovation and Entrepreneurship agency.

References Harnie, D., M. Saey, A. E. Vapirev, J.K. Wegner, A. Gedich, M.N. Steijaert, H. Ceulemans, R. Wuyts, and W. De Meuter. 2016. “Scaling Machine Learning for Target Prediction in Drug Discovery Using Apache Spark.” Future Generation Computer Systems.

Simm, J., G. Klambauer, A. Arany, M.N. Steijaert, J.K. Wegner, E. Gustin, V. Chupakhin, et al. 2017. “Repurposed High-Throughput Images Enable Biological Activity Prediction for Drug Discovery.” bioRxiv.

Unterthiner, T., A. Mayr, G. Klambauer, M.N. Steijaert, H. Ceulemans, J.K. Wegner, and S. Hochreiter. 2014. “Deep Learning as an Opportunity in Virtual Screening.” In Workshop on Deep Learning and Representation Learning (Nips 2014).




Speakers
avatar for Marvin Steijaert

Marvin Steijaert

Consultant, Open Analytics
Data science, Machine learning, Systems biology, Computational biology, Bioinformatics



Thursday July 6, 2017 11:36am - 11:54am CEST
2.01 Wild Gallery

11:36am CEST

codebookr: Codebooks in *R*
Keywords: code book, data dictionary, data cleaning, validation, automation
Webpages: https://github.com/petebaker/codebookr, https://github.com/ropensci/auunconf/issues/46
codebookr is an R package under development to automate cleaning, checking and formatting data using metadata from Codebooks or Data Dictionaries. It is primarily aimed at epidemiological research and medical studies but can be easily used in other research areas.
Researchers collecting primary, secondary or tertiary data from RCTs or government and hospital administrative systems often have different data documentation and data cleaning needs to those scraping data off the web or collecting in-house data for business analytics. However, all studies will benefit from using codebooks which comprehensively document all study variables including derived variables. Codebooks document data formats, variable names, variable labels, factor levels, valid ranges for continuous variables, details of measuring instruments and so on.
For statistical consultants, each new data set has a new codebook. While statisticians may get a photocopied codebook or pdf, my preference is a spreadsheet so that the metadata can be used directly. Many data analysts are happy to use this metadata to code syntax to read, clean and check data. I prefer to automate this process by reading the codebook into R and then using the metadata directly for data checking, cleaning, factor level definitions.
While there is considerable interest in the data wrangling and cleaning (Jonge and Loo 2013; Wickham 2014; Fischetti 2017), there appear to be few tools available to read codebooks (see http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html) and even less to automatically apply the metadata to datasets.
We outline the fundamentals of codebookr and demonstrate it’s use on examples of research projects undertaken at University of Queensland’s School of Public Health.
References Fischetti, Tony. 2017. Assertr: Assertive Programming for R Analysis Pipelines. https://CRAN.R-project.org/package=assertr.

Jonge, Edwin de, and Mark van der Loo. 2013. “An Introduction to Data Cleaning with R.” Technical Report 201313. Statistics Netherlands. http://cran.vinastat.com/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf.

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.




Speakers


Thursday July 6, 2017 11:36am - 11:54am CEST
2.02 Wild Gallery

11:36am CEST

factorMerger: a set of tools to support results from post hoc testing

ANOVA-likestatisticaltestsfordifferencesamonggroupsareavailableforalmostahundredyears. But for large number of groups the results from commonly used post-hoc tests are often hard to in- terpret. To deal with this problem, the factorMerger package constructs and plots the hierarchical relation among compared groups. Such hierarchical structure is derived based on the Likelihood Ratio Test and is presented with the Merging Paths Plots created with the ggplot2 package. The cur- rent implementation handles one-dimensional and multi-dimensional Gaussian models as well as binomial and survival models. This article presents the theory and examples for a single-factor use cases.
Package webpage: https://github.com/geneticsMiNIng/FactorMerger
Keywords: analysis of variance (ANOVA), hierarchical clustering, likelihood ratio test (LRT), post
hoc testing

Speakers
AS

Agnieszka Sitko

Data Scientist, Warsaw University of Technology



Thursday July 6, 2017 11:36am - 11:54am CEST
3.01 Wild Gallery

11:36am CEST

Scraping data with rvest and purrr
Keywords: rvest, purrr, webscraping, fantasy, sports
Webpages: http://www.maxhumber.com
Really interesting data never actually lives inside of a tidy csv. Unless, of course, you think Iris or mtcars is super interesting. Interesting data lives outside of comma separators. It’s unstructured, and messy, and all over the place. It lives around us and on poorly formatted websites, just waiting and begging to be played with.
Finding and fetching and cleaning your own data is a bit like cooking a meal from scratch—instead of microwaving a frozen TV dinner. Microwaving food is simple. It’s literally one step: put thing in microwave. There is, however, no singular step to making a proper meal from scratch. Every meal is different. The recipe for making coconut curry isn’t the same as the recipe for Brussels sprout tacos. But both require a knife and a frying pan!
In “Scraping data with rvest and purrr” I will talk through how to pair and combine rvest (the knife) and purrr (the frying pan) to scrape interesting data from a bunch of websites. This talk is inspired by a recent blog post that I authored for and was well received by the r-bloggers.com community.
rvest is a popular R package that makes it easy to scrape data from html web pages.
purrr is a relatively new package that makes it easy to write code for a single element of a list that can be quickly generalized to the rest of that same list.

Speakers


Thursday July 6, 2017 11:36am - 11:54am CEST
4.02 Wild Gallery
  Talk, Web

11:54am CEST

Estimating the Parameters of a Continuous-Time Markov Chain from Discrete-Time Data with ctmcd
Keywords: Embedding Problem, Generator Matrix, Continuous-Time Markov Chain, Discrete-Time Markov Chain
Webpages: https://CRAN.R-project.org/package=ctmcd
The estimation of the parameters of a continuous-time Markov chain from discrete-time data is an important statistical problem which occurs in a wide range of applications: e.g., with the analysis of gene sequence data, for causal inference in epidemiology, for describing the dynamics of open quantum systems in physics, or in rating based credit risk modeling to name only a few.
The parameters of a continuous-time Markov chain are called generator matrix (also: transition rate matrix or intensity matrix) and the issue of estimating generator matrices from discrete-time data is also known as the embedding problem for Markov chains. For dealing with this missing data situtation, a variety of estimation approaches have been developed. These comprise adjustments of matrix logarithm based candidate solutions of the aggregated discrete-time data, see (Israel, Rosenthal, and Wei 2001) or (Kreinin and Sidelnikova 2001). Moreover, likelihood inference can be conducted by an instance of the expectation-maximization (EM) algorithm and Bayesian inference by a Gibbs sampling procedure based on the conjugate gamma prior distribution (Bladt and Sørensen 2005).
The R package ctmcd (Pfeuffer 2016) is the first publicly available implementation of the approaches listed above. Besides point estimates of generator matrices, the package also contains methods to derive confidence and credibility intervals. The capabilities of the package are illustrated using Standard & Poor’s discrete-time credit rating transition data. Moreover, methodological issues of the described approaches are discussed, i.e., the derivation of the conditional expectations of the E-Step in the EM algorithm and the sampling of endpoint-conditioned continuous-time Markov chain trajectories for the Gibbs sampler.
References Bladt, M., and M. Sørensen. 2005. “Statistical Inference for Discretely Observed Markov Jump Processes.” Journal of the Royal Statistical Society B.

Israel, R. B., J. S. Rosenthal, and J. Z. Wei. 2001. “Finding Generators for Markov Chains via Empirical Transition Matrices, with Applications to Credit Ratings.” Mathematical Finance.

Kreinin, A., and M. Sidelnikova. 2001. “Regularization Algorithms for Transition Matrices.” Algo Research Quarterly.

Pfeuffer, M. 2016. “ctmcd: An R Package for Estimating the Parameters of a Continuous-Time Markov Chain from Discrete-Time Data.” In Revision (the R Journal).




Speakers


Thursday July 6, 2017 11:54am - 12:12pm CEST
3.01 Wild Gallery
  Talk, Methods I

11:54am CEST

Interactive and Reproducible Research for RNA Sequencing Analysis
Keywords: Shiny, microbiome, sequencing, ecology, 16S rRNA
Webpageshttps://acnc-shinyapps.shinyapps.io/DAME/https://github.com/bdpiccolo/ACNC-DAME
A new renaissance in knowledge about the role of commensal microbiota in health and disease is well underway facilitated by culture-independent sequencing technologies; however, microbial sequencing data poses new challenges (e.g., taxonomic hierarchy, overdispersion) not generally seen in more traditional sequencing outputs. Additionally, complex study paradigms from clinical or basic research studies necessitate a multilayered analysis pipeline that can seamlessly integrate both primary bioinformatics and secondary statistical analysis combined with data visualization.
In order to address this need, we created a web-based Shiny app, titled DAME, which allows users not familiar with R programming to import, filter, and analyze microbial sequencing data from experimental studies. DAME only requires two files (a BIOM file with sequencing reads combined with taxonomy details, and a csv file containing experimental metadata), which upon upload will trigger the app to render a linear work-flow controlled by the user. Currently, DAME supports group comparisons of several ecological estimates of α-diversity (ANOVA) and β-diversity indices (ordinations and PERMANOVA). Additionally, pairwise differential comparisons of operational taxonomic units (OTUs) using Negative Binomial Regression at all taxonomic levels can be performed. All analyses are accompanied by dynamic graphics and tables for complete user interactivity. DAME leverages functions derived from phyloseqvegan, and DESeq2 packages for microbial data organization and analysis and DThighcharter* and scatterD3 for table and plot visualizations. Downloadable options for α-diversity measurements and DESeq2 table outputs are also provided.
The current release (v0.1) is available online (https://acnc-shinyapps.shinyapps.io/DAME/) and in the Github repository (https://github.com/bdpiccolo/ACNC-DAME). *This app uses Highsoft software with non-commercial packages. Highsoft software product is not free for commercial use. Funding supported by United States Department of Agriculture-Agricultural Research Service Project: 6026-51000-010-05S.



Speakers

Thursday July 6, 2017 11:54am - 12:12pm CEST
2.01 Wild Gallery

11:54am CEST

IRT test equating with the R package equateIRT
Keywords: Equating, Item Response Theory, Multiple Forms, Scoring, Testing.
Webpages: https://CRAN.R-project.org/package=equateIRT
In many testing programs, security reasons require that test forms are composed of different items, making test scores not comparable across different administrations. The equating process aims to provide comparable test scores. This talk focuses on Item Response Theory (IRT) methods for dichotomous items. In IRT models, the probability of a correct response depends on the latent trait under investigation and on the item parameters. Due to indentifiability issues, the latent variable is usually assumed to have zero mean and variance equal to one. Hence, when the model is fitted separately for different groups of examinees, the item parameter estimates are expressed on different measurement scales. The scale conversion can be achieved by applying a linear transformation of the item parameters, and the coefficients of this equation are called equating coefficients. This talk explains the functionalities of the R package equateIRT (Battauz 2015), which implements the estimation of the equating coefficients and the computation of the equated scores. Direct equating coefficients between pairs of forms that share some common items can be estimated using the mean-mean, mean-geometric mean, mean-sigma, Haebara and Stocking-Lord methods. However, the linkage plans are often quite complex, and not all forms can be linked directly. As proposed in Battauz (2013), the package computes also the indirect equating coefficients for a chain of forms and the average equating coefficients when two forms can be linked through more than one path. Using the equating coefficients so obtained, the item parameter estimates are converted to a common metric and it is possible to compute comparable scores. For this task, the package implements the true score equating and the observed score equating methods. Standard errors of the equating coefficients and the equated scores are also provided.
References Battauz, Michela. 2013. “IRT Test Equating in Complex Linkage Plans.” Psychometrika 78 (3): 464–80. doi:10.1007/s11336-012-9316-y.

———. 2015. “EquateIRT: An R Package for Irt Test Equating.” Journal of Statistical Software 68 (1): 1–22. doi:10.18637/jss.v068.i07.




Speakers
avatar for Michela Battauz

Michela Battauz

Associate Professor, University of Udine



Thursday July 6, 2017 11:54am - 12:12pm CEST
4.01 Wild Gallery

11:54am CEST

jug: Building Web APIs for R
Keywords: REST, API, web, http
Webpages: https://CRAN.R-project.org/package=jug, https://github.com/Bart6114/jug
jug is a web framework for R. The framework helps to easily set up API endpoints. Its main goal is to make building and configuration of web APIs as easy as possible, while still allowing in-depth control over HTTP request processing when needed.
A jug instance allows one to expose solutions developed in R to the web or and/or applications communicating over HTTP. This way, other applications can gain access to, e.g. custom R plotting functions or generate new predictions based on a trained machine learning model.
jug is build upon httpuv. This results in a stable and robust back-end. Recently, endeavors have been made to allow a jug instance to process requests in parallel. The GitHub repository includes a Dockerfile to ease productionisation and containerisation of a jug instance.
During this talk, a tangible reproducible example of creating an API based on a machine learning model will be presented and some of the challenges and experiences in exposing R based results through an API will be discussed.

Speakers

Thursday July 6, 2017 11:54am - 12:12pm CEST
4.02 Wild Gallery
  Talk, Web

11:54am CEST

Show me the errors you didn't look for
Keywords: Data cleaning, Quality control, Reproducible research, Data validation
Webpages: https://CRAN.R-project.org/package=dataMaid, https://github.com/ekstroem/dataMaid
The inability to replicate scientific studies has washed over many scientific fields in the last couple of years with potentially grave consequences. We need to give this problem its due diligence: Extreme care is needed when considering the representativeness of the data, and when we convey reproducible research information. We should not just document the statistical analyses and the data but also the exact steps that were part of the data cleaning process so we know which potential errors that we are unlikely to identify in the data.
Data cleaning and -validation are the first steps in any data analysis since the validity of the conclusions from the statistical analysis hinges on the quality of the input data. Mistakes in the data arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. Ideally, a human investigator should go through each variable in the dataset and look for potential errors — both in input values and codings — but that process can be very time-consuming, expensive and error-prone in itself.
We present the R package dataMaid which implements an extensive and customizable suite of quality assessment tools to identify and document potential problems in the variables of a dataset. The results can be presented in an auto-generated, non-technical, stand-alone overview document intended to be perused by an investigator with an understanding of the variables in the dataset, but not necessarily knowledge of R. Thereby, dataMaid aids the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control. dataMaid also provides a suite of more typical R tools for interactive data quality assessment and -cleaning.

Speakers


Thursday July 6, 2017 11:54am - 12:12pm CEST
2.02 Wild Gallery

11:54am CEST

The growing popularity of R in data journalism
Online presentation: https://goo.gl/pF9bKU


In this talk, Timo Grossenbacher, data journalist at Swiss Public Broadcast and creator of Rddj.info, will show that R is becoming more and more popular among a new community: data journalists. He will showcase some innovative work that has been done with R in the field of data journalism, both by his own team and by other media outlets all over the world. At the same time, he will point out the strengths (reproducibility, for example) and hurdles (having to learn to code) of using R for a typical data journalism workflow – a workflow that is often centered around quick, exploratory data analysis rather than statisticial modeling. During the talk, he will also point out and controversially discuss packages that are of great help for journalists especially, such as the tidyverse, readxl and googlesheets packages.

Speakers
avatar for Timo Grossenbacher

Timo Grossenbacher

Projektleiter «Automated Journalism», Tamedia
Timo Grossenbacher (1987) verantwortet seit Sommer 2020 Projekte im Bereich «Automated Journalism» bei Tamedia. Davor war er mehr als fünf Jahre als Datenjournalist bei Schweizer Radio und Fernsehen tätig. Er hat Geographie und Informatik an der Universität Zürich studiert und... Read More →


Thursday July 6, 2017 11:54am - 12:12pm CEST
PLENARY Wild Gallery

12:12pm CEST

Automatically archiving reproducible studies with Docker
Keywords: Docker, Reproducible Research, Open Science
Webpage: https://github.com/o2r-project/containerit/
Reproducibility of computations is crucial in an era where data is born digital and analysed algorithmically. Most studies however only publish the results, often with figures as important interpreted outputs. But where do these figures come from? Scholarly articles must provide not only a description of the work but be accompanied by data and software. R offers excellent tools to create reproducible works, i.e. Sweave and RMarkdown. Several approaches to capture the workspace environment in R have been made, working around CRAN’s deliberate choice not to provide explicit versioning of packages and their dependencies. They preserve a collection of packages locally (packrat, pkgsnap, switchr/GRANBase) or remotely (MRAN timemachine/checkpoint), or install specific versions from CRAN or source (requireGitHub, devtools). Installers for old versions of R are archived on CRAN. A user can manually re-create a specific environment, but this is a cumbersome task.
We introduce a new possibility to preserve a runtime environment including both, packages and R, by adding an abstraction layer in the form of a container, which can execute a script or run an interactive session. The package containeRit automatically creates such containers based on Docker. Docker is a solution for packaging an application and its dependencies, but shows to be useful in the context of reproducible research (Boettiger 2015). The package creates a container manifest, the Dockerfile, which is usually written by hand, from sessionInfo(), R scripts, or RMarkdown documents. The Dockerfiles use the Rocker community images as base images. Docker can build an executable image from a Dockerfile. The image is executable anywhere a Docker runtime is present. containeRit uses harbor for building images and running containers, and sysreqs for installing system dependencies of R packages. Before the planned CRAN release we want to share our work, discuss open challenges such as handling linked libraries (see discussion on geospatial libraries in Rocker), and welcome community feedback.
containeRit is developed within the DFG-funded project Opening Reproducible Research to support the creation of Executable Research Compendia (ERC) (Nüst et al. 2017).
References Boettiger, Carl. 2015. “An Introduction to Docker for Reproducible Research, with Examples from the R Environment.” ACM SIGOPS Operating Systems Review 49 (January): 71–79. doi:10.1145/2723872.2723882.

Nüst, Daniel, Markus Konkol, Edzer Pebesma, Christian Kray, Marc Schutzeichel, Holger Przibytzin, and Jörg Lorenz. 2017. “Opening the Publication Process with Executable Research Compendia.” D-Lib Magazine 23 (January). doi:10.1045/january2017-nuest.




Speakers
avatar for Daniel Nüst

Daniel Nüst

researcher, University of Münster
Reproducible Research, R, and Docker. Geo. Open Source.



Thursday July 6, 2017 12:12pm - 12:30pm CEST
2.02 Wild Gallery

12:12pm CEST

FFTrees: An R package to create, visualise and use fast and frugal decision trees
Online presentation: https://ndphillips.github.io/useR2017_pres/

Keywords
: decision trees, decision making, package, visualization
Webpages: https://cran.r-project.org/web/packages/FFTrees/, https://rpubs.com/username/project
Many complex real-world problems call for fast and accurate classification decisions. An emergency room physician faced with a patient complaining of chest pain needs to quickly decide if the patient is having a heart attack or not. A lost hiker, upon discovering a patch of mushrooms, needs to decide whether they are safe to eat or are poisonous. A stock portfolio adviser, upon seeing that, at 3:14 am, an influential figure tweeted about a 5 company he is heavily invested in, needs to decide whether to move his shares or sit tight. These decisions have important consequences and must be made under time-pressure with limited information. How can and should people make such decisions? One effective way is to use a fast and frugal decision tree (FFT). FFTs are simple heuristics that allow people to make fast, accurate decisions based on limited information (Gigerenzer and Goldstein 1996; Martignon, Katsikopoulos, and Woike 2008). In contrast to compensatory decision algorithms such as regression, or computationally intensive algorithms such as random forests, FFTs allow people to make fast decisions ‘in the head’ without requiring statistical training or a calculation device. Because they are so easy to implement, they are especially helpful in applied decision domains such as emergency rooms, where people need to be able to make decisions quickly and transparently (Gladwell 2007; Green and Mehr 1997)
While FFTs are easy to implement, actually constructing an effective FFT from data is less straightforward. While several FFT construction algorithms have been proposed 15 (Dhami and Ayton 2001; Martignon, Katsikopoulos, and Woike 2008; Martignon et al. 2003), none have been programmed and distributed in an easy-to-use and well-documented tool. The purpose of this paper is to fill this gap by introducing FFTrees (Phillips 2016), an R package (R Core Team 2016) that allows anyone to create, evaluate, and visualize FFTs from their own data. The package requires minimal coding, is documented by many examples, and provides quantitative performance measures and visual displays showing exactly how cases are classified at each level in the tree.
This presentation is structured in three sections: Section 1 provides a theoretical background on binary classification decision tasks and explains how FFTs solve them. Section 2 provides a 5-step tutorial on how to use the FFTrees package to construct and evaluate FFTs from data. Finally, Section 3 compares the prediction performance of FFTrees to alternative algorithms such as logistic regression and random forests. To preview our results, we find that trees created by FFTrees are both more efficient, and as accurate as the best of these algorithms across a wide variety of applied datasets. Moreover, they produce trees much simpler than that of standard decision tree algorithms such as rpart (Therneau, Atkinson, and Ripley 2015), while maintining similar prediction performance.
References Dhami, Mandeep K, and Peter Ayton. 2001. “Bailing and Jailing the Fast and Frugal Way.” Journal of Behavioral Decision Making 14 (2). Wiley Online Library: 141–68.

Gigerenzer, Gerd, and Daniel G Goldstein. 1996. “Reasoning the Fast and Frugal Way: Models of Bounded Rationality.” Psychological Review 103 (4). American Psychological Association: 650.

Gladwell, Malcolm. 2007. Blink: The Power of Thinking Without Thinking. Back Bay Books.

Green, Lee, and David R Mehr. 1997. “What Alters Physicians’ Decisions to Admit to the Coronary Care Unit?” Journal of Family Practice 45 (3). [New York, Appleton-Century-Crofts]: 219–26.

Martignon, Laura, Konstantinos V Katsikopoulos, and Jan K Woike. 2008. “Categorization with Limited Resources: A Family of Simple Heuristics.” Journal of Mathematical Psychology 52 (6). Elsevier: 352–61.

Martignon, Laura, Oliver Vitouch, Masanori Takezawa, and Malcolm R Forster. 2003. “Naive and yet Enlightened: From Natural Frequencies to Fast and Frugal Decision Trees.” Thinking: Psychological Perspective on Reasoning, Judgment, and Decision Making, 189–211.

Phillips, Nathaniel. 2016. FFTrees: Generate, Visualise, and Compare Fast and Frugal Decision Trees.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Therneau, Terry, Beth Atkinson, and Brian Ripley. 2015. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.





Thursday July 6, 2017 12:12pm - 12:30pm CEST
PLENARY Wild Gallery

12:12pm CEST

MCMC Output Analysis Using R package mcmcse
Markov chain Monte Carlo (MCMC) is a method of producing a correlated sample in order to estimate expectations with respect to a target distribution. A fundamental question is when should sampling stop so that we have good estimates of the desired quantities? The key to answering these questions lies in assessing the Monte Carlo error through a multivariate Markov chain central limit theorem. This talk presents the R package mcmcse, which provides estimators for the asymptotic covariance matrix in the Markov chain CLT. In addition, the package calculates a multivariate effective sample size which can be rigorously used to terminate MCMC simulation. I will present the use of the R package mcmcse to conduct robust, valid, and theoretically just output analysis for Markov chain data.

Speakers
avatar for Dootika Vats

Dootika Vats

University of Warwick



Thursday July 6, 2017 12:12pm - 12:30pm CEST
3.01 Wild Gallery
  Talk, Methods I

12:12pm CEST

Stochastic Gradient Descent Log-Likelihood Estimation in the Cox Proportional Hazards Model with Applications to The Cancer Genome Atlas Data
Online presentation: http://r-addict.com/useR2017/#/

In the last decade, the volume of data have grown faster then the speed of processors. In this situation the statistical machine learnig methods have become more limited by the computations time than the volume of datasets. Compromise solutions in the case of large scale data are associated with the computational complexity of optimization methods, which must be made in a non-trivial way. One of such solutions are optimization algorithms that are basen on a stochastic gradient descent (Bottou (2010), Bottou (2012), Widrow (1960)), which exhibit a high efficiency during operations on the data of a large scale.
In my presentation I will describe the stochastic gradient descent algorithm that was applied in the log- likelihood estimation process of coefficients’ calcualtions of the Cox proportional hazards model. This algorithm can be successfully used in a time to event analyzes, in which which the number of explanatory variables significantly exceeds the number of observations. The prepared method of estimation of coefficients with the usage of a stochastic gradient decent can be applied in survival analyzes from ares like: molecular biology, bioinformatical screenings of gene expressions or analyzes based on DNA microarrays, that are widely used in the clinical diagnostics, treatment and research.
The created estimation workflow was a new approach (in the time I wrote my master thesis), not known in the literature. It’s resistant to the problem of variables collinearity and works well in situations of continuous coefficients improvement for a streaming data.

Speakers

Thursday July 6, 2017 12:12pm - 12:30pm CEST
2.01 Wild Gallery

1:30pm CEST

An Efficient Algorithm for Solving Large Fixed Effects OLS Problems with Clustered Standard Error Estimation
title: An Efficient Algorithm for Solving Large Fixed Effects OLS Problems with Clustered Standard Error Estimation author: | | Thomas Balmat and Jerome Reiter | | Duke University
Keywords: large data least squares, fixed effects estimation, clustered standard error estimation, sparse matrix methods, high performance computing
Large fixed effects regression problems, involving order 107 observations and 103 effects levels, present special computational challenges but, also, a special performance opportunity because of the large proportion of entries in the expanded design matrix (fixed effect levels translated from single columns into dichotomous indicator columns, one for each level) that are zero. For many problems, the proportion of zero entries is above 0.99995, which would be considered sparse. In this presentation, we demonstrate an efficient method for solving large, sparse fixed effects OLS problems without creation of the expanded design matrix and avoiding computations involving zero-level effects. This leads to minimal memory usage and optimal execution time. A feature, often desired in social science applications, is to estimate parameter standard errors clustered about a key identifier, such as employee ID. For large problems, with ID counts in the millions, this presents a significant computational challenge. We present a sparse matrix indexing algorithm that produces clustered standard error estimates that, for large fixed effects problems, is many times more efficient than standard “sandwich” matrix operations.

Speakers


Thursday July 6, 2017 1:30pm - 1:48pm CEST
3.01 Wild Gallery

1:30pm CEST

Community-based learning and knowledge sharing
Keywords: Teaching, Knowledge Sharing, Best Practices
Webpage: https://github.com/rOpenGov/edu
R is increasingly used to teach programming, quantitative analytics, and reproducible research practices. Based on our combined experience from universities, research institutes, and the public sector, we summarize key ingredients for teaching of modern data science. Learning to program has already been greatly facilitated by initiatives such as Data Carpentry and Software Carpentry, and educational resources have been developed by the users, including domain specific tutorial collections and training materials (Kamvar et al. 2017; Lahti et al. 2017; Afanador-Llach et al. 2017). An essential pedagogical feature of R is that it enables a problem-centered, interactive learning approach. Even programming-naive learners can, in our experience, rapidly adopt practical skills by analyzing topical example data sets supported by ready-made Rmarkdown templates; these can provide an immediate starting point to expose the learners to some of the key tools and best practices (Wilson et al. 2016). However, many aspects of learning R are still better appreciated by advanced users; such as harnessing the full potential of open collaboration model by joint development of custom R packages, report templates, shiny-modules, or database functions that enables rapid development of solutions catering specific practical needs. Indeed, at all levels of learning, getting things done fast, appears to be an essential component for successful learning as it provides instant rewards and helps to put the acquired skills into immediate use. The diverse needs of different application domains pose a great challenge for crafting common guidelines and materials, however. Leveraging the existing experience within the learning community can greatly support the learning process as it helps to ensure the domain specificity and relevance of the teaching experience. This can actively promoted by peer support and knowledge sharing; some ways to achieve this include code review, show-and-tell culture, informal meetings, online channels (e.g. Slack, IRC, Facebook) and hackathons. Last but not least, having fun throughout the learning process is essential; gamification of assignments with real-time rankings or custom functions performing non-statistical operations like emailing gif images can raise awareness of how R as a full-fledged programming language differs from proprietary statistical packages. In order to meet these demands, we designed specific open infrastructure to support learning in R. Our infrastructure gathers a set of modules to construct domain spesific assignments for various phases of data analysis. The assignments are coupled with automated evaluation and scoring routines that provide instant feedback during learning. In this talk, we introduce these R-based teaching tools and summarize our practical experiences on the various pedagogical aspects, opportunities, and challenges of community-based learning and knowledge sharing enabled by the R ecosystem.
References Afanador-Llach, Maria José, Antonio Rojas Castro, Adam Crymble, Víctor Gayol, Fred Gibbs, Caleb McDaniel, Ian Milligan, Amanda Visconti, and Jeri Wieringa. 2017. “The Programming Historian. Second Edition.” http://programminghistorian.org/.

Kamvar, Zhian N., Margarita M. López-Uribe, Simone Coughlan, Niklaus J. Grünwald, Hilmar Lapp, and Stéphanie Manel. 2017. “Developing Educational Resources for Population Genetics in R: An Open and Collaborative Approach.” Molecular Ecology Resources 17 (1): 120–28. doi:10.1111/1755-0998.12558.

Lahti, Leo, Sudarshan Shetty, Tineka Blake, and Jarkko Salojarvi. 2017. “Microbiome R Package.” http://microbiome.github.io/microbiome.

Wilson, Greg, Jenny Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal. 2016. “Good Enough Practices for Scientific Computing,” 1–30.




Speakers
avatar for Markus Kainu

Markus Kainu

Researcher, The Social Insurance Institution of Finland


slides pdf

Thursday July 6, 2017 1:30pm - 1:48pm CEST
PLENARY Wild Gallery

1:30pm CEST

Interactive graphs for blind and print disabled people
Keywords: accessibility, exploration, interactivity
Descriptions of graphs using long text strings are difficult for blind people and others with print disabilities to process; they lack the interactivity necessary to understand the content and presentation of even the simplest statistical graphs. Until very recently, R has been the only statistical software that has any capacity for offering the print disabled community any hope of support with respect to accessing graphs. We have levered off the ability to create text descriptions of graphs and the ability to create interactive web content for chemical diagrams to offer a new user experience.
We will present the necessary tools that (1) produce the desired graph in the correct form of a scalable vector graphic (SVG) file, (2) create a supporting XML structure for exploration of the SVG, and (3) the javascript library to support these files being mounted on the web.
Demonstration of how a blind user can explore the graph by “walking” a tree-like structure will be given. A key enhancement is the ability to explore the content at different levels of understanding; the user chooses to hear either the bare basic factual description or a more descriptive layer of feedback that can offer the user insight.



Thursday July 6, 2017 1:30pm - 1:48pm CEST
4.02 Wild Gallery
  Talk, Graphics

1:30pm CEST

R-based computing with big data on disk
Keywords: big data, reproducibility, data aggregation, bioinformatics, imaging
Webpages: https://github.com/kuwisdelu/matter, http://bioconductor.org/packages/release/bioc/html/matter.html
A common challenge in many areas of data science is the proliferation of large and heterogeneous datasets, stored in disjoint files and specialized formats, and exceeding the available memory of a computer. It is often important to work with these data on a single machine, e.g. to quickly explore the data, or to prototype alternative analysis approaches on limited hardware. Current solutions for working with such data on disk on a single machine in R involve wrapping existing file formats and structures (e.g., NetCDF, HDF5, database approaches, etc.) or converting them to very simple flat files (e.g., bigmemory, ff).
Here we argue that it is important to enable more direct interactions with such data in R. Direct interactions avoid the time and storage cost of creating converted files. They minimize the loss of information that can occur during the conversion, and therefore improve the accuracy and the reproducibility of the analytical results. They can best leverage the rich resources from over 10,000 packages already available in R.
We present matter, a novel paradigm and a package for direct interactions with complex, larger-than-memory data on disk in R. matter provides transparent access to datasets on disk, and allows us to build a single dataset from many smaller data fragments in custom formats, without reading them into memory. This is accomplished by means of a flexible data representation that allows the structure of the data in memory to be different from its structure on disk. For example, what matter presents as a single, contiguous vector in R may be composed of many smaller fragments from multiple files on disk. This allows matter to scale to large datasets, stored in large stand-alone files or in large collections of smaller files.
To illustrate the utility of matter, we will first compare its performance to bigmemory and ff using data in flat files, which can be easily accessed by all the three approaches. In tests on simulated datasets greater than 1 GB and common analyses such as linear regression and principal components analysis, matter consumed the same or less memory, and completed the analyses in a comparable time. It was therefore similar or more efficient than the available solutions.
Next, we will illustrate the advantage of matter in a research area that works with complex formats. Mass spectrometry imaging (MSI) relies on imzML, a common open-source format for data representation and sharing across mass spectrometric vendors and workflows. Results of a single MSI experiment are typically stored in multiple files. An integration of matter with the R package Cardinal allowed us to perform statistical analyses of all the datasets in a public Gigascience repository of MSI datasets, ranging from <1 GB up to 42 GB in size. All of the analyses were performed on a single laptop computer. Due to the structure of imzML, these analyses would not have been possible with the existing alternative solutions for working with larger-than-memory datasets in R .
Finally, we will demonstrate the applications of matter to large datasets in other formats, in particular text data that arise in applications in genomics and natural language processing, and will discuss approaches to using matter when developing new statistical methods for such datasets.

Speakers

Thursday July 6, 2017 1:30pm - 1:48pm CEST
2.01 Wild Gallery

1:30pm CEST

ReinforcementLearning: A package for replicating human behavior in R
Keywords: Reinforcement Learning, Human-Like Learning, Experience Replay, Q-Learning, Decision Analytics
Webpages: https://github.com/nproellochs/ReinforcementLearning
Reinforcement learning has recently gained a great deal of traction in studies that call for human-like learning. In settings where an explicit teacher is not available, this method teaches an agent via interaction with its environment without any supervision other than its own decision-making policy. In many cases, this approach appears quite natural by mimicking the fundamental way humans learn. However, implementing reinforcement learning is programmatically challenging, since it relies on continuous interactions between an agent and its environment. In fact, there is currently no package available that performs model-free reinforcement learning in R. As a remedy, we introduce the ReinforcementLearning R package, which allows an agent to learn optimal behavior based on sample experience consisting of states, actions and rewards. The result of the learning process is a highly interpretable reinforcement learning policy that defines the best possible action in each state. The package provides a remarkably flexible framework and is easily applied to a wide range of different problems. We demonstrate the added benefits of human-like learning using multiple real-world examples, e.g. by teaching the optimal movements of a robot in a grid map.


Slides pptx

Thursday July 6, 2017 1:30pm - 1:48pm CEST
3.02 Wild Gallery

1:30pm CEST

The **renjin** package: Painless Just-in-time Compilation for High Performance R
Keywords: performance, compliation, Renjin
Webpages: http://docs.renjin.org/en/latest/package/
R is a highly dynamic language that has developed, in some circles, a reputation for poor performance. New programmers are counseled to avoid for loops and experienced users condemened to rewrite perfectly good R code in C++.
Renjin is an alternative implementation of the R language that includes a Just-in-Time compiler which uses information at runtime to dynamically specialize R code and generate highly-efficient machine code, allowing users to write “normal”, expresssive R code and let the compiler worry about performance.
While Renjin aims to provide a complete alternative to the GNU R interpreter, it is not yet fully compatible with all R packages, and lacks a number of features, including graphics support. For this reason, we present renjin, a new package that embeds Renjin’s JIT compiler in the existing GNU R compiler, enabling even novice programmers to achieve a high performance without resorting to C++ or making the switch to a different interpreter.
This talk will introduce the techniques behind Renjin’s optimizing compiler, demonstrate how it can be simply applied to performance-critical sections of R code, and some tips and tricks for getting the most of out of renjin.

Speakers
avatar for Alexander Bertram

Alexander Bertram

Technical Director, BeDataDriven
I work on two projects: Renjin (www.renjin.org), a interpreter and optimizing compiler for the R language; and ActivityInhtfo (www.activityinfo.org), a data collection, management, and analysis platform for the UN and NGOs working in crisis environments. Talk to me about compilers... Read More →


Thursday July 6, 2017 1:30pm - 1:48pm CEST
2.02 Wild Gallery

1:48pm CEST

An LLVM-based Compiler Toolkit for R
Online presentation: 

https://drive.google.com/file/d/0B7N7RhxAvGVgOU5tWmJMUDBGQUE/view



Keywords
: compiler, code analysis, performance
Webpages: https://github.com/nick-ulle/rstatic, https://github.com/duncantl/Rllvm
R allows useRs to focus on the problems they want to solve by abstracting away the details of the hardware. This is a major contributor to R’s success as a data analysis language, but also makes R too slow and resource-hungry for certain tasks. Traditionally, useRs have worked around this limitation by rewriting bottlenecks in Fortran, C, or C++. These languages provide a substantial performance boost at the cost of abstraction, a trade-off that useRs should not have to face.
This talk introduces a collection of packages for analyzing, optimizing, and building compilers for R code, extending earlier work by Temple Lang (2014). By building on top of the LLVM Compiler Infrastructure (Lattner and Adve 2004), a mature open-source library for native code generation, these tools enable translation of R code to specialized machine code for a variety of hardware. Moreover, the tools are extensible and ease the creation of efficient domain-specific languages based on R, such as nimble and dplyr. Potential applications will be discussed and a simple compiler (inspired by Numba) for mathematical R code will be presented as a demonstration.
References Lattner, Chris, and Vikram Adve. 2004. “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.” In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, 75. CGO ’04. Washington, DC, USA: IEEE Computer Society.

Temple Lang, Duncan. 2014. “Enhancing R with Advanced Compilation Tools and Methods.” Statistical Science 29 (2). Institute of Mathematical Statistics: 181–200.




Speakers

Thursday July 6, 2017 1:48pm - 2:06pm CEST
2.02 Wild Gallery

1:48pm CEST

Daff: diff, patch and merge for data.frames
Keywords: Reproducible research, data versioning
Webpages: https://CRAN.R-project.org/package=daff, https://github.com/edwindj/daff
In data analysis, it can be necessary to compare two files comparing tabular data. Unfortunately, existing tools have been customized for comparing source code or other text files, and are unsuitable for comparing tabular data.
The daff R package provides tools for comparing and tracking changes in tabular data stored in data.frames. daff wraps Paul Fitz’s multi-language daff package (https://github.com/paulfitz/daff), which generates data diff that capture row and column modifications, reorders, additions, and deletions. These data diffs follow a standard format (http://dataprotocols.org/tabular-diff-format/) which can be used to HTML formatted diffs, summarize changes, and even patch (a new version of) input data.
daff augments brings the utility of source-code change tracking tools to tabular data, enabling data versioning as a component of software development and reproducible research.
References

Speakers

daff pdf

Thursday July 6, 2017 1:48pm - 2:06pm CEST
2.01 Wild Gallery

1:48pm CEST

Data Carpentry: Teaching Reproducible Data Driven Discovery
Keywords: reproducible research, rmarkdown, open science, training, github
Webpages: https://datacarpentry.org
Data Carpentry is a non-profit organization and community. It develops and teaches workshops aimed at researchers with little to no programming experience. It teaches skills and good practices for data management and analysis, with a particular emphasis on reproducibility. Over a two-day workshop, participants are exposed to the full life cycle of data-driven research. Since its creation in 2014, Data Carpentry has taught over 125 workshops and trained 400+ certified instructors. Because the workshops are domain specific, participants can get familiar with the dataset used throughout the workshop quickly, and focus on learning the computing skills. We have developed detailed assessments to evaluate the effectiveness and level of satistaction of the participants after attending a workshop as well as the impact on their research and careers 6 months or more after a workshop. Here, we will present an overview of the organization, the skills taught with a particular emphasis on using R, and the strategies used to make these workshops successful.



Thursday July 6, 2017 1:48pm - 2:06pm CEST
PLENARY Wild Gallery
  Talk, Education

1:48pm CEST

Deep Learning for Natural Language Processing in R
Keywords: Deep Learning, Natural Language Processing
Webpages: https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with-convolutional-neural-networks-on-microsoft-azure/, https://github.com/dmlc/mxnet/tree/master/R-package, https://github.com/Azure/Cortana-Intelligence-Gallery-Content/tree/master/Tutorials/Deep-Learning-for-Text-Classification-in-Azure
The use of deep learning for NLP has attracted a lot of interest in the research community over recent years. This talk describes how deep learning techniques can be applied to natural language processing (NLP) tasks using R. We demonstrate how the MXNet deep learning framework can be used to implement, train and deploy deep neural networks that can solve text categorization and sentiment analysis problems.
We begin by briefly discussing the motivation and theory behind applying deep learning to NLP tasks. Deep learning has achieved a lot of success in the domain of image recognition. State-of-the-art image classification systems employ convolutional neural networks (CNNs) with a large number of layers. These networks perform well because they can learn hierarchical representations of the input with increasing levels of abstraction. In the context of NLP, neural networks have been shown to achieve good results. In particular, Recurrent Neural Networks such as Long Short Term Memory Networks (LSTMs) perform well for problems where the input is a sequence, such as speech recognition and text understanding. In this talk we explore an interesting approach which takes inspiration from the image recognition domain and applies CNNs to NLP problems. This is achieved by encoding segments of text in an image-like matrix, where each encoded word or character is equivalent to a pixel in the image.
CNNs have achieved excellent performance for text categorization and sentiment analysis. In this talk, we demonstrate how to implement a CNN for these tasks in R. As an example, we describe in detail the code to implement the Crepe model. To train this network, each input sentence is transformed into a matrix in which each column represents a one-hot encoding of each character. We describe the code needed to perform this transformation and how to specify the structure of the network and hyperparameters using the R bindings to MXNet provided in the mxnet package. We show how we implemented a custom C++ iterator class to efficiently manage the input and output of data. This allows us to process CSV files in chunks, taking batches of raw text and tranforming them into matrices in memory, whilst distributing the computation over multiple GPUs. We describe how to set up a virtual machine with GPUs on Microsoft Azure to train the network, including installation of the necessary drivers and libraries. The network is trained on the Amazon categories dataset which consists of a training set of 2.38 million sentences, each of which map to one of 7 categories including Books, Electronics and Home & Kitchen.
The talk concludes with a demo of how a trained network can be deployed to classify new sentences. We demonstrate how this model can be deployed as a web service which can be consumed from a simple web app. The user can query the web service with a sentence and the API will return a product category. Finally, we show how the Crepe model can be applied to the sentiment analysis task using exactly the same network structure and training methods.
Through this talk, we aim to give the audience insight into the motivation for employing CNNs to solve NLP problems. Attendees will also gain an understanding of how they can be implemented, efficiently trained and deployed in R.

Speakers


Thursday July 6, 2017 1:48pm - 2:06pm CEST
3.02 Wild Gallery

1:48pm CEST

Package ggiraph: a ggplot2 Extension for Interactive Graphics
Keywords: visualization, interactive
Webpages: https://CRAN.R-project.org/package=ggiraph, https://davidgohel.github.io/ggiraph/
With rise of data visualisation, ggplot2 and D3.js tools have become very popular these last years. The first is providing an high level library for data visualisation whereas the latter is providing a low level library for binding graphical elements in a web context.
The ggiraph package combines both tools. From a user point of view, it enables the production of interactive graphics from ggplot2 objects by using their extension mechanism. It provides useful interactive capabilities such as tooltips and zoom/pan. Last but not least, graphical elements can be selected when a ggiraph object is embedded in a Shiny app: selection will be available as a reactive value. The interface is simple, flexible and does not requires effort to be integrated in R Markdown documents or Shiny applications.
In this talk I will introduce ggiraph and show examples of using it as a data visualisation tools in RStudio, Shiny applications and R Markdown documents.

Speakers


Thursday July 6, 2017 1:48pm - 2:06pm CEST
4.02 Wild Gallery
  Talk, Graphics

1:48pm CEST

R Package glmm: Likelihood-Based Inference for Generalized Linear Mixed Models
Speakers
avatar for Christina Knudson

Christina Knudson

Assistant Professor, University of St. Thomas



Thursday July 6, 2017 1:48pm - 2:06pm CEST
3.01 Wild Gallery

2:06pm CEST

**countreg**: Tools for count data regression
Keywords: Count data regression, model diagnostics, rootogram, visualization
Webpages: https://R-Forge.R-project.org/projects/countreg
The interest in regression models for count data has grown rather rapidly over the last 20 years, partly driven by methodological questions and partly by the availability of new data sets with complex features (see, e.g., Cameron and Trivedi 2013). The countreg package for R provides a number of fitting functions and new tools for model diagnostics. More specifically, it incorporates enhanced versions of fitting functions for hurdle and zero-inflation models that have been available via the pscl package for some 10 years (Zeileis, Kleiber, and Jackman 2008), now also permitting binomial responses. In addition, it provides zero-truncation models for data without zeros, along with mboost family generators that enable boosting of zero-truncated and untruncated count data regressions, thereby supplementing and extending family generators available with the mboost package. For visualizing model fits, countreg offers rootograms (Tukey 1972; Kleiber and Zeileis 2016) and probability integral transform (PIT) histograms. A (generic) function for computing (randomized) quantile residuals is also available. Furthermore, there are enhanced options for predict() methods. Several new data sets from a variety of fields (including dentistry, ethology, and finance) are included.
Development versions of countreg have been available from R-Forge for some time, a CRAN release is planned for summer 2017.
References Cameron, A. Colin, and Pravin K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. Cambridge: Cambridge University Press.

Kleiber, Christian, and Achim Zeileis. 2016. “Visualizing Count Data Regressions Using Rootograms.” The American Statistician 70 (3): 296–303.

Tukey, John W. 1972. “Some Graphic and Semigraphic Displays.” In Statistical Papers in Honor of George W. Snedecor, edited by T. A. Bancroft, 293–316. Ames, IA: Iowa State University Press.

Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. “Regression Models for Count Data in R.” Journal of Statistical Software 27 (8): 1–25. http://www.jstatsoft.org/v27/i08/.






Thursday July 6, 2017 2:06pm - 2:24pm CEST
3.01 Wild Gallery

2:06pm CEST

odbc - A modern database interface
Keywords: ODBC, DBI, databases, dplyr, RStudio
Webpages: https://CRAN.R-project.org/package=odbc, https://github.com/rstats-db/odbc
Getting data into and out of databases is one of the most fundamental parts of data science. Much of the world’s data is stored in databases, including traditional databases such as SQL Server, MySQL, PostgreSQL, and Oracle, as well as non-traditional databases like Hive, BigQuery, Redshift and Spark.
The odbc package provides an R interface to Open Database Connectivity (ODBC) drivers and databases including all those listed previously. odbc provides consistent output; including support for timestamps and 64-bit integers, improved performance for reading and writing, and complete compatibility with the DBI package.
odbc connections can be used as dplyr backends, allowing one to perform expensive queries within the database and reduce the need to transfer and load large amounts of data in an R session. odbc is also integrated into the RStudio IDE, with dialogs to setup and establish connections, preview available tables and schemas and run ad-hoc SQL queries. The RStudio Professional Products are bundled with a suite of ODBC drivers, to make it easy for System Administrators to establish and support connections to a variety of database technologies.

Speakers

Thursday July 6, 2017 2:06pm - 2:24pm CEST
2.01 Wild Gallery

2:06pm CEST

Performance Benchmarking of the R Programming Environment on Knight's Landing
Keywords: Multicore architectures, benchmarking, scalability, Xeon Phi
We present performance results obtained with a new performance benchmark of the R programming environment on the Xeon Phi Knights Landing and standard Xeon-based compute nodes. The benchmark package consists of microbenchmarks of matrix linear algebra kernels and machine learning functionality included in the R distribution that can be built from those kernels. Our microbenchmarking results show that the Knights Landing compute nodes exhibited similar or superior performance compared to the standard Xeon-based nodes for matrix dimensions of moderate to large size for most of the microbenchmarks, executing as much as five times faster than the standard Xeon-based nodes. For the clustering and neural network training microbenchmarks, the standard Xeon-based nodes performed up to four times faster than their Xeon Phi counterparts for many large data sets, indicating that commonly used R packages may need to be reengineered to take advantage of existing optimized, scalable kernels.
Over the past several years a trend of increased demand for high performance computing (HPC) in data analysis has emerged. This trend is driven by increasing data sizes and computational complexity(Fox et al. 2015; Kouzes et al. 2009). Many data analysts, researchers, and scientists are turning to HPC machines to help with algorithms and tools, such as machine learning, that are computationally demanding and require large amounts of memory (Raj et al. 2015). The characteristics of large scale machines (e.g. large amounts of RAM per node, high storage capacity, and advanced processing capabilities) appear very attractive to these researchers, however, challenges remain for algorithms to make optimal use of the hardware (Lee et al. 2014). Depending on the nature of the analysis to be performed, analytics workflows may be carried out as many independent concurrent processes requiring little or no coordination between them, or as highly coordinated parallel processes in which the processes perform portions of the same computational task. Regardless of the implementation, it is important for data analysts to have software environments at their disposal which can exploit the performance advantages of modern HPC machines.
A way to assess the performance of software on a given computing platform and inter-compare performance across different platforms is through benchmarking. Benchmark results can also be used to prioritize software performance optimization efforts on emerging HPC systems. One such emerging architecture is the Intel Xeon Phi processor, codenamed Knights Landing (KNL). The latest Intel Xeon Phi processor is a system on a chip, many-core, vector processor with up to 68 cores and two 512-bit vector processing units per core, a sufficient deviation from the standard Xeon processors and Xeon Phi accelerators of the previous generation to necessitate a performance assessment of the R programming environment on KNL.
We developed an R performance benchmark to determine the single-node run time performance of compute intensive linear algebra kernels that are common to many data analytics algorithms, and the run time performance of machine learning functionality commonly implemented with linear algebra operations. We then performed single-node strong scaling tests of the benchmark on both Xeon and Xeon Phi based systems to determine problem sizes and numbers of threads for which the KNL architecture was comparable to or outperformed their standard Intel Xeon counterparts. It is our intention that these results be used to guide future performance optimization efforts of the R programming environment to increase the applicability of HPC machines for compute-intensive data analysis. The benchmark is also generally applicable to a variety of systems and architectures and can be easily run to determine the computational potential of a system when using R for many data analysis tasks.
References Fox, Geoffrey, Judy Qiu, Shantenu Jha, Saliya Ekanayake, and Supun Kamburugamuve. 2015. “Big Data, Simulations and Hpc Convergence.” In Workshop on Big Data Benchmarks, 3–17. Springer.

Kouzes, Richard T, Gordon A Anderson, Stephen T Elbert, Ian Gorton, and Deborah K Gracio. 2009. “The Changing Paradigm of Data-Intensive Computing.” Computer 42 (1). IEEE: 26–34.

Lee, Seunghak, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. 2014. “On Model Parallelization and Scheduling Strategies for Distributed Machine Learning.” In Advances in Neural Information Processing Systems, 27:2834–42.

Raj, Pethuru, Anupama Raman, Dhivya Nagaraj, and Siddhartha Duggirala. 2015. “High-Performance Big-Data Analytics.” Computing Systems and Approaches (Springer, 2015) 1. Springer.




Speakers


Thursday July 6, 2017 2:06pm - 2:24pm CEST
2.02 Wild Gallery

2:06pm CEST

R4ML: A Scalable R for Machine Learning
Keywords: R, Distributed/Scalable, Machine Learning, SparkR, SystemML
Webpages: https://github.com/SparkTC/R4ML
R is the de facto standard for statistics and analysis. In this talk, we introduce R4ML, a new open-source R package for scalable machine learning from IBM. R4ML provides a bridge between R, Apache SystemML and SparkR, allowing R scripts to invoke custom algorithms developed in SystemML’s R-like domain specific language. This capability also provides a bridge to the algorithm scripts that ship with Apache SystemML, effectively adding a new library of prebuilt scalable algorithms for R on Apache Spark. R4ML integrates seamlessly SparkR, so data scientists can use the best features of SparkR and SystemML together in the same script. In addition, the R4ML package provides a number of useful new scalable R functions that simplify common data cleaning and statistical analysis tasks.
Our talk will begin with an overview of the R4ML package, its API, supported canned algorithms, and the integration to Spark and SystemML. We will walk through a small example of creating a custom algorithm and a demo of canned algorithm. We will share our experiences using R4ML technology with IBM clients. The talk will conclude with pointers to how the audience can try out R4ML and discuss potential areas of community collaboration.

Speakers


Thursday July 6, 2017 2:06pm - 2:24pm CEST
3.02 Wild Gallery

2:06pm CEST

Statistics in Action with R: an educative platform
Keywords: Hypothesis testing, regression model, mixed effects model, mixture model, change point detection
Webpage: http://sia.webpopix.org/
We are developing at Inria and Ecole Polytechnique the web-based educative platform Statistics in Action with R.
The purpose of this online course is to show how statistics may be efficiently used in practice using R.
The course presents both statistical theory and practical analysis on real data sets. The R statistical software and several R packages are used for implementing methods presented in the course and analyzing real data. Many interactive Shiny apps are also available.
Topics covered in the current version of the course are:
  • hypopthesis testing (single and multiple comparisons)
  • regression models (linear and nonlinear models)
  • mixed effects models (linear and nonlinear models)
  • mixture models
  • detection of change points
  • image restoration
We are aware that important aspects of statistics are not addressed, both in terms of models and methods. We plan to fill some of these gaps shortly.
Even if R is extensively used for this course, this is not a R programming course. On one hand, our objective is not to propose the most efficient implementation of an algorithm, but rather to provide a code that is easy to understand, to reuse and to extend.
On the other hand, the R functions used to illustrate a method are not used as “black boxes”. We show in detail how the results of a given function are obtained. Then, the course may be read at two different levels: we may be only interested in the statistical technique to use (and then the R function to use) for a given problem (see the first part of the course about polynomial regression), or we may want to go into details and understand how these results are computed (see the second part of this course about polynomial regression).
This course was first given at Ecole Polytechnique (France) in 2017.

Speakers

Thursday July 6, 2017 2:06pm - 2:24pm CEST
PLENARY Wild Gallery

2:06pm CEST

Visual funnel plot inference for meta-analysis
Keywords: meta-analysis, funnel plot, visual inference, publication bias, small study effects
Webpages: https://CRAN.R-project.org/package=metaviz, https://metaviz.shinyapps.io/funnelinf_app/
The funnel plot is widely used in meta-analysis to detect small study effects, especially publication bias. However, it has been repeatedly shown that the interpretation of funnel plots is highly subjective and often leads to false conclusions regarding the presence or absence of such small study effects (Terrin, Schmid, and Lau 2005). Visual inference (Buja et al. 2009) is the formal inferential framework to test if graphically displayed data do or do not support a hypothesis. The general idea is that if the data supports an alternative hypothesis, the graphical display showing the real data should be identifiable when simultaneously presented with displays of simulated data under the null hypothesis. When compared to conventional statistical tests, visual inference showed promising results in experiments, for example, for testing linear model coefficients using boxplots and scatterplots (Majumder, Hofmann, and Cook 2013). With the package nullabor (Wickham, Chowdhury, and Cook 2014) helpful general purpose functions for visual inference are available within R. Due to the often uncertain or even misleading nature of funnel plot based conclusions, we identified funnel plots as a prime candidate field for the application of visual inference. For this purpose, we developed the function funnelinf which is available within the R package metaviz. The function funnelinf is specifically tailored to visual inference of funnel plots, for instance, with options for displaying significance contours, Egger’s regression line, and for using different meta-analytic models for null plot simulation. In addition, the functionalities of funnelinf are made available as a shiny app for the convenient use by meta-analysts not familiar with R. Visual funnel plot inference and the capabilities of funnelinf are illustrated with real data from a meta-analysis on the mozart effect. Furthermore, results of an empirical experiment evaluating the power of visual funnel plot inference compared to traditional statistical funnel plot based tests are presented. Implications of these results are discussed and specific guidelines for the use of visual funnel plot inference are given.
References Buja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah F Swayne, and Hadley Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 367: 4361–83.

Majumder, Mahbubul, Heike Hofmann, and Dianne Cook. 2013. “Validation of Visual Statistical Inference, Applied to Linear Models.” Journal of the American Statistical Association 108: 942–56.

Terrin, Norma, Christopher H Schmid, and Joseph Lau. 2005. “In an Empirical Evaluation of the Funnel Plot, Researchers Could Not Visually Identify Publication Bias.” Journal of Clinical Epidemiology 58: 894–901.

Wickham, Hadley, Niladri Roy Chowdhury, and Di Cook. 2014. Nullabor: Tools for Graphical Inference. https://CRAN.R-project.org/package=nullabor.






Thursday July 6, 2017 2:06pm - 2:24pm CEST
4.02 Wild Gallery
  Talk, Graphics

2:24pm CEST

*GNU R* on a Programmable Logic Controller (PLC) in an Embedded-Linux Environment
Keywords: GNU R, PLC, Embedded, Linux
Abstract Being one of the leading institutions in the field of applied energy research, the Bavarian Center for Applied Energy Research (ZAE Bayern) combines excellent research with excellent economic implementation of the results. Our main research goal is to increase the capacity of low-voltage grids for installed photovoltaics. Therefore the influences were analysed by taking measurements in grid nodes and households. In this context we applied several open-source programming languages, but we found GNU R to be best suiting. This results from being capable of analysing, manipulating and plotting the measurement data as well as simulating and controlling our real test systems. We have installed multiple storages and modules in different households. In order to control these test sites, we use Wago PLC 750-8202, that are currently programmed in Codesys. As strategies can get quite complex (due to individual forecasting, dynamic non-linear storage models, …), we see that Codesys isn’t capable of this. With choosing R for complex computations, we have access to a wide rage of libraries and our self-developed strategies used for analysis and simulation relying on the measurement data from the grid and the weather. To bring the strategy to the PLC, we divided the whole system into two parts. One of it is running on our central servers and is preparing external data from our databases for each test site. Therefore, we set up a control platform using node-red and the httpuv package in order to run R scripts on demand. The second half will be computed on the Wago PLCs. With the board support package (BSP), Wago provides a tool-chain to its customers for build their own customised firmware. Our proposed idea is to get R and Python together with the basic Codesys running on the PLC. As for that, Python will serve as an asynchronous local controller, that is able to start calculations in R and provide control quantities to Codesys. We try to apply the rzmq package for inter process communication (IPC) and data exchange. For example, the information delivered to the Python controller will be cyclically forwarded to the global.environment of a continuously running R instance. This helps us to reduce the start-up and initialization effort for our models. On demand, R is instructed to calculate a given strategy, that is chosen for a specific day and situation by the central servers. R will hand the result back to the Python controller and forward it to Codesys, where short-term closed control-loops can be established. Our approach will be tested and verified on our Wago PLCs in our environment.

Schematic procedure

Speakers


Thursday July 6, 2017 2:24pm - 2:42pm CEST
2.02 Wild Gallery

2:24pm CEST

A quasi-experiment for the influence of the user interface on the acceptance of R
Keywords: Teaching Statistics using R, User Interface, Technology Acceptance Model
Teaching computation with R within statistic education is affected by the acceptance of the technology (Baglin 2013). In a quasi-experiment we investigate whether different user interfaces to R, namely mosaic or Rcmdr, influence the acceptance according to the Technology Acceptance Model (Venkatesh et al. 2003).
The focus thereby is on the perceived usefulness and ease of use of R software for people studying while working in different economy related disciplines. At our private university of applied science for professionals studying while working with more than 30 study centres across Germany use of R is compulsory in all statistical courses in all the different Master programs and in all study centres. Due to a change in the course of study we were able to teach the lecture twice in one term, one with Rcmdr, one with mosaic, enabling a quasi-experimental setup for two lectures each.
References Baglin, James. 2013. “Applying a Theoretical Model for Explaining the Development of Technological Skills in Statistics Education.” Technology Innovations in Statistics Education 7 (2). https://escholarship.org/uc/item/8w97p75s.

Venkatesh, Viswanath, Michael G. Morris, Gordon B. Davis, and Fred D. Davis. 2003. “User Acceptance of Information Technology: Toward a Unified View.” MIS Q. 27 (3): 425–78. http://dl.acm.org/citation.cfm?id=2017197.2017202.




Speakers
avatar for Matthias Gehrke

Matthias Gehrke

Professor, FOM University of Applied Sciences



Thursday July 6, 2017 2:24pm - 2:42pm CEST
PLENARY Wild Gallery
  Talk, Education

2:24pm CEST

Computer Vision and Image Recognition algorithms for R users
**Keywords**: Computer Vision, Image recognition, Object detection, Image feature engineering

**Webpages**: https://github.com/bnosac/image

R has already quite some packages for image processing, namely [magick](https://CRAN.R-project.org/package=magick), [imager](https://CRAN.R-project.org/package=imager), [EBImage](https://bioconductor.org/packages/EBImage) and [OpenImageR](https://CRAN.R-project.org/package=OpenImageR).

The field of image processing is rapidly evolving with new algorithms and techniques quickly popping up from Learning and Detection, to Denoising, Segmentation and Edges, Image Comparison and Deep Learning.

In order to complement these existing packages with new algorithms, we implemented a number of *R* packages. Some of these packages have been released on https://github.com/bnosac/image, namely:

- **image.CornerDetectionF9**: FAST-9 corner detection for images (license: BSD-2).
- **image.LineSegmentDetector**: Line Segment Detector (LSD) for images (license: AGPL-3).
- **image.ContourDetector**: Unsupervised Smooth Contour Line Detection for images (license: AGPL-3).
- **image.CannyEdges**: Canny Edge Detector for Images (license: GPL-3).
- **image.dlib**: Speeded up robust features (SURF) and histogram of oriented gradients (HOG) features (license: AGPL-3).
- **image.darknet**: Image classification using darknet with deep learning models AlexNet, Darknet, VGG-16, Extraction (GoogleNet) and Darknet19. As well object detection using the state-of-the art YOLO detection system (license: MIT).
- **dlib**: dlib: Allow Access to the 'Dlib' C++ Library (license: BSL-1.0)

More packages and extensions will be released in due course.

In this talk, we provide an overview of these newly developed packages and the new computer vision algorithms made accessible for R users.

![](https://github.com/bnosac/image/raw/master/logo-image.png)

Speakers
avatar for Jan Wijffels

Jan Wijffels

statistician, www.bnosac.be



Thursday July 6, 2017 2:24pm - 2:42pm CEST
3.02 Wild Gallery

2:24pm CEST

How to Use (R)Stan to Estimate Models in External R Packages
Keywords: Bayesian, developeRs
Webpages: rstan, rstanarm, rstantools, StanHeaders, bayesplot, shinyStan, loo, home page
The rstan package provides an interface from R to the Stan libraries, which makes it possible to access Stan’s advanced algorithms to draw from any posterior distribution whose density function is differentiable with respect to the unknown parameters. The rstan package is ranked in the \(99\)-th percentile overall on Depsy due to its number of downloads, citations, and use in other projects. This talk is a follow-up to the very successful Stan workshop at useR2016 and will be more focused on how maintainers of other R packages can easily use Stan’s algorithms to estimate the statistical models that their packages provide. These mechanisms were developed to support the rstanarm package for estimating regression models with Stan and have since been used by over twenty R packages, but they are perhaps not widely known and difficult to accomplish manually. Fortunately, the rstan_package.skeleton function in the rstantools package can be used to automate most of the process, so the package maintainer only needs to write the log-posterior density (up to a constant) in the Stan language and provide an R wrapper to call the pre-compiled C++ representation of the model. Methods for the resulting R object can be defined that allow the user to analyze the results using post-estimation packages such as bayesplot, ShinyStan, and loo.
References Carpenter, Bob, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1): 1–32. doi:10.18637/jss.v076.i01.




Speakers


Thursday July 6, 2017 2:24pm - 2:42pm CEST
3.01 Wild Gallery

2:24pm CEST

Improving DBI
Online presentation: https://krlmlr.github.io/useR17/Joint-profiling.html

Keywords
: Database, SQLite, specification, test suite
Webpages: https://CRAN.R-project.org/package=DBI, https://CRAN.R-project.org/package=DBItest, https://CRAN.R-project.org/package=RSQLite
Getting data in and out of R is a minor but very important part of a statistician’s or data scientist’s work. Sometimes, the data are packaged as R package; however, in the majority of cases one has to deal with third-party data sources. Using a database for storage and retrieval is often the only feasible option with today’s ever-growing data.
DBI is R’s native DataBase Interface: a set of virtual functions declared in the DBI package. DBI backends connect R to a particular database system by implementing the methods defined in DBI and accessing DBMS-specific APIs to perform the actual query processing. A common interface is helpful for both users and backend implementers. Thanks to generous support from the R Consortium, the contract for DBI’s virtual functions is now specified in detail in their documentation, which are also linked to corresponding backend-agnostic tests in the DBItest package. This means that the compatibility of backends to the DBI specification can be verified automatically. The support from the R Consortium also allowed to bring one existing DBI backend, RSQLite, on par with the specification; the odbc package, a DBI-compliant interface to ODBC, has been written from scratch against the specification defined by DBItest.
Among other topics, the presentation will introduce new and updated DBI methods, show the design and usage of the test suite, and describe the changes in the RSQLite implementation.

Speakers

Thursday July 6, 2017 2:24pm - 2:42pm CEST
2.01 Wild Gallery

2:24pm CEST

mapedit - interactive manipulation of spatial objects
Keywords: Spatial analysis, Interactive, Visualization
Webpages: https://github.com/r-spatial/mapedit, http://r-spatial.org/r/2017/01/30/mapedit_intro.html
The R ecosystem offers a powerful set of packages for geospatial analysis. For a comprehensive list see the CRAN Task View: Analysis of Spatial Data. Yet, many geospatial workflows require interactivity for smooth uninterrupted completion. This interactivity is currently restricted to viewing and visual inspection (e.g. packages leaflet and mapview) and, with very few exceptions, there is currently no way to manipulate spatial data in an interactive manner in R. One noteworthy exception is function drawExtent in the raster package which lets the user select a geographic sub-region of a given Raster* object on a static plot of the visualized layer and saves the resultant extent or subset in a new object (if desired). Such operations are standard spatial tasks and are part of all standard spatial toolboxes. With new tools, such as htmlwidgets, shiny, and crosstalk, we can now inject this useful interactivity without leaving the R environment.
Package mapedit aims to provide a set of tools for basic, yet useful manipulation of spatial objects within the R environment. More specifically, we will provide functionality to:
  1. draw, edit and delete a set of new features on a blank map canvas,
  2. edit and delete existing features,
  3. select and query from a set of existing features,
  4. edit attributes of existing features.
In this talk we will outline the conceptual and technical approach we take in mapedit to provide the above functionality and will provide a short live demonstration hightlighting the use of the package.
The mapedit project is being realized with financial support from the RConsortium.

Speakers

Thursday July 6, 2017 2:24pm - 2:42pm CEST
4.02 Wild Gallery
  Talk, Graphics

2:42pm CEST

*implyr**: A **dplyr** Backend for a Apache Impala
Keywords: Tidyverse, dplyr, SQL, Apache Impala, Big Data
Webpages: https://CRAN.R-project.org/package=implyr
This talk introduces implyr, a new dplyr backend for Apache Impala (incubating). I compare the features and performance of implyr to that of dplyr backends for other distributed query engines including sparklyr for Apache Spark’s Spark SQL, bigrquery for Google BigQuery, and RPresto for Presto.
Impala is a massively parallel processing query engine that enables low-latency SQL queries on data stored in the Hadoop Distributed File System (HDFS), Apache HBase, Apache Kudu, and Amazon Simple Storage Service (S3). The distributed architecture of Impala enables fast interactive queries on petabyte-scale data, but it imposes limitations on the dplyr interface. For example, row ordering of a result set must be performed in the final phase of query processing. I describe the methods used to work around this and other limitations.
Finally, I discuss broader issues regarding the DBI-compatible interfaces that dplyr requires for underlying connectivity to database sources. implyr is designed to work with any DBI-compatible interface to Impala, such as the general packages odbc and RJDBC, whereas other dplyr database backends typically rely on one particular package or mode of connectivity.

Speakers

Thursday July 6, 2017 2:42pm - 3:00pm CEST
2.01 Wild Gallery

2:42pm CEST

brms: Bayesian Multilevel Models using Stan
The brms package (Bürkner, in press) implements Bayesian multilevel models in R using the probabilistic programming language Stan (Carpenter, 2017). A wide range of distributions and link functions are supported, allowing users to fit linear, robust linear, binomial, Poisson, survival, response times, ordinal, quantile, zero-inflated, hurdle, and even non-linear models all in a multilevel context. Further modeling options include auto-correlation and smoothing terms, user defined dependence structures, censored data, meta-analytic standard errors, and quite a few more. In addition, all parameters of the response distribution can be predicted in order to perform distributional regression. Prior specifications are flexible and explicitly encourage users to apply prior distributions that actually reflect their beliefs. In addition, model fit can easily be assessed and compared with posterior predictive checks and leave-one-out cross-validation.


Speakers

Thursday July 6, 2017 2:42pm - 3:00pm CEST
3.01 Wild Gallery

2:42pm CEST

Depth and depth-based classification with R package **ddalpha**
Keywords: Data depth, Supervised classification, DD-plot, Outsiders, Visualization
Webpages: https://cran.r-project.org/package=ddalpha
Following the seminal idea of John W. Tukey, data depth is a function that measures how close an arbitrary point of the space is located to an implicitly defined center of a data cloud. Having undergone theoretical and computational developments, it is now employed in numerous applications with classification being the most popular one. The R-package ddalpha is a software directed to fuse experience of the applicant with recent achievements in the area of data depth and depth-based classification.
ddalpha provides an implementation for exact and approximate computation of most reasonable and widely applied notions of data depth. These can be further used in the depth-based multivariate and functional classifiers implemented in the package, where the \(DD\alpha\)-procedure is in the main focus. The package is expandable with user-defined custom depth methods and separators. The implemented functions for depth visualization and the built-in benchmark procedures may also serve to provide insights into the geometry of the data and the quality of pattern recognition.



Thursday July 6, 2017 2:42pm - 3:00pm CEST
3.02 Wild Gallery

2:42pm CEST

Exploring and presenting maps with **tmap**
Keywords: Visualisation, maps, interaction, exploration
Webpages: https://CRAN.R-project.org/package=tmap, https://github.com/mtennekes/tmap
A map tells more than a thousand coordinates. Generally, people tend to like maps, because they are appealing, recognizable, and often easy to understand. Maps are not only useful for navigation, but also to explore, analyse, and present spatial data.
The tmap package offers a powerful engine to visualize maps, both static and interactive. It is based on the Grammar of Graphics, with a syntax similar to ggplot2, but tailored to spatial data. Layers from different shapes can be stacked, map legends and attributes can be added, and small multiples can be created.
An example of a map is the following. This maps consists of a choropleth of Happy Planet Index values per country and a dot map of large world cities on top. Alternatively, a choropleth can also be created with qtm(World, "HPI").
library(tmap) data(World, metro) tm_shape(World) + tm_polygons("HPI", id = "name") + tm_text("name", size = "AREA") + tm_shape(metro) + tm_dots(id = "name") + tm_style_natural() Interaction with charts and maps is not considered as a nice extra feature anymore, of which users will say “wow, this is interactive!”. To the contrary, users will expect charts and maps to be interactive, especially when published online. Also in R, interaction has become common ground, especially since the introduction of shiny and htmlwidgets. However, the increase of interactive maps does not mean the end of static maps. Newspapers, journals, and posters still rely on printed maps. To design a static thematic map that is appealing, informative, and simple is a special craft.
There are two modes in which maps can be visualized: "plot" for static plotting and "view" for interactive viewing. Users are able to switch between these modes without effort. The choropleth above is reproduced in interactive mode as follows:
tmap_mode("view") last_map() For lazy users like me, the code ttm() toggles between the two modes. The created maps can be exported to static file formats, such as pdf and png, as well as interactive html files. Maps can also be embedded in rmarkdown documents and shiny apps.
save_tmap(filename = "map.png", width = 1920) save_tmap(filename = "index.html") Visualization of spatial data is important troughout the whole process from exploration to presentation. Exploration requires short and intuitive coding. Presentation requires full control over the map layout, including color scales and map attributes. The tmap package facilitates both exploration and presentation of spatial data.

Speakers
avatar for Martijn Tennekes

Martijn Tennekes

Data scientist, Statistics Netherlands (CBS)
Data visualization is my thing. Author of R packages treemap, tabplot, and tmap.



Thursday July 6, 2017 2:42pm - 3:00pm CEST
4.02 Wild Gallery
  Talk, Graphics

2:42pm CEST

The analysis of R learning styles with R
Keywords: R in education, Learning patterns, Learning styles
The world of education is changing more than ever. In the university of the 21st century, there is no room for one-way education with a summative evaluation at the end of the teaching period. Instead, there is a need for formative assessment, including frequent and individual feedback (Lindblom Ylanne and Lonka 1998). However, when the number of students is large, providing individual feedback requires a huge amount of effort and time. This effort is intensified when the subject matter taught allows for a certain flexibility to solve problems. Although data analysis can provide useful insights about learning styles and patterns of students both at the time of learning and afterwards, this abstract shows that it can also be leveraged for providing fast feedback.
This abstract both incorporates a procedure to provide large-scale (semi-)individual feedback by using systematic assignments, as well as insights into different learning styles by combining information on the assigments and the final scores of the students. The case used is a course on explorative data analysis (EDA) taught to a group of circa 80 first year business engineering students at Hasselt University, covering a diverse set of topics such as data manipulation, visualization, import and tidying. During the course, students have to complete assignments on a regular interval in order to fully administer the new skills, each arranged around a specific topic. These assignments come in the form of Rmarkdown files in which the students have to complete R-chunks appropriately. Each Rmarkdown file is then re-run by the education team, and the data generated for each student is used for evaluation.
Each problem which the students have to solve in these assignments is labelled by the education team with the principles it assesses. For example, in the case of visualization, it might have to do with using appropriate aestethics, appropriate geoms, appropriate context (e.g. titles, labels), etc. By mapping these labels and the scores of the student, a precise learning profile for each student can be constructed which indicates his weaknesses and his strenghts (Vermunt and Vermetten 2004). By using this information, students can be clustered in different groups, which can then be addressed with tailored feedback on their progress and pointers to useful additional exercises in order to remedy those areas in which they perform less good.
In a second step, an ex post analysis can be done by combining the learning profiles created with the final grades and possibly other information such as educational background. This information can be employed to find which group of students represent problem cases, i.e. having a high probability of failing for the course. These insights can proof useful in future editions of the course, as a mechanism for rapid identification of students who might have difficulties with certain concepts. Moreover, it can be used to adapt the course, such that certain concepts which proof the be problematic are highlighted in a different or in a more comprehensive manner throughout the course (Tait and Entwistle 1996).
References Lindblom Ylanne, Sari, and Kirsti Lonka. 1998. “Individual Ways of Interacting with the Learning Environment. Are They Related to Study Success?” Learning and Instruction 9 (1). Elsevier: 1–18.

Tait, Hilary, and Noel Entwistle. 1996. “Identifying Students at Risk Through Ineffective Study Strategies.” Higher Education 31 (1). Springer: 97–116.

Vermunt, Jan D, and Yvonne J Vermetten. 2004. “Patterns in Student Learning: Relationships Between Learning Strategies, Conceptions of Learning, and Learning Orientations.” Educational Psychology Review 16 (4). Springer: 359–84.






Thursday July 6, 2017 2:42pm - 3:00pm CEST
PLENARY Wild Gallery
  Talk, Education
 
Friday, July 7
 

11:00am CEST

Actuarial and statistical aspects of reinsurance in R
Keywords: extreme value theory, censoring, splicing, risk measures
Webpages: https://CRAN.R-project.org/package=ReIns, https://github.com/TReynkens/ReIns
Reinsurance is an insurance purchased by one party (usually an insurance company) to indemnify parts of its underwritten insurance risk. The company providing this protection is then the reinsurer. A typical example of a reinsurance is an excess-loss insurance where the reinsurer indemnifies all losses above a certain threshold that are incurred by the insurer. Albrecher, Beirlant, and Teugels (2017) give an overview of reinsurance forms, and its actuarial and statistical aspects: models for claim sizes, models for claim counts, aggregate loss calculations, pricing and risk measures, and choice of reinsurance. The ReIns package, which complements this book, contains estimators and plots that are used to model claim sizes. As reinsurance typically concerns large losses, extreme value theory (EVT) is crucial to model the claim sizes. ReIns provides implementations of classical EVT plots and estimators (see e.g. Beirlant et al. 2004) which are essential tools when modelling heavy-tailed data such as insurance losses.
Insurance claims can take long before being completely settled, i.e. there is a long time between the occurrence of the claim and the final payment. If the claim is notified to the (re)reinsurer but not completely settled before the evaluation time, not all information on the final claim amount is available, and hence censoring is present. Several EVT methods for censored data are included in ReIns.
A global fit for the distribution of losses is e.g. needed in reinsurance. Modelling the whole range of the losses using a standard distribution is usually very hard and often impossible. A possible solution is to combine two distributions in a splicing model: a light-tailed distribution for the body, i.e. light and moderate losses, and a heavy-tailed distribution for the tail to capture large losses. Reynkens et al. (2016) propose a splicing model with a mixed Erlang (ME) distribution for the body and a Pareto distribution for the tail. This combines the flexibility of the ME distribution with the ability of the Pareto distribution to model extreme values. ReIns contains the implementation of the expectation maximisation (EM) algorithm to fit the splicing model to censored data. Risk measures and excess-loss insurance premiums can be computed using the fitted splicing model.
In this talk, we apply the plots and estimators, available in ReIns, to model real life insurance data. Focus will be on the splicing modelling framework and other methods adapted for censored data.
References Albrecher, Hansjörg, Jan Beirlant, and Jef Teugels. 2017. Reinsurance: Actuarial and Statistical Aspects. Wiley, Chichester.

Beirlant, Jan, Yuri Goegebeur, Johan Segers, and Jef Teugels. 2004. Statistics of Extremes: Theory and Applications. Wiley, Chichester.

Reynkens, Tom, Roel Verbelen, Jan Beirlant, and Katrien Antonio. 2016. “Modelling Censored Losses Using Splicing: A Global Fit Strategy with Mixed Erlang and Extreme Value Distributions.” https://arxiv.org/abs/1608.01566.




Speakers


Friday July 7, 2017 11:00am - 11:18am CEST
2.02 Wild Gallery

11:00am CEST

Change Point Detection in Persistence Diagrams
Keywords: TDA, Persistence, Wavelets, Change-Point Detection
Webpages: https://github.com/speegled/cpbaywave
Topological data analysis (TDA) offers a multi-scale method to represent, visualize and interpret complex data by extracting topological features using persistent homology. We will focus on persistence diagrams, which are a way of representing the persistent homology of a point cloud. At their most basic level, persistence diagrams can give something similar to clustering information, but they also can give information about loops or other topological structures within a data set.
Wavelets are another multi-scale tool used to represent, visualize and interpret complex data. Wavelets offer a way of examining the local changes of a data set while also estimating the global trends.
We will present two algorithms that combine wavelets and persistence. First, we use a wavelet based density estimator to bootstrap confidence intervals in persistence diagrams. Wavelets seem well-suited for this, since if the underlying data lies on a manifold, then the density should have discontinuities that will need to be detected. Additionally, the wavelet based algorithm is fast enough to allow some cross-validation of the tuning parameters. Second, we present an algorithm for detecting the most likely change point of the persistent homology of a time series.
The majority of this talk will consist of presenting examples which will illustrate persistence diagrams, the change point detection algorith, and the types of changes in geometric and/or topological structure in data that can be detected via this algorithm.

Speakers

Friday July 7, 2017 11:00am - 11:18am CEST
3.01 Wild Gallery

11:00am CEST

IntegratedJM - an R package to Jointly Model the Gene-Expression and Bioassay Data, Taking Care of the Fingerprint Feature effect
Keywords: Bioactivity, Biomarkers, Chemical Structure, Joint Model, Multi-source
Webpages: https://cran.r-project.org/web/packages/IntegratedJM/index.html
In recent days, data from different sources need to be integrated together in order to arrive at meaningful conclusions. In drug-discovery experiments, most of the different data sources, related to a new set of compounds under development, are of high-dimension. For example, in order to investigate the properties of a new set of compounds, pharmaceutical companies need to analyse chemical structure (fingerprint features) of the compounds, phenotypic bioactivity (bioassay read-outs) data for targets of interest and transcriptomic(gene expression) data. Perualila-Tan et al. (2016) proposed a joint model in which the three data sources are included to better detect the association between gene expression and biological activity. For a given set of compounds, the joint modeling approach accounts for a possible effect of the chemical structure of the compound on both variables. The joint model allows us to identify genes as potential biomarkers for compound’s efficacy. The joint modeling approach, proposed by Perualila-Tan et al. (2016), is implemented in the IntegratedJM R package which provides, in addition to model estimation and inference, a set of exploratory and visualization functions that can be used to clearly present the results. The joint model and the IntegratedJM R package are discussed in details in Perualila et al. (2016) as well.
References Perualila, Nolen Joy, Ziv Shkedy, Rudradev Sengupta, Theophile Bigirumurame, Luc Bijnens, Willem Talloen, Bie Verbist, Hinrich W.H. Göohlmann, Adetayo Kasim, and QSTAR Consortium. 2016. “Applied Surrogate Endpoint Evaluation Methods with Sas and R.” In, edited by Ariel Alonso, Theophile Bigirumurame, Tomasz Burzykowski, Marc Buyse, Geert Molenberghs, Leacky Muchene, Nolen Joy Perualila, Ziv Shkedy, and Wim Van der Elst, 275–309. CRC Press.

Perualila-Tan, Nolen, Adetayo Kasim, Willem Talloen, Bie Verbist, Hinrich W.H. Göhlmann, QSTAR Consortium, and Ziv Shkedy. 2016. “A Joint Modeling Approach for Uncovering Associations Between Gene Expression, Bioactivity and Chemical Structure in Early Drug Discovery to Guide Lead Selection and Genomic Biomarker Development.” Statistical Applications in Genetics and Molecular Biology 15: 291–304. doi:10.1515/sagmb-2014-0086.




Speakers


Friday July 7, 2017 11:00am - 11:18am CEST
3.02 Wild Gallery

11:00am CEST

naniar: Data structures and functions for consistent exploration of missing data
1: Monash University, Department of econometrics and business statistics nicholas.tierney@gmail.com 2: Monash University, Department of econometrics and business statistics dicook@monash.edu 3: Queensland University of Technology, ARC Centre of Excellence for Statistical and Mathematical Frontiers milesmcbain@gmail.com
Keywords
  • Missing Data
  • Exploratory Data analysis
  • Imputation
  • Data Visualization
  • Data Mining
  • Statistical Graphics
Missing values are ubiquitous in data and need to be carefully explored and handled in the initial stages of analysis to avoid bias. However, exploring why and how values are missing is typically an inefficient process. For example, visualising data with missing values in ggplot2 results in omission of missing values with a warning, and base R silently omits missing values Wickham (2009). Additionally, imputed missing data are not typically distinguished in visualisation and data summaries. Tidy data structures described in Wickham (2014) provide an efficient, easy and consistent approach to performing data manipulation and wrangling, where each row is an observation and each column is a variable. There are currently no guidelines for representing missing data structures in a tidy format, nor simple approaches to visualising missing values. This paper describes an R package, naniar, for exploring missing values in data with minimal deviation from the common workflows of ggplot and tidy data. Naniar builds data structures and functions that ensure missing values are handled effectively for plotting and summarising data with missing values, and examining the effects of imputation.
References
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23.



Speakers

Friday July 7, 2017 11:00am - 11:18am CEST
2.01 Wild Gallery

11:00am CEST

Programming with tidyverse grammars
Keywords: tidyeval, tidyverse, dplyr, quasiquotation, NSE
Webpages: https://CRAN.R-project.org/package=dplyr, https://github.com/hadley/rlang
Evaluating code in the context of a dataset is one of R’s most useful feature. This idiom is used in base R functions like subset() and transform() and has been developed in tidyverse packages like dplyr and ggplot2 to design elegant grammars. The downside is that such interfaces are notoriously difficult to program with. It is not as easy as it should be to program with dplyr inside functions in order to reduce duplicated code involving dplyr pipelines. To solve these issues, RStudio has developed tidyeval, a set of new language features that make it straightforward to program with these grammars. We present tidyeval in this talk with a focus on solving concrete problems with popular tidyverse packages like dplyr.

Speakers


Friday July 7, 2017 11:00am - 11:18am CEST
4.02 Wild Gallery

11:00am CEST

Teaching psychometrics and analysing educational tests with **ShinyItemAnalysis**
Keywords: psychometrics, educational test, item response theory, shiny, R
Webpages: https://CRAN.R-project.org/package=ShinyItemAnalysis, https://shiny.cs.cas.cz/ShinyItemAnalysis/
This work introduces ShinyItemAnalysis (Martinková, Drabinová, Leder, & Houdek, 2017) R package and an online shiny application for psychometric analysis of educational tests and their items.
ShinyItemAnalysis covers broad range of methods and offers data examples, model equations, parameter estimates, interpretation of results, together with selected R code, and is thus suitable for teaching psychometric concepts with R. It is based on examples developed for course of Item Response Theory models for graduate students at University of Washington.
Besides, the application aspires to be a simple tool for analysis of educational tests by allowing the users to upload and analyze their own data and to automatically generate analysis report in PDF or HTML. It has been used at workshops for educators developing admission tests and in development of instruments for classroom testing such as concept inventories, see McFarland et al. (2017).
We argue that psychometric analysis should be a routine part of test development in order to gather proofs of reliability and validity of the measurement. With example of admission test to medical school we demonstrate how ShinyItemAnalysis may provide a simple and free tool to routinely analyze tests and to explain advanced psychometric models to students and those who develop educational tests.
References Martinková, P., Drabinová, A., Leder, O., & Houdek, J. (2017). ShinyItemAnalysis: Test and item analysis via shiny. Retrieved from shiny.cs.cas.cz/ShinyItemAnalysis/; https://CRAN.R-project.org/package=ShinyItemAnalysis

McFarland, J. L., Price, R. M., Wenderoth, M. P., Martinková, P., Cliff, W., Michael, J., & Modell, H. (2017). Development and validation of the Homeostasis Concept Inventory. CBE-Life Sciences Education.




Speakers
avatar for Patrícia Martinková

Patrícia Martinková

Researcher, Institute of Computer Science, Czech Academy of Sciences
Researcher in statistics and psychometrics from Prague. Uses R to boost active learning in classes. Fulbright alumna and 2013-2015 visiting research scholar with Center for Statistics and the Social Sciences and Department of Statistics, University of Washington.



Friday July 7, 2017 11:00am - 11:18am CEST
PLENARY Wild Gallery
  Talk, Shiny II

11:18am CEST

Detecting eQTLs from high-dimensional sequencing data using recount2
Keywords: eQTLs, RNA-seq, recount2, Batch Effect, gEUVADIS
Webpages: https://jhubiostatistics.shinyapps.io/recount/, https://www.bioconductor.org/packages/recount
recount2 is a recently launched multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets for 2,041 different studies with over 70,000 human RNA-seq samples from the Sequence Read Archive (SRA), Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA) projects (Collado-Torres et al. (2016)). The raw sequencing reads were processed with Rail-RNA as described at Nellore et al. (2016). RangedSummarizedExperiment objects at the gene, exon or exon-exon junctions level, the raw counts, the phenotype metadata used, the urls to the sample coverage bigWig files or the mean coverage bigWig file for a particular study can be accessed via the Bioconductor package recount or via a Shiny App.
We use this source of preprocessed RNA-seq expression data to present our recently developed analysis protocol for performing extensive eQTL analyses. The goal of an eQTL analysis is to detect patterns of gene expression related to specific genetic variants. We demonstrate how to integrate gene expression data from recount2 and genotype information to perform eQTL analyses and visualize the results with gene-SNP interaction plots. We explain in detail how expression and genotype data are filtered, transformed, and batch corrected. We also discuss possible pitfalls and artifacts that may occur when analyzing genomic data from different sources jointly. Our protocol is tested on a publicly available data set of the RNA-sequencing project from the GEUVADIS consortium and also applied to recently generated omics data from the GeneSTAR project at Johns Hopkins University.
References Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey Leek. 2016. “Recount: A Large-Scale Resource of Analysis-Ready RNA-seq Expression Data.” bioRxiv. doi:10.1101/068478.

Nellore, Abhinav, Leonardo Collado-Torres, Andrew E Jaffe, Jose Alquicira-Hernandez, Christopher Wilks, Jacob Pritt, James Morton, Jeffrey T Leek, and Ben Langmead. 2016. “Rail-RNA: Scalable Analysis of RNA-seq Splicing and Coverage.” Bioinformatics. doi:10.1093/bioinformatics/btw575.




Speakers


Friday July 7, 2017 11:18am - 11:36am CEST
3.02 Wild Gallery

11:18am CEST

Easy imputation with the simputation package
Missing value imputation is a common technique for dealing with missing data. Accordingly, R and its many extension packages offer a wide range of techniques to impute missing data. Imputation can be done using specialized imputation functions or, with a bit of programming, one of the many predictive models available in R or its extension packages.

The current set of available imputation and modeling techniques is the result of decades of development by many different contributors. As a result, imputation and modeling functions may have very different interfaces accross packages. Combining and comparing imputation methods can therefore be a cumbersome task.

The simputation package offers a uniform and robust interface to a number of popular imputation techniques. The package follows the ‘grammar of data manipulation’ (Wickham and Francois 2016), where the first argument to a function and its output are always rectangular datasets. This allows one to chain imputaton methods with the not-a-pipe operator of the magrittr package (Bache and Wickham 2014). In simputation all imputation functions are of the following form.

impute_[model](data, formula, ...)
For example, functions impute_lm or impute_em impute missing values based on linear modeling or EM-estimation respectively. The formula object is interpreted so multiple variables can be imputed based on the same set of predictors. Also, a grouping operator (|) allows one to impute using the split-apply-combine strategy for any imputation method.

Currently supported methods include imputation based on standard linear models, M
-estimation and elasticnet (ridge, lasso) regression; CART and randomForest models; multivariate methods including EM-estimation and iterative randomForest estimation; donor imputation including random and sequential hotdeck, predictive mean matching and kNN
imputation. A flexible interface for simple user-provided imputation expressions is provided as well.

Speakers


Friday July 7, 2017 11:18am - 11:36am CEST
2.01 Wild Gallery

11:18am CEST

Interactive bullwhip effect exploration using SCperf and Shiny
Abstract: The bullwhip effect, an increase in demand variability along the supply chain, is pointed out as a key driver of inefficiencies associated with the supply chain. In the presence of this phenomenon, participants involved in the manufacture of a product and its distribution to final customer face unstable production schedules or excessive inventory.
Although there are several implementations illustrating the bullwhip effect, it still remains difficult for scholars and supply chain practitioners to understand and quantify its real effect in the suply chain performance.
Using the SCperf package and Shiny app, we have developed an interactive bullwhip game which follows the standard setup of the classic “Beer Distribution Game” (MIT). Our web interface illustrates the distribution process of a multi-echelon supply chain, the goal of the game being to minimize costs along the chain while satisfying service level requirements.
Our open source application is user friendly, easily supports sophisticated forecasting techniques and inventory models and does not require any R experience.
In this talk, we describe the underlying design of the game and show by means of examples how changing the forecasting method or tuning the parameters of the replenishment policy (lead time, customer service level, etc) induces or reduces the bullwhip effect. This application may be adapted to become a learning tool in classroom and training programs.
Keywords: bullwhip effect, beergame, inventory, SCperf, shiny

Speakers
avatar for Marlene Silva-Marchena

Marlene Silva-Marchena

Independent consultant
I am a Business Intelligence consultant. I work with database creation and advanced analytics. I transform information into strategic knowledge to help companies to optimize business decisions. My interests include statistical modelling, programming, operational research and risk... Read More →



Friday July 7, 2017 11:18am - 11:36am CEST
2.02 Wild Gallery

11:18am CEST

Letting R sense the world around it with **shinysense**
Keywords: Shiny, JavaScript, Data Collection, User Experience
Webpages: https://github.com/nstrayer/shinysense, https://nickstrayer.shinyapps.io/shinysense_earr_demo/, https://nickstrayer.shinyapps.io/shinysense_swipr_demo/
shinysense is a package containing shiny modules all geared towards helping users make mobile-first apps for collecting data, or helping Shiny “sense” the outside world. Currently the package contains modules for gathering data on swiping (shinyswipr), audio (shinyearr), and from accelerometers (shinymovr). The goal of these functions is to take Shiny from a tool for demonstrating finished models or workflows into being a tool for data collection, enabling its use for training/testing models or building richer user experiences.
Several demo apps are contained in the package, including training and testing a speech recognition system using shinyearr and detecting spell casts performed by swinging your phone like a wand with shinymovr. In addition, the package is already being used in real-world products. Notable examples being the app papr which allows users to rapidly read and react to abstracts by swiping cards containing their content using shinyswipr, validating algorithm output in GenomeBot Tweet Generator, and contributr: an app that allows users to review github issues on various R packages.
A major goal of the construction of shinysense was mobile-friendly behavior. The massive proliferation of smartphones laden with sensors is a potential goldmine of data and use cases for statisticians and data scientists. This package attempts to help users harness this new flood of opportunities. A side effect of mobile oriented design is increased usability in non-static environments (Dou and Sundar 2016, Wigdor, Fletcher, and Morrison (2009)). For instance: an app running on a smartphone allowing physicians to input parameters into clinical models and instantly see the results (shinyswipr), to the generation and testing of fitness tracking algorithms by carrying a phone in a pocket (shinymovr).
Much in keeping with the primary goal of Shiny, by bringing powerful software tools such as inputs typically reserved for JavaScript/ native apps to a tool used by scientists such as R we hope to lower the costs (monetary and otherwise) of bringing innovative applications to fruition.
References Dou, Xue, and S Shyam Sundar. 2016. “Power of the Swipe: Why Mobile Websites Should Add Horizontal Swiping to Tapping, Clicking, and Scrolling Interaction Techniques.” International Journal of Human-Computer Interaction 32 (4). Taylor & Francis: 352–62.

Wigdor, Daniel, Joe Fletcher, and Gerald Morrison. 2009. “Designing User Interfaces for Multi-Touch and Gesture Devices.” In CHI’09 Extended Abstracts on Human Factors in Computing Systems, 2755–8. ACM.




Speakers

Friday July 7, 2017 11:18am - 11:36am CEST
PLENARY Wild Gallery
  Talk, Shiny II

11:18am CEST

Modules in R
Keywords: programming, functional-programming
Webpages: https://CRAN.R-project.org/package=modules, https://github.com/wahani/modules
In this talk I present the concept of modules inside the R language. The key idea of the modules package is to provide a unit of source code which is self contained, i.e. has it’s own scope. The main and most reliable infrastructure for such organisational units of source code is a package. Compared to a package modules can be considered ad-hoc, but still self contained. That means they come with a mechanism to import dependencies and also to export member functions. However, modules do not act as replacements for packages. Instead they are a unit of abstraction in between functions and packages.
There are two use cases in which modules can be beneficial. First when we write scripts and want to use sourced functions or, in general, need more control of the enclosing environment of a function. Here we may be interested to be able to state the dependencies of a function close to its definition; and also without the typical side effects to the global environment of the current R session. Second, as an organisational unit inside packages. Here modules can act as similar entities as objects in object-oriented-programming. However, other languages with similar concepts are mostly functional and the design borrows from languages like julia, Erlang and F#. As a result modules are not designed to contain data. Furthermore there is no formal mechanism for inheritance. Instead several possibilities for module composition are implemented.



Friday July 7, 2017 11:18am - 11:36am CEST
4.02 Wild Gallery

11:18am CEST

Morphological Analysis with R
Keywords: Shiny, DataTables, analysis methods, problem-solving
Webpages: https://github.com/sgrubsmyon/morphr
Morphological analysis is a problem-structuring method developed by the astrophysicist Fritz Zwicky in the 1940s to 1960s [1–3]. It can be used to explore and constrain a multi-dimensional, possibly non-quantifiable, problem space. The problem is put into a morphological field, a tabular representation where each parameter corresponds to a column whose rows are filled with the parameter values. Each parameter value is mutually checked for consistency with all other parameter values. This enables to systematically exclude inconsistent configurations and therefore greatly reduces the problem space.
Dedicated software is helpful to visualize and work with a morphological field. While full-fledged software solutions already exist [4,5], they are confined to the Windows desktop and cannot be run in a web browser. The R package morphr is a first step into the direction of a browser-based morphological analysis tool. By leveraging R technology, one can relatively easily bring morphological analysis into a modern, web-centric, cross-platform environment, embedded in an open source ecosystem. Morphological fields and their constraints can be visualized interactively in a web browser: a user can select parameter values via mouse click, causing the field to highlight the remaining configurations consistent with the selection. To provide the interactivity, the package is using R’s shiny package and is built on top of the excellent DT package, which is an R wrapper around the JavaScript library DataTables.
In this talk, morphological analysis in general is introduced and it is shown how a morphological analysis can be assisted with R using morphr.
References 1. Zwicky F. Morphology and nomenclature of jet engines. Aeronautical Engineering Review (1947) 6:49–50.

2. Zwicky F. Morphological astronomy. The Observatory (1948) 68:121–143. Available at: http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1948Obs....68..121Z&data_type=PDF_HIGH&whole_paper=YES&type=PRINTER&filetype=.pdf

3. Zwicky F, Wilson AG. New methods of thought and procedure - contributions to the symposium on methodologies. Berlin Heidelberg: Springer Science & Business Media (1967). doi:10.1007/978-3-642-87617-2

4. Swedish Morphological Society. General Morphological Analysis - A general method for non-quantified modeling. (2002 (Revised 2013)) Available at: http://www.swemorph.com/ma.html

5. Swedish Morphological Society. MA/Carma™ - Advanced Computer Support for General Morphological Analysis. (2005–2016) Available at: http://www.swemorph.com/macarma.html




Speakers


Friday July 7, 2017 11:18am - 11:36am CEST
3.01 Wild Gallery

11:36am CEST

Generating Missing Values for Simulation Purposes: A Multivariate Amputation Procedure
Keywords: R-function ampute, Multivariate Amputation, Missing Data Methodology, Simulation Studies
Webpages: https://github.com/RianneSchouten/mice/blob/ampute/vignettes/Vignette_Ampute.pdf, https://www.rdocumentation.org/packages/mice/versions/2.30/topics/ampute, https://cran.r-project.org/web/packages/mice/index.html
Abstract: Missing data are a ubiquitous problem in scientific research, especially since most statistical analyses require complete data. To evaluate the performance of methods dealing with missing data, researchers perform simulation studies. An important aspect of these studies is the generation of missing values in complete data (i.e. the amputation procedure) and this procedure will be our focus.
Since no amputation software was available, we developed and implemented an extensive amputation procedure into an R-function: ampute (available in multiple imputation package mice). We will show that the multivariate amputation approach generates legitimate missing data problems.
NA
We will provide evidence that ampute overcomes the problems of stepwise univariate amputation. With ampute, we have an efficient amputation method to accurately evaluate missing data methodology.

Speakers


Friday July 7, 2017 11:36am - 11:54am CEST
2.01 Wild Gallery

11:36am CEST

Integrated analysis of digital PCR experiments in R
Keywords: digital PCR, multiple comparison, GUI, reproducible research
Webpages: https://CRAN.R-project.org/package=dpcR, http://michbur.github.io/dpcR_manual/, http://michbur.github.io/pcRuniveRsum/
Digital PCR (dPCR) is a variant of PCR, where the PCR amplification is conducted in multiple small volume reactions (termed partitions) instead of a bulk. The dichotomous status of each partition (positive or negative amplification) is used for absolute quantification of the template molecules by Poisson transformation of the proportion of positive partitions. The vast expansion of dPCR technology and its applications has been followed by the development of statistical data analysis methods. Yet, the software landscape is scattered, consisting of scripts in various programming languages, web servers with narrow scopes or closed source vendor software packages, that are usually tightly tied to their platform. This leads to unfavourable environments, as results from different platforms, or even from different laboratories using the same platform, cannot be easily compared with one another.
To address these challenges, we developed the dpcReport shiny server that provides an open-source tool for the analysis of dPCR data. dpcReport provides a streamlined analysis framework to the dPCR community that is compatible with the data output (e.g., CSV, XLSX) from different dPCR platforms (e.g., Bio-Rad QX100/200, Biomark). This goes beyond the basic dPCR data analysis with vendor-supplied softwares, which is often limited to the computation of the mean template copy number per partition and its uncertainty. dpcReport gives users more control over their data analysis and they benefit from standardization and reproducible analysis.
Our web server analyses data regardless of the platform vendor or type (droplet or chamber dPCR). It is not limited to the commercially available platforms and can also be used with experimental systems by importing data through the universal REDF format, which follows the IETF RFC 4180 standard. dpcReport provides users with advanced tools for data quality control and it incorporates statistical tests for comparing multiple reactions in an experiment [@burdukiewicz_methods_2016], currently absent in many dPCR-related software tools. dpcReport provides users with advanced tools for data quality control. The conducted analyses are fully integrated within extensive and customizable interactive HTML reports including figures, tables and calculations.
To improve reproducibility and transparency, a report may include snippets in R enabling an exact reproduction of the analysis performed by dpcReport. We developed dpcR package to collect all functionalities employed by the shiny server. Furthermore, the package provides additional functions facilitating analysis and quality control of dPCR data. Nevertheless, core functionalities are available through the shiny server to minimize entry barrier required to use our software.
Both dpcReport and dpcR follow the standardized dPCR nomenclature of the dMIQE guidelines [@huggett_digital_2013]. Since the vast functionality offered by our software may be overwhelming at first, our software is extensively documented. The documentation is enriched by the analysis of sample data sets.
The dpcReport web server and dpcR package belong to pcRuniveRsum, a collection of R tools for analysis DNA Amplification of Experiments
References

Speakers

Friday July 7, 2017 11:36am - 11:54am CEST
3.02 Wild Gallery

11:36am CEST

Markov-Switching GARCH Models in R: The MSGARCH Package
Keywords: GARCH, MSGARCH, Markov–switching, conditional volatility, risk management
Webpages: https://CRAN.R-project.org/package=MSGARCH, https://github.com/keblu/MSGARCH
Markov–switching GARCH models have become popular to model the structural break in the conditional variance dynamics of financial time series. In this paper, we describe the R package MSGARCH which implements Markov–switching GARCH–type models very efficiently by using C++ object–oriented programming techniques. It allows the user to perform simulations as well as Maximum Likelihood and Bayesian estimation of a very large class of Markov–switching GARCH–type models. Risk management tools such as Value–at–Risk and Expected–Shortfall calculations are available. An empirical illustration of the usefulness of the R package MSGARCH is presented.

Speakers


Friday July 7, 2017 11:36am - 11:54am CEST
2.02 Wild Gallery

11:36am CEST

ompr: an alternative way to model mixed-integer linear programs
Keywords: integer programming, linear programming, modelling, optimization
Webpages: https://github.com/dirkschumacher/ompr
Many real world optimization problems, such as the popular traveling salesman problem, can be formulated as a mixed-integer linear program (MILP). The aim of MILP is to optimize a linear objective function, subject to a set of linear constraints. Over the past decades, specialized open-source and commercial solvers have been developed, such as the GNU Linear Programming Kit (GLPK) which can efficiently solve these kinds of problems.
In R, interfaces to these solvers are mostly matrix oriented. When solving a MILP in R, you would thus first need to develop your actual model and then translate it into code that constructs a matrix and vectors before passing it to a solver. Especially for more complex models, the R code might be rather hard to develop and to reason about without additional documentation.
ompr is a domain specific language that lets you model MILPs declaratively using functions like set_objective, add_variable or add_constraint. Together with magrittr pipes you can build a model just like a dplyr statement incrementally, without worrying on how to build the matrix and vectors. Furthermore, an ompr model is independent from specific solvers and a lot of popular solvers can easily be used through the ROI family of packages (Hornik et al. 2016).
The idea to model mixed-integer programs algebraically is not new in general. Domain specific languages such as GNU MathProg or the JuMP project (Dunning, Huchette, and Lubin 2015) in Julia implement a similar approach as ompr and inspired its development. As far as I know, there is one other related R package, roml (Vana, Schwendinger, and Hochreiter 2016), that is currently under development and follows a similiar pathway.
The ompr package is developed and available on GitHub. In addition to the package itself several vignettes and examples exist describing how to model and solve popular optimization problems, such as the traveling salesman problem, the warehouse location problem or solving Sudokus interactively with shiny.
In this talk I will present the modelling features of ompr, how the package can be used to solve practical optimization problems and some ideas for future developments.
References Dunning, Iain, Joey Huchette, and Miles Lubin. 2015. “JuMP: A Modeling Language for Mathematical Optimization.” arXiv:1508.01982 [Math.OC]. http://arxiv.org/abs/1508.01982.

GNU Linear Programming Kit. 2017. http://www.gnu.org/software/glpk/glpk.html.

Hornik, Kurt, David Meyer, Florian Schwendinger, and Stefan Theussl. 2016. ROI: R Optimization Infrastructure. https://CRAN.R-project.org/package=ROI.

Vana, Laura, Florian Schwendinger, and Ronald Hochreiter. 2016. R Optimization Modeling Language. https://r-forge.r-project.org/projects/roml/.




Speakers


Friday July 7, 2017 11:36am - 11:54am CEST
3.01 Wild Gallery

11:36am CEST

papr: Tinder for pre-prints, a Shiny Application for collecting gut-reactions to pre-prints from the scientific community
papr is an R Shiny web application and social network for evaluating bioRxiv pre-prints. The app serves multiple purposes, allowing the user to quickly swipe through pertinent abstracts as well as find a community of researchers with similar interests. It also serves as a portal for accessible “open science”, getting abstracts into the hands of users of all skill levels. Additionally, the data could help build a general understanding of what research the community finds exciting.

We allow the user to log in via Google to track multiple sessions and have implemented a recommender engine, allowing us to tailor which abstracts are shown based on each user’s previous abstract rankings. While using the app, users view an abstract pulled from bioRxiv and rate it as “exciting and correct”, “exciting and questionable”, “boring and correct”, or “boring and questionable” by swiping the abstract in a given direction. The app includes optional social network features, connecting users who provide their twitter handle to users who enjoy similar papers.

This presentation will demonstrate how to incorporate tactile interfaces, such as swiping, into a Shiny application using a package we created for this functionality shinysense, store real-time user data on Dropbox using drop2, login in capabilities using googleAuthR and googleID, how to implement a recommender engine using principle component analysis, and how we have handled issues of data safety/security through proactive planning and risk mitigation. Finally, we will report the app activity, summarizing both the user traffic and what research users are finding exciting.


Friday July 7, 2017 11:36am - 11:54am CEST
PLENARY Wild Gallery

11:36am CEST

Taking Advantage of the Byte Code Compiler
Keywords: Compiler, Byte-Code, Interpreter, Performance
Webpages: http://www.stat.uiowa.edu/~luke/R/compiler/compiler.pdf
Since version 2.13, R includes a byte-code compiler and interpreter which complement the older abstract-syntax-tree (AST) interpreter. The AST interpreter directly executes R code represented as a tree of objects produced by the parser. The byte-code compiler compiles the AST into a sequence of byte-code instructions, which is then interpreted using a byte-code interpreter. The byte-code compiler and interpreter implement a number of performance optimizations which speed up scalar code and particularly loops operating on scalar variables. Performance of highly vectorized code is likely to be unaffected by the byte-code compiler/interpreter as in fact such code spends most of the time executing outside the R interpreters. The compiler can compile packages at installation time and individual R functions on request. It can also transparently compile loops and functions as they execute (just-in-time).
The talk will show examples of code that runs particularly fast with the compiler, code that is unaffected by it, and code that runs particularly slow. The slowdowns are almost always due to time spent in compilation; once compiled and loaded, the code should not run slower than in the unoptimized AST interpreter. When triggered just-in-time, the compiler includes heuristics that try to prevent compilation in case it is not likely to pay off, but sometimes they are wrong. The talk will show on concrete examples how these overheads can be avoided. The talk will be technical and will be aimed at package authors and R users who write performance critical R code.

Speakers

Friday July 7, 2017 11:36am - 11:54am CEST
4.02 Wild Gallery

11:54am CEST

Dynamic Assessment of Microbial Ecology (*DAME*): A Shiny App for Analysis and Visualization of Microbial Sequencing Data


**Webpages**: http://bioconductor.org/packages/pcaExplorer/, https://github.com/federicomarini/ideal

Next generation sequencing technologies, such as RNA-Seq, generate tens of millions of reads to define the expression levels of the features of interest. A wide number and variety of software packages have been developed for accommodating the needs of the researcher, mostly in the R/Bioconductor framework. Many of them focus on the identification of differentially expressed (DE) genes (**DESeq2**, **edgeR**, [@Love2015]) to discover quantitative changes between experimental groups, while other address alternative splicing, discovery of novel transcripts, or RNA editing.

Moreover, Exploratory Data Analysis is a common step to all these workflows, and despite its importance for generating highly reliable results, it is often neglected, as many of the steps involved might require a considerable proficiency of the user in the programming languages. Principal Components Analysis (PCA) is used often to obtain a dimension-reduced overview of the data [@Jolliffe2002].

Our proposal will address the two steps of Exploratory Data Analysis and Differential Expression analysis with two different packages, integrated and available in Bioconductor, namely **pcaExplorer** and **ideal**. We propose web applications developed in the Shiny framework which will also include support for reproducible analyses, thanks to an embedded text editor and a template document, to seamlessly generate HTML reports as a result of the user's exploration.

This solution, which we also outlined in [@Marini2016], serves as a concrete proof of principle of integrating the essential features of interactivity (as a proxy for accessibility) and reproducibility in the same tool, fitting both the needs of life scientists and experienced analyists, thus making our packages good candidates to become companion tools for each RNA-Seq analysis.

# References



Speakers
avatar for Federico Marini

Federico Marini

IMBEI - Mainz
Just got my PhD in Biostatistics/Bioinformatics at the IMBEI @University Medical Center in Mainz, Germany



Friday July 7, 2017 11:54am - 12:12pm CEST
3.02 Wild Gallery

11:54am CEST

Dynamic modeling and parameter estimation with dMod
Keywords: Parameter Estimation, ODEs, Systems Biology, Maximum-Likelihood, Profile-Likelihood
Webpages: https://github.com/dkaschek/dMod, https://github.com/dkaschek/cOde
ODE models to describe and understand interactions in complex dynamical systems are widely used in the physical sciences and beyond. In many situations, the model equations depend on parameters. When parameters are not known from first principle, they need to be estimated from experimental data.
The dMod package for R provides a framework for formulating complex reaction networks and estimating the inherent reaction parameters from experimental data. By design, different experimental conditions as well as explicit or implicit equality constraints, e.g., steady-state constraints, are formulated by parameter transformations which thereby take a central role in dMod. Since, in general, the observed reaction dynamics is a non-linear function of the reaction parameters, profile-likelihood methods are implemented to assess non-linear parameter dependencies and estimate parameter- and prediction confidence intervals.
Here, we present the abilities and particularities of our modeling framework. The methods are illustrated based on a minimal systems biology example.

Speakers

Friday July 7, 2017 11:54am - 12:12pm CEST
3.01 Wild Gallery

11:54am CEST

How to deal with Missing Data in Time Series and the imputeTS package
Keywords: Missing Data, Time Series, Imputation, Visualization, Data Pre-Processing
Webpages: https://CRAN.R-project.org/package=imputeTS, https://github.com/SteffenMoritz/imputeTS
In almost every domain from industry, finance, up to biology time series data is measured. One common problem that can come along with time series measurement are missing observations. During several projects with industry partners in the last years, we often experienced sensor malfunctions or transmission issues leading to missing sensor data. As subsequent processes or analysis methods may require missing values to be replaced with reasonable values up-front, missing data handling can be crucial.
This talk gives a short overview about methods for missing data in time series in R in general and subsequently introduces the imputeTS package. The imputeTS package is specifically made for handling missing data in time series and offers several functions for visualization and replacement (imputation) of missing data. Based on usage examples it is shown how imputeTS can be used for time series imputation.
Most well-known and established packages (e.g. mice, VIM, AMELIA, missMDA) for missing value imputation focus mostly on cross-sectional data, while methods for time series data are not that familiar to users. Also, from an algorithmic perspective, these two imputation use cases are slightly different: imputation for cross-sectional data relies on inter-attribute correlations, while (univariate) time series imputation needs to employ time dependencies. Overall, this talk is supposed to give users a first glance at time series imputation in R with special focus on the imputeTS package.

Speakers


Friday July 7, 2017 11:54am - 12:12pm CEST
2.01 Wild Gallery

11:54am CEST

R and Haskell: Combining the best of two worlds
Keywords: HaskellR, Haskell, Interoperability
Surely there’s no need to explain to useR! attendees what’s so great about R! Haskell, on the other hand, is a great language too - statically typed, purely functional, lazy, fast, and with that, you know, cool and mathy touch … ;-)
Statistics, data science, and machine learning, however, are not that easy to do from Haskell, as it does not have all the specialized and comfortable-to-use libraries R has.
Fortunately, the guys at tweag.io developed HaskellR, providing Haskell with full R interoperability. With ihaskell-inline-R, R can even be used in IHaskell notebooks. In this session, we’ll show how to get started with HaskellR, and how you can get the best of both worlds.

Speakers

Friday July 7, 2017 11:54am - 12:12pm CEST
2.02 Wild Gallery

11:54am CEST

shiny.collections: Google Docs-like live collaboration in Shiny
Keywords: Shiny, data applications, UX, live collaboration, data persistence
Webpages: https://appsilon.github.io/shiny.collections/
What users expect from web applications today differs dramatically from what was available 5 years ago. They are used to interactivity, data persistence, and what’s more, the ability to share live collaboration experiences, like in Google Docs. If one user changes the data, other users want to see the changes immediately on their screens. They don’t care whether it is a data-exploration app from a data scientist or a solution built by a team of software engineers.
Shiny is perfect for building interactive data-driven applications suited for the modern user. In this presentation, we show how to create real-time collaboration experience in Shiny apps.
From the presentation, you will learn the concepts of reactive databases, how to use them in Shiny, and how to adapt existing components to provide live collaboration.
We will present a package we developed for that. shiny.collections adds persistent reactive collections that can be effortlessly integrated with components like Shiny inputs, DT data table or rhandsontable. The package makes it easy to build collaborative Shiny applications with persistent data.
The presentation will be very actionable. Our goal is for everyone in the audience to be able to add persistence and collaboration to their apps in less than 10 minutes.
References “RethinkDB.” https://rethinkdb.com/.




Speakers
avatar for Marek Rogala

Marek Rogala

Appsilon Data Science
I'm a data scientist, software engineer and entrepreneur with experience from Google and Domino Data Lab. Passionate about data analysis, machine learning, software design and tackling hard algorithmic and analytical problems. CTO and co-founder at Appsilon Data Science - consulting... Read More →



Friday July 7, 2017 11:54am - 12:12pm CEST
PLENARY Wild Gallery
  Talk, Shiny II
 


Filter sessions
Apply filters to sessions.