useR!2017 has ended

Monday, July 3

### 5:00pm CEST

Monday July 3, 2017 5:00pm - 8:00pm CEST
Brussels South Train Station Avenue Fonsny

### 6:00pm CEST

Monday July 3, 2017 6:00pm - 6:30pm CEST
Wild Gallery Getijstraat 11, 1190 Vorst

### 6:30pm CEST

This pre-conference session is aimed at newcomers to useR! as an introduction to the conference and the wider R community. The session will feature short talks, open discussion and informal networking. The planned talks are:

Julie Josse: Introduction to Forwards

David Smith: The R ecosystem

Kevin O’Brien: Navigating the R community

Laure Cougnaud: Making the most of useR!

Heather Turner: useR! abstract review: what the program committee look for

Maëlle Salmon: rOpenSci onboarding system and community

Charlotte Wickham: Collaborative coding

Julia Silge: Making a career from coding
---

Thanks,

Heather

Monday July 3, 2017 6:30pm - 9:00pm CEST
PLENARY Wild Gallery

Tuesday, July 4

### 8:00am CEST

Tuesday July 4, 2017 8:00am - 9:15am CEST
Wild Gallery Getijstraat 11, 1190 Vorst

Speakers
MC

CR

## Colin Rundel

Tuesday July 4, 2017 9:30am - 11:00am CEST
2.02 Wild Gallery
Tutorial
• Company 6

Speakers
SM

CR

## Christian Ritz

Tuesday July 4, 2017 9:30am - 11:00am CEST
4.02 Wild Gallery
Tutorial
• Company 11

### 9:30am CEST

Hi, and welcome to the 'Geospatial data visualization in R' tutorial. I have uploaded a PDF on the UseR! schedule site. You can access it at http://schd.ws/hosted_files/user2017/28/user2017.geodataviz-overview.pdf. Please go through it before the conf. It contains the outline of the tutorial and instructions to set up your laptop for the tutorial. Thanks and hope to see you all soon.

Speakers

Manager Cyber Security Analytics, Ernst & Young
InfoSec Data Scientist, with a solid passion in GIS/cartography.

Tuesday July 4, 2017 9:30am - 11:00am CEST
4.03 Wild Gallery
Tutorial
• Company 7

Speakers
MP

## Martyn Plummer

Tuesday July 4, 2017 9:30am - 11:00am CEST
2.01 Wild Gallery
Tutorial
• Company 5

### 9:30am CEST

Dear participant in the tutorial on “Introduction to parallel computing with R“,

Welcome to the tutorial! I’d like to share some practical information with you:

The tutorial will take place on Tuesday July 4th at 9:30am at the Wild Gallery (3.01).

Please read the description of the tutorial, so that you know what to expect and also what not to expect:
https://github.com/hanase/useR2017
(You might want to bookmark it as the material will be  provided from that page later.)

Please bring your laptop with you. Even though the ability to provide individual support will be limited, it should be straightforward to follow along with the material.

For RStudio users, please note that currently RStudio contains a bug that prevents one of the packages handled in the tutorial from working correctly. Thus I recommend that you use an alternative R user interface than RStudio. If you use RStudio, you will not be able to run about 1/4 of the material. (This bug in RStudio was reported and hopefully will be fixed soon.)

There will be time during the tutorial to install all the required packages.  If you want to get ahead, here is the command to do this:
install.packages(c("foreach", "doParallel", "doRNG", "snowFT", "extraDistr", "ggplot2", "reshape2”, “wpp2017”),  dependencies = TRUE)

Finally, if you have any questions, don’t hesitate to ask me (hanas@uw.edu).

Looking forward to seeing you in Brussels.

Speakers
HS

## Hana Ševčíková

University of Washington

Tuesday July 4, 2017 9:30am - 11:00am CEST
3.01 Wild Gallery
Tutorial
• Company 9

Speakers
DE

## Dirk Eddelbuettel

Tuesday July 4, 2017 9:30am - 11:00am CEST
PLENARY Wild Gallery
Tutorial
• Company 4

### 9:30am CEST

I'm looking forward to meeting you on July 4th at "Solving Iteration Problems with purrr"!
You only need two things for the tutorial: your laptop with R and some packages installed (see below), and the tutorial slides (http://bit.ly/purrr-slides).
Required packages:

install.packages("tidyverse")         # install.packages("devtools")    devtools::install_github("jennybc/repurrrsive")

While not necessary for the tutorial, you may find it useful to see all the materials (including some code solutions) after the tutorial at https://github.com/cwickham/purrr-tutorial

Speakers

## Charlotte Wickham

Tuesday July 4, 2017 9:30am - 11:00am CEST
3.02 Wild Gallery
Tutorial
• Company 8

### 9:30am CEST

I am very pleased you will be participating in the Sports Analyt- ics with R tutorial at the useR!2017 Conference . The tutorial is just a few days away.

In order that you get the most out of the experience, I’ve pro- vided some useful information about resources for the workshop and some packages you will want to make sure to install by the day of the tutorial.

Before the Tutorial:

• Please make sure you will have a laptop with you and will be able to use R during the tutorial.

• You should ensure that you have version 3.2 or higher of R on the machine you will be using at the conference.

• Please install the deuce package from github, which has datasets we will be using during the tutorial. Depending on your con- nection speed, this may take several minutes to install so it is very important to do this before July 4th. You can install using the devtools command:

devtools::install_github(‘skoval/deuce’).

• If you want to be able to do all of the examples, please make sure you have installed the following packages and their de- pendencies: devtools, gtools, lubridate, stringr, dplyr, tidyr, rvest, RSelenium, ggplot2, ggthemes, plot3D, caret, blogdown, plotly

• You can install blogdown from github using: devtools::install_github(‘rstudio/blogdown’). The others should be available on CRAN.

• Using RSelenium may require additional installation of Sele-
nium serve, a java program. You can find the information on
installation under ‘How do I run the Selenium Server‘ at https: //cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics html#how-do-i-run-the-selenium-server

I will also have pdf copies of the lecture slides if you would like them after the tutorial, please contact me.

Contact:

If you have questions before or after the conference, please send an e-mail to me (Stephanie Kovalchik) at s.a.kovalchik@gmail. com.

Speakers
SK

## Stephanie Kovalchik

Tuesday July 4, 2017 9:30am - 11:00am CEST
4.01 Wild Gallery
Tutorial
• Company 10

### 11:00am CEST

Tuesday July 4, 2017 11:00am - 11:30am CEST
CATERING POINTS Wild Gallery
BREAK
• Company 12

Speakers
MC

CR

## Colin Rundel

Tuesday July 4, 2017 11:30am - 1:00pm CEST
2.02 Wild Gallery
Tutorial
• Company 15

Speakers
SM

CR

## Christian Ritz

Tuesday July 4, 2017 11:30am - 1:00pm CEST
4.02 Wild Gallery
Tutorial
• Company 20

### 11:30am CEST

Hi, and welcome to the 'Geospatial data visualization in R' tutorial. I have uploaded a PDF on the UseR! schedule site. You can access it at http://schd.ws/hosted_files/user2017/28/user2017.geodataviz-overview.pdf. Please go through it before the conf. It contains the outline of the tutorial and instructions to set up your laptop for the tutorial. Thanks and hope to see you all soon.

Speakers

Manager Cyber Security Analytics, Ernst & Young
InfoSec Data Scientist, with a solid passion in GIS/cartography.

Tuesday July 4, 2017 11:30am - 1:00pm CEST
4.03 Wild Gallery
Tutorial
• Company 16

Speakers
MP

## Martyn Plummer

Tuesday July 4, 2017 11:30am - 1:00pm CEST
2.01 Wild Gallery
Tutorial
• Company 14

### 11:30am CEST

Dear participant in the tutorial on “Introduction to parallel computing with R“,

Welcome to the tutorial! I’d like to share some practical information with you:

The tutorial will take place on Tuesday July 4th at 9:30am at the Wild Gallery (3.01).

Please read the description of the tutorial, so that you know what to expect and also what not to expect:
https://github.com/hanase/useR2017
(You might want to bookmark it as the material will be  provided from that page later.)

Please bring your laptop with you. Even though the ability to provide individual support will be limited, it should be straightforward to follow along with the material.

For RStudio users, please note that currently RStudio contains a bug that prevents one of the packages handled in the tutorial from working correctly. Thus I recommend that you use an alternative R user interface than RStudio. If you use RStudio, you will not be able to run about 1/4 of the material. (This bug in RStudio was reported and hopefully will be fixed soon.)

There will be time during the tutorial to install all the required packages.  If you want to get ahead, here is the command to do this:
install.packages(c("foreach", "doParallel", "doRNG", "snowFT", "extraDistr", "ggplot2", "reshape2”, “wpp2017”),  dependencies = TRUE)

Finally, if you have any questions, don’t hesitate to ask me (hanas@uw.edu).

Looking forward to seeing you in Brussels.

Speakers
HS

## Hana Ševčíková

University of Washington

Tuesday July 4, 2017 11:30am - 1:00pm CEST
3.01 Wild Gallery
Tutorial
• Company 18

Speakers
DE

## Dirk Eddelbuettel

Tuesday July 4, 2017 11:30am - 1:00pm CEST
PLENARY Wild Gallery
Tutorial
• Company 13

### 11:30am CEST

I'm looking forward to meeting you on July 4th at "Solving Iteration Problems with purrr"!
You only need two things for the tutorial: your laptop with R and some packages installed (see below), and the tutorial slides (http://bit.ly/purrr-slides).
Required packages:

install.packages("tidyverse")         # install.packages("devtools")    devtools::install_github("jennybc/repurrrsive")

While not necessary for the tutorial, you may find it useful to see all the materials (including some code solutions) after the tutorial at https://github.com/cwickham/purrr-tutorial

Speakers

## Charlotte Wickham

Tuesday July 4, 2017 11:30am - 1:00pm CEST
3.02 Wild Gallery
Tutorial
• Company 17

### 11:30am CEST

I am very pleased you will be participating in the Sports Analyt- ics with R tutorial at the useR!2017 Conference . The tutorial is just a few days away.

In order that you get the most out of the experience, I’ve pro- vided some useful information about resources for the workshop and some packages you will want to make sure to install by the day of the tutorial.

Before the Tutorial:

• Please make sure you will have a laptop with you and will be able to use R during the tutorial.

• You should ensure that you have version 3.2 or higher of R on the machine you will be using at the conference.

• Please install the deuce package from github, which has datasets we will be using during the tutorial. Depending on your con- nection speed, this may take several minutes to install so it is very important to do this before July 4th. You can install using the devtools command:

devtools::install_github(‘skoval/deuce’).

• If you want to be able to do all of the examples, please make sure you have installed the following packages and their de- pendencies: devtools, gtools, lubridate, stringr, dplyr, tidyr, rvest, RSelenium, ggplot2, ggthemes, plot3D, caret, blogdown, plotly

• You can install blogdown from github using: devtools::install_github(‘rstudio/blogdown’). The others should be available on CRAN.

• Using RSelenium may require additional installation of Sele-
nium serve, a java program. You can find the information on
installation under ‘How do I run the Selenium Server‘ at https: //cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics html#how-do-i-run-the-selenium-server

I will also have pdf copies of the lecture slides if you would like them after the tutorial, please contact me.

Contact:

If you have questions before or after the conference, please send an e-mail to me (Stephanie Kovalchik) at s.a.kovalchik@gmail. com.

Speakers
SK

## Stephanie Kovalchik

Tuesday July 4, 2017 11:30am - 1:00pm CEST
4.01 Wild Gallery
Tutorial
• Company 19

### 1:00pm CEST

Tuesday July 4, 2017 1:00pm - 2:00pm CEST
CATERING POINTS Wild Gallery
BREAK
• Company 21

Speakers
AS

## Arun Srinivasan

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
3.01 Wild Gallery
Tutorial
• Company 26

### 2:00pm CEST

Please have the latest version of R and RStudio installed. Also, please install RCpp. Windows users will need Rtools, Mac users will need Xcode.
To help tailor the workshop, it would be helpful if you would complete the questionnaire https://www.jumpingrivers.com/q/useR2017.html

Speakers

## Colin Gillespie

Consultant | Associate Professor, Jumping Rivers | Newcastle University

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
PLENARY Wild Gallery
Tutorial
• Company 22

Speakers

## Taylor Arnold

Assistant Professor of Statistics, University of Richmond
Large scale text and image processing
LT

## Lauren Tilton

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
3.02 Wild Gallery
Tutorial
• Company 27

### 2:00pm CEST

Dear participants, thanks for your interest in our tutorial about Optimal Changepoint Detection! To follow along, please bring a laptop with an internet connection. Links to the course materials for the tutorial are described on the README of https://github.com/tdhock/change-tutorial Thanks in advance and looking forward to seeing you at useR!

Speakers
TD

RK

## Rebecca Killick

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
4.01 Wild Gallery
Tutorial
• Company 29

### 2:00pm CEST

Modelling the environment in R: from small-scale to global appli- cations

Last minute information

Dear useRs,

the useR conference is approaching soon and we are currently updating our material for the tutorial. Here a few short notes.

Recommended

• a laptop with Linux, Windows or MacOS

• packages deSolve, FME, OceanView and shiny (together with some dependencies, that are auto-

matically installed)

• WiFi

Optional
Execute the following function

in R to install the packages above and some others that we use from time to time.

Install the development tools

• Windows: https://cran.r-project.org/bin/windows/Rtools/ • Linux:

– Debian/Ubuntu sudo apt-get r-base-dev

– Fedora sudo yum install R
• Mac: see https://cran.r-project.org/bin/macosx/

What are your most interested in?

The tentative plan of the tutorial can be found on https://www.user2017.brussels/uploads/Karline_ Rtutorial2017.html It consists of two parts, an introductory part and an outlook. You will see, that the outlook contains several options that cannot be covered in equal detail.

Please let us know what you are mostly interested in: 1

install.packages(c("deSolve", "marelac", "OceanView", "rootSolve", "bvpSolve", "deTestSet", "FME", "ReacTran", "marelac", "cOde", "scatterplot3d", "shiny", "AquaEnv", "rMarkdown", "devtools", "rodeo"))

• how to make models more realistic by implementing forcing functions and events,
• how to implement complex models in an efficient way with package rodeo,
• how to speed up differential equation models (matrix formulation, code generators, parallel computing), • how to create web-based model applications with deSolve and shiny.

Just send an email to us. karline.soetaert@nioz.nl and thomas.petzoldt@tu-dresden.de Note: Updates will appear at: http://desolve.r-forge.r-project.org/user2017
See you in Brussels!

Speakers

## Thomas Petzoldt

Senior Scientist, TU Dresden (Dresden University of Technology)
dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation
KS

## Karline Soetaert

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
2.01 Wild Gallery
Tutorial
• Company 23

Speakers
BB

HS

JV

## Joaquin Vanschoren

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
4.02 Wild Gallery
Tutorial
• Company 28

Speakers
GC

## Gábor Csárdi

RStudio

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
2.02 Wild Gallery
Tutorial
• Company 24

Speakers

## Edzer Pebesma

professor, University of Muenster
My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Tuesday July 4, 2017 2:00pm - 3:30pm CEST
4.03 Wild Gallery
Tutorial
• Company 25

### 3:30pm CEST

Tuesday July 4, 2017 3:30pm - 4:00pm CEST
CATERING POINTS Wild Gallery
BREAK
• Company 30

Speakers
AS

## Arun Srinivasan

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
3.01 Wild Gallery
Tutorial
• Company 35

### 4:00pm CEST

Please have the latest version of R and RStudio installed. Also, please install RCpp. Windows users will need Rtools, Mac users will need Xcode.
To help tailor the workshop, it would be helpful if you would complete the questionnaire https://www.jumpingrivers.com/q/useR2017.html

Speakers

## Colin Gillespie

Consultant | Associate Professor, Jumping Rivers | Newcastle University

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
PLENARY Wild Gallery
Tutorial
• Company 31

Speakers

## Taylor Arnold

Assistant Professor of Statistics, University of Richmond
Large scale text and image processing
LT

## Lauren Tilton

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
3.02 Wild Gallery
Tutorial
• Company 36

### 4:00pm CEST

Dear participants, thanks for your interest in our tutorial about Optimal Changepoint Detection! To follow along, please bring a laptop with an internet connection. Links to the course materials for the tutorial are described on the README of https://github.com/tdhock/change-tutorial Thanks in advance and looking forward to seeing you at useR!

Speakers
TD

RK

## Rebecca Killick

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
4.01 Wild Gallery
Tutorial
• Company 38

### 4:00pm CEST

Modelling the environment in R: from small-scale to global appli- cations

Last minute information

Dear useRs,

the useR conference is approaching soon and we are currently updating our material for the tutorial. Here a few short notes.

Recommended

• a laptop with Linux, Windows or MacOS

• packages deSolve, FME, OceanView and shiny (together with some dependencies, that are auto-

matically installed)

• WiFi

Optional
Execute the following function

in R to install the packages above and some others that we use from time to time.

Install the development tools

• Windows: https://cran.r-project.org/bin/windows/Rtools/ • Linux:

– Debian/Ubuntu sudo apt-get r-base-dev

– Fedora sudo yum install R
• Mac: see https://cran.r-project.org/bin/macosx/

What are your most interested in?

The tentative plan of the tutorial can be found on https://www.user2017.brussels/uploads/Karline_ Rtutorial2017.html It consists of two parts, an introductory part and an outlook. You will see, that the outlook contains several options that cannot be covered in equal detail.

Please let us know what you are mostly interested in: 1

install.packages(c("deSolve", "marelac", "OceanView", "rootSolve", "bvpSolve", "deTestSet", "FME", "ReacTran", "marelac", "cOde", "scatterplot3d", "shiny", "AquaEnv", "rMarkdown", "devtools", "rodeo"))

• how to make models more realistic by implementing forcing functions and events,
• how to implement complex models in an efficient way with package rodeo,
• how to speed up differential equation models (matrix formulation, code generators, parallel computing), • how to create web-based model applications with deSolve and shiny.

Just send an email to us. karline.soetaert@nioz.nl and thomas.petzoldt@tu-dresden.de Note: Updates will appear at: http://desolve.r-forge.r-project.org/user2017
See you in Brussels!

Speakers

## Thomas Petzoldt

Senior Scientist, TU Dresden (Dresden University of Technology)
dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation
KS

## Karline Soetaert

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
2.01 Wild Gallery
Tutorial
• Company 32

Speakers
BB

HS

JV

## Joaquin Vanschoren

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
4.02 Wild Gallery
Tutorial
• Company 37

Speakers
GC

## Gábor Csárdi

RStudio

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
2.02 Wild Gallery
Tutorial
• Company 33

Speakers

## Edzer Pebesma

professor, University of Muenster
My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Tuesday July 4, 2017 4:00pm - 5:30pm CEST
4.03 Wild Gallery
Tutorial
• Company 34

### 5:30pm CEST

Tuesday July 4, 2017 5:30pm - 8:00pm CEST
CATERING POINTS Wild Gallery

Wednesday, July 5

### 8:00am CEST

Wednesday July 5, 2017 8:00am - 9:15am CEST
Wild Gallery Getijstraat 11, 1190 Vorst

### 9:00am CEST

Wednesday July 5, 2017 9:00am - 9:15am CEST
PLENARY Wild Gallery

### 9:15am CEST

Wednesday July 5, 2017 9:15am - 9:30am CEST
PLENARY Wild Gallery

### 9:30am CEST

In the social sciences, structural equation modeling (SEM) is often considered to be the mother of all statistical modeling. It includes univariate and multivariate regression models, generalized linear mixed models, factor analysis, path analysis, item response theory, latent class analysis, and much more. SEM can also handle missing data, non-normal data, categorical data,multilevel data, longitudinal data, (in)equality constraints, and on a good day, SEM makes you a fresh cup of tea.

For several decades, software for structural equation modeling was exclusively commercial and/or closed-source. Today, several free and open-source alternatives are available. In this presentation, I will tell the story of the R package lavaan'. How was it conceived? What were the original goals, and where do we stand today? And why is it not finished yet? As the story unfolds, I will highlight some aspects of software development that are often underexposed: the importance of software archaeology, the design of model syntax, the importance of numerical techniques, the curse of backwards compatibility, the temptation to use compiled code to speed things up, and the difficult choice between a monolithic versus a modular approach.

Finally, I will talk about my experiences with useRs, discussion groups, community support and the lavaan ecosystem.

Speakers
YR

## Yves Rosseel

Wednesday July 5, 2017 9:30am - 10:30am CEST
PLENARY Wild Gallery

### 10:30am CEST

Wednesday July 5, 2017 10:30am - 11:00am CEST
CATERING POINTS Wild Gallery
BREAK
• Company 43

### 11:00am CEST

Keywords: Analytics, Marketing, tidyverse, purrr, ggplot2, rgdal, sp and more
Webpages: https://creativecommons.tankerkoenig.de (sic), https://www.openstreetmap.org/
We present an R-based analysis to measure the impact of different market drivers on fuel prices in Germany. The analysis is based on the open dataset on German fuel prices, bringing in many additional open data sets along the way.
• Overview of the dataset
1. History, Legal framework and data collection
2. Current uses in “price-finder apps”
3. Structure of the dataset
4. Preparation of the data
5. A first graphical analysis
• price levels
• weekly and daily pricing patterns
• Overview of potential price drivers and corresponding data sources
1. A Purrr workflow for preparing regional data from Destatis
• Number of registered cars
• Number of fuel stations
• Number of inhabitants
• Mean income, etc.
1. Determining geographical market drivers with OSM data using sp, rgdal, geosphere
• Branded vs independent
• Location: higwhway, close to highway exit (“Autohof”) etc.
• Proximity to competitors, etc.
1. Cost drivers
• Market prices for crude oil
• Distance of fuel station to fuel depot
• Land lease and property-prices
1. Outlook:
• Weather
• Traffic density
Based on this data, we will present different modelling approaches to quantify the impact of the above drivers on average price levels. We will also give an outlook and first results on temporal pricing patterns and indicators for competitive or anti-competitive behaviour.
This talk is a condensed version of an online R-workshop that I am currently preparing and which I expect to be fully available at the time of UseR 2017.

Speakers
BV

## Boris Vaillant

Wednesday July 5, 2017 11:00am - 11:18am CEST
PLENARY Wild Gallery

### 11:00am CEST

Keywords: Shiny, shiny.semantic, UI, UX, application performance, analytics consulting
Shiny has proved itself a great tool for communicating data science teams’ results. However, developing a Shiny app for a large scope project that will be used commercially by more than dozens of users is not easy. The first challenge is User Interface (UI): the expectations are that the app should not vary from modern web pages. Secondly, performance directly impacts user experience (UX), and it’s difficult to maintain efficiency with growing requirements and user base.
In this talk, we will share our experience from a real-life case study of building an app used daily by 700 users where our data science team tackled all these problems. This, to our knowledge, was one of the biggest production deployments of a Shiny App.
We will show an innovative approach to building a beautiful and flexible Shiny UI using shiny.semantic package (an alternative to standard Bootstrap). Furthermore, we will talk about the non-standard optimization tricks we implemented to gain performance. Then we will discuss challenges regarding complex reactivity and offer solutions. We will go through implementation and deployment process of the app using a load balancer. Finally, we will present the application and give details on how this benefited our client.

Speakers
OM

## Olga Mierzwa-Sulima

Wednesday July 5, 2017 11:00am - 11:18am CEST
4.02 Wild Gallery
Talk, Shiny I

### 11:00am CEST

Keywords: Time Series, Forecasting, Robust Statistics, Exponential Smoothing
Webpages: https://CRAN.R-project.org/package=robets, https://rcrevits.wordpress.com/
Simple forecasting methods, such as exponential smoothing, are very popular in business analytics. This is not only due to their simplicity, but also because they perform very well, in particular for shorter time series. Incorporating trend and seasonality into an exponential smoothing method is standard. Many real time series, show seasonal patterns that should be exploited for forecasting purposes. Including a trend or not may be less clear. For instance, weekly sales (in units) may show an increasing trend, but the sales will not grow to infinity. Here, the damped trend model gives an outcome. Damped trend exponential smoothing gives excellent results in forecasting competitions.
In a highly cited paper, Hyndman and Khandakar (2008) developed an automatic forecasting method using exponential smoothing, available as the R package forecast. We propose the package robets, an outlier robust alternative of the function ets in the forecast package. For each method of a class of exponential smoothing variants we made a robust alternative. The class includes methods with a damped trend and/or seasonal components. The robust method is developed by robustifying every aspect of the original exponential smoothing variant. We provide robust forecasting equations, robust initial values, robust smoothing parameter estimation and a robust information criterion. The method is an extension of Gelper, Fried, and Croux (2010) and is described in more detail in Crevits and Croux (2016).
The code of the developed R package is based on the function ets of the forecast package. The usual functions for visualizing the models and forecasts also work for robets objects. Additionally there is a function plotOutliers which highlights outlying values in a time series.
References Crevits, Ruben, and Christophe Croux. 2016. “Forecasting with Robust Exponential Smoothing with Damped Trend and Seasonal Components.” Working Paper.

Gelper, S, R Fried, and C Croux. 2010. “Robust Forecasting with Exponential and Holt-Winters Smoothing.” Journal of Forecasting 29: 285–300.

Hyndman, R J, and Y Khandakar. 2008. “Automatic Time Series Forecasting: The Forecast Package for R.” Journal of Statistical Software 27 (3).

Speakers
RC

## Ruben Crevits

Wednesday July 5, 2017 11:00am - 11:18am CEST
2.01 Wild Gallery

### 11:00am CEST

Keywords: age-structured contact matrix, areal count time series, infectious disease epidemiology, norovirus, spatio-temporal surveillance data
Webpages: https://CRAN.R-project.org/package=surveillance
Routine surveillance of notifiable infectious diseases gives rise to weekly counts of reported cases stratified by region and age group. A well-established approach to the statistical analysis of such surveillance data are endemic-epidemic time-series models (hhh4) as implemented in the R package surveillance (Meyer, Held, and Höhle 2017). Autoregressive model components reflect the temporal dependence inherent to communicable diseases. Spatial dynamics are largely driven by human travel and can be captured by movement network data or a parametric power law based on the adjacency matrix of the regions. Furthermore, the social phenomenon of “like seeks like” produces characteristic contact patterns between subgroups of a population, in particular with respect to age (Mossong et al. 2008). We thus incorporate an age-structured contact matrix in the hhh4 modelling framework to
1. assess age-specific disease spread while accounting for its spatial pattern (Meyer and Held 2017)
2. improve probabilistic forecasts of infectious disease spread (Held, Meyer, and Bracher 2017)
We analyze weekly surveillance counts on norovirus gastroenteritis from the 12 city districts of Berlin, in six age groups, from week 2011/27 to week 2015/26. The following year (2015/27 to 2016/26) is used to assess the quality of the predictions.
References Held, Leonhard, Sebastian Meyer, and Johannes Bracher. 2017. “Probabilistic Forecasting in Infectious Disease Epidemiology: The Thirteenth Armitage Lecture.” bioRxiv. doi:10.1101/104000.

Meyer, Sebastian, and Leonhard Held. 2017. “Incorporating Social Contact Data in Spatio-Temporal Models for Infectious Disease Spread.” Biostatistics 18 (2): 338–51. doi:10.1093/biostatistics/kxw051.

Meyer, Sebastian, Leonhard Held, and Michael Höhle. 2017. “Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance.” Journal of Statistical Software. http://arxiv.org/abs/1411.0416.

Mossong, Joël, Niel Hens, Mark Jit, Philippe Beutels, Kari Auranen, Rafael Mikolajczyk, Marco Massari, et al. 2008. “Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases.” PLoS Medicine 5 (3): e74. doi:10.1371/journal.pmed.0050074.

Speakers

## Sebastian Meyer

Friedrich-Alexander-Universität Erlangen-Nürnberg

slides pdf

Wednesday July 5, 2017 11:00am - 11:18am CEST
3.01 Wild Gallery

### 11:00am CEST

Keywords: random forest, transformation model, quantile regression forest, conditional distribution, conditional quantiles
Webpages: https://R-forge.R-project.org/projects/ctm https://arxiv.org/1701.02110
Regression models for supervised learning problems with a continuous target are commonly understood as models for the conditional mean of the target given predictors. This notion is simple and therefore appealing for interpretation and visualisation. Information about the whole underlying conditional distribution is, however, not available from these models. A more general understanding of regression models as models for conditional distributions allows much broader inference from such models, for example the computation of prediction intervals. Several random forest-type algorithms aim at estimating conditional distributions, most prominently quantile regression forests. We propose a novel approach based on a parametric family of distributions characterised by their transformation function. A dedicated novel transformation tree'' algorithm able to detect distributional changes is developed. Based on these transformation trees, we introducetransformation forests’‘as an adaptive local likelihood estimator of conditional distribution functions. The resulting models are fully parametric yet very general and allow broad inference procedures, such as the model-based bootstrap, to be applied in a straightforward way. The procedures are implemented in the trtf’’ R add-on package currently available from R-forge.

Speakers
TH

## Torsten Hothorn

Wednesday July 5, 2017 11:00am - 11:18am CEST
2.02 Wild Gallery

### 11:00am CEST

1. Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City , UT

Funding: This work is supported by funding from the R Consortium and The University of Utah Center for Clinical and Translational Science (NIH 5UL1TR001067-02).

Abstract: Over the last few years while the open source statistical package R has come to prominence it has gained important resources, such as multiple flexible class systems. However, methods for documentation have not kept pace with other advances in the language. I will present the work of the R Documentation Task Force, an R Consortium Working Group, in creating the next generation of documentation system for R.

The new documentation system is based off a S4 formal class system and exists independent of but is complimentary to the packaging system in R. Documentation objects are stored as objects and as such can be manipulated programmatically as with all R objects.

This approach creates a “many in-many out” approach, meaning that developers of software and documentation can create documentation in the format that is easiest for them, such as Rd or Roxygen, and users of the documentation can read or utilize documentation in a convenient format. Since R also makes use of code from other languages such as C++, this creates faculties for including documentation without recreating it.

This work is based on input from the R Documentation Task Force, which is a working group, supported by the R Consortium and the University of Utah Center for Clinical and Translational Science, consisting of R Core developers, representatives from the R Consortium member companies and community developers with relevant interest in documentation.

Good documentation is critical for researchers to disseminate computational research methods, either internally or externally to their organization. This work will facilitate the creation of documentation by making documentation immediately accessible and promote documentation consumption through multiple outputs which can be implemented by developers.

Speakers
AR

## Andrew Redd

Wednesday July 5, 2017 11:00am - 11:18am CEST
3.02 Wild Gallery

### 11:00am CEST

ALL INFO: http://riotworkshop.github.io

Wednesday July 5, 2017 11:00am - 6:40pm CEST
4.03 Wild Gallery

### 11:18am CEST

Keywords: machine learning, predictive modeling, predictive accuracy, scalability, speed
Webpages: https://github.com/szilard/benchm-ml
Binary classification is one of the most widely used machine learning methods in business applications. If the number of features is not very large (sparse), algorithms such as random forests, gradient boosted trees or deep learning neural networks (and ensembles of those) are expected to perform the best in terms of accuracy. There are countless off-the-shelf open source implementations for the previous algorithms (e.g. R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.), but which one to use in practice? Surprisingly, there is a huge variation between even the most commonly used implementations of the same algorithm in terms of scalability, speed, accuracy. In this talk we will see which open source tools work reasonably well on larger datasets commonly encountered in practice. Not surprizingly, all the best tools are available seamlessly from R.

Speakers
SP

## Szilard Pafka

Wednesday July 5, 2017 11:18am - 11:36am CEST
2.02 Wild Gallery

### 11:18am CEST

Keywords: composite likelihood, effective sample size, REML, spatial dependence
Composite likelihood methods have become popular in spatial statistics. This is mainly due to the fact that large matrices need to be inverted in full maximum likelihood and this becomes computationally expensive when you have a large number of regions under consideration. We introduce restricted pairwise composite likelihood (RECL) methods for estimation of mean and covariance parameters in a spatial Gaussian random field, without resorting back to the full likelihood. A simulation study was carried out to investigate how this method works in settings of increasing domain as well as infill asymptotics, whilst varying the strength of correlation, with similar scenarios as Curriero and Lele (1999). Preliminary results showed that pairwise composite likelihoods tend to underestimate the variance parameters, especially when there is high correlation, while RECL corrects for the underestimation. Therefore, RECL is recommended if interest is in both the mean and the variance parameters. The methods are made available in the spatialRECL package and implemented in R. The methodology will be highlighted in the first part of the presentation, and some analysis will be made on a real data example of TSH levels from Galicia, Spain.
References Curriero, F, and S Lele. 1999. “A Composite Likelihood Approach to Semivariogram Estimation.” J Agric Biol Envir S 4 (1): 9–28.

Speakers
KC

## Kathy Chenjerai Mutambanengwe

Wednesday July 5, 2017 11:18am - 11:36am CEST
2.01 Wild Gallery

### 11:18am CEST

[ASSISTant](https://cran.r-project.org/package=ASSISTant) is an R
package for a novel group-sequential adaptive trial. The design is
motivated by a randomized controlled trial to compare an endovascular
procedure with conventional medical treatment for stroke patients; see
Lai, Lavori and Liao [-@Lai2014191]. The endovascular procedure may
be effective only in a subgroup of patients not known at the design
stage but may be learned statistically from the data collected during
the course of the trial. The group-sequential design implemented in
ASSISTant incorporates adaptive choice of the patient subgroup among
several possibilities which includes the entire patient population as
a choice. Appropriate Type I and type II errors of a test can be
defined in this setting and the design maintains a prescribed type I
error by using the closed testing principle in multiple testing.

[NIH DIFFUSE-3](https://www.nihstrokenet.org/clinical-trials/acute-interventional-trials/defuse-3)
trial currently underway. The package is on CRAN [-@assistant] and
github.

Speakers

## Balasubramanian Narasimhan

Stanford University

Wednesday July 5, 2017 11:18am - 11:36am CEST
3.01 Wild Gallery

### 11:18am CEST

Keywords: music, sonification, Shiny
Recent years have brought considerable advances in data sonification (Ligges et al. 2016; Sueur, Aubin, and Simonis 2008; Stone and Garisson 2012; Stone and Garrison 2013; Levine 2015), but data sonification is still a very involved process with many technical limitations. Developing data music in R has historically been a very tedious process because of R’s poor concurrency features and general weakness in audio rendering capabilities (Levine 2016). End-user data music tools can be more straightforward, but they usually constrain users to very particular and rudimentary aesthetic mappings (Siegert and Williams 2017; Levine 2014; Borum Consulting 2014). Finally, existing data music implementations have limited interactivity capabilities, and no integrated solutions are available for embedding in business intelligence dashboards.
I have addressed these various issues by implementing bradio, a Shiny widget for rendering data music. In bradio, a song is encoded as a Javascript function that can take data inputs from R, through Shiny. The Javascript component relies on the webaudio Javascript package (johnnyscript 2014) and is thus compatible with songs written for the webaudio Javascript package, the baudio Javascript package (substack 2014), and Javascript code-music-studio (substack 2015); this compatibility allows for existing songs to be adapted easily as data music. bradio merges the convenience of interactive Javascript music with the data analysis power of R, facilitating the prototyping and presentation of sophisticated interactive data music.

johnnyscript. 2014. webaudio. 2.0.0 ed. https://www.npmjs.com/package/webaudio.

Levine, Thomas. 2014. Sheetmusic. 0.0.4 ed. https://pypi.python.org/pypi/sheetmusic.

———. 2015. “Plotting Data as Music Videos in R.” In UseR! http://user2015.math.aau.dk/contributed_talks#61.

———. 2016. “Approaches to Live Music Synthesis for Multivariate Data Analysis in R.” In SatRday. http://budapest.satrdays.org/.

Ligges, Uwe, Sebastian Krey, Olaf Mersmann, and Sarah Schnackenberg. 2016. tuneR: Analysis of Music. http://r-forge.r-project.org/projects/tuner/.

Siegert, Stefan, and Robin Williams. 2017. Sonify: Data Sonification - Turning Data into Sound. https://CRAN.R-project.org/package=sonify.

Stone, Eric, and Jesse Garisson. 2012. “Give Your Data a Listen.” In UseR! http://biostat.mc.vanderbilt.edu/wiki/pub/Main/UseR-2012/81-Stone.pdf.

Stone, Eric, and Jesse Garrison. 2013. AudiolyzR: Give Your Data a Listen. https://CRAN.R-project.org/package=audiolyzR.

substack. 2014. baudio. 2.1.2 ed. https://www.npmjs.com/package/baudio.

———. 2015. code-music-studio. 1.5.2 ed. https://www.npmjs.com/package/code-music-studio.

———. n.d. “Make Music with Algorithms!” http://studio.substack.net/-/help.

Sueur, J., T. Aubin, and C. Simonis. 2008. “Seewave: A Free Modular Tool for Sound Analysis and Synthesis.” Bioacoustics 18: 213–26. http://isyeb.mnhn.fr/IMG/pdf/sueuretal_bioacoustics_2008.pdf.

Speakers
TL

## Thomas Levine

Wednesday July 5, 2017 11:18am - 11:36am CEST
4.02 Wild Gallery
Talk, Shiny I

### 11:18am CEST

Keywords: Ensemble package, user-friendly interface, gene expression analysis, biclustering
Webpages: https://r-forge.r-project.org/R/?group_id=589, https://github.com/ewouddt/RcmdrPlugin.BiclustGUI
The increasing amount of R packages makes it difficult to any newcomer to orientate himself/herself in the large amount of option available for topics such as modeling, clustering, variable selection, optimization, sample size estimation etc. The quality of the packages, the associated help files, error reporting system and continuity of support vary significantly and methods may be duplicated across multiple packages if the packages focus on a specific application within a particular field only.
Ensemble packages can be seen as another type of contribution to the R community. Careful revision of packages that approach the same topic from different perspectives may be very useful for increasing the overall quality of the CRAN repository. The revision should not be limited to the technical part, but should also cover methodological aspects. A necessary condition for success of the ensemble package is of course that this revision happens in close collaboration with the authors of the original package.
An additional benefit of ensemble packages lies in leveraging many graphical options of the traditional R framework. Starting from a simple Graphical User Interface, over an R Commander plugins, to Shiny applications, R provides wide range of visualization options. By combining visualization with the content of original packages, the ensemble package can provide different user experience. Such a property extends added value of ensemble beyond a simple review library. Necessarily, the flexibility of the package is reduced by transformation into point and click interface, but the user requiring a fully flexible environment can be referred to the original packages.
We present two case studies of such ensemble packages: IsoGeneGUI and BiclustGUI. IsoGeneGUI is implemented in the Graphical User Interface (GUI) and combines the original IsoGene package for dose-response analysis of high dimensional data with other packages such as orQA, ORIClust, goric and ORCME, that offer methods to analyze different perspectives of gene expression based data sets. IsoGeneGUI thus provides a wide range of methods methods (and the most complete data analysis tool for order restricted analysis) in a user friendly fashion. Hence analyzes can be implemented by users with only limited knowledge of R programming. The RcmdrPlugin.BiclustGUI is a GUI plugin for R Commander that combines various biclustering packages, bringing multiple algorithms, visualizations and diagnostics tools into one unified framework. Additionally, the package allows for simple inclusion of potential future biclustering methods.
The collaboration with the authors of the original packages on implementation of their methods within an ensemble package was extremely important for both case studies. Indeed, in that way, the link with the original packages could be retained. The ensemble package allowed for careful evaluation of the methods, their overlap and differences, and for presenting them as a concise framework in a user friendly environment.

Speakers
MO

## Martin Otava

Wednesday July 5, 2017 11:18am - 11:36am CEST
3.02 Wild Gallery
Talk, Packages

### 11:18am CEST

Whether a case might be identified as an outlier depends on the other cases in the dataset and on the variables available. A case can stand out as unusual on one or two variables, while appearing middling on the others. If a case is identified as an outlier, it is useful to find out why. This paper introduces a new display, the O3 plot (Overview Of Outliers), for supporting outlier analyses, and describes its implementation in R.

Figure 1 shows an example of an O3 plot for four German demographic variables recorded for the 299 Bundestag constituencies. There is a row for each variable combination for which outliers were found and two blocks of columns. Each row of the block on the left shows which variable combination defines that row. There are 4 variables, so there are 4 columns, one for each variable, and a cell is coloured grey if that variable is part of the combination. The combinations (the rows) are sorted by numbers of outliers found within numbers of variables in the combination, and blue dotted lines separate the combinations with different numbers of variables. The columns in the left block are sorted by how often the variables occur. A boundary column separates this block from the block on the right that records the outliers found with whichever outlier identification algorithm was used (in this case Wilkinson’s HDoutliers with alpha=0.05). There is one column for each case that is found to be an outlier at least once and these columns are sorted by the numbers of times the cases are outliers.

Given $$n$$ cases and $$p$$ variables there would be $$(p+1+n)$$ columns if all cases were an outlier on some combination of variables. And if outliers were identified for all possible combinations there would be $$2^p-1$$ rows. An O3 plot has too many rows if there are lots of variables with many combinations having outliers and it has too many columns if there are lots of cases identified as outliers on at least one variable combination. Combinations are only reported if outliers are found for them and cases are only reported which occur at least once as an outlier.

O3 plots show which cases are identified often as outliers, which are identified in single dimensions, and which are only identified in higher dimensions. They highlight which variables and combinations of variables may be affected by possible outliers.

Speakers
AU

## Antony Unwin

Wednesday July 5, 2017 11:18am - 11:36am CEST
PLENARY Wild Gallery

### 11:36am CEST

With the massive popularity of R in the statistics and data science communities along with the recent movement towards open development and reproducible research with CRAN and GitHub, R has become the de facto go-to for cutting edge statistical software. With this movement, a problem faced by many groups is how individual programmers can work on related codebases in an open, collaborative manner while emphasizing good software practices and reproducible research. The sparsebn package, recently released on CRAN, is an example of this dilemma: sparsebn is a family of packages for learning graphical models, with different algorithms tailored for different types of data. Although each algorithm shares many similarities, different researchers and programmers were in charge of implementing different algorithms. Instead of releasing disparate, unrelated packages, our group developed a shared family of packages in order to streamline the addition of new algorithms so as to minimize programming overhead (the dreaded “data munging” and “plumbing” work). In this talk, I will use sparsebn as a case study in collaborative research and development, illustrating both the development process and the fruits of our labour: A fast, modern package for learning graphical models that leverages cutting-edge trends in high-dimensional statistics and machine learning.

Speakers
BA

## Bryon Aragam

Carnegie Mellon University

Wednesday July 5, 2017 11:36am - 11:54am CEST
3.02 Wild Gallery

### 11:36am CEST

Keywords: Dimension reduction, Correlation dimension, Singular value decomposition, Load forecasting
We present a new R package for curve linear regression: the clr package.
This package implements a new methodology for linear regression with both curve response and curve regressors, which is described in Cho et al. (2013) and Cho et al. (2015).
The key idea behind this methodology is dimension reduction based on a singular value decomposition in a Hilbert Space, which reduces the curve regression problem to several scalar linear regression problems.
We apply curve linear regression with clr to model and forecast daily electricity loads.
References Bathia, N., Q. Yao, and F. Ziegelmann. 2010. “Identifying the Finite Dimensionality of Curve Time Series.” The Annals of Statistics 38: 3352–86.

Cho, H., Y. Goude, X. Brossat, and Q. Yao. 2013. “Modelling and Forecasting Daily Electricity Load Curves: A Hybrid Approach.” Journal of the American Statistical Association 108: 7–21.

———. 2015. “Modelling and Forecasting Daily Electricity Load via Curve Linear Regression.” In Modeling and Stochastic Learning for Forecasting in High Dimension, edited by Anestis Antoniadis and Xavier Brossat, 35–54. Springer.

Fan, J., and Q. Yao. 2003. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer.

Hall, P., and J. L. Horowitz. 2007. “Methodology and Convergence Rates for Functional Linear Regression.” The Annals of Statistics 35: 70–91.

Speakers
AP

## Amandine Pierrot

Wednesday July 5, 2017 11:36am - 11:54am CEST
2.01 Wild Gallery

### 11:36am CEST

Keywords: Distributional regression, recursive partitioning, decision trees, random forests
Webpages: https://R-Forge.R-project.org/projects/partykit/
In regression analysis one is interested in the relationship between a dependent variable and one or more explanatory variables. Various methods to fit statistical models to the data set have been developed, starting from ordinary linear models considering only the mean of the response variable and ranging to probabilistic models where all parameters of a distribution are fit to the given data set.
If there is a strong variation within the data it might be advantageous to split the data first into more homogeneous subgroups based on given covariates and then fit a local model in each subgroup rather than fitting one global model to the whole data set. This can be done by applying regression trees and forests.
Both of these two concepts, parametric modeling and algorithmic trees, have been investigated and developed further, however, mostly separated from each other. Therefore, our goal is to embed the progress made in the field of probabilistic modeling in the idea of algorithmic tree and forest models. In particular, more flexible models such as GAMLSS (Rigby and Stasinopoulos 2005) should be fitted in the nodes of a tree in order to capture location, scale, shape as well as censoring, tail behavior etc. while non-additive effects of the explanatory variables can be detected by the splitting algorithm used to build the tree.
The corresponding implementation is provided in an R package disttree which is available on R-Forge and includes the two main functions disttree and distforest. Next to the data set and a formula the user only has to specify a distribution family and receives a tree/forest model with a set of distribution parameters for each final node. One possible way to specify a distribution family is to hand over a gamlss.dist family object (Stasinopoulos, Rigby, and others 2007). In disttree and distforest the fitting function distfit is applied within a tree building algorithm chosen by the user. Either the MOB algorithm, an algorithm for model-based recursive partitioning (Zeileis, Hothorn, and Hornik 2008), or the ctree algorithm (Hothorn, Hornik, and Zeileis 2006) can be used as a framework. These algorithms are both implemented in the partykit package (Hothorn et al. 2015).
References Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3). Taylor & Francis: 651–74.

Hothorn, Torsten, Kurt Hornik, Carolin Strobl, and Achim Zeileis. 2015. “Package ’Party’.” Package Reference Manual for Party Version 0.9–0.998 16: 37.

Rigby, Robert A, and D Mikis Stasinopoulos. 2005. “Generalized Additive Models for Location Scale and Shape (with Discussion).” Applied Statistics 54.3: 507–54.

Stasinopoulos, D Mikis, Robert A Rigby, and others. 2007. “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software 23 (7): 1–46.

Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. 2008. “Model-Based Recursive Partitioning.” Journal of Computational and Graphical Statistics 17 (2). Taylor & Francis: 492–514.

Speakers
LS

## Lisa Schlosser

Wednesday July 5, 2017 11:36am - 11:54am CEST
2.02 Wild Gallery

### 11:36am CEST

Keywords: mathematical model, infectious disease, epidemiology, networks, R
Webpages: https://CRAN.R-project.org/package=EpiModel, http://epimodel.org/
The EpiModel package provides tools for building, simulating, and analyzing mathematical models for epidemics using R. Epidemic models are a formal representation of the complex systems that collectively determine the population dynamics of infectious disease transmission: contact between people, inter-host infection transmission, intra-host disease progression, and the underlying demographic processes. Simulating epidemic models serves as a computational laboratory to gain insight into the dynamics of these disease systems, test empirical hypotheses about the determinants of a specific outbreak patterns, and forecast the impact of interventions like vaccines, clinical treatment, or public health education campaigns.
A range of different modeling frameworks has been developed in the field of mathematical epidemiology over the last century. Several of these are included in EpiModel, but the unique contribution of this software package is a general stochastic framework for modeling the spread of epidemics across dynamic contact networks. Network models represent repeated contacts with the same person or persons over time (e.g., sexual partnerships). These repeated contacts give rise to persistent network configurations – pairs, triples, and larger connected components – that in turn may establish the temporally ordered pathways for infectious disease transmission across a population. The timing and sequence of contacts, and the transmission acts within them, is most important when transmission requires intimate contact, that contact is relatively rare, and the probability of infection per contact is relatively low. This is the case for HIV and other sexually transmitted infections.
Both the estimation and simulation of the dynamic networks in EpiModel are implemented using Markov Chain Monte Carlo (MCMC) algorithm functions for exponential-random graph models (ERGMs) from the statnet suite of R packages. These MCMC algorithms exploit a key property of ERGMs: that the maximum likelihood estimates of the model parameters uniquely reproduce the model statistics in expectation. The mathematical simulation of the contact network over time is theoretically guaranteed to vary stochastically around the observed network statistics. Temporal ERGMs provide the only integrated, principled framework for both the estimation of dynamic network models from sampled empirical data and also the simulation of complex dynamic networks with theoretically justified methods for handling changes in population size and composition over time.
In this talk, I will provide an overview of both the modeling tools built into EpiModel, designed to facilitate learning for students new to modeling, and the package’s application programming interface (API) for extending EpiModel, designed to facilitate the exploration of novel research questions for advanced modelers. I will motivate these research-level extensions by discussing our recent applications of these network modeling statistical methods and software tools to investigate the transmission dynamics of HIV and sexually transmitted infections among men who have sex with men in the United States and heterosexual couples in Sub-Saharan Africa.

Speakers

## Samuel Jenness

Assistant Professor, Emory University
Epidemic modeling, network science, HIV/STI prevention

Wednesday July 5, 2017 11:36am - 11:54am CEST
3.01 Wild Gallery

### 11:36am CEST

Online presentation: https://github.com/bborgesr/useR2017

Keywords
: databases, shiny, DBI, dplyr, pool
Webpages: http://shiny.rstudio.com/articles/overview.html, https://github.com/rstudio/pool
Connecting to an external database from R can be challenging. This is made worse when you need to interact with a database from a live Shiny application. To demystify this process, I’ll do two things.
First, I’ll talk about best practices when connecting to a database from Shiny. There are three important packages that help you with this and I’ll weave them into this part of the talk. The DBI package does a great job of standardizing how to establish a connection, execute safe queries using SQL (goodbye SQL injections!) and close the connection. The dplyr package builds on top of this to make even easier to connect to databases and extract data, since it allows users to query the database using regular dplyr syntax in R (no SQL knowledge necessary). Yet a third package, pool, exists to help you when using databases in Shiny applications, by taking care of connection management, and often resulting in better performance.
Second, I’ll demo these concepts in practice by showing how we can connect to a database from Shiny to create a CRUD application. I will show the application running and point out specific parts of the code (which will be publicly available).

Speakers
BB

## Barbara Borges Ribeiro

Wednesday July 5, 2017 11:36am - 11:54am CEST
4.02 Wild Gallery
Talk, Shiny I

### 11:36am CEST

Title Sports Betting and R: How R is changing the sports betting world Speaker: Marco Blume Keywords: Sports Betting, Sports Analytics, Vegas, Markets Webpages - https://cran.r-project.org/web/packages/odds.converter/index.html - https://cran.r-project.org/web/packages/pinnacle.API/index.html - http://pinnacle.com/

Sports Betting markets are one of the purest prediction markets that exist and are yet vastly misunderstood by the public. Many assume that the center of the sports betting world is situated in Las Vegas.  However, in the modern era, sports bookmaking is a task that looks a lot like market making in finance with sophisticated algorithmic trading systems running and constantly adjusting prices in real-time as events occur.  But, unlike financial markets, sports are governed by a set of physical rules and can usually be measured and understood.

Since the late 90s, Pinnacle has been one of the largest sportsbooks in the world and one of the only sportsbooks who will take wagers from professional bettors (who win in the long term).  Similar to card counters in Blackjack, most other sportsbook will ban these winners.  At Pinnacle the focus is on modelLing, automation, data science and R is a central piece of the business and a large number of customers use an API to interact with us.

In this talk, we dispel common misconceptions about the sports betting world and show how this is actually a very sexy problem in modelLing and data science and show how we are using R to try to beat Vegas and other sportsbooks every day in a form of data science warfare.

Since the rise of in-play betting markets, an operator must make a prediction in real time on the probability of outcomes for the remainder of an event within a very small margin of error. Customers can compete by building their own models or utilising information that might not be accounted for in the market and expressing their belief through wagering.

Naturally, a customer will generally wager when they believe they have an edge, and then the operator must determine how to change its belief after each piece of new information (wagers, in-game events, etc). This essentially involves predicting how much information is encoded in a wager, which depends partially on the sharpness of each customer, and then determining how to act on that information to maximise profits.

One way to look at this is that we are aggregating, in a smart way, the world’s models, opinions, and information when we come up with a price. This is a powerful concept and is why, for example, political prediction markets are much more accurate than polls or pundits.

For this reason, we are releasing another package to CRAN very soon: We will be releasing a package that has all our odds for the entire MLB season and US Election 2016, which can be combined with the very popular Lahman package to build predictive models and to measure the prediction vs real market data to see how your model would have performed in a real market.

We believe this is a very exciting (and difficult) problem to use for educational purposes.   This package can be used in conjunction with two of our existing packages already on CRAN for a few years: odds.converter (to convert between betting market odds types and probabilities) and Pinnacle.API (used to interact with Pinnacle’s real-time odds API in R).

Even if you have no interest in sports or wagering, we believe this is a fascinating problem and our data and tools are perfect for the R community at large to work with, for academic reasons or for hobby.

Speakers

## Marco Blume

Wednesday July 5, 2017 11:36am - 11:54am CEST
PLENARY Wild Gallery

### 11:54am CEST

Keywords: Optimal design, Nonlinear models, Optimization, Evolutionary algorithm
Webpages: https://cran.r-project.org/web/packages/ICAOD
The ICAOD package applies a novel multi-heuristic algorithm called imperialist competitive algorithm (ICA) to find different types of optimal designs for nonlinear models (Masoudi et al., in press). The setup assumes that we have a general parametric regression model and a design criterion formulated as a convex function of the Fisher information matrix. The package constructs locally D-optimal, minimax D-optimal, standardized maximin D-optimal and optimum-on-the-average designs for a class of nonlinear models, including multiple-objective optimal designs for the 4-parameter Hill model commonly used in dose response studies and other applied fields. Several useful functions are also provided in the package, namely a function to check optimality of the generated design using an equivalence theorem followed by a graphic plot of the sensitivity function for visual appreciation. Another function is to compute the efficiency lower bound of the generated design if the algorithm is terminated prematurely.
References Masoudi E., Holling H., Wong W.K. (in press) Application of imperialist competitive algorithm to find minimax and standardized maximin optimal designs, Computational Statistics & Data Analysis.

Speakers
EM

## Ehsan Masoudi

Wednesday July 5, 2017 11:54am - 12:12pm CEST
2.01 Wild Gallery

### 11:54am CEST

Keywords: Living systematic review, meta-analysis, shiny, reproducible research
Webpages: https://github.com/natydasilva/metawRite
Systematic reviews are used to understand how treatments are effective and to design disease control policies, this approach is used by public health agencies such as the World Health Organization. Systematic reviews in the literature often include a meta-analysis that summarizes the findings of multiple studies. It is critical that such reviews are updated quickly as new scientific information becomes available, so the best evidence is used for treatment advice. However, the current peer-reviewed journal based approach to publishing systematic reviews means that reviews can rapidly become out of date and updating is often delayed by the publication model. Living systematic reviews have been proposed as a new approach to dealing with this problem. The main concept of a living review is to enable rapid updating of systematic reviews as new research becomes available, while also ensuring a transparent process and reproducible review. Our approach to a living systematic review will be implemented in an R package named metawRite. The goal is to combine writing and analysis of the review, allowing versioning and updating in an R package . metawRite package will allow an easy and effective way to display a living systematic review available in a web-based display. Three main tasks are needed to have an effective living systematic review: the ability to produce dynamic reports, availability online with an interface that enables end users to understand the data and the ability to efficiently update the review (and any meta-analysis) with new research (Elliott et al. 2014). metawRite package will cover these three task integrated in a friendly web based environment for the final user. This package is not a new meta-analysis package instead will be flexible enough to read different output models from the most used meta-analysis packages in R (metafor (Viechtbauer 2010), meta (Schwarzer 2007) among others), organize the information and display the results in an user driven interactive dashboard. The main function of this package will display a modern web-based application for update a living systematic review. This package combines the power of R, shiny (Chang et al. 2017) and knitr (Xie 2015) to get a dynamic reports and up to date meta-analysis results remaining user friendly. The package has the potential to be used by a large number of groups that conduct and update systematic review such as What Works clearinghouse (https://ies.ed.gov/ncee/WWC/) which reviews education interventions, Campbell Collaboration https://www.campbellcollaboration.org that includes reviews on topics such as social and criminal justice issues and many other social science topics, the Collaboration for Environment Evidence (http://www.environmentalevidence.org) and food production and security (http://www.syreaf.org) among others.
References Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Elliott, Julian H, Tari Turner, Ornella Clavisi, James Thomas, Julian PT Higgins, Chris Mavergames, and Russell L Gruen. 2014. “Living Systematic Reviews: An Emerging Opportunity to Narrow the Evidence-Practice Gap.” PLoS Med 11 (2). Public Library of Science: e1001603.

Schwarzer, Guido. 2007. “Meta: An R Package for Meta-Analysis.” R News 7 (3): 40–45.

Viechtbauer, Wolfgang. 2010. “Conducting Meta-Analyses in R with the metafor Package.” Journal of Statistical Software 36 (3): 1–48. http://www.jstatsoft.org/v36/i03/.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Vol. 29. CRC Press.

Speakers

## Natalia da silva

Just got my PhD in Statistics at ISU, Iowa State University
My interest are: supervised learning methods, prediction, exploratory data analysis, statistical graphics, reproducible research and meta-analysis. Co-founder of R-Ladies Ames and now Co-founder of R-LadiesMVD (Montevideo, UY). I'm a conference buddie.

Wednesday July 5, 2017 11:54am - 12:12pm CEST
3.01 Wild Gallery

### 11:54am CEST

Keywords: machine learning, hyperparameter optimization, tuning, classification, networked science
Webpages: https://jakob-r.github.io/mlrHyperopt/
Most machine learning tasks demand hyperparameter tuning to achieve a good performance. For example, Support Vector Machines with radial basis functions are very sensitive to the choice of both kernel width and soft margin penalty C. However, for a wide range of machine learning algorithms these “search spaces” are less known. Even worse, experts for the particular methods might have conflicting views. The popular package caret (Jed Wing et al. 2016) approaches this problem by providing two simple optimizers grid search and random search and individual search spaces for all implemented methods. To prevent training on misconfigured methods a grid search is performed by default. Unfortunately it is only documented which parameters will be tuned but the exact bounds have to be obtained from the source code. As a counterpart mlr (Bischl et al. 2016) offers more flexible parameter tuning methods such as an interface to mlrMBO (Bischl et al. 2017) for conducting Bayesian optimization. Unfortunately mlr lacks of default search spaces and thus parameter tuning becomes difficult. Here mlrHyperopt steps in to make hyperparameter optimization as easy as in caret. As a matter of fact, for a developer of a machine learning package, it is unquestionable impossible to be an expert of all implemented methods and provide perfect search spaces. Hence mlrHyperopt aims at:
• improving the search spaces of caret with simple tricks.
• letting the users submit and download improved search spaces to a database.
• providing advanced tuning methods interfacing mlr and mlrMBO.
A study on selected data sets and numerous popular machine learning methods compares the performance of the grid and random search implemented in caret to the performance of mlrHyperopt for different budgets.
References Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “Mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. https://CRAN.R-project.org/package=mlr.

Bischl, Bernd, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. 2017. “mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions.” arXiv:1703.03373 [Stat], March. http://arxiv.org/abs/1703.03373.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2016. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Speakers
JR

## Jakob Richter

Wednesday July 5, 2017 11:54am - 12:12pm CEST
2.02 Wild Gallery

### 11:54am CEST

Keywords: package development, localization, translation, errors and warnings, R Consortium
Webpages: https://CRAN.R-project.org/package=msgtools, https://github.com/RL10N
R is becoming the global standard language for data analysis, but it requires its user to speak English. RL10N is an R Consortium funded project to make it easier to translate error messages and warnings into different languages. The talk covers how to automatically translate messages using Google Translate and Microsoft Translator, and how to easily integrate these message translations into your R packages using msgtools. Make your code more accessible to users around the world!

Speakers

## Richard Cotton

Wednesday July 5, 2017 11:54am - 12:12pm CEST
3.02 Wild Gallery
Talk, Packages

### 11:54am CEST

Keywords: Shiny, enterprise computing, open source
Webpages: https://shinyproxy.io
Shiny is nice technology to write interactive R-based applications. It has been rapidly adopted and the R community has collaborated on many interesting extensions. Until recently, though, deployments in larger organizations and companies required proprietary solutions. ShinyProxy fills this gap and offers a fully open source alternative to run and manage shiny applications at large.
In this talk we detail the ShinyProxy architecture and demonstrate how it meets the needs of organizations. First of all, by design ShinyProxy scales to thousands of concurrent users. Secondly, it offers authentication and authorization functionality using standard technologies like LDAP, ActiveDirectory, OpenID Connect as well as social authentication (Facebook, Twitter, Google, LinkedIn or Github). Thirdly, the management interface allows to monitor application usage real-time and provides infrastructure to collect usage statistics in event logging databases (e.g. influxdb) or databases for scientific computing (e.g. MonetDB). Finally, the ShinyProxy developers took special care to develop a solution that can be easily embedded in broader applications and (responsive) web sites.
Besides these basic features, the use of Docker technology opens a new world of possibilities that go beyond the current proprietary platforms and in the final section of the talk we will show how academic institutions, governmental organizations and industry roll out Shiny apps with ShinyProxy and, last but not least, how you can do this too.

Speakers
TV

## Tobias Verbeke

Wednesday July 5, 2017 11:54am - 12:12pm CEST
4.02 Wild Gallery
Talk, Shiny I

### 11:54am CEST

Keywords: soundscape ecology, urbanization, green space, indicators, soundscape
Abstract
Sustainable urban environments with urban green spaces like city parks and urban gardens provide enduring benefits for individuals and society. Providing recreational spaces they encourage physical activity resulting in improved physical and mental health of citizens. As such, the density and the quality of these areas are of high importance in urban area planning.
In order to study urban green spaces as a landscape, the study of their soundscape as the holistic experience of their sounds has recently gained attention in soundscape ecological studies. Using R, the soundecology and seewave packages provide accessible processing tools appropriate to automate the calculation of soundecology indicators of long run sound recordings from permanent outdoor recorders. These indicators give information about the biophonic component in the present soundscape, and as such give a clear indication of the quality of the green space. Since bird vocalizations contribute strongly to the biophonic component, their spring singing activity is clearly reflected in the yearly pattern of these indicators.
A pilot study focussing on the annual variations of the soundscape of a typical urban green space has been conducted.

Speakers
PD

## Paul Devos

Wednesday July 5, 2017 11:54am - 12:12pm CEST
PLENARY Wild Gallery

### 12:12pm CEST

The increase in life expectancy followed by the growing proportion of old individuals living with chronic diseases contributes to the burden of disability worldwide. The estimation of how much each chronic condition contributes to the disability prevalence can be useful to develop public health strategies to reduce the burden. In this presentation, we will introduce the R package addhaz, which is based on the attribution method (Nusselder and Looman 2004) to partition the total disability prevalence into the additive contributions of chronic diseases using cross-sectional data. The R package includes tools to fit the binomial and multinomial additive hazard models, the core of the attribution method. The models are fitted by maximizing the binomial and multinomial log-likelihood functions using constrained optimization (constrOptim). The 95% Wald and bootstrap percentile confidence intervals can be obtained for the parameter estimates. Also, the absolute and relative contribution of each chronic condition to the disability prevalence and their bootstrap confidence intervals can be estimated. An additional feature of addhaz is the possibility to use parallel computing to obtain the bootstrap confidence intervals, reducing computation time. In this presentation, we will illustrate the use of addhaz with examples for the binomial and multinomial models, using the data from the Brazilian National Health Survey, 2013.
Keywords: Disability, Binomial outcome, Multinomial outcome, Additive hazard model, Cross-sectional data
References Nusselder, Wilma J, and Caspar WN Looman. 2004. “Decomposition of Differences in Health Expectancy by Cause.” Demography 41 (2). Springer: 315–34.

Speakers
RY

## Renata Yokota

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
3.01 Wild Gallery

### 12:12pm CEST

Keywords: Shiny modules, HTMLWidgets, HTMLTemplates, openCPU, NoSQL, Docker
Webpages: https://www.friss.eu/en/
FRISS is a Dutch, fast growing company with a 100% focus on fraud, risk and compliance for non-life insurance companies and is the European market leader with over 100+ implementations in more than 15 countries worldwide. The FRISS platform offers insurers fully automated access to a vast set of external data sources, which together facilitate many different types of screenings, based on knowledge rules, statistical models, clustering, text mining, image recognition and other machine learning techniques. The information produced by the FRISS platform is bundled into a risk score that provides a quantified risk assessment on a person or case, that enables insurers to make better and faster decisions.
At FRISS, all analytical applications and services are based on R. Interactive applications are based on Shiny, a popular web application platform for R designed by RSTUDIO, while openCPU, an interoperable HTTP API for R, is used to deploy advanced scoring engines at scale, that can be deeply integrated into other services.
In this talk, we show various architectures on how to create high performance, large scale Shiny apps and scoring engines, with a clean code base. Shiny apps are based around the module pattern, HTMLWidgets and HTMLTemplates. Shiny modules allow a developer to compose a complex app via a set of easy to understand modules, each with separate UI and server logic. In these architectures, each module has a set of reactive inputs and outputs and focuses on a single, dedicated task. Subsequently, the modules are combined in a main app that can perform a multitude of complex tasks, yet is still easy to understand and to reason about. In addition, we show how HTMLWidgets allow you to bring the best of JavaScript, the language of the web, into R and show how HTMLTemplates can be used to create R based web applications with a fresh, modern and distinct look.
Finally, in this talk, we show various real-life examples of complex, large scale Shiny applications developed at FRISS. These applications are actively used by insurers worldwide for reporting, dashboarding, anomaly detection, interactive network exploration and fraud detection and allow insurers to combat fraud, risk and compliance. In addition, we show how the aforementioned techniques can be combined with modern NoSQL databases like ElasticSearch, MongoDB and Neo4j, to create high performance apps and how Docker can be used for a smooth deployment process in on-premises scenarios, that is both fast and secure.

Speakers
HS

## Herman Sontrop

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
4.02 Wild Gallery
Talk, Shiny I

### 12:12pm CEST

Keywords: data maps, OpenStreetMap, spatial, visualization
Webpages: https://CRAN.R-project.org/package=osmplotr, https://github.com/ropensci/osmplotr, https://github.com/osmdatar/osmdata
R, like any and every other system for analysing and visualising spatial data, has a host of ways to overlay data on maps (or the other way around). Maps nevertheless contain data—nay, maps are data—making this act tantamount to overlaying data upon data. That’s likely not going to end well, and so this talk will present two new packages that enable you to visualise your own data with actual map data such as building polygons or street lines, rather than merely overlaying (or underlaying) them. The osmdata package enables publically accessible data from OpenStreetMap to be read into R, and osmplotr can then use these data as a visual basis for your own data. Both categorical and continuous data can be visualised through colours or through structural properties such as line thicknesses or types. We think this results is more visually striking and beautiful data maps than any alternative approach that necessitates separating your data from map data.

Speakers
MP

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
PLENARY Wild Gallery

### 12:12pm CEST

Keywords: population growth, nonlinear models, differential equation
Webpages: https://CRAN.R-project.org/package=growthrates, https://github.com/tpetzoldt/growthrates
The population growth rate is a direct measure of fitness, common in many disciplines of theoretical and applied biology, e.g. physiology, ecology, eco-toxicology or pharmacology. The R package growthrates aims to streamline growth rate estimation from direct or indirect measures of population density (e.g. cell counts, optical density or fluorescence) of batch experiments or field observations. It can be applicable to different species of bacteria, protists, and metazoa, e.g. E. coli, Cyanobacteria, Paramecium, green algae or Daphnia.
The package includes three types of methods:
1. Fitting of linear models to the period of exponential growth using the “growth rates made easy”-method of Hall and Barlow (2013),
2. Nonparametric growthrate estimation by using smoothers. The current implementation uses function smooth.spline, similar to method of package grofit (Kahm et al. 2010),
3. Nonlinear fitting of parametric models like logistic, Gompertz, Baranyi or Huang (Huang 2011) is done with package FME (Flexible Modelling Environment) of Soetaert and Petzoldt (2010). Growth models can be given either in closed form or as numerically integrated system of differential equations, that are numerically solved with package deSolve (Soetaert, Petzoldt, and Setzer 2010) and cOde (Kaschek 2016).
The package contains methods to fit single data sets or complete experimental series. It uses S4 classes and contains functions for extracting results (e.g. coef, summary, residuals, …), and methods for convenient plotting. The fits and the growth models can be visualized with shiny apps.
References Hall, Acar, B. G., and M. Barlow. 2013. “Growth Rates Made Easy.” Mol. Biol. Evol. 31: 232–38. doi:10.1093/molbev/mst197.

Huang, Lihan. 2011. “A New Mechanistic Growth Model for Simultaneous Determination of Lag Phase Duration and Exponential Growth Rate and a New Belehdredek-Type Model for Evaluating the Effect of Temperature on Growth Rate.” Food Microbiology 28 (4): 770–76. doi:10.1016/j.fm.2010.05.019.

Kahm, Matthias, Guido Hasenbrink, Hella Lichtenberg-Frate, Jost Ludwig, and Maik Kschischo. 2010. “Grofit: Fitting Biological Growth Curves with R.” Journal of Statistical Software 33 (7): 1–21. doi:10.18637/jss.v033.i07.

Kaschek, Daniel. 2016. cOde: Automated C Code Generation for Use with the deSolve and bvpSolve Packages. https://CRAN.R-project.org/package=cOde.

Soetaert, Karline, and Thomas Petzoldt. 2010. “Inverse Modelling, Sensitivity and Monte Carlo Analysis in R Using Package FME.” Journal of Statistical Software 33 (3): 1–28. doi:10.18637/jss.v033.i03.

Soetaert, Karline, Thomas Petzoldt, and R. Woodrow Setzer. 2010. “Solving Differential Equations in R: Package deSolve.” Journal of Statistical Software 33 (9): 1–25. doi:10.18637/jss.v033.i09.

Speakers

## Thomas Petzoldt

Senior Scientist, TU Dresden (Dresden University of Technology)
dynamic modelling, ecology, environmental statistics, aquatic ecosystems, antibiotic resistances, R packages: simecol, deSolve, FME, marelac, growthrates, shiny apps for teaching, object orientation

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
2.01 Wild Gallery

### 12:12pm CEST

Keywords: Classes, Object-oriented programming, R6, Reference classes
Webpages: https://CRAN.R-project.org/package=R6, https://github.com/wch/R6
R6 is an implementation of a classical object-oriented programming system for R. In classical OOP, objects have mutable state and they contain methods to modify and access internal state. This stands in contrast with the functional style of object-oriented programming provided by the S3 and S4 class systems, where the objects are (typically) not mutable, and the methods to modify and access their contents are external to the objects themselves.
R6 has some similarities with R’s built-in Reference Class system. Although the implementation of R6 is simpler and lighter weight than that of Reference Classes, it offers some additional features such as private members and robust cross-package inheritance.
In this talk I will discuss when it makes sense to use R6 as opposed to functional OOP, demonstrate how to use the package, and explore some of the internal design of R6.

Speakers
WC

## Winston Chang

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
3.02 Wild Gallery
Talk, Packages

### 12:12pm CEST

Keywords: optimization, tuning, surrogate model, computer experiments
Webpages: https://CRAN.R-project.org/package=SPOT
Real-world optimization problems often have very high complexity, due to multi-modality, constraints, noise or other crucial problem features. For solving these optimization problems a large collection of methods are available. Most of these methods require to set a number of parameters, which have a significant impact on the optimization performance. Hence, a lot of experience and knowledge about the problem is necessary to give the best possible results. This situation grows worse if the optimization algorithm faces the additional difficulty of strong restrictions on resources, especially time, money or number of experiments.
Sequential parameter optimization (Bartz-Beielstein, Lasarczyk, and Preuss 2005) is a heuristic combining classical and modern statistical techniques for the purpose of efficient optimization. It can be applied in two manners:
• to efficiently tune and select the parameters of other search algorithms, or
• to optimize expensive-to-evaluate problems directly, via shifting the load of evaluations to a surrogate model.
SPO is especially useful in scenarios where
1. no experience of how to choose the parameter setting of an algorithm is available,
2. a comparison with other algorithms is needed,
3. an optimization algorithm has to be applied effectively and efficiently to a complex real-world optimization problem, and
4. the objective function is a black-box and expensive to evaluate.
The Sequential Parameter Optimization Toolbox SPOT provides enhanced statistical techniques such as design and analysis of computer experiments, different methods for surrogate modeling and optimization to effectively use sequential parameter optimization in the above mentioned scenarios.
Version 2 of the SPOT package is a complete redesign and rewrite of the original R package. Most function interfaces were redesigned to give a more streamlined usage experience. At the same time, modular and transparent code structures allow for increased extensibility. In addition, some new developments were added to the SPOT package. A Kriging model implementation, based on earlier Matlab code by Forrester et al. (Forrester, Sobester, and Keane 2008), has been extended to allow for the usage of categorical inputs. Additionally, it is now possible to use stacking for the construction of ensemble learners (Bartz-Beielstein and Zaefferer 2017). This allows for the creation of models with a far higher predictive performance, by combining the strengths of different modeling approaches.
In this presentation we show how the new interface of SPOT can be used to efficiently optimize the geometry of an industrial dust filter (cyclone). Based on a simplified simulation of this real world industry problem, some of the core features of SPOT are demonstrated.
References Bartz-Beielstein, Thomas, and Martin Zaefferer. 2017. “Model-Based Methods for Continuous and Discrete Global Optimization.” Applied Soft Computing 55: 154–67. doi:10.1016/j.asoc.2017.01.039.

Bartz-Beielstein, Thomas, Christian Lasarczyk, and Mike Preuss. 2005. “Sequential Parameter Optimization.” In Proceedings Congress on Evolutionary Computation 2005 (Cec’05), 1553. Edinburgh, Scotland. http://www.spotseven.de/wp-content/papercite-data/pdf/blp05.pdf.

Forrester, Alexander, Andras Sobester, and Andy Keane. 2008. Engineering Design via Surrogate Modelling. Wiley.

Speakers
SK

## Sebastian Krey

Wednesday July 5, 2017 12:12pm - 12:30pm CEST
2.02 Wild Gallery

### 12:30pm CEST

Wednesday July 5, 2017 12:30pm - 1:30pm CEST
CATERING POINTS Wild Gallery
BREAK
• Company 74

### 1:30pm CEST

Spatial analysis of the urban environment frequently requires estimating whether a given point is shaded or not, given a representation of spatial obstacles (e.g. buildings) and a time-stamp with its associated solar position. For example, we may be interested in -
• Calculating the amount of time a given roof or facade is shaded, to determine the utility of installing Photo-Voltaic cells for electricity production.
• Calculating shade footprint on vegetated areas, to determine the expected microclimatic influence of a new tall building.
These types of calculations are usually applied in either vector-based 3D (e.g. ESRI’s ArcScene) or raster-based 2.5D (i.e. Digital Elevation Model, DEM) settings. However, the former solutions are mostly restricted to proprietary software associated with specific 3D geometric model formats. The latter DEM-based solutions are more common, in both open-source (e.g. GRASS GIS) as well as proprietary (e.g. ArcGIS) software. The insol R package provides such capabilities in R. Though conceptually and technically simpler to work with, DEM-based approaches are less suitable for an urban environment, as opposed to natural terrain, for two reasons -
• A continuous elevation surface at sufficiently high resolution for the urban context (e.g. LIDAR) may not be available and is expensive to produce.
• DEMs cannot adequately represent individual vertical urban elements (e.g. building facades), thus limiting the interpretability of results.
The shadow package aims at addressing these limitations. Functions in this package operate on a vector layer of building outlines along with their heights (class SpatialPolygonsDataFrame from package sp), rather than a DEM. Such data are widely available, either from local municipalities or from global datasets such as OpenStreetMap. Currently functions to calculate shadow height, Sky View Factor (SVF) and shade footprint on ground are implemented. Since the inputs are vector-based, the resulting shadow estimates are easily associated with specific urban elements such as buildings, roofs or facades.
We present a case study where package shadow was used to calculate shading on roofs and facades in a large neighborhood (Rishon-Le-Zion city, Israel), on an hourly temporal resolution and a 1-m spatial resolution. The results were combined with Typical Meteorological Year (TMY) direct solar radiation data to derive total annual insolation for each 1-m grid cell. Subsequently the locations where installation of photovoltaic (PV) cells is worthwhile, given a predefined threshold production, were mapped.
The approach is currently applicable to a flat terrain and does not treat obstacles (e.g. trees) other than the buildings. Our case study demonstrates that subject to these limitations package shadow can be used to calculate shade and insolation estimates in an urban environment using widely available polygonal building data. Future development of the package will be aimed at combining vector-associated shadow calculations with raster data representing non-flat terrain.

Speakers
MD

## Michael Dorman

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
3.01 Wild Gallery
Talk, GIS

### 1:30pm CEST

1. Institute for Biostatistics and Statistical Bioinformatics, Hasselt University, Belgium
2. Independent consultant
Keywords: High dimensional data, Clustering
Webpages: https://cran.r-project.org/web/packages/IntClust/index.html
Discovering the exact activities of a compound is of primary interest in drug development. A single drug can interact with multiple targets and unintended drug-target interactions could lead to severe side effects. Therefore, it is valuable in the early phases of drug discovery to not only demonstrate the desired on-target efficacy of compounds but also to outline its unwanted off-target effects. Further, the earlier unwanted behaviour is documented, the better. Otherwise, the drug could fail in a later stage which means that the invested time, effort and money are lost.
In the early stages of drug development, different types of information on the compounds are collected: the chemical structures of the molecules (fingerprints), the predicted targets (target predictions), on various bioassays, the toxicity and more. An analysis of each data source could reveal interesting yet disjoint information. It only provides a limited point of view and does not give information on how everything is interconnected in the global picture (Shi, De Moor, and Moreau 2009). Therefore, a simultaneous analysis of multiple data sources can provide a more complete insight on the compounds' activity.
An analysis based on multiple data sources is relatively new and growing area in drug discovery and drug development. Multi-source clustering procedures provide us with the opportunity to relate several data sources to each other to gain a better understanding of the mechanism of action of compounds. The use of multiple data sources was investigated in the QSTAR (quantitative structure transcriptional activity relationship) consortium (Ravindranath et al. 2015). The goal was to find associations between chemical, bioassay and transcriptomic data in the analysis of a set of compounds under development.
In the current study, we extend the clustering method presented in(Perualila-Tan et al. 2016) and review the performance of several clustering methods on a real drug discovery project in R. We illustrate how the new clustering approaches provide a valuable insight for the integration of chemical, bioassay and transcriptomic data in the analysis of a specific set of compounds. The proposed methods are implemented and publicly available in the R package IntClust which is a wrapper package for a multitude of ensemble clustering methods.
References Perualila-Tan, N., Z. Shkedy, W. Talloen, H. W. H. Goehlmann, QSTAR Consortium, M. Van Moerbeke, and A. Kasim. 2016. “Weighted-Similarity Based Clustering of Chemical Structure and Bioactivity Data in Early Drug Discobased.” Journal of Bioinformatics and Computational Biology.

Ravindranath, A. C., N. Perualila-Tan, A. Kasim, G. Drakakis, S. Liggi, S. C. Brewerton, D. Mason, et al. 2015. “Connecting Gene Expression Data from Connectivity Map and in Silico Target Predictions for Small Molecule Mechanism-of-Action Analysis.” Mol. BioSyst. 11 (1). The Royal Society of Chemistry: 86–96. doi:10.1039/C4MB00328D.

Shi, Y., B. De Moor, and Y. Moreau. 2009. “Clustering by Heterogeneous Data Fusion: Framework and Applications.” NIPS Workshop.

Speakers
MV

## Marijke Van Moerbeke

UseR pdf

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
2.01 Wild Gallery

### 1:30pm CEST

The _R_ community suffers from an underrepresentation of women* in every role and area of participation: whether as leaders (no women on the _R_ core team, 5 of 37 female ordinary members of the _R_-Foundation), package developers [around 10% women amongst CRAN maintainers, @forwards; @CRANsurvey], conference speakers and participants [around 28% at useR! 2016, @forwards], educators, or users.

As a diversity initiative alongside the Forwards Task Force, _R_-Ladies’ mission is to achieve proportionate representation by encouraging, inspiring, and empowering the minorities currently underrepresented in the _R_ community.
_R_-Ladies’ primary focus is on supporting the _R_ enthusiasts who identify as an underrepresented gender minority to achieve their programming potential, by building a collaborative global network of _R_ leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide.

Since _R_-Ladies global was created a year ago we have grow exponentially to more than 4000 _R_-Ladies in 15 countries and have established a great brand. We want to share the amazing work _R_-Ladies has achieved, future plans and how the _R_ community can support and champion _R_-Ladies around the world.

Speakers

## Hannah Frick

Data Scientist, Mango Solutions
Talk to me about psychometrics/IRT, running (data in R) and diversity / R-Ladies!

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
PLENARY Wild Gallery
Talk, Community

### 1:30pm CEST

Keywords: stream processing, big data, ETL, scale
Webpages: https://CRAN.R-project.org/package=AWR, https://CRAN.R-project.org/package=AWR.KMS, https://CRAN.R-project.org/package=AWR.Kinesis
R is rarely mentioned among the big data tools, although it’s fairly well scalable for most data science problems and ETL tasks. This talk presents an open-source R package to interact with Amazon Kinesis via the MultiLangDaemon bundled with the Amazon KCL to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of Kinesis shards.
Besides the technical background and a quick introduction on how Kinesis works, this talk will feature some stream processing use-cases at CARD.com, and will also provide an overview and hands-on demos on the related data infrastructure built on the top of Docker, Amazon ECS, ECR, KMS, Redshift and a bunch of third-party APIs – besides the related open-source R packages, eg AWR, AWR.KMS and AWR.Kinesis, developed at CARD.
References

Speakers
GD

## Gergely Daroczi

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
3.02 Wild Gallery
Talk, HPC

### 1:30pm CEST

Keywords: text mining, natural language processing, tidy data, sentiment analysis
Webpages: https://CRAN.R-project.org/package=tidytext, http://tidytextmining.com/
Unstructured, text-heavy data sets are increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. We introduce the tidytext package for approaching text analysis from a tidy data perspective. We can manipulate, summarize, and visualize the characteristics of text using the R tidy tool ecosystem; these tools extend naturally to many text analyses and allow analysts to integrate natural language processing into effective workflows already in wide use. We explore how to implement approaches such as sentiment analysis of texts and measuring tf-idf to quantify what a document is about.

Speakers

## Julia Silge

data scientist, Stack Overflow

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
4.02 Wild Gallery

### 1:30pm CEST

Keywords: data science, predictive maintenance, industry 4.0, business, industry, use case,
The buzz for industry 4.0 continues – digitalizing business processes is one of the main aims of companies in the 21st century. One topic gains particular importance: predictive maintenance. Enterprises use this method in order to cut production and maintenance costs and to increase reliability.
Being able to predict machine failures, performance drops or quality deterioration is a huge benefit for companies. With this knowledge, maintenance and failure costs can be reduced and optimized.
With the help of R and its massive community, analysts can apply the best algorithms and methods for predictive maintenance. When a good analytic model for predictive maintenance has been found, companies are challenged to implement them in their own environments and workflows. Especially regarding the workflow across different departments, it is necessary to find an appropriate solution which is capable of interdisciplinary work, as well.
My talk will show how this challenge was solved for TRUMPF Laser GmbH, a subsidiary of TRUMPF, a world-leading high-technology company which offers production solutions in the machine tool, laser and electronic sectors. I would like to share my experience with R and predictive maintenance in a real-world industry scenario and show the audience how to automate R code and visualize it in a front-end solution for all departments involved.

Speakers

## Andreas Prawitt

Data Scientist, eoda GmbH
At the useR Conference I am interested to see how Data Scientist use R in larger companies. I am looking forward to show you how TRUMPF Lasertechnik integrated R in their Analytical workflows.

Wednesday July 5, 2017 1:30pm - 1:48pm CEST
2.02 Wild Gallery

### 1:48pm CEST

Keywords: GIS interface, QGIS, Python interface
Webpages: https://cran.r-project.org/web/packages/RQGIS/index.html, https://github.com/jannes-m/RQGIS

Speakers
JM

## Jannes Muenchow

Wednesday July 5, 2017 1:48pm - 2:06pm CEST
3.01 Wild Gallery
Talk, GIS
• Company 58

### 1:48pm CEST

Keywords: Big Data, Machine Learning, Scalability, High Perforance Computing, Graph Analytics
Webpages: https://oracle.com/goto/R
Big Data garners much attention, but how can enterprises extract value from data as found in the growing corporate data lakes or data reservoirs. Extracting value from big data requires high performance and scalable tools – both in hardware and software. Increasingly, enterprises take on massive machine learning and graph analytics projects, where the goal is to build models and analyze graphs involving multi-billion row tables or to partition analyses into thousands or even millions of components.
Data scientists need to address use cases that range from modeling individual customer behavior to understanding aggregate behavior, or exploring centrality of nodes within a graph to monitoring sensors from the Internet of Things for anomalous behavior. While R is cited as the most used statistical language, limitations of scalability and performance often restrict its use for big data. In this talk, we present architectural elements enabling high performance and scalability, highlighting scenarios both on Hadoop/Spark and database platforms using R. We illustrate how Oracle Advanced Analytics’ Oracle R Enterprise component and Oracle R Advanced Analytics for Hadoop enable using R on big data, achieving both scalability and performance.

Speakers

## Mark Hornick

Senior Director, Oracle
Mark Hornick is the Senior Director of Product Management for the Oracle Machine Learning (OML) family of products. He leads the OML PM team and works closely with Product Development on product strategy, positioning, and evangelization, Mark has over 20 years of experience with integrating... Read More →

Wednesday July 5, 2017 1:48pm - 2:06pm CEST
3.02 Wild Gallery
Talk, HPC

### 1:48pm CEST

NA
NA
Abstract: Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e., data made up of profiles, whose rows belong to the simplex), remains largely unexplored, particularly in cases where the observed value of an observation is equal or close to zero for one or more samples. This work is motivated by the analysis of two sets of compositional data, both focused on the categorization of profiles but arising from considerably different applications: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib’ bike sharing system in Paris, France. For both of these applications, we propose the use of appropriate data transformations in conjunction with either Gaussian mixture models or K-means algorithms and penalized model selection criteria. Using our Bioconductor package coseq, we illustrate the user-friendly implementation and visualization provided by our proposed approach, with a focus on the functional coherence of the gene co-expression clusters and the geographical coherence of the bike station groupings.
Keywords: Clustering, compositional data, K-means, mixture model, transformation, co-expression

Speakers
AR

## Andrea Rau

Wednesday July 5, 2017 1:48pm - 2:06pm CEST
2.01 Wild Gallery

### 1:48pm CEST

Keywords: political science, reproducibility, corpus, data journalism, text mining
Webpages: https://CRAN.R-project.org/package=manifestor, https://manifesto-project.wzb.eu/information/documents/manifestoR
The Manifesto Project is a long-term political science research project that has been collecting, archiving and analysing party programs from democratic elections since 1979, and is one of longest standing and most widely used data sources in political science. The project recently released manifestoR as its official R package for accessing and analysing the data collected by the project. The package is aimed at three groups: it is a valuable tool for data journalism and social sciences, a data source for text mining, and a prototype for software that promotes research reproducibility.
The manifestoR package provides access to the Manifesto Corpus (Merz, Regel & Lewandowski 2016) – the project’s text database – which contains more than 3000 digitalised election programmes from 573 parties, together running in elections between 1946 and 2015 in 50 countries, and includes documents in more than 35 different languages. More than 2000 of these documents are available as digitalised, cleaned, UTF-8 encoded full text – the rest as PDF files. As these texts are accessible from directly within R, manifestoR provides a comfortable and valuable data source for text miners interested in political and/or multilingual training data, as well as for data journalists.
The manifesto texts accessible through manifestoR are labelled statement by statement, according to a 56 category scheme which identifies policy issues and positions. On the basis of this labelling scheme, the political science community has developed many aggregate indices on different scales for parties’ ideological positions. Most of these algorithms have been collected and included in manifestoR in order to provide a centralised and easy to use starting point for scientific and journalistic analyses and inquiries.
Replicability and reproducibility of scientific analyses are core values of the R community, and are of growing importance in the social sciences. Hence, manifestoR was designed with the goal of reproducible research in mind and tries to set an example of how a political science research project can publish and maintain an open source package to promote reproducibility when using its data. The Manifesto Project’s text collection is constantly growing and being updated, but any version ever published can easily be used as the basis for scripts written with manifestoR. In addition, the package integrates seamlessly with the widely-used tm package (Feinerer 2008) for text mining in R, and provides a data_frame representation for every data object in order to connect to the tidyverse packages (Wickham 2014), including the text-specific tidytext (Silge & Robinson 2016). For standardising and open-sourcing the implementations of aggregate indices from the community in manifestoR, we sought collaboration with the original authors. Additionally, the package provides infrastructure to easily adapt such indices, or to create new ones. The talk will also discuss the lessons learned and the unmet challenges that have arisen in developing such a package specifically for the political science community.
References
• Feinerer, Ingo (2008). A text mining framework in R and its applications. Doctoral thesis, WU Vienna University of Economics and Business.
• Merz, N., Regel, S., & Lewandowski, J. (2016). The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis. Research &Amp; Politics, 3(2), 2053168016643346. doi: 10.1177/2053168016643346
• Silge, J., & Robinson, D. (2016). Tidytext: Text Mining and Analysis Using Tidy Data Principles in R. JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.
• Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10

Speakers
JL

## Jirka Lewandowski

Wednesday July 5, 2017 1:48pm - 2:06pm CEST
4.02 Wild Gallery

### 1:48pm CEST

Keywords: Shiny, Project Management, Product Management, Best Practice, Stakeholder Management
Shiny development is exploding in the R world, especially for enabling analysts to share their results with business users interactively. At Mango Solutions, the number of Shiny apps we are being commissioned to build has increased dramatically with approximately 30% of current projects involving some aspect of Shiny development.
Typically, Shiny has been used as a prototyping tool to quickly show business the value of data driven projects with the aim to productionalise the app once buy-in from stakeholders is gained. Shiny is fantastically quick to get an app up and running and into the hands of users and additional features can be rapidly prototyped for stakeholders.
In this presentation I will share with you our experience from a client project where Shiny prototyping got out of control- the app was so successful for the business the pilot phase quickly evolved into full deployment as more users were involved in “testing” without production best practice implemented yet. I will then tell you how we faced into this challenge which involved client education and the planning and implementation of the required deployment rigour.
I will also share our thoughts on how to approach Shiny prototyping and development (taking on board our lessons learnt) depending on the app’s needs- you can still quickly implement features with Shiny but with a few recommendations you can minimise the largest risks of your app getting away from you.

Speakers
GM

## Grace Meyer

Wednesday July 5, 2017 1:48pm - 2:06pm CEST
2.02 Wild Gallery

### 1:48pm CEST

1. Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Center for Statistics, Hasselt University, 3590 Diepenbeek, Belgium
2. Department of Epidemiology and Biostatistics, Gonder University, Ethiopia
3. Human Sciences Research Council (HSRC), PRETORIA, South Africa
4. The University of South Africa (UNISA), PRETORIA, South Africa
5. Wolfson Research Institute for Health and Wellbeing, Durham University, Durham

Keywords: Developing countries, master programs, Biostatistucs, E-learning using R

Webpage: https://github.com/eR-Biostat

One of the main problems in high education at a master level in developing countries is the lack of high quality course materials for courses in master programs. The >eR-Biostat initiative is focused on masters programs in Biostatistics/Statistics and aim to develop new E-learning system for courses at a master level.

The E-learning system, developed as a part of the >eR-Biostat initiative, offers free online course materials for master students in biostatistics/statistics in developing countries. For each course, the materials are publicly available and consist of several type of course materials: (1) notes for the course, (2) slides for the course, (3) R programs, ready to use, which contain all data and R code for the all examples and illustrations discussed in the course and (4) homework assignments and exams.

The >eR-Biostat initiative introduces a new, R based, learning system, the multi-module learning system, in which the students in the local universities in developing countries will be able to follow courses in different learning format, including e-courses taken online and a combination between e-courses and local lectures given by local staff members. R software and packages are used in all courses as data analysis tool for all examples and illustrations. The >eR-Biostat initiative provides a free, accessible and ready to use tool for capacity building in biostatistics/statistics for local universities in developing countries with current low or near zero capacity in these topics. In its nurture, the R community is used for this type of collaboration (for example, CRAN and Bioconductor which offer access to the most up-to-date R packages for data analysis). The >eR-Biostat initiative is aimed to bring the R community members for the development of high education courses in the same way it is currently done in software development.

Speakers

## Ziv Shkedy

Professor, Hasselt University
I am a professor for biostatistics and bioinformatics in the center for statistics in Hasselt University, Belgium

Wednesday July 5, 2017 1:48pm - 2:06pm CEST
PLENARY Wild Gallery
Talk, Community

### 2:06pm CEST

This talk introduces the R package cleanNLP, which provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford’s CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish (Marneffe et al. 2016, De Marneffe et al. (2014)). Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction (Lee et al. 2011). The functionality provided by the package applies the tidy data philosophy (Wickham 2014) to the processing of raw textual data by offering three distinct contributions:

a data schema representing the output of an NLP annotation pipeline as a collection of normalized tables;
a set of native Java output functions converting a Stanford CoreNLP annotation object directly, without converting into an intermediate XML format, into this collection of normalized tables;
tools for converting from the tidy model into (sparse) data matrices appropriate for exploratory and predictive modeling.
Together, these contributions simplify the process of doing exploratory data analysis over a corpus of text. The output works seamlessly with both tidy data tools as well as other programming and graphing systems. The talk will illustrate the basic usage of the cleanNLP package, explain the rational behind the underlying data model, and show an example from a corpus of the text from every State of the Union address made by a United States President (Peters 2016).

Speakers

## Taylor Arnold

Assistant Professor of Statistics, University of Richmond
Large scale text and image processing

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
4.02 Wild Gallery

### 2:06pm CEST

Keywords: data science, business, industry, best practice, critical enterprise environments, R,
Over the last couple of years, R has become increasingly popular among business users. Today, it is the first choice of many data science departments when it comes to ad-hoc analysis and data visualization, research and prototyping.
But when it comes to critical production environments, IT departments are still reluctant to consider R as part of their software stack. And there are reasons for that: Dynamic typing, the reputation for being slow (still around!), the lack of experience regarding management and administration of R (and its 10,000 packages), to name some of them.
Nevertheless, with the help of some friends, it is feasible and reasonable to use R as a core application even in critical production environments. This talk will share lessons learned from practical experience and point out a best practice landscape of tools, approaches and methods to make that happen.

Speakers
OB

## Oliver Bracht

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
2.02 Wild Gallery

### 2:06pm CEST

Keywords: R, package, biclustering, binary data
Webpages: https://cran.r-project.org/web/packages/BiBitR/index.hmtl, https://github.com/ewouddt/BiBitR
Biclustering is a data analysis method that can be used to cluster the rows and columns in a (big) data matrix simultaneously in order to identify local submatrices of interest, i.e., local patterns in a big data matrix. For binary data matrices, the local submatrices that biclustering methods can identify consists of rectangles of 1’s. Several methods were developed for biclustering of binary data, such as the Bimax algorithm proposed by Prelić et al. (2006) and the BiBit algorithm by Rodriguez-Baena, Perez-Pulido, and Aguilar-Ruiz (2011). However, these methods are capable to discover only perfect biclusters which means that noise is not allowed (i.e., zeros are not included in the bicluster). We present an extension for the BiBit algorithm (E-BiBit) that allows for noisy biclusters. While this method works very fast, its downside is that it often produces a large number of biclusters (typically >10000) which makes it very difficult to recover any meaningful patterns and to interpret the results. Furthermore many of these biclusters are highly overlapping.
We propose a data analysis workflow to extract meaningful noisy biclusters from binary data using an extended and pattern-guided’ version of BiBit and combine it with traditional clustering/networking methods. The proposed algorithm and the data analysis workflow are illustrated using the BiBitR R package to extract and visualize these results.
The proposed method/data analysis flow is applied to high dimensional real life health data which contains information of disease symptoms of hundreds thousands of patients. The E-BiBit algorithm is used to identify homogeneous subsets of patients who share the same disease symptom profiles.
The E-BiBit has also been included in the BiclustGUI R package (De Troyer and Otava (2016), De Troyer et al. (2016)), an ensemble GUI package in which multiple biclustering and visualisation methods are implemented.
References De Troyer, E., and M. Otava. 2016. Package ’Rcmdrplugin.BiclustGUI’: ’Rcmdr’ Plug-in Gui for Biclustering. https://ewouddt.github.io/RcmdrPlugin.BiclustGUI/aboutbiclustgui/.

De Troyer, E., M. Otava, J. D. Zhang, S. Pramana, T. Khamiakova, S. Kaiser, M. Sill, et al. 2016. “Applied Biclustering Methods for Big and High-Dimensional Data Using R.” In, edited by A. Kasim, Z. Shkedy, S. Kaiser, S. Hochreiter, and W. Talloen. CRC Press Taylor & Francis Group, Chapman & Hall/CRC Biostatistics Series.

Prelić, A., S. Bleuler, P. Zimmermann, Wille A., P. Bühlmann, W. Gruissem, L. Henning, L. Thiele, and E. Zitzler. 2006. “A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data.” Bioinformatics 22: 1122–9.

Rodriguez-Baena, Domingo S., Antona J. Perez-Pulido, and Jesus S. Aguilar-Ruiz. 2011. “A Biclustering Algorithm for Extracting Bit-Patterns from Binary Dataets.” Bioinformatics 27 (19).

Speakers
ED

## Ewoud De Troyer

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
2.01 Wild Gallery

### 2:06pm CEST

There is a lot happening at R Consortium! We now have 15 members, including the Gordon and Betty Moore Foundation which joined this year as a Platinum member, 21 active projects and a fired- up grant process. This March the ISC awarded grants to 10 new projects totaling nearly \$240,000. In this talk we will describe how the R Consortium is evolving to carry out its mission to provide support for the R language, the R Foundation and the R Community. We will summarize the active projects, why they are important and where we think they are going, and describe how individual R users can get involved with R Consortium projects.
Keywords: R Consortium
Webpages: https://www.r-consortium.org/

Speakers
JR

## David Smith

Ask me about R at Microsoft, the R Consortium, or the Revolutions blog.

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
PLENARY Wild Gallery
Talk, Community

### 2:06pm CEST

Keywords Spatial analysis, Setup, GRASS, SAGA, OTB, QGIS
Despite the well known capabilities of spatial analysis and data handling in the world of R, an enormous gap persists between R and the mature open source Geographic Information System (GIS) and Remote Sensing (RS) software community. Prominent representatives like QGIS, GRASS GIS and SAGA GIS provide comprehensive and continually growing collections of highly sophisticated algorithms that are mostly fast, stable and usually well proofed by the community
Although a number of R wrappers aim to bridge this gap (eg rgrass7 for GRASS GIS 7.x, RSAGA for SAGA GIS) – among which RQGIS is the most recent outcome to realize a simple access to the powerful QGIS command line interface – most of these packages are not that easy to setup. Most of the wrappers are trying to find and/or set an appropriate environment, nevertheless it is in many cases at least cumbersome to get all necessary settings correct, especially if one has to work with restricted rights or parallel installations of the same GIS software.
In order to overcome known limitations, the package link2GI provides a small framework for easy linking of R to major GIS software. Here, linking simply means to provide all necessary environment settings as well as full access to the command line APIs of these software tools, whereby the strategy differs from software to software. As a result an easy entrance door for linking current versions of GRASS7.x GIS, SAGA GIS, QGIS as well as other command line tools like the Orfeo Toolbox (OTB) to R is provided. The package focus on both R users that are not very familiar with the conditions and pitfalls of their preferred operating system and more experienced users that want to have some comfortable shortcuts for a seamless integration of e.g. GRASS. The most simple call link2GI::linkGRASS7(x=anySpatialObject) will search for the OS dependent installations of GRASS 7. Furthermore, it will setup the rsession according to the provided spatial object. All steps can be influenced manually which will significantly speed up the process. Especially if you work with already established GRASS databases it provides a convenient way to link mapsets and locations correctly.
The package is also providing some basic tools beyond simple linking. Since Edzer Pebesma’s new sf package, it is for the first time possible to deal with big vector data sets (> 1.000.000 polygons or 25.000.000 vertices). Nevertheless it is advantagous to process the more sophisticeded spatial analysis with external GIS software. To improve this process link2GI provides a first version of direct reading and writing GRASS and SAGA vector data from and to R to speed up the conversion process. Finally, a first version of a common Orfeo Toolbox wrapper for simplifying OTB calls is introduced.

Speakers
CR

## Christoph Reudenbach

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
3.01 Wild Gallery
Talk, GIS

### 2:06pm CEST

Keywords: R, Spark, Docker containers, Kubernetes, Cloud computing
Rc$$^2$$ (R cloud computing) is a containerized environment for running R, Hadoop, and Spark with various persistent data stores including PostgreSQL, HDFS, HBase, Hive, etc. At this time, the server side of Rc$$^2$$ runs on Docker’s Community Edition, which can be: on the same machine as the client, on a server, or in the cloud. Currently, Rc$$^2$$ supports a macOS client, but iOS and web clients are in active development.
The clients are designed for small or large screens with a left editor panel and a right console/output panel. The editor panel supports R scripts, R Markdown, and Sweave, but bash, SQL, Python, and additional languages will be added. The right panel allows toggling among the console and graphical objects as well as among generated help, html, and pdf files. A slide-out panel allows toggling among session files, R environments, and R packages. Extensive search capabilities are available in all panels.
The base server configuration has containers for an app server, a database server, and a compute engine. The app server communicates with the client. The compute engine is available with or without Hadoop/Spark. Additional containers can be added or removed from within Rc$$^2$$ as it is running, or various prebuilt topologies can be launched from the Welcome window. Multiple sessions can be run concurrently in tabs. For example, a local session could be running along with another session connected to a Spark cluster.
Although the Rc$$^2$$ architecture supports physical servers and clusters, the direction of computing is in virtualization. The docker containers in Rc$$^2$$ can be orchestrated by kubernetes to build arbitrarily large virtual clusters for the compute engine (e.g., parallel R) and/or for Hadoop/ Spark. The focus initially is on building a virtual cluster from Spark containers using kubernetes built on a persistent data store, e.g., HDFS. The ultimate goal is to built data science workflows, e.g., ingesting streaming data into Kafka, modulating it into a data store, and passing it to Spark Streaming.

Speakers
JH

## Jim Harner

Professor Emeritus, West Virginia University

Rc2 pdf

Wednesday July 5, 2017 2:06pm - 2:24pm CEST
3.02 Wild Gallery
Talk, HPC

### 2:24pm CEST

Keywords: R foundation, R community, gender gap, diversity, useR! conferences
Webpages: https://forwards.github.io/
R Forwards is a R Foundation taskforce which aims at leading the R community forwards in widening the participation of women and other under-represented groups. We are organized in sub-teams that work on specific tasks, such as data collection and analysis, social media, gathering teaching materials, organizing targeted workshops, keep track of scholarships and interesting diversity initiatives, etc. In this talk, I will present an overview of our activities and in particular the work of the survey team who analyzed the questionnaire run at useR! 2016. We collected information on the participants socio-demographic, experiences and interest in R to get a better understanding of how to make the R community a more inclusive environment. We regularly post our results with blogs. Based on this analysis, I will present some of our recommendations.

Speakers

## julie josse

Polytechnique, Polytechnique
Professor of statistics, my research focuses on handling missing values. Conference buddy, I would be glad to discussing with you

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
PLENARY Wild Gallery
Talk, Community

### 2:24pm CEST

Keywords: R in production, business applications
INWT Statistics is a company specialised on services around Predictive Analytics. For our clients we develop customised algorithms and solutions. While R is the de facto standard within our company, we face many challenges in our day to day work when we implement these solutions for our clients. To overcome these challenges we use standardised approaches for integrating predictive models into the infrastructure of our clients.
In this talk I will give an overview of a typical project structure and the role of the R language within our projects. R is used as an Analytics tool, for automatic reporting, building dashboards, and various programming tasks. While developing solutions, we always have to keep in mind how our clients plan to utilise the results. Here we have experience with the full delivery of the outcome in the form of R packages and workshops, as well as giving access to the results by using dashboards or automatically generated reports. Different companies need different models of implementation. Thus in each project we have to decide early on how R can be used to its full potential to meet our clients requirements. In this regard, I give insights into various models of implementation and our experience with each of them.

Speakers
VP

## Verena Pflieger

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
2.02 Wild Gallery

### 2:24pm CEST

Keywords: Spatial statistics, spherical geometry, geospatial index, GIS
Webpages: https://github.com/spatstat/s2, https://cran.r-project.org/package=s2
Google’s S2 geometry library is a somewhat hidden gem which hasn’t received the attention it deserves. It both facilitates geometric operations directly on the sphere such as polygonal unions, intersections, differences etc. without the hassle of projecting data in the common latitude and longitude format, and provides an efficient quadtree type hierarchical geospatial index.
The original C++ source code is available in a Google Code archive and it has been partially ported to e.g. Java, Python, NodeJS, and Go, and it is used in MongoDB’s 2dsphere index.
The geospatial index in the S2 library allows for useful approximations of arbitrary regions on the sphere which can be efficiently manipulated.
We describe how the geospatial index is constructed and some of it properties as well as how to perform some of the geometrical operations supported by the library. This is all done using Rcpp to interface the C++ code from R.

Speakers
ER

## Ege Rubak

talk html

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
3.01 Wild Gallery
Talk, GIS

Speakers
CG

## Christophe Genolini

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
2.01 Wild Gallery

### 2:24pm CEST

Keywords: Parallelization, Resource-Aware Scheduling, Hyperparameter Tuning, Embedded Systems
Webpages: http://sfb876.tu-dortmund.de/SPP/sfb876-a3.html
We present a resource-aware scheduling strategy for parallelizing R applications on heterogeneous architectures, like those commonly found in mobile devices. Such devices typically consist of different processors with different frequencies and memory sizes, and are characterized by tight resource and energy restrictions. Similar to the parallel package that is part of the R distribution, we target problems that can be decomposed into independent tasks that are then processed in parallel. However, as the parallel package is not resource-aware and does not support heterogeneous architectures, it is ill-suited for the kinds of systems we are considering.
The application we are focusing on is parameter tuning of machine learning algorithms. In this scenario, the execution time of an evaluation of a parameter configuration can vary heavily depending on the configuration and the underlying architecture. Key to our approach is a regression model that estimates the execution time of a task for each available processor type based on previous evaluations. In combination with a scheduler allowing to allocate tasks to specific processors, we thus enable efficient resource-aware parallel scheduling to optimize the overall execution time.
We demonstrate the effectiveness of our approach in a series of examples targeting the ARM big.LITTLE architecture, an architecture commonly found in mobile phones.
References ARM. 2017. “big.LITTLE Technology.” https://www.arm.com/products/processors/technologies/biglittleprocessing.php.

Helena Kotthaus, Ingo Korb. 2017. “TraceR: Profiling Tool for the R Language.” Department of Computer Science 12, TU Dortmund University. https://github.com/allr/traceR-installer.

Kotthaus, Helena, Ingo Korb, and Peter Marwedel. 2015. “Performance Analysis for Parallel R Programs: Towards Efficient Resource Utilization.” 01/2015. Department of Computer Science 12, TU Dortmund University.

Richter, Jakob, Helena Kotthaus, Bernd Bischl, Peter Marwedel, Jörg Rahnenführer, and Michel Lang. 2016. “Faster Model-Based Optimization Through Resource-Aware Scheduling Strategies.” In LION10, 267–73. Springer International Publishing.

Speakers

## Helena Kotthaus

Department of Computer Science 12, TU Dortmund University, Dortmund, Germany

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
3.02 Wild Gallery
Talk, HPC

### 2:24pm CEST

Keywords: text analysis, text mining, machine learning, social media
Summary A useR! Talk about text analysis and text mining using R. I would cover the broad set of tools for text analysis and natural language processing in R, with an emphasis on my R package quanteda but also covering other major tools in the R ecosystem for text analysis (e.g. stringi).
The talk would is tutorial covers how to perform common text analysis and natural language processing tasks using R. Contrary to a belief popular among some data scientists, when used properly, R is a fast and powerful tool for managing even very large text analysis tasks. My talk would present the many option available, demonstrate that these work on large data, and compare the features of R for these tasks versus popular options in Python.
Specifically, I will demonstrate how to format and input source texts, how to structure their metadata, and how to prepare them for analysis. This includes common tasks such as tokenisation, including constructing ngrams and “skip-grams”, removing stopwords, stemming words, and other forms of feature selection. I will also show to how to tag parts of speech and parse structural dependencies in texts. For statistical analysis, I will show how R can be used to get summary statistics from text, search for and analyse keywords and phrases, analyse text for lexical diversity and readability, detect collocations, apply dictionaries, and measure term and document associations using distance measures. Our analysis covers basic text-related data processing in the R base language, but most relies on the quanteda package (https://github.com/kbenoit/quanteda) for the quantitative analysis of textual data. We also cover how to pass the structured objects from quanteda into other text analytic packages for doing topic modelling, latent semantic analysis, regression models, and other forms of machine learning.

About me Kenneth Benoit is Professor of Quantitative Social Research Methods at the London School of Economics and Political Science. His current research focuses on automated, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining. For the past 5 years, he has been developing a major R package for text analysis, quanteda, as part of European Research Council grant ERC-2011-StG 283794-QUANTESS.

Speakers
KB

## Kenneth Benoit

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
4.02 Wild Gallery

### 2:42pm CEST

Keywords: Random Number Generation, Monte Carlo, Parallel Execution, Reproducibility
Webpages: https://github.com/miraisolutions/rTRNG
Monte Carlo simulations provide a powerful computational approach to address a wide variety of problems in several domains, such as physical sciences, engineering, computational biology and finance. The independent-samples and large-scale nature of Monte Carlo simulations make the corresponding computation suited for parallel execution, at least in theory. In practice, pseudo-random number generators (RNGs) are intrinsically sequential. This often prevents having a parallel Monte Carlo algorithm that is playing fair, meaning that results are independent of the architecture, parallelization techniques and number of parallel processes (Mertens 2009; Bauke 2016).
We will show that parallel-oriented RNGs and techniques in fact exist and can be used in R with the rTRNG package (Porreca, Schmid, and Bauke 2017). The package relies on TRNG (Bauke 2016), a state-of-the-art C++ pseudo-random number generator library for sequential and parallel Monte Carlo simulations.
TRNG provides parallel RNGs that can be manipulated by jumping ahead an arbitrary number of steps or splitting a sequence into any desired subsequence(s), thus supporting techniques such as block-splitting and leapfrogging suitable to parallel algorithms.
The rTRNG package provides access to the functionality of the underlying TRNG C++ library by embedding its sources and headers. Beyond this, it makes use of Rcpp and RcppParallel to offer several ways of creating and manipulating pseudo-random streams, and drawing random variates from them, which we will demonstrate:
• Base-R-like usage for selecting and manipulating the current engine, as a simple and immediate way for R users to use rTRNG
• Reference objects wrapping the underlying C++ TRNG random number engines can be created and manipulated in OOP-style, for greater flexibility in using parallel RNGs in R
• TRNG C++ library and headers can be accessed directly from within R projects that use C++, both via standalone C++ code (via sourceCpp) or through creating an R package that depends on rTRNG
References Bauke, Heiko. 2016. Tina’s Random Number Generator Library. https://numbercrunch.de/trng/trng.pdf.

Mertens, Stephan. 2009. “Random Number Generators: A Survival Guide for Large Scale Simulations.” In Modern Computational Science 09. BIS-Verlag.

Porreca, Riccardo, Roland Schmid, and Heiko Bauke. 2017. rTRNG: R Package Providing Access and Examples to TRNG C++ Library. https://github.com/miraisolutions/rTRNG/.

Speakers

## Riccardo Porreca

Mirai Solutions GmbH

Wednesday July 5, 2017 2:42pm - 3:00pm CEST
3.02 Wild Gallery
Talk, HPC

### 2:42pm CEST

Keywords: NLP, Spark, Deep Learning, Network Science
Webpages: https://github.com/akzaidi
Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we’ll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient.
Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we’ll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages.
To describe the utility of this framework we’ll provide an example where we’ll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017)), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames.
References Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. 3 (March). JMLR.org: 1137–55. http://dl.acm.org/citation.cfm?id=944919.944966.

Olah, Christopher. 2014. “Deep Learning, NLP, and Representations.” https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. doi:10.18637/jss.v059.i10.

Zaidi, Ali. 2017. “Summarizing Git Commits and Github Pull Requests Using Sequence to Sequence Neural Attention Models.” CS224N: Final Project, Stanford University.

Speakers

## Ali Zaidi

Data Scientist, Microsoft

Wednesday July 5, 2017 2:42pm - 3:00pm CEST
4.02 Wild Gallery

### 2:42pm CEST

Scalable, Spatiotemporal Tidy Arrays for R (stars)

Edzer Pebesma, Etienne Racine, Michael Sumner
Spatiotemporal data often comes in the form of dense arrays, with space and time being array dimensions. Examples include socio-economic or demographic data, environmental variables monitored at fixed stations, time series of satellite images with multiple spectral bands, spatial simulations, climate model results. Currently, R does not have infrastructure to handle and analyse such arrays easily. Package raster is probably still the most powerful package for handling this kind of data in memory and on disk, but does not address non-raster time series, rasters time series with multiple attributes, rasters with mixed type attributes, or spatially distributed sets of satellite images. This project will not only deal with these cases, but also extend the “in memory or on disk” model to that where the data are held remotely in cloud storage, which is a more feasible option e.g. for satellite data collected Today. We will implement pipe-based workflows that are developed and tested on samples before they are evaluated for complete datasets, and discuss the challenges of visualiasation and storage in such workflows. This is work in progress, and the talk will discuss the design stage and hopefully show an early prototype.

Speakers

## Edzer Pebesma

professor, University of Muenster
My research interested is spatial, temporal, and spatiotemporal data in R. I am one of the authors, and maintainer, of sp and sf. You'll find my tutorial material at https://edzer.github.io/UseR2017/ - note that I will update it until shortly before the tutorial.

Wednesday July 5, 2017 2:42pm - 3:00pm CEST
3.01 Wild Gallery
Talk, GIS

### 2:42pm CEST

Keywords: r, data science, web traffic, visualization
Since its founding in 2008, the question and answer website Stack Overflow has been a valuable resource for the R community, collecting more than 175,000 questions about the R that are visited millions of times each month. This makes it a useful source of data for observing trends about how people use and learn the language. In this talk, I show what we can learn from Stack Overflow data about the global use of the R language over the last decade. I’ll examine what ecosystems of R packages are used in combination, what other technologies are used alongside *R**, and what countries and cities have the highest density of users. Together, the data paints a picture of a global and rapidly growing community. Aside from presenting these results, I’ll introduce interactive tools and visualizations that the company has published to explore this data, as well as a number of open datasets that analysts can use to examine trends in software development.

Speakers
DR

## David Robinson

Wednesday July 5, 2017 2:42pm - 3:00pm CEST
PLENARY Wild Gallery

### 3:00pm CEST

Wednesday July 5, 2017 3:00pm - 3:30pm CEST
CATERING POINTS Wild Gallery
BREAK

### 3:30pm CEST

Wednesday July 5, 2017 3:30pm - 3:45pm CEST
PLENARY Wild Gallery

### 3:45pm CEST

Abstract: How can we effectively and efficiently teach statistical thinking and computation to students with little to no background in either? How can we equip them with the skills and tools for reasoning with various types of data and leave them wanting to learn more? In this talk we describe an introductory data science course that is our (working) answer to these questions. The courses focuses on data acquisition and wrangling, exploratory data analysis, data visualization, and effective communication and approaching statistics from a model-based, instead of an inference-based, perspective. A heavy emphasis is placed on a consistent syntax (with tools from the tidyverse), reproducibility (with R Markdown) and version control and collaboration (with git/GitHub). We help ease the learning curve by avoiding local installation and supplementing out-of-class learning with interactive tools (like tutor and DataCamp). By the end of the semester teams of students work on fully reproducible data analysis projects on data they acquired, answering questions they care about. This talk will discuss in detail course structure, logistics, and pedagogical considerations as well as give examples from the case studies used in the course. We will also share student feedback and assessment of the success of the course in recruiting students to the statistical science major.

Speakers
MC

## Mine Cetinkaya-Rundel

Wednesday July 5, 2017 3:45pm - 4:45pm CEST
PLENARY Wild Gallery

### 4:45pm CEST

Wednesday July 5, 2017 4:45pm - 5:00pm CEST
CATERING POINTS Wild Gallery
BREAK

### 5:00pm CEST

There will be a special session Wed. July 5, from 17:00 - 18:30 in the main meeting room to provide for discussion and hopefully team-building for the development of tools to improve the ability of users to quickly find and efficiently exploit the richness of the R package collections. Following brief presentations, we hope to gather in smaller groups to address such approaches as wrappers to unify package calls, improved task views and similar aids to guide package selection, and better search methods. Other ideas are welcome. Advance materials are available in the wiki at https://github.com/nashjc/Rnavpkg/wiki. We welcome advance communications with interested useRs.

John Nash (nashjc at uottawa.ca), Julia Silge (julia.silge at gmail.com), and Spencer Graves (spencer.graves atprodsyse.com)

Wednesday July 5, 2017 5:00pm - 6:30pm CEST
PLENARY Wild Gallery

### 6:30pm CEST

Wednesday July 5, 2017 6:30pm - 10:00pm CEST
CATERING POINTS Wild Gallery

Thursday, July 6

### 8:00am CEST

Thursday July 6, 2017 8:00am - 9:15am CEST
Wild Gallery Getijstraat 11, 1190 Vorst

### 9:15am CEST

Thursday July 6, 2017 9:15am - 9:30am CEST
PLENARY Wild Gallery

### 9:30am CEST

The publications on dose-response analysis in the recent years is fairly clear divided into modelling (ie assuming dose as a quantitative covariate) and trend tests (ie assuming dose as a qualitative factor). Both approaches show advantages and disadvantages. What is missing is a joint approach. Three components are required:
i) a quasilinear regression approach, namely the maximum of arithmetic, ordinal and logarithmic dose metameter models according to Tukey et al. (1985)
ii) a contrast test for a maximum of Williams-type contrasts according to Bretz and Hothorn (2003)
iii) the multiple marginal models approach according to Pipper et al. (2011) allowing the distribution of the maximum of multiple glmm’s.

This new versatile trend test provides three advantages:
1) almost powerful for any shape of the dose-.response (including sublinear and supralinear)
2) problem-related interpretability based on confidence limits of slopes and/or contrasts
3) widespread use in the glmm.

By means of the R library(tukeytrend) (Schaarschmidt et al., 2017) case studies for multinomial vector comparisons, multiple binary endpoints, bivariate different scaled endpoints and ANCOVA-adjusted dose-response data will be explained.

Speakers
LH

## Ludwig Hothorn

Thursday July 6, 2017 9:30am - 10:30am CEST
PLENARY Wild Gallery

### 10:30am CEST

Thursday July 6, 2017 10:30am - 11:00am CEST
CATERING POINTS Wild Gallery
BREAK

### 11:00am CEST

Keywords: Data integration, Graphical modeling, High-dimensional precision matrix estimation; Networks
Webpages: https://CRAN.R-project.org/package=rags2ridges, https://github.com/CFWP/rags2ridges
Contact: cf.peeters@vumc.nl
A contemporary use for inverse covariance matrices (aka precision matrices) is found in the data-based reconstruction of networks through graphical modeling. Graphical models merge probability distributions of random vectors with graphs that express the conditional (in)dependencies between the constituent random variables. The rags2ridges package enables L2-penalized (i.e., ridge) estimation of the precision matrix in settings where the number of variables is large relative to the sample size. Hence, it is a package where high-dimensional (HD) data meets networks.
The talk will give an overview of the rags2ridges package. Specifically, it will show that the package is a one-stop-go as it provides functionality for the extraction, visualization, and analysis of networks from HD data. Moreover, it will show that the package provides a basis for the vertical (across data sets) and horizontal (across platforms) integration of HD data stemming from omics experiments. Last but not least, it will explain why many rap musicians are stating that one should ‘get ridge, or die trying’.
References https://arxiv.org/abs/1509.07982
https://arxiv.org/abs/1608.04123
http://dx.doi.org/10.1016/j.csda.2016.05.012

Speakers

## Carel Peeters

Assistant Professor, VU University medical center
Biostatistician specializing in multivariate and high-dimensional molecular biostatistics.

Thursday July 6, 2017 11:00am - 11:18am CEST
3.01 Wild Gallery
Talk, Methods I

### 11:00am CEST

Keywords: Bayesian analysis, Exponential random graph models, Monte Carlo methods
Webpages: https://CRAN.R-project.org/package=Bergm
Exponential random graph models (ERGMs) are a very important family of statistical models for analyzing network data. From a computational point of view, ERGMs are extremely difficult to handle since their normalising constant, which depends on model parameters, is intractable. In this talk, we show how parameter inference can be carried out in a Bayesian framework using MCMC strategies which circumvents the need to calculate the normalising constants.
The new version of the Bergm package for R (Caimo and Friel 2014) provides a comprehensive framework for Bayesian analysis for ERGMs useing the approximate exchange algorithm (Caimo and Friel 2011) and calibration of the pseudo-posterior distribution (Bouranis, Friel, and Maire 2015) to sample from the ERGM parameter posterior distribution. The package can also supply graphical Bayesian goodness-of-fit procedures that address the issue of model adequacy.
This talk will have a strong focus on the main practical implementation features of the software that will be described by the analysis of real network data (with various applications in Neuroscience and Organisation Science).
References Bouranis, L., N. Friel, and F. Maire. 2015. “Bayesian Inference for Misspecified Exponential Random Graph Models.” arXiv Preprint arXiv:1510.00934.

Caimo, A., and N. Friel. 2011. “Bayesian Inference for Exponential Random Graph Models.” Social Networks 33 (1): 41–55.

———. 2014. “Bergm: Bayesian Exponential Random Graphs in R.” Journal of Statistical Software 61 (2): 1–25.

Speakers
AC

## Alberto Caimo

Thursday July 6, 2017 11:00am - 11:18am CEST
4.01 Wild Gallery

### 11:00am CEST

R packages offer the chance to distribute large datasets while also providing functions for exploring and working with that data. However, data packages often exceed the suggested size of CRAN packages, which is a challenge for package maintainers who would like to share their code through this central and popular repository. In this talk, we outline an approach in which the maintainer creates a smaller code package with the code to interact with the data, which can be submitted to CRAN, and a separate data package, which can be hosted by the package maintainer through a personal repository. Although repositories are not mainstream, and so cannot be listed with an “Includes” or “Depends” dependency for a package submitted to CRAN, we suggest a way of including the data package as a suggested package and incorporating conditional code in the executable code within vignettes, examples, and tests, as well as conditioning functions in the code package to check for the availability of the data package. We illustrate this approach for a pair of packages , and , that allows users to explore exposure to hurricanes and tropical storms in the United States. This approach may prove useful for a number of R package maintainers, especially with the growing trend to the sharing and use of open data in many of the fields in which R is popular.

Speakers
BA

## Brooke Anderson

Thursday July 6, 2017 11:00am - 11:18am CEST
2.02 Wild Gallery

### 11:00am CEST

Keywords: Moodle, SQL, tidy data
Webpages: https://github.com/jchrom/moodler
Learning management systems (LMS) generate large amounts of data. The LMS Moodle is at the forefront of open source learning platforms, and thanks to its widespread adoption by schools and businesses, it represents a great target for educational data-analytic efforts. In order to facilitate data analysis of Moodle data in R, we introduce a new R package: moodler. It is a collection of useful SQL queries and data-wrangling functions that fetch data from Moodle database and turn it into tidy data frames. This makes it easy to feed data from Moodle to a large number of R packages that focus on specific types of analyses.

Speakers
JC

## Jakub Chromec

Thursday July 6, 2017 11:00am - 11:18am CEST
4.02 Wild Gallery
Talk, Web

### 11:00am CEST

Keywords: Model visualisation, model exploration, structure visualisation, grammar of model visualisation
The ggplot2 (Wickham 2009) package changed the way how we approach to data visualisation. Instead of looking for suitable type of a plot out of dozens of predefined templates now we express the relation among variables with a well defined grammar based on the excellent book The Grammar of Graphics (Wilkinson 2006).
Similar revolution is happening with tools for visualisation of statistical models. In the CRAN repository, one may find a lot of great packages that graphically explain a structure or diagnostic for some family of statistical models. Just to mention few known and powerful packages: rms, forestmodel and regtools (regression models), survminer (survival models), ggRandomForests (random forest based models), factoextra (multivariate structure exploration), factorMerger (one-way ANOVA) and many, many others. They are great, but they do not share same logic nor structure.
New packages from the tidyverse, like broom (Robinson 2017), creates an opportunity to build an unified interface for model exploration and visualisation for large collection of statistical models. And there is more and more articles that set theoretical foundations for unified grammar of model visualization (see for example Wickham, Cook, and Hofmann 2015).
In this talk I am going to present various approaches to the model visualisation, give an overview of selected existing packages for visualisation of statistical models and discuss proposition for a unified grammar of model visualisation.
References Robinson, David. 2017. Broom: Convert Statistical Analysis Objects into Tidy Data Frames. https://CRAN.R-project.org/package=broom.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wickham, Hadley, Dianne Cook, and Heike Hofmann. 2015. Visualizing Statistical Models: Removing the Blindfold. Statistical Analysis; Data Mining 8(4).

Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.

Speakers
PB

## Przemyslaw Biecek

Thursday July 6, 2017 11:00am - 11:18am CEST
PLENARY Wild Gallery

### 11:00am CEST

The immune system has the monumental challenge of being capable of respond- ing to any pathogen or foreign substance invading the body while ignoring self and innocuous molecules. T cells—which play a central role in directing immune responses, regulating other immune cells, and remembering past infections— accomplish this feat by maintaining a diverse repertoire of T cell receptors (TCR). A typical T cell expresses one unique TCR, and the TCR is made up of two chains—the TCRα and TCRβ chains—that both determine the set of molecules that the T cell can respond to. Since T cells play such a central role in many immune responses, identifying the TCR pairs of T cells involved in infec- tious diseases, cancers, and autoimmune diseases can have profound insights for designing vaccines and immunotherapies. I introduce a novel approach to ob- taining paired TCR sequences with the alphabetr package, which implements algorithms that identify TCR pairs in an efficient, high-throughput fashion for antigen-specific T cell populations (Lee et al. 2017).

Speakers
ES

## Edward S. Lee

Thursday July 6, 2017 11:00am - 11:18am CEST
2.01 Wild Gallery

### 11:18am CEST

Can you keep a secret? Andrie de Vries^1 and Gábor Csárdi^2 1. Senior Programme Manager, Algorithms and Data Science, Microsoft 2. Independent consultant
Keywords: Asymmetric encryption, Public key encryption
When you use R to connect to a database, cloud computing service or other API, you must supply passwords, for example database credentials, authentication keys, etc.
A new package, secret solves this problem by allowing you to encrypt and encrypt secrets using public key encryption. The package is available at github [@secret] and soon also on CRAN.
If you attend this session, you will learn:
• Patterns for inadvertently leak secrets
• The essentials of public key cryptography: how to create an asymmetric key pair (public and private key)
• How to create a vault with encrypted secrets using the secret package
• How to share these secrets with your collaborators by encrypting the secret with their public key
• How you can do all of this in 5 lines of R code
This session will appeal to all R users who must use passwords to connect to services.

References

Speakers

## Andrie de Vries

Senior Programme Manager, Microsoft
Andrie is a senior programme manager at Microsoft, responsible for community projects and evangelization of Microsoft's contribution in Europe to the open source R language. He is co-author of the very popular title "R for Dummies" and a top contributor to the Q&A website, StackOverflow... Read More →

Thursday July 6, 2017 11:18am - 11:36am CEST
4.02 Wild Gallery
Talk, Web

### 11:18am CEST

RosettaHUB aims at establishing a global open data science and open education meta cloud centered on usability, reproducibility, auditability, and shareability. It enables a wide range of social interactions and real-time collaborations.
RosettaHUB leverages public and private clouds and makes them easy to use for everyone. RosettaHUB’s federation platform allows any higher education institution or research laboratory to create a virtual organization within the hub. The institution’s members (researchers, educators, students) receive automatically active AWS accounts which are consolidated under one paying account, supervised in terms of budget and cloud resources usage, protected with safeguarding microservices and monitored/managed centrally by the institution’s administrator. The cloud resources are generally paid for using the coupons provided by Amazon as part of the AWS Educate program. The Organization members’ active AWS accounts are put under the control of a collaboration portal which simplifies dramatically everything related to the interaction with AWS and its collaborative use by communities of researchers, educators and students. The portal allows similar capabilities for Google Compute Engine, Azure, OpenStack-based and OpenNebula-based clouds.
RosettaHUB leverages Docker and allows users to work with containers seamlessly. Those containers are portable. When coupled with RosettaHUB’s open APIs, they break the silos between clouds and avoid vendor lock-in. Simple web interfaces allow users to create those containers, connect them to data storages, snapshot them, share snapshots with collaborators and migrate them from one cloud to another. The RosettaHUB perspectives make it possible to use the containers to serve securely noVNC, RStudio, Jupyter and to enable those tools for real-time collaboration. Zeppelin, Spark-notebook and Shiny Apps are also supported. The RosettaHUB real-time collaborative containerized workbench is a universal IDE for data scientists. It makes it possible to interact in a stateful manner with hybrid kernels gluing together in a single process R, Python, Scala, SQL clients, Java, Matlab, Mathematica, etc. and allowing those different environments to share their workspace and their variables in memory. The RosettaHUB kernels and objects model break the silos between data science environments and make it possible to use them simultaneously in a very effective and flexible manner. A simplified reactive programming framework makes it possible to create reactive data science microservices and interactive web applications based on multi-language macros and visual widgets. A scientific web based spreadsheet makes it possible to interact with R/Python/Scala capabilities from within cells which includes variables import/export and variables mirroring to cells as well as the automatic mapping of any function in those environments to formulas invokable in cells. Spreadsheet cells can also contain code and code execution results making it become a flexible multi-language notebook. Ubiquitous docker containers coupled with the RosettaHUB workbench checkpointing capability and the logging to embedded databases of all the interactions the users have with their environments make everything created within RosettaHUB reproducible and auditable.
The RosettaHUB’s APIs (700+ functions) cover the full spectrum of programmatic interaction between users and clouds, containers and R/Python/Scala kernels. Clients for the APIs are available as an R package, a Pyhton module, a Java library, an Excel add-in and a Word Add-in. Based on those APIs, RosettaHUB provides a CloudFormation- like service which makes it easy to create and manage as templates, collections of related Cloud resources, container images, R/Python/Scala scripts, macros and visual widgets alongside with optional cloud credentials. Those templates are cloud agnostic and they make it possible for anyone to easily create and distribute complex data science applications and services. The user with whom the template is shared can with one-click trigger the reconstruction and wiring on the fly of all the artifacts and dependencies. The RosettaHUB templates constitute a powerful sharing

mechanism for RosettaHUB’s e-Science and e-learning environments snapshots as well as for Jupyter/Zeppelin notebooks, shiny Apps, etc. RosettaHUB’s marketplace transform those templates into products that can be shared or sold.
The presentation will be an overview of RosettaHUB and will discuss the results of the RosettaHUB/AWS Educate initiative which involved 30 higher education institutions and research labs counting over 3000 researchers, educators, and students.

Speakers
KC

## Karim Chine

Thursday July 6, 2017 11:18am - 11:36am CEST
2.02 Wild Gallery

### 11:18am CEST

**Keywords**: non-negative matrix factorization, magnetic resonance imaging, brain tumor

Treatment of brain tumors is complicated by their high degree of heterogeneity. Various stages of the disease can occur throughout the same lesion, and transitions between the pathological tissue regions (i.e. active tumor, necrosis and edema) are diffuse [@price2006improved]. Clinical practice could benefit from an accurate and reproducible method to differentiate brain tumor tissue based on medical imaging data.

We present a hierarchical variant of non-negative matrix factorization (hNMF) for characterizing brain tumors using multi-parametric magnetic resonance imaging (MRI) data [@sauwen2015hierarchical]. Non-negative matrix factorization (NMF) decomposes a non-negative input matrix *X* into 2 factor matrices *W* and *H*, thereby providing a parts-based representation of the input data. In the current context, the columns of *X* correspond to the image voxels and the rows represent the different MRI parameters. The columns of *W* represent tissue-specific signatures and the rows of *H* contain the relative abundances per tissue type over the different voxels.

**hNMF** is available as an *R* package on CRAN and compatible with the **NMF** package. Besides the standard NMF algorithms that come with the **NMF** package, an effcient NMF algorithm called hierarchical alternating least-squares NMF was implemented and used within the hNMF framework. hNMF can be used as a general matrix factorization technique, but in the context of this talk it will be shown that valid tissue signatures are obtained using hNMF. Tissue abundances can be mapped back to the imaging domain, providing tissue differentiation on a voxel-wise basis (see Figure 1).

![Figure 1: hNMF abundance maps of the pathological tissue regions of a glioblastoma patient. Left to right: T~1~-weighted background image with region of interest (green frame); abundance map active tumor; abundance map necrosis; abundance map edema.](./Assembly.png)

# References

Speakers
NS

## Nicolas Sauwen

Thursday July 6, 2017 11:18am - 11:36am CEST
2.01 Wild Gallery

### 11:18am CEST

difNLR: Detection of potentional gender/minority bias with extensions of logistic regression

1. Faculty of Mathematics and Physics, Charles University, Prague
2. Institute of Computer Science, Czech Academy of Sciences, Prague

Keywords: detection of item bias, differential item functioning, psychometrics, R

Webpages: https://CRAN.R-project.org/package=difNLR, https://CRAN.R-project.org/package=ShinyItemAnalysis, https://shiny.cs.cas.cz/ShinyItemAnalysis/

The R package difNLR has been developed for detection of potentially unfair items in educational and psychological testing, analysis of so called Differential Item Functioning (DIF), based on extensions of logistic regression model. For dichotomous data, six models have been implemented to offer wide range of proxies to Item Response Theory models. Parameters are obtained using non-linear least square estimation and DIF detection procedure is performed by either F or likelihood ratio test of submodel. For unscored data, analysis of Differential Distractor Functioning (DDF) based on multinomial regression model is offered to provide closer look at individual item options (distractors). Features and options are demonstrated on three data sets. The package is designed to correspond to difR package (one of the most used R libraries in DIF detection, see Magis, Béland, Tuerlinckx, & De Boeck (2010)) and currently is exploited by ShinyItemAnalysis (Martinková, Drabinová, Leder, & Houdek, 2017) which provides graphical interface offering detailed analysis of educational and psychological tests.

References
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847

Martinková, P., Drabinová, A., Leder, O., & Houdek, J. (2017). ShinyItemAnalysis: Test and item analysis via shiny. Retrieved from shiny.cs.cas.cz/ShinyItemAnalysis/; https://CRAN.R-project.org/package=ShinyItemAnalysis

Martinková, P., Drabinová, A., Liaw, Y.-L., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual assessments. CBE-Life Sciences Education, 16(2). https://doi.org/10.1187/cbe.16-10-0307

McFarland, J. L., Price, R. M., Wenderoth, M. P., Martinková, P., Cliff, W., Michael, J., … Wright, A. (2017). Development and validation of the homeostasis concept inventory. CBE-Life Sciences Education, 16(2). https://doi.org/10.1187/cbe.16-10-0305

Speakers

Thursday July 6, 2017 11:18am - 11:36am CEST
4.01 Wild Gallery

### 11:18am CEST

Keywords: Quantitative Fisheries Science, Common Fisheries Policy, Management Strategy Evaluation, advice, simulation
Webpages: https://flr-project.org, https://github.com/flr
The management of the activities of fishing fleets aims at ensuring the sustainable exploitation of the ocean’s living resources, the provision of important food resources to humankind, and the profitability of an industry that is an important economic and social activity in many areas of Europe and elsewhere. These are the principles of the European Union Common Fisheries Policy (CFP), which has driven the management of Europe’s fisheries resources since 1983.
Quantitative scientific advice is at the heart of fisheries management regulations, providing estimates of the likely current and future status of fish stocks through statistical population models, termed stock assessments, but also probabilistic comparisons of the expected effects of alternative management procedures. Management Strategy Evaluation (MSE) uses stochastic simulation to incorporate both the inherent variability of natural systems, and our limited ability to model their dynamics, into analyses of the expected effects of a given management intervention on the sustainability of both fish stocks and fleets.
The Fishery Library in R (FLR) project has been for the last ten years building an extensible toolset of statistical and simulation methods for quantitative fisheries science (Kell et al. 2007), with the overarching objective of enabling fisheries scientists to carry out analyses of management procedures in a simplified and robust manner through the MSE approach.
FLR has become widely used in many of the scientific bodies providing fisheries management advice, both in Europe and elsewhere. The evaluation of the effects of some elements of the revised CFP, the analysis of the proposed fisheries management plans for the North Sea, or the comparison of management strategies for Atlantic tuna stocks, among others, have used the FLR tools to advice managers of the possible courses of action to favour the sustainable use of many marine fish stocks.
The FLR toolset is currently composed of 20 packages, covering the various steps in the fisheries advice and simulation workflow. They include a large number of S4 classes, and more recently Reference Classes, to model the data structures that represent each of the elements in the fisheries system. Class inheritance and method overloading are essential tools that have allowed the FLR packages to interact, complement and enrich each other, while still limiting the number of functions an user needs to be aware of. Methods also exist that make use of R’s parallelization facilities and of compiled code to deal with complex computations. Statistical models have also been implemented, making use of both R’s capabilities and external libraries for Automatic Differentiation.
We present the current status of FLR, the new developments taking place, and the challenges faced in the development of a collection of packages based on S4 classes and methods.
References Kell, L. T., I. Mosqueira, P. Grosjean, J.-M. Fromentin, D. Garcia, R. Hillary, E. Jardim, et al. 2007. “FLR: An Open-Source Framework for the Evaluation and Development of Management Strategies.” ICES Journal of Marine Science 64 (4). http://dx.doi.org/10.1093/icesjms/fsm012.

Speakers

## Finlay Scott

Joint Research Centre, European Commission

Thursday July 6, 2017 11:18am - 11:36am CEST
PLENARY Wild Gallery

### 11:18am CEST

Keywords: clustered data, clustered covariance matrix estimators, object-orientation, simulation, R
Webpages: http://R-forge.R-project.org/projects/sandwich/
Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, or other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to “the” clustered standard errors, there is a surprisingly wide variation in clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g. for zero-inflated, censored, or limited responses).
In R, the sandwich package (Zeileis 2004; Zeileis 2006) provides an object-oriented approach to “robust” covariance matrix estimation based on methods for two generic functions (estfun() and bread()). Using this infrastructure, sandwich covariances for cross-section or time series data have been available for models beyond lm() or glm(), e.g., for packages MASS, pscl, countreg, betareg, among many others. However, corresponding functions for clustered or panel data have been somewhat scattered or available only for certain modeling functions. This shortcoming has been corrected in the development version of sandwich on R-Forge. Here, we introduce this new object-oriented implementation of clustered and panel covariances and assess the methods’ performance in a simulation study.
References Zeileis, Achim. 2004. “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software 11 (10): 1–17. http://www.jstatsoft.org/v11/i10/.

———. 2006. “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software 16 (9): 1–16. http://www.jstatsoft.org/v16/i09/.

Speakers
SB

## Susanne Berger

Thursday July 6, 2017 11:18am - 11:36am CEST
3.01 Wild Gallery
Talk, Methods I

### 11:36am CEST

Keywords: Citation data, Directed network, Paired comparisons, Quasi-symmetry, Sparse matrices
Motivated by the analysis of large-scale citation networks, we implement the familiar Bradley-Terry model (Zermelo 1929; Bradley and Terry 1952) in such a way that it can be applied, with relatively modest memory and execution-time requirements, to pair-comparison data from networks with large numbers of nodes. This provides a statistically principled method of ranking a large number of objects, based only on paired comparisons.
The BradleyTerryScalable package complements the existing CRAN package BradleyTerry2 (Firth and Turner 2012) by permitting a much larger number of objects to be compared. In contrast to BradleyTerry2, the new BradleyTerryScalable package implements only the simplest, ‘unstructured’ version of the Bradley-Terry model. The new package leverages functionality in the additional R packages igraph (Csardi and Nepusz 2006), Matrix (Bates and Maechler 2017) and Rcpp (Eddelbuettel 2013) to provide flexibility in model specification (whole-network versus disconnected cliques) as well as memory efficiency and speed. The Bayesian approach of Caron and Doucet (2012) is provided as an optional alternative to maximum likelihood, in order to allow whole-network ranking even when the network of paired comparisons is not fully connected.
The BradleyTerryScalable package can readily handle data from directed networks with many thousands of nodes. The use of the Bradley-Terry model to produce a ranking from citation data was originally advocated in Stigler (1994), and was studied in detail more recently in Varin, Cattelan, and Firth (2016); here we will illustrate its use with a large-scale network of inter-company patent citations.
References Bates, Douglas, and Martin Maechler. 2017. “Matrix: Sparse and Dense Matrix Classes and Methods.” R Package Version 1.2-8. http://cran.r-project.org/package=Matrix.

Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. the Method of Paired Comparisons.” Biometrika 39: 324–45.

Caron, François, and Arnaud Doucet. 2012. “Efficient Bayesian Inference for Generalized Bradley–Terry Models.” Journal of Computational and Graphical Statistics 21: 174–96.

Csardi, Gabor, and Tamas Nepusz. 2006. “The igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. http://igraph.org.

Eddelbuettel, Dirk. 2013. Seamless R and C++ Integration with Rcpp. New York: Springer.

Firth, David, and Heather L Turner. 2012. “Bradley-Terry Models in R: The BradleyTerry2 Package.” Journal of Statistical Software 48 (9). http://www.jstatsoft.org/v48/i09.

Stigler, Stephen M. 1994. “Citation Patterns in the Journals of Statistics and Probability.” Statistical Science, 94–108.

Varin, Cristiano, Manuela Cattelan, and David Firth. 2016. “Statistical Modelling of Citation Exchange Between Statistics Journals.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 179: 1–63.

Zermelo, Ernst. 1929. “Die Berechnung Der Turnier-Ergebnisse Als Ein Maximumproblem Der Wahrscheinlichkeitsrechnung.” Mathematische Zeitschrift 29: 436–60.

Speakers

## Ella Kaye

Thursday July 6, 2017 11:36am - 11:54am CEST
4.01 Wild Gallery

### 11:36am CEST

Webpages: https://www.jamovi.org, https://CRAN.R-project.org/package=jmv
In spite of the availability of the powerful and sophisticated R ecosystem, spreadsheets such as Microsoft Excel remain ubiquitous within the business community, and spreadsheet like software, such as SPSS, continue to be popular in the sciences. This likely reflects that for many people the spreadsheet paradigm is familiar and easy to grasp.
The jamovi project aims to make R and its ecosystem of analyses accessible to this large body of users. jamovi provides a familiar, attractive, interactive spreadsheet with the usual spreadsheet features: data-editing, filtering, sorting, and real-time recomputation of results. Significantly, all analyses in jamovi are powered by R, and are available from CRAN. Additionally, jamovi can be placed in ‘syntax mode’, where the underlying R code for each analysis is produced, allowing for a seamless transition to an interactive R session.
We believe that jamovi represents a significant opportunity for the authors of R packages. With some small modifications, an R package can be augmented to run inside of jamovi, allowing R packages to be driven by an attractive user-interface (in addition to the normal R environment). This makes R packages accessible to a much larger audience, and at the same time provides a clear pathway for users to migrate from a spreadsheet to R scripting.
This talk introduces jamovi, introduces its user-interface and feature set, and demonstrates the ease with which R packages can be augmented to additionally support the interactive spreadsheet paradigm.
jamovi is available from www.jamovi.org

Speakers

## Jonathon Love

useR pdf

Thursday July 6, 2017 11:36am - 11:54am CEST
PLENARY Wild Gallery

### 11:36am CEST

Keywords: biosignatures, machine learning, drug design, data fusion, high-throughput screening
Webpages: https://www.openanalytics.eu/
For decades, high throughput screening of chemical compounds has played a central role in drug design. In general, such screens were only affordable if they had a narrow biological scope (e.g., compound activity on an isolated protein target). In recent years, screening techniques have become available that combine a high throughput with a high dimensional readout and a complex biological context (e.g., cell culture). Examples are high content imaging and L1000 transcriptomics. In addition, due to state-of-the-art machine learning methods (Unterthiner et al. 2014) and high performance computing (Harnie et al. 2016) it has become possible to benefit from such high dimensional biological data on an enterprise scale. Together, these advances enable Biosignature-Based Drug Design, a paradigm that will dramatically change pharmaceutical research.
A software pipeline, mainly built in R and C++, allows us to support Biosignature-Based Drug Design in an enterprise setting. It is worth noting that dealing with multiple data sets of this scale and complexity is non-trivial and challenging. With our pipeline, we tailor generic methods to the needs of specific projects in diverse therapeutic areas. This operational application goes hand in hand with an ongoing effort –together with academic partners– to improve and extend our workflow.
We will show use cases in which Biosignature-Based Drug Design has increased the effectiveness and cost-efficiency of high throughput screens by repurposing historic data (Simm et al. 2017). Moreover, integrating multiple data sources allows to takes into account a broader biological context, rather than a single mode of action. This will yield a better understanding of on- and off-target effects. Ultimately, this may reduce failure rates for drug candidates in clinical trials.
Acknowledgements This work was supported by research grants IWT130405 ExaScience Life Pharma and IWT150865 Exaptation from the Flanders Innovation and Entrepreneurship agency.

References Harnie, D., M. Saey, A. E. Vapirev, J.K. Wegner, A. Gedich, M.N. Steijaert, H. Ceulemans, R. Wuyts, and W. De Meuter. 2016. “Scaling Machine Learning for Target Prediction in Drug Discovery Using Apache Spark.” Future Generation Computer Systems.

Simm, J., G. Klambauer, A. Arany, M.N. Steijaert, J.K. Wegner, E. Gustin, V. Chupakhin, et al. 2017. “Repurposed High-Throughput Images Enable Biological Activity Prediction for Drug Discovery.” bioRxiv.

Unterthiner, T., A. Mayr, G. Klambauer, M.N. Steijaert, H. Ceulemans, J.K. Wegner, and S. Hochreiter. 2014. “Deep Learning as an Opportunity in Virtual Screening.” In Workshop on Deep Learning and Representation Learning (Nips 2014).

Speakers

## Marvin Steijaert

Consultant, Open Analytics
Data science, Machine learning, Systems biology, Computational biology, Bioinformatics

Thursday July 6, 2017 11:36am - 11:54am CEST
2.01 Wild Gallery

### 11:36am CEST

Keywords: code book, data dictionary, data cleaning, validation, automation
Webpages: https://github.com/petebaker/codebookr, https://github.com/ropensci/auunconf/issues/46
codebookr is an R package under development to automate cleaning, checking and formatting data using metadata from Codebooks or Data Dictionaries. It is primarily aimed at epidemiological research and medical studies but can be easily used in other research areas.
Researchers collecting primary, secondary or tertiary data from RCTs or government and hospital administrative systems often have different data documentation and data cleaning needs to those scraping data off the web or collecting in-house data for business analytics. However, all studies will benefit from using codebooks which comprehensively document all study variables including derived variables. Codebooks document data formats, variable names, variable labels, factor levels, valid ranges for continuous variables, details of measuring instruments and so on.
For statistical consultants, each new data set has a new codebook. While statisticians may get a photocopied codebook or pdf, my preference is a spreadsheet so that the metadata can be used directly. Many data analysts are happy to use this metadata to code syntax to read, clean and check data. I prefer to automate this process by reading the codebook into R and then using the metadata directly for data checking, cleaning, factor level definitions.
While there is considerable interest in the data wrangling and cleaning (Jonge and Loo 2013; Wickham 2014; Fischetti 2017), there appear to be few tools available to read codebooks (see http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html) and even less to automatically apply the metadata to datasets.
We outline the fundamentals of codebookr and demonstrate it’s use on examples of research projects undertaken at University of Queensland’s School of Public Health.
References Fischetti, Tony. 2017. Assertr: Assertive Programming for R Analysis Pipelines. https://CRAN.R-project.org/package=assertr.

Jonge, Edwin de, and Mark van der Loo. 2013. “An Introduction to Data Cleaning with R.” Technical Report 201313. Statistics Netherlands. http://cran.vinastat.com/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf.

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.

Speakers
PB

## Peter Baker

Thursday July 6, 2017 11:36am - 11:54am CEST
2.02 Wild Gallery

### 11:36am CEST

ANOVA-likestatisticaltestsfordifferencesamonggroupsareavailableforalmostahundredyears. But for large number of groups the results from commonly used post-hoc tests are often hard to in- terpret. To deal with this problem, the factorMerger package constructs and plots the hierarchical relation among compared groups. Such hierarchical structure is derived based on the Likelihood Ratio Test and is presented with the Merging Paths Plots created with the ggplot2 package. The cur- rent implementation handles one-dimensional and multi-dimensional Gaussian models as well as binomial and survival models. This article presents the theory and examples for a single-factor use cases.
Package webpage: https://github.com/geneticsMiNIng/FactorMerger
Keywords: analysis of variance (ANOVA), hierarchical clustering, likelihood ratio test (LRT), post
hoc testing

Speakers
AS

## Agnieszka Sitko

Data Scientist, Warsaw University of Technology

Thursday July 6, 2017 11:36am - 11:54am CEST
3.01 Wild Gallery

### 11:36am CEST

Keywords: rvest, purrr, webscraping, fantasy, sports
Webpages: http://www.maxhumber.com
Really interesting data never actually lives inside of a tidy csv. Unless, of course, you think Iris or mtcars is super interesting. Interesting data lives outside of comma separators. It’s unstructured, and messy, and all over the place. It lives around us and on poorly formatted websites, just waiting and begging to be played with.
Finding and fetching and cleaning your own data is a bit like cooking a meal from scratch—instead of microwaving a frozen TV dinner. Microwaving food is simple. It’s literally one step: put thing in microwave. There is, however, no singular step to making a proper meal from scratch. Every meal is different. The recipe for making coconut curry isn’t the same as the recipe for Brussels sprout tacos. But both require a knife and a frying pan!
In “Scraping data with rvest and purrr” I will talk through how to pair and combine rvest (the knife) and purrr (the frying pan) to scrape interesting data from a bunch of websites. This talk is inspired by a recent blog post that I authored for and was well received by the r-bloggers.com community.
rvest is a popular R package that makes it easy to scrape data from html web pages.
purrr is a relatively new package that makes it easy to write code for a single element of a list that can be quickly generalized to the rest of that same list.

Speakers
MH

## Max Humber

Thursday July 6, 2017 11:36am - 11:54am CEST
4.02 Wild Gallery
Talk, Web

### 11:54am CEST

Keywords: Embedding Problem, Generator Matrix, Continuous-Time Markov Chain, Discrete-Time Markov Chain
Webpages: https://CRAN.R-project.org/package=ctmcd
The estimation of the parameters of a continuous-time Markov chain from discrete-time data is an important statistical problem which occurs in a wide range of applications: e.g., with the analysis of gene sequence data, for causal inference in epidemiology, for describing the dynamics of open quantum systems in physics, or in rating based credit risk modeling to name only a few.
The parameters of a continuous-time Markov chain are called generator matrix (also: transition rate matrix or intensity matrix) and the issue of estimating generator matrices from discrete-time data is also known as the embedding problem for Markov chains. For dealing with this missing data situtation, a variety of estimation approaches have been developed. These comprise adjustments of matrix logarithm based candidate solutions of the aggregated discrete-time data, see (Israel, Rosenthal, and Wei 2001) or (Kreinin and Sidelnikova 2001). Moreover, likelihood inference can be conducted by an instance of the expectation-maximization (EM) algorithm and Bayesian inference by a Gibbs sampling procedure based on the conjugate gamma prior distribution (Bladt and Sørensen 2005).
The R package ctmcd (Pfeuffer 2016) is the first publicly available implementation of the approaches listed above. Besides point estimates of generator matrices, the package also contains methods to derive confidence and credibility intervals. The capabilities of the package are illustrated using Standard & Poor’s discrete-time credit rating transition data. Moreover, methodological issues of the described approaches are discussed, i.e., the derivation of the conditional expectations of the E-Step in the EM algorithm and the sampling of endpoint-conditioned continuous-time Markov chain trajectories for the Gibbs sampler.
References Bladt, M., and M. Sørensen. 2005. “Statistical Inference for Discretely Observed Markov Jump Processes.” Journal of the Royal Statistical Society B.

Israel, R. B., J. S. Rosenthal, and J. Z. Wei. 2001. “Finding Generators for Markov Chains via Empirical Transition Matrices, with Applications to Credit Ratings.” Mathematical Finance.

Kreinin, A., and M. Sidelnikova. 2001. “Regularization Algorithms for Transition Matrices.” Algo Research Quarterly.

Pfeuffer, M. 2016. “ctmcd: An R Package for Estimating the Parameters of a Continuous-Time Markov Chain from Discrete-Time Data.” In Revision (the R Journal).

Speakers
MP

## Marius Pfeuffer

Thursday July 6, 2017 11:54am - 12:12pm CEST
3.01 Wild Gallery
Talk, Methods I

### 11:54am CEST

Keywords: Shiny, microbiome, sequencing, ecology, 16S rRNA
Webpageshttps://acnc-shinyapps.shinyapps.io/DAME/https://github.com/bdpiccolo/ACNC-DAME
A new renaissance in knowledge about the role of commensal microbiota in health and disease is well underway facilitated by culture-independent sequencing technologies; however, microbial sequencing data poses new challenges (e.g., taxonomic hierarchy, overdispersion) not generally seen in more traditional sequencing outputs. Additionally, complex study paradigms from clinical or basic research studies necessitate a multilayered analysis pipeline that can seamlessly integrate both primary bioinformatics and secondary statistical analysis combined with data visualization.
In order to address this need, we created a web-based Shiny app, titled DAME, which allows users not familiar with R programming to import, filter, and analyze microbial sequencing data from experimental studies. DAME only requires two files (a BIOM file with sequencing reads combined with taxonomy details, and a csv file containing experimental metadata), which upon upload will trigger the app to render a linear work-flow controlled by the user. Currently, DAME supports group comparisons of several ecological estimates of α-diversity (ANOVA) and β-diversity indices (ordinations and PERMANOVA). Additionally, pairwise differential comparisons of operational taxonomic units (OTUs) using Negative Binomial Regression at all taxonomic levels can be performed. All analyses are accompanied by dynamic graphics and tables for complete user interactivity. DAME leverages functions derived from phyloseqvegan, and DESeq2 packages for microbial data organization and analysis and DThighcharter* and scatterD3 for table and plot visualizations. Downloadable options for α-diversity measurements and DESeq2 table outputs are also provided.
The current release (v0.1) is available online (https://acnc-shinyapps.shinyapps.io/DAME/) and in the Github repository (https://github.com/bdpiccolo/ACNC-DAME). *This app uses Highsoft software with non-commercial packages. Highsoft software product is not free for commercial use. Funding supported by United States Department of Agriculture-Agricultural Research Service Project: 6026-51000-010-05S.

Speakers
BP

## Brian Piccolo

Thursday July 6, 2017 11:54am - 12:12pm CEST
2.01 Wild Gallery

### 11:54am CEST

Keywords: Equating, Item Response Theory, Multiple Forms, Scoring, Testing.
Webpages: https://CRAN.R-project.org/package=equateIRT
In many testing programs, security reasons require that test forms are composed of different items, making test scores not comparable across different administrations. The equating process aims to provide comparable test scores. This talk focuses on Item Response Theory (IRT) methods for dichotomous items. In IRT models, the probability of a correct response depends on the latent trait under investigation and on the item parameters. Due to indentifiability issues, the latent variable is usually assumed to have zero mean and variance equal to one. Hence, when the model is fitted separately for different groups of examinees, the item parameter estimates are expressed on different measurement scales. The scale conversion can be achieved by applying a linear transformation of the item parameters, and the coefficients of this equation are called equating coefficients. This talk explains the functionalities of the R package equateIRT (Battauz 2015), which implements the estimation of the equating coefficients and the computation of the equated scores. Direct equating coefficients between pairs of forms that share some common items can be estimated using the mean-mean, mean-geometric mean, mean-sigma, Haebara and Stocking-Lord methods. However, the linkage plans are often quite complex, and not all forms can be linked directly. As proposed in Battauz (2013), the package computes also the indirect equating coefficients for a chain of forms and the average equating coefficients when two forms can be linked through more than one path. Using the equating coefficients so obtained, the item parameter estimates are converted to a common metric and it is possible to compute comparable scores. For this task, the package implements the true score equating and the observed score equating methods. Standard errors of the equating coefficients and the equated scores are also provided.
References Battauz, Michela. 2013. “IRT Test Equating in Complex Linkage Plans.” Psychometrika 78 (3): 464–80. doi:10.1007/s11336-012-9316-y.

———. 2015. “EquateIRT: An R Package for Irt Test Equating.” Journal of Statistical Software 68 (1): 1–22. doi:10.18637/jss.v068.i07.

Speakers

## Michela Battauz

Associate Professor, University of Udine

Thursday July 6, 2017 11:54am - 12:12pm CEST
4.01 Wild Gallery

### 11:54am CEST

Keywords: REST, API, web, http
Webpages: https://CRAN.R-project.org/package=jug, https://github.com/Bart6114/jug
jug is a web framework for R. The framework helps to easily set up API endpoints. Its main goal is to make building and configuration of web APIs as easy as possible, while still allowing in-depth control over HTTP request processing when needed.
A jug instance allows one to expose solutions developed in R to the web or and/or applications communicating over HTTP. This way, other applications can gain access to, e.g. custom R plotting functions or generate new predictions based on a trained machine learning model.
jug is build upon httpuv. This results in a stable and robust back-end. Recently, endeavors have been made to allow a jug instance to process requests in parallel. The GitHub repository includes a Dockerfile to ease productionisation and containerisation of a jug instance.
During this talk, a tangible reproducible example of creating an API based on a machine learning model will be presented and some of the challenges and experiences in exposing R based results through an API will be discussed.

Speakers
BS

## Bart Smeets

Thursday July 6, 2017 11:54am - 12:12pm CEST
4.02 Wild Gallery
Talk, Web

### 11:54am CEST

Keywords: Data cleaning, Quality control, Reproducible research, Data validation
Webpages: https://CRAN.R-project.org/package=dataMaid, https://github.com/ekstroem/dataMaid
The inability to replicate scientific studies has washed over many scientific fields in the last couple of years with potentially grave consequences. We need to give this problem its due diligence: Extreme care is needed when considering the representativeness of the data, and when we convey reproducible research information. We should not just document the statistical analyses and the data but also the exact steps that were part of the data cleaning process so we know which potential errors that we are unlikely to identify in the data.
Data cleaning and -validation are the first steps in any data analysis since the validity of the conclusions from the statistical analysis hinges on the quality of the input data. Mistakes in the data arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. Ideally, a human investigator should go through each variable in the dataset and look for potential errors — both in input values and codings — but that process can be very time-consuming, expensive and error-prone in itself.
We present the R package dataMaid which implements an extensive and customizable suite of quality assessment tools to identify and document potential problems in the variables of a dataset. The results can be presented in an auto-generated, non-technical, stand-alone overview document intended to be perused by an investigator with an understanding of the variables in the dataset, but not necessarily knowledge of R. Thereby, dataMaid aids the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control. dataMaid also provides a suite of more typical R tools for interactive data quality assessment and -cleaning.

Speakers
CE

## Claus Ekstrøm

Thursday July 6, 2017 11:54am - 12:12pm CEST
2.02 Wild Gallery

### 11:54am CEST

Online presentation: https://goo.gl/pF9bKU

In this talk, Timo Grossenbacher, data journalist at Swiss Public Broadcast and creator of Rddj.info, will show that R is becoming more and more popular among a new community: data journalists. He will showcase some innovative work that has been done with R in the field of data journalism, both by his own team and by other media outlets all over the world. At the same time, he will point out the strengths (reproducibility, for example) and hurdles (having to learn to code) of using R for a typical data journalism workflow – a workflow that is often centered around quick, exploratory data analysis rather than statisticial modeling. During the talk, he will also point out and controversially discuss packages that are of great help for journalists especially, such as the tidyverse, readxl and googlesheets packages.

Speakers

## Timo Grossenbacher

Projektleiter «Automated Journalism», Tamedia
Timo Grossenbacher (1987) verantwortet seit Sommer 2020 Projekte im Bereich «Automated Journalism» bei Tamedia. Davor war er mehr als fünf Jahre als Datenjournalist bei Schweizer Radio und Fernsehen tätig. Er hat Geographie und Informatik an der Universität Zürich studiert und... Read More →

Thursday July 6, 2017 11:54am - 12:12pm CEST
PLENARY Wild Gallery

### 12:12pm CEST

Keywords: Docker, Reproducible Research, Open Science
Webpage: https://github.com/o2r-project/containerit/
Reproducibility of computations is crucial in an era where data is born digital and analysed algorithmically. Most studies however only publish the results, often with figures as important interpreted outputs. But where do these figures come from? Scholarly articles must provide not only a description of the work but be accompanied by data and software. R offers excellent tools to create reproducible works, i.e. Sweave and RMarkdown. Several approaches to capture the workspace environment in R have been made, working around CRAN’s deliberate choice not to provide explicit versioning of packages and their dependencies. They preserve a collection of packages locally (packrat, pkgsnap, switchr/GRANBase) or remotely (MRAN timemachine/checkpoint), or install specific versions from CRAN or source (requireGitHub, devtools). Installers for old versions of R are archived on CRAN. A user can manually re-create a specific environment, but this is a cumbersome task.
We introduce a new possibility to preserve a runtime environment including both, packages and R, by adding an abstraction layer in the form of a container, which can execute a script or run an interactive session. The package containeRit automatically creates such containers based on Docker. Docker is a solution for packaging an application and its dependencies, but shows to be useful in the context of reproducible research (Boettiger 2015). The package creates a container manifest, the Dockerfile, which is usually written by hand, from sessionInfo(), R scripts, or RMarkdown documents. The Dockerfiles use the Rocker community images as base images. Docker can build an executable image from a Dockerfile. The image is executable anywhere a Docker runtime is present. containeRit uses harbor for building images and running containers, and sysreqs for installing system dependencies of R packages. Before the planned CRAN release we want to share our work, discuss open challenges such as handling linked libraries (see discussion on geospatial libraries in Rocker), and welcome community feedback.
containeRit is developed within the DFG-funded project Opening Reproducible Research to support the creation of Executable Research Compendia (ERC) (Nüst et al. 2017).
References Boettiger, Carl. 2015. “An Introduction to Docker for Reproducible Research, with Examples from the R Environment.” ACM SIGOPS Operating Systems Review 49 (January): 71–79. doi:10.1145/2723872.2723882.

Nüst, Daniel, Markus Konkol, Edzer Pebesma, Christian Kray, Marc Schutzeichel, Holger Przibytzin, and Jörg Lorenz. 2017. “Opening the Publication Process with Executable Research Compendia.” D-Lib Magazine 23 (January). doi:10.1045/january2017-nuest.

Speakers

## Daniel Nüst

researcher, University of Münster
Reproducible Research, R, and Docker. Geo. Open Source.

Thursday July 6, 2017 12:12pm - 12:30pm CEST
2.02 Wild Gallery

### 12:12pm CEST

Online presentation: https://ndphillips.github.io/useR2017_pres/

Keywords
: decision trees, decision making, package, visualization
Many complex real-world problems call for fast and accurate classification decisions. An emergency room physician faced with a patient complaining of chest pain needs to quickly decide if the patient is having a heart attack or not. A lost hiker, upon discovering a patch of mushrooms, needs to decide whether they are safe to eat or are poisonous. A stock portfolio adviser, upon seeing that, at 3:14 am, an influential figure tweeted about a 5 company he is heavily invested in, needs to decide whether to move his shares or sit tight. These decisions have important consequences and must be made under time-pressure with limited information. How can and should people make such decisions? One effective way is to use a fast and frugal decision tree (FFT). FFTs are simple heuristics that allow people to make fast, accurate decisions based on limited information (Gigerenzer and Goldstein 1996; Martignon, Katsikopoulos, and Woike 2008). In contrast to compensatory decision algorithms such as regression, or computationally intensive algorithms such as random forests, FFTs allow people to make fast decisions ‘in the head’ without requiring statistical training or a calculation device. Because they are so easy to implement, they are especially helpful in applied decision domains such as emergency rooms, where people need to be able to make decisions quickly and transparently (Gladwell 2007; Green and Mehr 1997)
While FFTs are easy to implement, actually constructing an effective FFT from data is less straightforward. While several FFT construction algorithms have been proposed 15 (Dhami and Ayton 2001; Martignon, Katsikopoulos, and Woike 2008; Martignon et al. 2003), none have been programmed and distributed in an easy-to-use and well-documented tool. The purpose of this paper is to fill this gap by introducing FFTrees (Phillips 2016), an R package (R Core Team 2016) that allows anyone to create, evaluate, and visualize FFTs from their own data. The package requires minimal coding, is documented by many examples, and provides quantitative performance measures and visual displays showing exactly how cases are classified at each level in the tree.
This presentation is structured in three sections: Section 1 provides a theoretical background on binary classification decision tasks and explains how FFTs solve them. Section 2 provides a 5-step tutorial on how to use the FFTrees package to construct and evaluate FFTs from data. Finally, Section 3 compares the prediction performance of FFTrees to alternative algorithms such as logistic regression and random forests. To preview our results, we find that trees created by FFTrees are both more efficient, and as accurate as the best of these algorithms across a wide variety of applied datasets. Moreover, they produce trees much simpler than that of standard decision tree algorithms such as rpart (Therneau, Atkinson, and Ripley 2015), while maintining similar prediction performance.
References Dhami, Mandeep K, and Peter Ayton. 2001. “Bailing and Jailing the Fast and Frugal Way.” Journal of Behavioral Decision Making 14 (2). Wiley Online Library: 141–68.

Gigerenzer, Gerd, and Daniel G Goldstein. 1996. “Reasoning the Fast and Frugal Way: Models of Bounded Rationality.” Psychological Review 103 (4). American Psychological Association: 650.

Gladwell, Malcolm. 2007. Blink: The Power of Thinking Without Thinking. Back Bay Books.

Green, Lee, and David R Mehr. 1997. “What Alters Physicians’ Decisions to Admit to the Coronary Care Unit?” Journal of Family Practice 45 (3). [New York, Appleton-Century-Crofts]: 219–26.

Martignon, Laura, Konstantinos V Katsikopoulos, and Jan K Woike. 2008. “Categorization with Limited Resources: A Family of Simple Heuristics.” Journal of Mathematical Psychology 52 (6). Elsevier: 352–61.

Martignon, Laura, Oliver Vitouch, Masanori Takezawa, and Malcolm R Forster. 2003. “Naive and yet Enlightened: From Natural Frequencies to Fast and Frugal Decision Trees.” Thinking: Psychological Perspective on Reasoning, Judgment, and Decision Making, 189–211.

Phillips, Nathaniel. 2016. FFTrees: Generate, Visualise, and Compare Fast and Frugal Decision Trees.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Therneau, Terry, Beth Atkinson, and Brian Ripley. 2015. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Speakers
NP

## Nathaniel Phillips

Thursday July 6, 2017 12:12pm - 12:30pm CEST
PLENARY Wild Gallery

### 12:12pm CEST

Markov chain Monte Carlo (MCMC) is a method of producing a correlated sample in order to estimate expectations with respect to a target distribution. A fundamental question is when should sampling stop so that we have good estimates of the desired quantities? The key to answering these questions lies in assessing the Monte Carlo error through a multivariate Markov chain central limit theorem. This talk presents the R package mcmcse, which provides estimators for the asymptotic covariance matrix in the Markov chain CLT. In addition, the package calculates a multivariate effective sample size which can be rigorously used to terminate MCMC simulation. I will present the use of the R package mcmcse to conduct robust, valid, and theoretically just output analysis for Markov chain data.

Speakers

## Dootika Vats

University of Warwick

Thursday July 6, 2017 12:12pm - 12:30pm CEST
3.01 Wild Gallery
Talk, Methods I

### 12:12pm CEST

In the last decade, the volume of data have grown faster then the speed of processors. In this situation the statistical machine learnig methods have become more limited by the computations time than the volume of datasets. Compromise solutions in the case of large scale data are associated with the computational complexity of optimization methods, which must be made in a non-trivial way. One of such solutions are optimization algorithms that are basen on a stochastic gradient descent (Bottou (2010), Bottou (2012), Widrow (1960)), which exhibit a high efficiency during operations on the data of a large scale.
In my presentation I will describe the stochastic gradient descent algorithm that was applied in the log- likelihood estimation process of coefficients’ calcualtions of the Cox proportional hazards model. This algorithm can be successfully used in a time to event analyzes, in which which the number of explanatory variables significantly exceeds the number of observations. The prepared method of estimation of coefficients with the usage of a stochastic gradient decent can be applied in survival analyzes from ares like: molecular biology, bioinformatical screenings of gene expressions or analyzes based on DNA microarrays, that are widely used in the clinical diagnostics, treatment and research.
The created estimation workflow was a new approach (in the time I wrote my master thesis), not known in the literature. It’s resistant to the problem of variables collinearity and works well in situations of continuous coefficients improvement for a streaming data.

Speakers
MK

## Marcin Kosiński

Thursday July 6, 2017 12:12pm - 12:30pm CEST
2.01 Wild Gallery

### 12:30pm CEST

Thursday July 6, 2017 12:30pm - 1:30pm CEST
CATERING POINTS Wild Gallery
BREAK

### 1:30pm CEST

title: An Efficient Algorithm for Solving Large Fixed Effects OLS Problems with Clustered Standard Error Estimation author: | | Thomas Balmat and Jerome Reiter | | Duke University
Keywords: large data least squares, fixed effects estimation, clustered standard error estimation, sparse matrix methods, high performance computing
Large fixed effects regression problems, involving order 107 observations and 103 effects levels, present special computational challenges but, also, a special performance opportunity because of the large proportion of entries in the expanded design matrix (fixed effect levels translated from single columns into dichotomous indicator columns, one for each level) that are zero. For many problems, the proportion of zero entries is above 0.99995, which would be considered sparse. In this presentation, we demonstrate an efficient method for solving large, sparse fixed effects OLS problems without creation of the expanded design matrix and avoiding computations involving zero-level effects. This leads to minimal memory usage and optimal execution time. A feature, often desired in social science applications, is to estimate parameter standard errors clustered about a key identifier, such as employee ID. For large problems, with ID counts in the millions, this presents a significant computational challenge. We present a sparse matrix indexing algorithm that produces clustered standard error estimates that, for large fixed effects problems, is many times more efficient than standard “sandwich” matrix operations.

Speakers
TB

## Thomas Balmat

Thursday July 6, 2017 1:30pm - 1:48pm CEST
3.01 Wild Gallery

### 1:30pm CEST

Keywords: Teaching, Knowledge Sharing, Best Practices
Webpage: https://github.com/rOpenGov/edu
R is increasingly used to teach programming, quantitative analytics, and reproducible research practices. Based on our combined experience from universities, research institutes, and the public sector, we summarize key ingredients for teaching of modern data science. Learning to program has already been greatly facilitated by initiatives such as Data Carpentry and Software Carpentry, and educational resources have been developed by the users, including domain specific tutorial collections and training materials (Kamvar et al. 2017; Lahti et al. 2017; Afanador-Llach et al. 2017). An essential pedagogical feature of R is that it enables a problem-centered, interactive learning approach. Even programming-naive learners can, in our experience, rapidly adopt practical skills by analyzing topical example data sets supported by ready-made Rmarkdown templates; these can provide an immediate starting point to expose the learners to some of the key tools and best practices (Wilson et al. 2016). However, many aspects of learning R are still better appreciated by advanced users; such as harnessing the full potential of open collaboration model by joint development of custom R packages, report templates, shiny-modules, or database functions that enables rapid development of solutions catering specific practical needs. Indeed, at all levels of learning, getting things done fast, appears to be an essential component for successful learning as it provides instant rewards and helps to put the acquired skills into immediate use. The diverse needs of different application domains pose a great challenge for crafting common guidelines and materials, however. Leveraging the existing experience within the learning community can greatly support the learning process as it helps to ensure the domain specificity and relevance of the teaching experience. This can actively promoted by peer support and knowledge sharing; some ways to achieve this include code review, show-and-tell culture, informal meetings, online channels (e.g. Slack, IRC, Facebook) and hackathons. Last but not least, having fun throughout the learning process is essential; gamification of assignments with real-time rankings or custom functions performing non-statistical operations like emailing gif images can raise awareness of how R as a full-fledged programming language differs from proprietary statistical packages. In order to meet these demands, we designed specific open infrastructure to support learning in R. Our infrastructure gathers a set of modules to construct domain spesific assignments for various phases of data analysis. The assignments are coupled with automated evaluation and scoring routines that provide instant feedback during learning. In this talk, we introduce these R-based teaching tools and summarize our practical experiences on the various pedagogical aspects, opportunities, and challenges of community-based learning and knowledge sharing enabled by the R ecosystem.
References Afanador-Llach, Maria José, Antonio Rojas Castro, Adam Crymble, Víctor Gayol, Fred Gibbs, Caleb McDaniel, Ian Milligan, Amanda Visconti, and Jeri Wieringa. 2017. “The Programming Historian. Second Edition.” http://programminghistorian.org/.

Kamvar, Zhian N., Margarita M. López-Uribe, Simone Coughlan, Niklaus J. Grünwald, Hilmar Lapp, and Stéphanie Manel. 2017. “Developing Educational Resources for Population Genetics in R: An Open and Collaborative Approach.” Molecular Ecology Resources 17 (1): 120–28. doi:10.1111/1755-0998.12558.

Lahti, Leo, Sudarshan Shetty, Tineka Blake, and Jarkko Salojarvi. 2017. “Microbiome R Package.” http://microbiome.github.io/microbiome.

Wilson, Greg, Jenny Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal. 2016. “Good Enough Practices for Scientific Computing,” 1–30.

Speakers

## Markus Kainu

Researcher, The Social Insurance Institution of Finland

slides pdf

Thursday July 6, 2017 1:30pm - 1:48pm CEST
PLENARY Wild Gallery

### 1:30pm CEST

Keywords: accessibility, exploration, interactivity
Descriptions of graphs using long text strings are difficult for blind people and others with print disabilities to process; they lack the interactivity necessary to understand the content and presentation of even the simplest statistical graphs. Until very recently, R has been the only statistical software that has any capacity for offering the print disabled community any hope of support with respect to accessing graphs. We have levered off the ability to create text descriptions of graphs and the ability to create interactive web content for chemical diagrams to offer a new user experience.
We will present the necessary tools that (1) produce the desired graph in the correct form of a scalable vector graphic (SVG) file, (2) create a supporting XML structure for exploration of the SVG, and (3) the javascript library to support these files being mounted on the web.
Demonstration of how a blind user can explore the graph by “walking” a tree-like structure will be given. A key enhancement is the ability to explore the content at different levels of understanding; the user chooses to hear either the bare basic factual description or a more descriptive layer of feedback that can offer the user insight.

Speakers
JR

## Jonathan R. Godfrey

Thursday July 6, 2017 1:30pm - 1:48pm CEST
4.02 Wild Gallery
Talk, Graphics

### 1:30pm CEST

Keywords: big data, reproducibility, data aggregation, bioinformatics, imaging
Webpages: https://github.com/kuwisdelu/matter, http://bioconductor.org/packages/release/bioc/html/matter.html
A common challenge in many areas of data science is the proliferation of large and heterogeneous datasets, stored in disjoint files and specialized formats, and exceeding the available memory of a computer. It is often important to work with these data on a single machine, e.g. to quickly explore the data, or to prototype alternative analysis approaches on limited hardware. Current solutions for working with such data on disk on a single machine in R involve wrapping existing file formats and structures (e.g., NetCDF, HDF5, database approaches, etc.) or converting them to very simple flat files (e.g., bigmemory, ff).
Here we argue that it is important to enable more direct interactions with such data in R. Direct interactions avoid the time and storage cost of creating converted files. They minimize the loss of information that can occur during the conversion, and therefore improve the accuracy and the reproducibility of the analytical results. They can best leverage the rich resources from over 10,000 packages already available in R.
We present matter, a novel paradigm and a package for direct interactions with complex, larger-than-memory data on disk in R. matter provides transparent access to datasets on disk, and allows us to build a single dataset from many smaller data fragments in custom formats, without reading them into memory. This is accomplished by means of a flexible data representation that allows the structure of the data in memory to be different from its structure on disk. For example, what matter presents as a single, contiguous vector in R may be composed of many smaller fragments from multiple files on disk. This allows matter to scale to large datasets, stored in large stand-alone files or in large collections of smaller files.
To illustrate the utility of matter, we will first compare its performance to bigmemory and ff using data in flat files, which can be easily accessed by all the three approaches. In tests on simulated datasets greater than 1 GB and common analyses such as linear regression and principal components analysis, matter consumed the same or less memory, and completed the analyses in a comparable time. It was therefore similar or more efficient than the available solutions.
Next, we will illustrate the advantage of matter in a research area that works with complex formats. Mass spectrometry imaging (MSI) relies on imzML, a common open-source format for data representation and sharing across mass spectrometric vendors and workflows. Results of a single MSI experiment are typically stored in multiple files. An integration of matter with the R package Cardinal allowed us to perform statistical analyses of all the datasets in a public Gigascience repository of MSI datasets, ranging from <1 GB up to 42 GB in size. All of the analyses were performed on a single laptop computer. Due to the structure of imzML, these analyses would not have been possible with the existing alternative solutions for working with larger-than-memory datasets in R .
Finally, we will demonstrate the applications of matter to large datasets in other formats, in particular text data that arise in applications in genomics and natural language processing, and will discuss approaches to using matter when developing new statistical methods for such datasets.

Speakers
KA

## Kylie A. Bemis

Thursday July 6, 2017 1:30pm - 1:48pm CEST
2.01 Wild Gallery

### 1:30pm CEST

Keywords: Reinforcement Learning, Human-Like Learning, Experience Replay, Q-Learning, Decision Analytics
Webpages: https://github.com/nproellochs/ReinforcementLearning
Reinforcement learning has recently gained a great deal of traction in studies that call for human-like learning. In settings where an explicit teacher is not available, this method teaches an agent via interaction with its environment without any supervision other than its own decision-making policy. In many cases, this approach appears quite natural by mimicking the fundamental way humans learn. However, implementing reinforcement learning is programmatically challenging, since it relies on continuous interactions between an agent and its environment. In fact, there is currently no package available that performs model-free reinforcement learning in R. As a remedy, we introduce the ReinforcementLearning R package, which allows an agent to learn optimal behavior based on sample experience consisting of states, actions and rewards. The result of the learning process is a highly interpretable reinforcement learning policy that defines the best possible action in each state. The package provides a remarkably flexible framework and is easily applied to a wide range of different problems. We demonstrate the added benefits of human-like learning using multiple real-world examples, e.g. by teaching the optimal movements of a robot in a grid map.

Speakers
NP

## Nicolas Pröllochs

Slides pptx

Thursday July 6, 2017 1:30pm - 1:48pm CEST
3.02 Wild Gallery

### 1:30pm CEST

Keywords: performance, compliation, Renjin
Webpages: http://docs.renjin.org/en/latest/package/
R is a highly dynamic language that has developed, in some circles, a reputation for poor performance. New programmers are counseled to avoid for loops and experienced users condemened to rewrite perfectly good R code in C++.
Renjin is an alternative implementation of the R language that includes a Just-in-Time compiler which uses information at runtime to dynamically specialize R code and generate highly-efficient machine code, allowing users to write “normal”, expresssive R code and let the compiler worry about performance.
While Renjin aims to provide a complete alternative to the GNU R interpreter, it is not yet fully compatible with all R packages, and lacks a number of features, including graphics support. For this reason, we present renjin, a new package that embeds Renjin’s JIT compiler in the existing GNU R compiler, enabling even novice programmers to achieve a high performance without resorting to C++ or making the switch to a different interpreter.
This talk will introduce the techniques behind Renjin’s optimizing compiler, demonstrate how it can be simply applied to performance-critical sections of R code, and some tips and tricks for getting the most of out of renjin.

Speakers

## Alexander Bertram

I work on two projects: Renjin (www.renjin.org), a interpreter and optimizing compiler for the R language; and ActivityInhtfo (www.activityinfo.org), a data collection, management, and analysis platform for the UN and NGOs working in crisis environments. Talk to me about compilers... Read More →

Thursday July 6, 2017 1:30pm - 1:48pm CEST
2.02 Wild Gallery

### 1:48pm CEST

Online presentation:

Keywords
: compiler, code analysis, performance
Webpages: https://github.com/nick-ulle/rstatic, https://github.com/duncantl/Rllvm
R allows useRs to focus on the problems they want to solve by abstracting away the details of the hardware. This is a major contributor to R’s success as a data analysis language, but also makes R too slow and resource-hungry for certain tasks. Traditionally, useRs have worked around this limitation by rewriting bottlenecks in Fortran, C, or C++. These languages provide a substantial performance boost at the cost of abstraction, a trade-off that useRs should not have to face.
This talk introduces a collection of packages for analyzing, optimizing, and building compilers for R code, extending earlier work by Temple Lang (2014). By building on top of the LLVM Compiler Infrastructure (Lattner and Adve 2004), a mature open-source library for native code generation, these tools enable translation of R code to specialized machine code for a variety of hardware. Moreover, the tools are extensible and ease the creation of efficient domain-specific languages based on R, such as nimble and dplyr. Potential applications will be discussed and a simple compiler (inspired by Numba) for mathematical R code will be presented as a demonstration.
References Lattner, Chris, and Vikram Adve. 2004. “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.” In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, 75. CGO ’04. Washington, DC, USA: IEEE Computer Society.

Temple Lang, Duncan. 2014. “Enhancing R with Advanced Compilation Tools and Methods.” Statistical Science 29 (2). Institute of Mathematical Statistics: 181–200.

Speakers
NU

## Nick Ulle

Thursday July 6, 2017 1:48pm - 2:06pm CEST
2.02 Wild Gallery

### 1:48pm CEST

Keywords: Reproducible research, data versioning
Webpages: https://CRAN.R-project.org/package=daff, https://github.com/edwindj/daff
In data analysis, it can be necessary to compare two files comparing tabular data. Unfortunately, existing tools have been customized for comparing source code or other text files, and are unsuitable for comparing tabular data.
The daff R package provides tools for comparing and tracking changes in tabular data stored in data.frames. daff wraps Paul Fitz’s multi-language daff package (https://github.com/paulfitz/daff), which generates data diff that capture row and column modifications, reorders, additions, and deletions. These data diffs follow a standard format (http://dataprotocols.org/tabular-diff-format/) which can be used to HTML formatted diffs, summarize changes, and even patch (a new version of) input data.
daff augments brings the utility of source-code change tracking tools to tabular data, enabling data versioning as a component of software development and reproducible research.
References

Speakers
ED

## Edwin de Jonge

daff pdf

Thursday July 6, 2017 1:48pm - 2:06pm CEST
2.01 Wild Gallery

### 1:48pm CEST

Keywords: reproducible research, rmarkdown, open science, training, github
Webpages: https://datacarpentry.org
Data Carpentry is a non-profit organization and community. It develops and teaches workshops aimed at researchers with little to no programming experience. It teaches skills and good practices for data management and analysis, with a particular emphasis on reproducibility. Over a two-day workshop, participants are exposed to the full life cycle of data-driven research. Since its creation in 2014, Data Carpentry has taught over 125 workshops and trained 400+ certified instructors. Because the workshops are domain specific, participants can get familiar with the dataset used throughout the workshop quickly, and focus on learning the computing skills. We have developed detailed assessments to evaluate the effectiveness and level of satistaction of the participants after attending a workshop as well as the impact on their research and careers 6 months or more after a workshop. Here, we will present an overview of the organization, the skills taught with a particular emphasis on using R, and the strategies used to make these workshops successful.

Speakers
MC

## Mine Cetinkaya - Rundel

Thursday July 6, 2017 1:48pm - 2:06pm CEST
PLENARY Wild Gallery
Talk, Education

### 1:48pm CEST

Keywords: Deep Learning, Natural Language Processing
Webpages: https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with-convolutional-neural-networks-on-microsoft-azure/, https://github.com/dmlc/mxnet/tree/master/R-package, https://github.com/Azure/Cortana-Intelligence-Gallery-Content/tree/master/Tutorials/Deep-Learning-for-Text-Classification-in-Azure
The use of deep learning for NLP has attracted a lot of interest in the research community over recent years. This talk describes how deep learning techniques can be applied to natural language processing (NLP) tasks using R. We demonstrate how the MXNet deep learning framework can be used to implement, train and deploy deep neural networks that can solve text categorization and sentiment analysis problems.
We begin by briefly discussing the motivation and theory behind applying deep learning to NLP tasks. Deep learning has achieved a lot of success in the domain of image recognition. State-of-the-art image classification systems employ convolutional neural networks (CNNs) with a large number of layers. These networks perform well because they can learn hierarchical representations of the input with increasing levels of abstraction. In the context of NLP, neural networks have been shown to achieve good results. In particular, Recurrent Neural Networks such as Long Short Term Memory Networks (LSTMs) perform well for problems where the input is a sequence, such as speech recognition and text understanding. In this talk we explore an interesting approach which takes inspiration from the image recognition domain and applies CNNs to NLP problems. This is achieved by encoding segments of text in an image-like matrix, where each encoded word or character is equivalent to a pixel in the image.
CNNs have achieved excellent performance for text categorization and sentiment analysis. In this talk, we demonstrate how to implement a CNN for these tasks in R. As an example, we describe in detail the code to implement the Crepe model. To train this network, each input sentence is transformed into a matrix in which each column represents a one-hot encoding of each character. We describe the code needed to perform this transformation and how to specify the structure of the network and hyperparameters using the R bindings to MXNet provided in the mxnet package. We show how we implemented a custom C++ iterator class to efficiently manage the input and output of data. This allows us to process CSV files in chunks, taking batches of raw text and tranforming them into matrices in memory, whilst distributing the computation over multiple GPUs. We describe how to set up a virtual machine with GPUs on Microsoft Azure to train the network, including installation of the necessary drivers and libraries. The network is trained on the Amazon categories dataset which consists of a training set of 2.38 million sentences, each of which map to one of 7 categories including Books, Electronics and Home & Kitchen.
The talk concludes with a demo of how a trained network can be deployed to classify new sentences. We demonstrate how this model can be deployed as a web service which can be consumed from a simple web app. The user can query the web service with a sentence and the API will return a product category. Finally, we show how the Crepe model can be applied to the sentiment analysis task using exactly the same network structure and training methods.
Through this talk, we aim to give the audience insight into the motivation for employing CNNs to solve NLP problems. Attendees will also gain an understanding of how they can be implemented, efficiently trained and deployed in R.

Speakers
AT

## Angus Taylor

Thursday July 6, 2017 1:48pm - 2:06pm CEST
3.02 Wild Gallery

### 1:48pm CEST

Keywords: visualization, interactive
Webpages: https://CRAN.R-project.org/package=ggiraph, https://davidgohel.github.io/ggiraph/
With rise of data visualisation, ggplot2 and D3.js tools have become very popular these last years. The first is providing an high level library for data visualisation whereas the latter is providing a low level library for binding graphical elements in a web context.
The ggiraph package combines both tools. From a user point of view, it enables the production of interactive graphics from ggplot2 objects by using their extension mechanism. It provides useful interactive capabilities such as tooltips and zoom/pan. Last but not least, graphical elements can be selected when a ggiraph object is embedded in a Shiny app: selection will be available as a reactive value. The interface is simple, flexible and does not requires effort to be integrated in R Markdown documents or Shiny applications.
In this talk I will introduce ggiraph and show examples of using it as a data visualisation tools in RStudio, Shiny applications and R Markdown documents.

Speakers
DG

## David Gohel

Thursday July 6, 2017 1:48pm - 2:06pm CEST
4.02 Wild Gallery
Talk, Graphics

Speakers

## Christina Knudson

Assistant Professor, University of St. Thomas

Thursday July 6, 2017 1:48pm - 2:06pm CEST
3.01 Wild Gallery

### 2:06pm CEST

Keywords: Count data regression, model diagnostics, rootogram, visualization
Webpages: https://R-Forge.R-project.org/projects/countreg
The interest in regression models for count data has grown rather rapidly over the last 20 years, partly driven by methodological questions and partly by the availability of new data sets with complex features (see, e.g., Cameron and Trivedi 2013). The countreg package for R provides a number of fitting functions and new tools for model diagnostics. More specifically, it incorporates enhanced versions of fitting functions for hurdle and zero-inflation models that have been available via the pscl package for some 10 years (Zeileis, Kleiber, and Jackman 2008), now also permitting binomial responses. In addition, it provides zero-truncation models for data without zeros, along with mboost family generators that enable boosting of zero-truncated and untruncated count data regressions, thereby supplementing and extending family generators available with the mboost package. For visualizing model fits, countreg offers rootograms (Tukey 1972; Kleiber and Zeileis 2016) and probability integral transform (PIT) histograms. A (generic) function for computing (randomized) quantile residuals is also available. Furthermore, there are enhanced options for predict() methods. Several new data sets from a variety of fields (including dentistry, ethology, and finance) are included.
Development versions of countreg have been available from R-Forge for some time, a CRAN release is planned for summer 2017.
References Cameron, A. Colin, and Pravin K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. Cambridge: Cambridge University Press.

Kleiber, Christian, and Achim Zeileis. 2016. “Visualizing Count Data Regressions Using Rootograms.” The American Statistician 70 (3): 296–303.

Tukey, John W. 1972. “Some Graphic and Semigraphic Displays.” In Statistical Papers in Honor of George W. Snedecor, edited by T. A. Bancroft, 293–316. Ames, IA: Iowa State University Press.

Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. “Regression Models for Count Data in R.” Journal of Statistical Software 27 (8): 1–25. http://www.jstatsoft.org/v27/i08/.

Speakers
CK

## Christian Kleiber

Thursday July 6, 2017 2:06pm - 2:24pm CEST
3.01 Wild Gallery

### 2:06pm CEST

Keywords: ODBC, DBI, databases, dplyr, RStudio
Webpages: https://CRAN.R-project.org/package=odbc, https://github.com/rstats-db/odbc
Getting data into and out of databases is one of the most fundamental parts of data science. Much of the world’s data is stored in databases, including traditional databases such as SQL Server, MySQL, PostgreSQL, and Oracle, as well as non-traditional databases like Hive, BigQuery, Redshift and Spark.
The odbc package provides an R interface to Open Database Connectivity (ODBC) drivers and databases including all those listed previously. odbc provides consistent output; including support for timestamps and 64-bit integers, improved performance for reading and writing, and complete compatibility with the DBI package.
odbc connections can be used as dplyr backends, allowing one to perform expensive queries within the database and reduce the need to transfer and load large amounts of data in an R session. odbc is also integrated into the RStudio IDE, with dialogs to setup and establish connections, preview available tables and schemas and run ad-hoc SQL queries. The RStudio Professional Products are bundled with a suite of ODBC drivers, to make it easy for System Administrators to establish and support connections to a variety of database technologies.

Speakers
JH

## Jim Hester

Thursday July 6, 2017 2:06pm - 2:24pm CEST
2.01 Wild Gallery

### 2:06pm CEST

Keywords: Multicore architectures, benchmarking, scalability, Xeon Phi
We present performance results obtained with a new performance benchmark of the R programming environment on the Xeon Phi Knights Landing and standard Xeon-based compute nodes. The benchmark package consists of microbenchmarks of matrix linear algebra kernels and machine learning functionality included in the R distribution that can be built from those kernels. Our microbenchmarking results show that the Knights Landing compute nodes exhibited similar or superior performance compared to the standard Xeon-based nodes for matrix dimensions of moderate to large size for most of the microbenchmarks, executing as much as five times faster than the standard Xeon-based nodes. For the clustering and neural network training microbenchmarks, the standard Xeon-based nodes performed up to four times faster than their Xeon Phi counterparts for many large data sets, indicating that commonly used R packages may need to be reengineered to take advantage of existing optimized, scalable kernels.
Over the past several years a trend of increased demand for high performance computing (HPC) in data analysis has emerged. This trend is driven by increasing data sizes and computational complexity(Fox et al. 2015; Kouzes et al. 2009). Many data analysts, researchers, and scientists are turning to HPC machines to help with algorithms and tools, such as machine learning, that are computationally demanding and require large amounts of memory (Raj et al. 2015). The characteristics of large scale machines (e.g. large amounts of RAM per node, high storage capacity, and advanced processing capabilities) appear very attractive to these researchers, however, challenges remain for algorithms to make optimal use of the hardware (Lee et al. 2014). Depending on the nature of the analysis to be performed, analytics workflows may be carried out as many independent concurrent processes requiring little or no coordination between them, or as highly coordinated parallel processes in which the processes perform portions of the same computational task. Regardless of the implementation, it is important for data analysts to have software environments at their disposal which can exploit the performance advantages of modern HPC machines.
A way to assess the performance of software on a given computing platform and inter-compare performance across different platforms is through benchmarking. Benchmark results can also be used to prioritize software performance optimization efforts on emerging HPC systems. One such emerging architecture is the Intel Xeon Phi processor, codenamed Knights Landing (KNL). The latest Intel Xeon Phi processor is a system on a chip, many-core, vector processor with up to 68 cores and two 512-bit vector processing units per core, a sufficient deviation from the standard Xeon processors and Xeon Phi accelerators of the previous generation to necessitate a performance assessment of the R programming environment on KNL.
We developed an R performance benchmark to determine the single-node run time performance of compute intensive linear algebra kernels that are common to many data analytics algorithms, and the run time performance of machine learning functionality commonly implemented with linear algebra operations. We then performed single-node strong scaling tests of the benchmark on both Xeon and Xeon Phi based systems to determine problem sizes and numbers of threads for which the KNL architecture was comparable to or outperformed their standard Intel Xeon counterparts. It is our intention that these results be used to guide future performance optimization efforts of the R programming environment to increase the applicability of HPC machines for compute-intensive data analysis. The benchmark is also generally applicable to a variety of systems and architectures and can be easily run to determine the computational potential of a system when using R for many data analysis tasks.
References Fox, Geoffrey, Judy Qiu, Shantenu Jha, Saliya Ekanayake, and Supun Kamburugamuve. 2015. “Big Data, Simulations and Hpc Convergence.” In Workshop on Big Data Benchmarks, 3–17. Springer.

Kouzes, Richard T, Gordon A Anderson, Stephen T Elbert, Ian Gorton, and Deborah K Gracio. 2009. “The Changing Paradigm of Data-Intensive Computing.” Computer 42 (1). IEEE: 26–34.

Lee, Seunghak, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. 2014. “On Model Parallelization and Scheduling Strategies for Distributed Machine Learning.” In Advances in Neural Information Processing Systems, 27:2834–42.

Raj, Pethuru, Anupama Raman, Dhivya Nagaraj, and Siddhartha Duggirala. 2015. “High-Performance Big-Data Analytics.” Computing Systems and Approaches (Springer, 2015) 1. Springer.

Speakers
SM

## Scott Michael

Thursday July 6, 2017 2:06pm - 2:24pm CEST
2.02 Wild Gallery

### 2:06pm CEST

Keywords: R, Distributed/Scalable, Machine Learning, SparkR, SystemML
Webpages: https://github.com/SparkTC/R4ML
R is the de facto standard for statistics and analysis. In this talk, we introduce R4ML, a new open-source R package for scalable machine learning from IBM. R4ML provides a bridge between R, Apache SystemML and SparkR, allowing R scripts to invoke custom algorithms developed in SystemML’s R-like domain specific language. This capability also provides a bridge to the algorithm scripts that ship with Apache SystemML, effectively adding a new library of prebuilt scalable algorithms for R on Apache Spark. R4ML integrates seamlessly SparkR, so data scientists can use the best features of SparkR and SystemML together in the same script. In addition, the R4ML package provides a number of useful new scalable R functions that simplify common data cleaning and statistical analysis tasks.
Our talk will begin with an overview of the R4ML package, its API, supported canned algorithms, and the integration to Spark and SystemML. We will walk through a small example of creating a custom algorithm and a demo of canned algorithm. We will share our experiences using R4ML technology with IBM clients. The talk will conclude with pointers to how the audience can try out R4ML and discuss potential areas of community collaboration.

Speakers
AN

## Alok N Singh

Thursday July 6, 2017 2:06pm - 2:24pm CEST
3.02 Wild Gallery

### 2:06pm CEST

Keywords: Hypothesis testing, regression model, mixed effects model, mixture model, change point detection
Webpage: http://sia.webpopix.org/
We are developing at Inria and Ecole Polytechnique the web-based educative platform Statistics in Action with R.
The purpose of this online course is to show how statistics may be efficiently used in practice using R.
The course presents both statistical theory and practical analysis on real data sets. The R statistical software and several R packages are used for implementing methods presented in the course and analyzing real data. Many interactive Shiny apps are also available.
Topics covered in the current version of the course are:
• hypopthesis testing (single and multiple comparisons)
• regression models (linear and nonlinear models)
• mixed effects models (linear and nonlinear models)
• mixture models
• detection of change points
• image restoration
We are aware that important aspects of statistics are not addressed, both in terms of models and methods. We plan to fill some of these gaps shortly.
Even if R is extensively used for this course, this is not a R programming course. On one hand, our objective is not to propose the most efficient implementation of an algorithm, but rather to provide a code that is easy to understand, to reuse and to extend.
On the other hand, the R functions used to illustrate a method are not used as “black boxes”. We show in detail how the results of a given function are obtained. Then, the course may be read at two different levels: we may be only interested in the statistical technique to use (and then the R function to use) for a given problem (see the first part of the course about polynomial regression), or we may want to go into details and understand how these results are computed (see the second part of this course about polynomial regression).
This course was first given at Ecole Polytechnique (France) in 2017.

Speakers
ML

## Marc Lavielle

Thursday July 6, 2017 2:06pm - 2:24pm CEST
PLENARY Wild Gallery

### 2:06pm CEST

Keywords: meta-analysis, funnel plot, visual inference, publication bias, small study effects
Webpages: https://CRAN.R-project.org/package=metaviz, https://metaviz.shinyapps.io/funnelinf_app/
The funnel plot is widely used in meta-analysis to detect small study effects, especially publication bias. However, it has been repeatedly shown that the interpretation of funnel plots is highly subjective and often leads to false conclusions regarding the presence or absence of such small study effects (Terrin, Schmid, and Lau 2005). Visual inference (Buja et al. 2009) is the formal inferential framework to test if graphically displayed data do or do not support a hypothesis. The general idea is that if the data supports an alternative hypothesis, the graphical display showing the real data should be identifiable when simultaneously presented with displays of simulated data under the null hypothesis. When compared to conventional statistical tests, visual inference showed promising results in experiments, for example, for testing linear model coefficients using boxplots and scatterplots (Majumder, Hofmann, and Cook 2013). With the package nullabor (Wickham, Chowdhury, and Cook 2014) helpful general purpose functions for visual inference are available within R. Due to the often uncertain or even misleading nature of funnel plot based conclusions, we identified funnel plots as a prime candidate field for the application of visual inference. For this purpose, we developed the function funnelinf which is available within the R package metaviz. The function funnelinf is specifically tailored to visual inference of funnel plots, for instance, with options for displaying significance contours, Egger’s regression line, and for using different meta-analytic models for null plot simulation. In addition, the functionalities of funnelinf are made available as a shiny app for the convenient use by meta-analysts not familiar with R. Visual funnel plot inference and the capabilities of funnelinf are illustrated with real data from a meta-analysis on the mozart effect. Furthermore, results of an empirical experiment evaluating the power of visual funnel plot inference compared to traditional statistical funnel plot based tests are presented. Implications of these results are discussed and specific guidelines for the use of visual funnel plot inference are given.
References Buja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah F Swayne, and Hadley Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 367: 4361–83.

Majumder, Mahbubul, Heike Hofmann, and Dianne Cook. 2013. “Validation of Visual Statistical Inference, Applied to Linear Models.” Journal of the American Statistical Association 108: 942–56.

Terrin, Norma, Christopher H Schmid, and Joseph Lau. 2005. “In an Empirical Evaluation of the Funnel Plot, Researchers Could Not Visually Identify Publication Bias.” Journal of Clinical Epidemiology 58: 894–901.

Wickham, Hadley, Niladri Roy Chowdhury, and Di Cook. 2014. Nullabor: Tools for Graphical Inference. https://CRAN.R-project.org/package=nullabor.

Speakers
MK

## Michael Kossmeier

Thursday July 6, 2017 2:06pm - 2:24pm CEST
4.02 Wild Gallery
Talk, Graphics

### 2:24pm CEST

Keywords: GNU R, PLC, Embedded, Linux
Abstract Being one of the leading institutions in the field of applied energy research, the Bavarian Center for Applied Energy Research (ZAE Bayern) combines excellent research with excellent economic implementation of the results. Our main research goal is to increase the capacity of low-voltage grids for installed photovoltaics. Therefore the influences were analysed by taking measurements in grid nodes and households. In this context we applied several open-source programming languages, but we found GNU R to be best suiting. This results from being capable of analysing, manipulating and plotting the measurement data as well as simulating and controlling our real test systems. We have installed multiple storages and modules in different households. In order to control these test sites, we use Wago PLC 750-8202, that are currently programmed in Codesys. As strategies can get quite complex (due to individual forecasting, dynamic non-linear storage models, …), we see that Codesys isn’t capable of this. With choosing R for complex computations, we have access to a wide rage of libraries and our self-developed strategies used for analysis and simulation relying on the measurement data from the grid and the weather. To bring the strategy to the PLC, we divided the whole system into two parts. One of it is running on our central servers and is preparing external data from our databases for each test site. Therefore, we set up a control platform using node-red and the httpuv package in order to run R scripts on demand. The second half will be computed on the Wago PLCs. With the board support package (BSP), Wago provides a tool-chain to its customers for build their own customised firmware. Our proposed idea is to get R and Python together with the basic Codesys running on the PLC. As for that, Python will serve as an asynchronous local controller, that is able to start calculations in R and provide control quantities to Codesys. We try to apply the rzmq package for inter process communication (IPC) and data exchange. For example, the information delivered to the Python controller will be cyclically forwarded to the global.environment of a continuously running R instance. This helps us to reduce the start-up and initialization effort for our models. On demand, R is instructed to calculate a given strategy, that is chosen for a specific day and situation by the central servers. R will hand the result back to the Python controller and forward it to Codesys, where short-term closed control-loops can be established. Our approach will be tested and verified on our Wago PLCs in our environment.

Schematic procedure

Speakers
OG

## Oliver Glaß

Thursday July 6, 2017 2:24pm - 2:42pm CEST
2.02 Wild Gallery

### 2:24pm CEST

Keywords: Teaching Statistics using R, User Interface, Technology Acceptance Model
Teaching computation with R within statistic education is affected by the acceptance of the technology (Baglin 2013). In a quasi-experiment we investigate whether different user interfaces to R, namely mosaic or Rcmdr, influence the acceptance according to the Technology Acceptance Model (Venkatesh et al. 2003).
The focus thereby is on the perceived usefulness and ease of use of R software for people studying while working in different economy related disciplines. At our private university of applied science for professionals studying while working with more than 30 study centres across Germany use of R is compulsory in all statistical courses in all the different Master programs and in all study centres. Due to a change in the course of study we were able to teach the lecture twice in one term, one with Rcmdr, one with mosaic, enabling a quasi-experimental setup for two lectures each.
References Baglin, James. 2013. “Applying a Theoretical Model for Explaining the Development of Technological Skills in Statistics Education.” Technology Innovations in Statistics Education 7 (2). https://escholarship.org/uc/item/8w97p75s.

Venkatesh, Viswanath, Michael G. Morris, Gordon B. Davis, and Fred D. Davis. 2003. “User Acceptance of Information Technology: Toward a Unified View.” MIS Q. 27 (3): 425–78. http://dl.acm.org/citation.cfm?id=2017197.2017202.

Speakers

## Matthias Gehrke

Professor, FOM University of Applied Sciences

Thursday July 6, 2017 2:24pm - 2:42pm CEST
PLENARY Wild Gallery
Talk, Education

### 2:24pm CEST

**Keywords**: Computer Vision, Image recognition, Object detection, Image feature engineering

**Webpages**: https://github.com/bnosac/image

R has already quite some packages for image processing, namely [magick](https://CRAN.R-project.org/package=magick), [imager](https://CRAN.R-project.org/package=imager), [EBImage](https://bioconductor.org/packages/EBImage) and [OpenImageR](https://CRAN.R-project.org/package=OpenImageR).

The field of image processing is rapidly evolving with new algorithms and techniques quickly popping up from Learning and Detection, to Denoising, Segmentation and Edges, Image Comparison and Deep Learning.

In order to complement these existing packages with new algorithms, we implemented a number of *R* packages. Some of these packages have been released on https://github.com/bnosac/image, namely:

- **image.CornerDetectionF9**: FAST-9 corner detection for images (license: BSD-2).
- **image.LineSegmentDetector**: Line Segment Detector (LSD) for images (license: AGPL-3).
- **image.ContourDetector**: Unsupervised Smooth Contour Line Detection for images (license: AGPL-3).
- **image.CannyEdges**: Canny Edge Detector for Images (license: GPL-3).
- **image.dlib**: Speeded up robust features (SURF) and histogram of oriented gradients (HOG) features (license: AGPL-3).
- **image.darknet**: Image classification using darknet with deep learning models AlexNet, Darknet, VGG-16, Extraction (GoogleNet) and Darknet19. As well object detection using the state-of-the art YOLO detection system (license: MIT).

More packages and extensions will be released in due course.

In this talk, we provide an overview of these newly developed packages and the new computer vision algorithms made accessible for R users.

![](https://github.com/bnosac/image/raw/master/logo-image.png)

Speakers

## Jan Wijffels

statistician, www.bnosac.be

Thursday July 6, 2017 2:24pm - 2:42pm CEST
3.02 Wild Gallery

### 2:24pm CEST

Keywords: Bayesian, developeRs
The rstan package provides an interface from R to the Stan libraries, which makes it possible to access Stan’s advanced algorithms to draw from any posterior distribution whose density function is differentiable with respect to the unknown parameters. The rstan package is ranked in the $$99$$-th percentile overall on Depsy due to its number of downloads, citations, and use in other projects. This talk is a follow-up to the very successful Stan workshop at useR2016 and will be more focused on how maintainers of other R packages can easily use Stan’s algorithms to estimate the statistical models that their packages provide. These mechanisms were developed to support the rstanarm package for estimating regression models with Stan and have since been used by over twenty R packages, but they are perhaps not widely known and difficult to accomplish manually. Fortunately, the rstan_package.skeleton function in the rstantools package can be used to automate most of the process, so the package maintainer only needs to write the log-posterior density (up to a constant) in the Stan language and provide an R wrapper to call the pre-compiled C++ representation of the model. Methods for the resulting R object can be defined that allow the user to analyze the results using post-estimation packages such as bayesplot, ShinyStan, and loo.
References Carpenter, Bob, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1): 1–32. doi:10.18637/jss.v076.i01.

Speakers
BG

## Ben Goodrich

Thursday July 6, 2017 2:24pm - 2:42pm CEST
3.01 Wild Gallery

### 2:24pm CEST

Online presentation: https://krlmlr.github.io/useR17/Joint-profiling.html

Keywords
: Database, SQLite, specification, test suite
Webpages: https://CRAN.R-project.org/package=DBI, https://CRAN.R-project.org/package=DBItest, https://CRAN.R-project.org/package=RSQLite
Getting data in and out of R is a minor but very important part of a statistician’s or data scientist’s work. Sometimes, the data are packaged as R package; however, in the majority of cases one has to deal with third-party data sources. Using a database for storage and retrieval is often the only feasible option with today’s ever-growing data.
DBI is R’s native DataBase Interface: a set of virtual functions declared in the DBI package. DBI backends connect R to a particular database system by implementing the methods defined in DBI and accessing DBMS-specific APIs to perform the actual query processing. A common interface is helpful for both users and backend implementers. Thanks to generous support from the R Consortium, the contract for DBI’s virtual functions is now specified in detail in their documentation, which are also linked to corresponding backend-agnostic tests in the DBItest package. This means that the compatibility of backends to the DBI specification can be verified automatically. The support from the R Consortium also allowed to bring one existing DBI backend, RSQLite, on par with the specification; the odbc package, a DBI-compliant interface to ODBC, has been written from scratch against the specification defined by DBItest.
Among other topics, the presentation will introduce new and updated DBI methods, show the design and usage of the test suite, and describe the changes in the RSQLite implementation.

Speakers

## Kirill Müller

Thursday July 6, 2017 2:24pm - 2:42pm CEST
2.01 Wild Gallery

### 2:24pm CEST

Keywords: Spatial analysis, Interactive, Visualization
Webpages: https://github.com/r-spatial/mapedit, http://r-spatial.org/r/2017/01/30/mapedit_intro.html
The R ecosystem offers a powerful set of packages for geospatial analysis. For a comprehensive list see the CRAN Task View: Analysis of Spatial Data. Yet, many geospatial workflows require interactivity for smooth uninterrupted completion. This interactivity is currently restricted to viewing and visual inspection (e.g. packages leaflet and mapview) and, with very few exceptions, there is currently no way to manipulate spatial data in an interactive manner in R. One noteworthy exception is function drawExtent in the raster package which lets the user select a geographic sub-region of a given Raster* object on a static plot of the visualized layer and saves the resultant extent or subset in a new object (if desired). Such operations are standard spatial tasks and are part of all standard spatial toolboxes. With new tools, such as htmlwidgets, shiny, and crosstalk, we can now inject this useful interactivity without leaving the R environment.
Package mapedit aims to provide a set of tools for basic, yet useful manipulation of spatial objects within the R environment. More specifically, we will provide functionality to:
1. draw, edit and delete a set of new features on a blank map canvas,
2. edit and delete existing features,
3. select and query from a set of existing features,
4. edit attributes of existing features.
In this talk we will outline the conceptual and technical approach we take in mapedit to provide the above functionality and will provide a short live demonstration hightlighting the use of the package.
The mapedit project is being realized with financial support from the RConsortium.

Speakers

## Tim Appelhans

Thursday July 6, 2017 2:24pm - 2:42pm CEST
4.02 Wild Gallery
Talk, Graphics

### 2:42pm CEST

Keywords: Tidyverse, dplyr, SQL, Apache Impala, Big Data
Webpages: https://CRAN.R-project.org/package=implyr
This talk introduces implyr, a new dplyr backend for Apache Impala (incubating). I compare the features and performance of implyr to that of dplyr backends for other distributed query engines including sparklyr for Apache Spark’s Spark SQL, bigrquery for Google BigQuery, and RPresto for Presto.
Impala is a massively parallel processing query engine that enables low-latency SQL queries on data stored in the Hadoop Distributed File System (HDFS), Apache HBase, Apache Kudu, and Amazon Simple Storage Service (S3). The distributed architecture of Impala enables fast interactive queries on petabyte-scale data, but it imposes limitations on the dplyr interface. For example, row ordering of a result set must be performed in the final phase of query processing. I describe the methods used to work around this and other limitations.
Finally, I discuss broader issues regarding the DBI-compatible interfaces that dplyr requires for underlying connectivity to database sources. implyr is designed to work with any DBI-compatible interface to Impala, such as the general packages odbc and RJDBC, whereas other dplyr database backends typically rely on one particular package or mode of connectivity.

Speakers
IC

## Ian Cook

Thursday July 6, 2017 2:42pm - 3:00pm CEST
2.01 Wild Gallery

### 2:42pm CEST

The brms package (Bürkner, in press) implements Bayesian multilevel models in R using the probabilistic programming language Stan (Carpenter, 2017). A wide range of distributions and link functions are supported, allowing users to fit linear, robust linear, binomial, Poisson, survival, response times, ordinal, quantile, zero-inflated, hurdle, and even non-linear models all in a multilevel context. Further modeling options include auto-correlation and smoothing terms, user defined dependence structures, censored data, meta-analytic standard errors, and quite a few more. In addition, all parameters of the response distribution can be predicted in order to perform distributional regression. Prior specifications are flexible and explicitly encourage users to apply prior distributions that actually reflect their beliefs. In addition, model fit can easily be assessed and compared with posterior predictive checks and leave-one-out cross-validation.

Speakers
PB

## Paul Bürkner

Thursday July 6, 2017 2:42pm - 3:00pm CEST
3.01 Wild Gallery

### 2:42pm CEST

Keywords: Data depth, Supervised classification, DD-plot, Outsiders, Visualization
Webpages: https://cran.r-project.org/package=ddalpha
Following the seminal idea of John W. Tukey, data depth is a function that measures how close an arbitrary point of the space is located to an implicitly defined center of a data cloud. Having undergone theoretical and computational developments, it is now employed in numerous applications with classification being the most popular one. The R-package ddalpha is a software directed to fuse experience of the applicant with recent achievements in the area of data depth and depth-based classification.
ddalpha provides an implementation for exact and approximate computation of most reasonable and widely applied notions of data depth. These can be further used in the depth-based multivariate and functional classifiers implemented in the package, where the $$DD\alpha$$-procedure is in the main focus. The package is expandable with user-defined custom depth methods and separators. The implemented functions for depth visualization and the built-in benchmark procedures may also serve to provide insights into the geometry of the data and the quality of pattern recognition.

Speakers
PM

## Pavlo Mozharovskyi

Thursday July 6, 2017 2:42pm - 3:00pm CEST
3.02 Wild Gallery

### 2:42pm CEST

Keywords: Visualisation, maps, interaction, exploration
Webpages: https://CRAN.R-project.org/package=tmap, https://github.com/mtennekes/tmap
A map tells more than a thousand coordinates. Generally, people tend to like maps, because they are appealing, recognizable, and often easy to understand. Maps are not only useful for navigation, but also to explore, analyse, and present spatial data.
The tmap package offers a powerful engine to visualize maps, both static and interactive. It is based on the Grammar of Graphics, with a syntax similar to ggplot2, but tailored to spatial data. Layers from different shapes can be stacked, map legends and attributes can be added, and small multiples can be created.
An example of a map is the following. This maps consists of a choropleth of Happy Planet Index values per country and a dot map of large world cities on top. Alternatively, a choropleth can also be created with qtm(World, "HPI").
library(tmap) data(World, metro) tm_shape(World) + tm_polygons("HPI", id = "name") + tm_text("name", size = "AREA") + tm_shape(metro) + tm_dots(id = "name") + tm_style_natural() Interaction with charts and maps is not considered as a nice extra feature anymore, of which users will say “wow, this is interactive!”. To the contrary, users will expect charts and maps to be interactive, especially when published online. Also in R, interaction has become common ground, especially since the introduction of shiny and htmlwidgets. However, the increase of interactive maps does not mean the end of static maps. Newspapers, journals, and posters still rely on printed maps. To design a static thematic map that is appealing, informative, and simple is a special craft.
There are two modes in which maps can be visualized: "plot" for static plotting and "view" for interactive viewing. Users are able to switch between these modes without effort. The choropleth above is reproduced in interactive mode as follows:
tmap_mode("view") last_map() For lazy users like me, the code ttm() toggles between the two modes. The created maps can be exported to static file formats, such as pdf and png, as well as interactive html files. Maps can also be embedded in rmarkdown documents and shiny apps.
save_tmap(filename = "map.png", width = 1920) save_tmap(filename = "index.html") Visualization of spatial data is important troughout the whole process from exploration to presentation. Exploration requires short and intuitive coding. Presentation requires full control over the map layout, including color scales and map attributes. The tmap package facilitates both exploration and presentation of spatial data.

Speakers

## Martijn Tennekes

Data scientist, Statistics Netherlands (CBS)
Data visualization is my thing. Author of R packages treemap, tabplot, and tmap.

Thursday July 6, 2017 2:42pm - 3:00pm CEST
4.02 Wild Gallery
Talk, Graphics

### 2:42pm CEST

Keywords: R in education, Learning patterns, Learning styles
The world of education is changing more than ever. In the university of the 21st century, there is no room for one-way education with a summative evaluation at the end of the teaching period. Instead, there is a need for formative assessment, including frequent and individual feedback (Lindblom Ylanne and Lonka 1998). However, when the number of students is large, providing individual feedback requires a huge amount of effort and time. This effort is intensified when the subject matter taught allows for a certain flexibility to solve problems. Although data analysis can provide useful insights about learning styles and patterns of students both at the time of learning and afterwards, this abstract shows that it can also be leveraged for providing fast feedback.
This abstract both incorporates a procedure to provide large-scale (semi-)individual feedback by using systematic assignments, as well as insights into different learning styles by combining information on the assigments and the final scores of the students. The case used is a course on explorative data analysis (EDA) taught to a group of circa 80 first year business engineering students at Hasselt University, covering a diverse set of topics such as data manipulation, visualization, import and tidying. During the course, students have to complete assignments on a regular interval in order to fully administer the new skills, each arranged around a specific topic. These assignments come in the form of Rmarkdown files in which the students have to complete R-chunks appropriately. Each Rmarkdown file is then re-run by the education team, and the data generated for each student is used for evaluation.
Each problem which the students have to solve in these assignments is labelled by the education team with the principles it assesses. For example, in the case of visualization, it might have to do with using appropriate aestethics, appropriate geoms, appropriate context (e.g. titles, labels), etc. By mapping these labels and the scores of the student, a precise learning profile for each student can be constructed which indicates his weaknesses and his strenghts (Vermunt and Vermetten 2004). By using this information, students can be clustered in different groups, which can then be addressed with tailored feedback on their progress and pointers to useful additional exercises in order to remedy those areas in which they perform less good.
In a second step, an ex post analysis can be done by combining the learning profiles created with the final grades and possibly other information such as educational background. This information can be employed to find which group of students represent problem cases, i.e. having a high probability of failing for the course. These insights can proof useful in future editions of the course, as a mechanism for rapid identification of students who might have difficulties with certain concepts. Moreover, it can be used to adapt the course, such that certain concepts which proof the be problematic are highlighted in a different or in a more comprehensive manner throughout the course (Tait and Entwistle 1996).
References Lindblom Ylanne, Sari, and Kirsti Lonka. 1998. “Individual Ways of Interacting with the Learning Environment. Are They Related to Study Success?” Learning and Instruction 9 (1). Elsevier: 1–18.

Tait, Hilary, and Noel Entwistle. 1996. “Identifying Students at Risk Through Ineffective Study Strategies.” Higher Education 31 (1). Springer: 97–116.

Vermunt, Jan D, and Yvonne J Vermetten. 2004. “Patterns in Student Learning: Relationships Between Learning Strategies, Conceptions of Learning, and Learning Orientations.” Educational Psychology Review 16 (4). Springer: 359–84.

Speakers
GJ

## Gert Janssenswillen

Thursday July 6, 2017 2:42pm - 3:00pm CEST
PLENARY Wild Gallery
Talk, Education

### 3:00pm CEST

Thursday July 6, 2017 3:00pm - 3:30pm CEST
CATERING POINTS Wild Gallery
BREAK

Speakers
TK

## Tareef Kawaf

RStudio

Thursday July 6, 2017 3:30pm - 3:45pm CEST
PLENARY Wild Gallery

### 3:45pm CEST

This era of Big Data has been a challenge for R users. First there were issues with address spaces restrictions in R itself.  Though this problem has been (mostly) solved, major issues remain in terms of performance, generalizability, convenience and possibly more than a bit of "hype." This talk will address these questions and assess the current overall status of parallel and distributed computing in R, especially the latter.  We will survey existing packages, including their strengths and weaknesses, and raise questions (if not full answers) of what should be done.

Speakers
NM

## Norman Matloff

Slides pdf

Thursday July 6, 2017 3:45pm - 4:45pm CEST
PLENARY Wild Gallery

Speakers
OB

## Oliver Bracht

Thursday July 6, 2017 4:45pm - 4:55pm CEST
PLENARY Wild Gallery

Speakers

## Mark Hornick

Senior Director, Oracle
Mark Hornick is the Senior Director of Product Management for the Oracle Machine Learning (OML) family of products. He leads the OML PM team and works closely with Product Development on product strategy, positioning, and evangelization, Mark has over 20 years of experience with integrating... Read More →

Thursday July 6, 2017 4:55pm - 5:00pm CEST
PLENARY Wild Gallery

Speakers
GM

## Grace Meyer

Thursday July 6, 2017 5:00pm - 5:05pm CEST
PLENARY Wild Gallery

Speakers
DP

## Dan Putler

Thursday July 6, 2017 5:05pm - 5:10pm CEST
PLENARY Wild Gallery

### 5:15pm CEST

Thursday July 6, 2017 5:15pm - 5:30pm CEST
CATERING POINTS Wild Gallery
BREAK

### 5:30pm CEST

Keywords: machine learning, bioinformatics, methylation data, brain tumor diagnosis
Webpages: https://www.molecularneuropathology.org/mnp
More than 100 brain tumor entities are listed in the World Health Organization (WHO) classification. Most of these are defined by morphological and histochemical criteria that may be ambiguous for some tumor entities and if the tissue material is of poor quality. This can make a histological diagnosis challenging, even for skilled neuropathologists. Molecular high-throughput technologies that can complement standard histological diagnostics have the potential to greatly enhance diagnostic accuracy. Profiling of genome-wide DNA methylation patterns, likely representing a ‘fingerprint’ of the cellular origin, is one such promising technology for tumor classification.
We have collected brain tumor DNA methylation profiles of almost 3,000 cases using the Illumina HumanMethylation450 (450k) array, covering over 90 brain tumor entities. Using this extensive dataset, we trained a Random Forest classifier which predicts brain tumor entities of diagnostic cases with high accuracy (Capper et al. 2017). 450k methylation data can also be used to generate genome-wide copy-number profiles and predict target gene methylation. We have developed a R package that includes a data analysis pipeline which takes data of the Illumina 450k array and the new EPIC array as input and automatically generates diagnostic reports containing quality control metrics, brain tumor class predictions with tumor class probability estimates, copy number profiles and target gene methylation status.
Besides sharing this R package with cooperating institutes worldwide, we also offer a web interface that allows researchers from other institutes to apply the pipeline to their own data. Practical experience from different cooperating institutes show that application of our pipeline to Illumina methylation array data represents a cost efficient method to greatly improve diagnostic accuracy and clinical decision making.
References Capper, David, David Jones, Martin Sill, Volker Hovestadt, Daniel Schrimpf, …, Andreas von Deimling, and Stefan Pfister. 2017. “DNA Methylation-Based Classification of Human Central Nervous System Tumors.” Submitted to Nature.

Speakers
MS

## Martin Sill

Thursday July 6, 2017 5:30pm - 5:35pm CEST
4.01 Wild Gallery

### 5:30pm CEST

Introduction
R is a powerful statistical engine comes with extensive libraries and innovative methodology implementation. On the other hand, Tableau is a popular business intelligence tool to visualize graphs and chart in fairly easy and straight forward manner. Tableau and R complement each other in areas where heavy data crunching is needed for visualization or when geospatial visualization and geodata manipulation are both needed. In this talk, two cases will be discussed to highlight the complementary merits of both software: dot map and choropleth map.

Mapping
Geospatial community is familiar with a laundry list of libraries and packages to be attached before perform spatial statistics. However, the users’ contribution to the CRAN mirrors has made monumental advances in bridging academic area and industry, shorten the gap between research and implementation. Both essential geographic data management and analysis can be performed readily in R. These advantages empower R as a leading open-source programming language against expensive commercial software in the market.

Despite extensive libraries and state-of-the-art algorithm available in R, visualization is not always praised as the language’s forte. For example, the omnipotent plot command to display kernel density estimation (KDE) in R challenges analysts to be well conversant with the location details to answer the key question: where is the hotspot? User interactivity functionality is another hurdle to overcome. The reactive graph requires considerably substantial amount of codes and sometimes cumbersome JavaScript to carry out simple customization or overlay dots on the base map. Some may argue that the communicating the point pattern process result with R is not an easy task. It is simply not compelling and visually aesthetic enough. #Integrating R and Tableau to produce map for spatial statistics Even though Tableau is hardly mentioned when it comes to mapping and geospatial, the user friendliness and interactivity are its selling points. Presenter has found majority of his mapping requirements satisfied by Tableau’s functionalities. The talk will detail two applications where R and Tableau can come together and greatly complement each other.

• Display a set of point objects on a map: in this example, a collection of coordinates will be projected to the standard WGS 1984 Coordinate Reference System and Tableau acts as a base map to overlay the points.

• Juxtapose and highlight region with using areal data: the task requires analyst to map area objects, compare different regions and highlight member of top and bottom groups according a ranking parameter. Technically speaking, a choropleth map is needed.

Speakers
FN

## Francis Nguyen Ngoc Hoang Long

Thursday July 6, 2017 5:30pm - 5:35pm CEST
2.02 Wild Gallery

### 5:30pm CEST

Keywords: R blogging, blogdown, GitHub, R Markdown
Webpages: https://www.rstudio.com/rviews/, https://github.com/rstudio/blogdown
Blogging about R presents its own technical challenges. The need to include sophisticated R code, Shiny applications, R Markdown documents, and interactive graphics severely taxes traditional blogging platforms such as WordPress and Typepad. We believe that the new blogdown package, which generates static websites using R Markdown and Hugo, represents the future of R blogging. In this talk, we will describe the basics of the blogdown package, and share our experiences editing and producing RStudio’s R Views blog using blogdown as the blog engine and GitHub as the platform for coordination.

Speakers
JR

## Joseph Rickert

Thursday July 6, 2017 5:30pm - 5:35pm CEST
3.02 Wild Gallery

### 5:30pm CEST

Keywords: risk management, shiny, automation, risk modelling, reporting
The typical risk management in a small-sized bank is heavily dependent on manual processes and using well-known spreadsheet applications even far beyond their original scope, like being used as a data storage tool. This has several reasons: First, spreadsheet applications (ie. Microsoft Excel) are well known and distributed throughout different countries and industries. Further risk management often receives data and information from all kind of different departments and hence has to deal with diverse data systems and data structures. Nevertheless, most of those systems are able to export their information to an Excel-compatible format. Finally, the costs to invest in systems and tools that would adequately replace classical spreadsheet applications are usually too high for small-sized banks.
Nevertheless, there are some problems attached to this procedure. This talk will show those weaknesses while proposing a different solution which allows to automate as much as possible and to integrate very important features to the risk management. This new approach which uses different functionalities of R is focused on three major topics:
• Automation of reporting processes - increasing efficiency
• Present and distribute the results - Impress your stakeholders
The first part focuses on different practical examples that were done by the reporting department in Excel (e.g. Margin Call reports, exposure calculations,etc.). Those processes were done mostly manually and are therefore very error-prone. An elegant solution is using the different functionalities in R to connect to different systems (Salesforce, Oracle,.) and the possibility to run R scripts as a batch in the task scheduler and hence free the time of the analysts to focus on the qualitative part of the topics. Further the usage of shiny allows to standardize processes that need user inputs and helps to improve the user experience.
The second topic shows the usage of BI methods using similar methods as in the first part (data migration/ connecting systems). This includes the development of a rating model for SME and Corporate clients and a Value-at-risk model for FX derivatives as an example. Both cases are fully executable R codes which provide full audit trail.
Part three focuses on the very important topic of presenting ‘results’ (ie. reports, models, etc.) to different stakeholders. Here some newer developments in the R community like shiny and shinydashoard show its full potential. In addition to shiny being used for rating tools it can be used to show different type of reports without ‘frightening’ other stakeholders with the typical R environment.
To sum up there is a wide field of applications for using R in the risk management which improves the overall performance. The talk will be accompanied by a live demo of the tools discussed.

Speakers
MH

## Martin Hölbl

Thursday July 6, 2017 5:30pm - 5:35pm CEST
4.02 Wild Gallery

### 5:30pm CEST

Keywords: Weather, Data, Climate, Germany, FTP
Webpages: https://CRAN.R-project.org/package=rdwd, https://github.com/brry/rdwd#rdwd
Since 2014, the German weather service (Deutscher Wetterdienst, DWD) has released over 25’000 observational weather records from stations across Germany. Along with several derived and gridded datasets, they are available free of charge on the Climate Data Center FTP server.
The purpose of rdwd is to facilitate usage and analysis of weather data in Germany. The targeted user group contains many scientists who may not be very familiar with R. Therefore an emphasis was put on clear vignettes and an instructive github readme file. rdwd was first released in January 2017 and with some additional features, it is now ready for broader advertisement.

Speakers
BB

## Berry Boessenkool

Thursday July 6, 2017 5:30pm - 5:35pm CEST
2.01 Wild Gallery

### 5:30pm CEST

Keywords: cutpoints, tidy
Webpages: https://github.com/Thie1e/cutpointr
Clinicians often use cutpoints or decision-thresholds to decide e.g. whether or not a patient with a depression score of “20” needs treatment for her or his depression. The R package cutpointr allows for estimating such optimal cutpoints for binary decisions by maximizing a specified metric or by using kernel estimation or distribution-based methods (Fluss, Faraggi, and Reiser 2005; Leeflang et al. 2008). The package includes a parallelizable bootstrapping routine to provide estimates of the cutpoints’ variability and their out-of-bag performance. cutpointr follows current tidy programming practices to allow for efficient estimation and use in simulation studies as well as interplay with functions from the tidyverse. Furthermore, cutpointr accepts user defined functions and includes existing functions (López-Ratón et al. 2014) to calculate optimal cutpoints. We also discuss future plans for cutpointr, specifically a Shiny-interface to make the package more accessible.
References Fluss, Ronen, David Faraggi, and Benjamin Reiser. 2005. “Estimation of the Youden Index and Its Associated Cutoff Point.” Biometrical Journal 47 (4): 458–72. http://onlinelibrary.wiley.com/doi/10.1002/bimj.200410135/full.

Leeflang, Mariska MG, Karel GM Moons, Johannes B. Reitsma, and Aielko H. Zwinderman. 2008. “Bias in Sensitivity and Specificity Caused by Data-Driven Selection of Optimal Cutoff Values: Mechanisms, Magnitude, and Solutions.” Clinical Chemistry, no. 4: 729–38. http://go.galegroup.com/ps/i.do?id=GALE%7CA209106300&sid=googleScholar&v=2.1&it=r&linkaccess=fulltext&issn=00099147&p=AONE&sw=w.

López-Ratón, Mónica, María Xosé Rodríguez-Álvarez, Carmen Cadarso-Suárez, and Francisco Gude-Sampedro. 2014. “OptimalCutpoints: An R Package for Selecting Optimal Cutpoints in Diagnostic Tests.” Journal of Statistical Software 061 (i08). http://econpapers.repec.org/article/jssjstsof/v_3a061_3ai08.htm.

Speakers
CT

## Christian Thiele

Thursday July 6, 2017 5:30pm - 5:35pm CEST
3.01 Wild Gallery

### 5:35pm CEST

Keywords: event study methodology
Webpages: https://CRAN.R-project.org/package=eventstudies, https://github.com/nipfpmf/eventstudies
This R package allows a dataset to be studied in an event-time frame and perform parametric/non-parametric analysis using several inference procedures. There are currently three adjustment functions and three inference strategies including the classical event study using market model and t-test (Brown and Warner 1985) along with an implementation of Augmented Market Model (Patnaik and Shah 2010). The package contains a user-friendly all encompassing function eventstudy to conduct an event study in one line of R code with other functions to provide more flexibility and control. It can also be used to develop novel research methods in event studies (Patnaik, Shah, and Singh 2013).
References Brown, Stephen J, and Jerold B Warner. 1985. “Using Daily Stock Returns: The Case of Event Studies.” Journal of Financial Economics 14 (1). Elsevier: 3–31.

Patnaik, Ila, and Ajay Shah. 2010. “Does the Currency Regime Shape Unhedged Currency Exposure?” Journal of International Money and Finance 29 (5): 760–69.

Patnaik, Ila, Ajay Shah, and Nirvikar Singh. 2013. “Foreign Investors Under Stress: Evidence from India.” International Finance 16 (2): 213–44.

Speakers
CA

## Chirag Anand

Thursday July 6, 2017 5:35pm - 5:40pm CEST
4.02 Wild Gallery

### 5:35pm CEST

Keywords: Redmine, project management, API, automation, useR!
Redmine is a popular open-source project management platform with rich capabilities. Luckily it has a RESTful API with bindings for many common programming languages, alas not including R until now. We introduce redmineR, an R interface to Redmine API to close this gap.
As a use case we give a glimpse into useR! conference organizer’s life. The useR!2017 abstract revision process was automated using R and leveraging the capabilities of Redmine with redmineR. Abstracts submitted were extracted from a MySQL database used for the web site back-end and converted into Redmine “issues” with redmineR, automatically split into sub-projects by topics. Reviewers for each topic having access to the corresponding sub-project were automatically notified by Redmine. All communication for the review process also happened inside Redmine with reviewers communicating through comments and assignments. After the review process redmineR was used to poll all abstracts with “Accepted” status to send automatic email notifications to the lucky authors.
1. Open Analytics, Belgium

Speakers
MN

## Maxim Nazarov

Statistical Consultant, Open Analytics

Thursday July 6, 2017 5:35pm - 5:40pm CEST
3.02 Wild Gallery

### 5:35pm CEST

Environmental seismology is the science of investigating the seismic signals that are emitted by Earth surface processes. This emerging field provides unique opportunities to identify, locate, track and inspect a wide range of the processes that shape our planet. Modern broadband seismometers are sensitive enough to detect signals from sources as weak as wind interacting with the ground and as powerful as collapsing mountains. This places the field of environmental seismology at the seams of many geoscientific disciplines and requires integration of a series of specialised analysis techniques.

R provides the perfect environment for this challenge. The package eseis uses the foundations laid by a series of existing packages and data types tailored to solve specialised problems (e.g., signal, sp, rgdal, Rcpp, matrixStats) and thus provides access to efficiently handling large streams of seismic data (> 300 million samples per station and day). It supports standard data formats (mseed, sac), preparation techniques (deconvolution, filtering, rotation), processing methods (spectra, spectrograms, event picking, migration for localisation) and data visualisation. Thus, eseis provides a seamless approach to the entire workflow of environmental seismology and passes the output to related analysis fields with temporal, spatial and modelling focus in R.

Speakers
MD

## Michael Dietze

Thursday July 6, 2017 5:35pm - 5:40pm CEST
2.01 Wild Gallery

### 5:35pm CEST

Keywords: chemoinformatics, multivariate data analysis, time series data
Process analytical technology (PAT) is defined as a system for designing, analyzing, and controlling pharmaceutical manufacturing processes through timely measurements (i.e., during processing) of critical quality and performance attributes. PAT poses many data analysis challenges, such as appropriate techniques for data preprocessing, quantification and integration with the external information (e.g. DoE factors). Currently available R packages in the public repositories allow for integrated analysis and implementation of analytic pipelines in the industrial setting. In this presentation we will focus on a specific application of using infrared (IR) spectroscopy technology for synthesis reaction monitoring and multivariate analysis of IR spectra by using matrix factorization techniques (e.g. principal component analysis, factor analysis for bicluster acquisition, non-negative matrix factorization, time series factor analysis and curve resolution). Unlike a supervised partial least squares technique - which is commonly used in chemometrics - this is a set of unsupervised techniques implemented in R which allow to extract and explore the most essential information in the IR spectra. The ability to extract this type of information reduces the task of monitoring about 600 highly correlated points per spectrum to monitoring a few independent factor scores only. For the proof of concept, the scores from several matrix factorization methods are compared to the known compound concentrations and the differences and commonalities of the different approaches are discussed.

Speakers
TM

## Tor Maes

Thursday July 6, 2017 5:35pm - 5:40pm CEST
4.01 Wild Gallery

### 5:35pm CEST

Keywords: Data Preparation, Datetime Data, Tidyverse,
padr solves two problems that you can be confronted with when preparing datetime data for analysis. First, data is often recorded on too fine a granualarity. For instance, the timestamp registers the moment up to the second, while you want to do the analysis on an hourly level. The thicken function will add a column to a data frame. In conjunction with dplyr, it will allow for quick aggregation to the higher level. Second, when no events take place there are typically no data records generated. This is sensible from a storage perspective, but often unhelpful for analysing the data. In this context the pad function is used. Besides demonstrating these two functions, I will elaborate on the concept of the interval, on which both functions heavily rely.

Speakers

## Edwin Thoen

Data Scientist, Rabobank
I work at the Advanced Data Analytics department of a large bank that supports all departments of the bank with projects that makes the organisation more data-driven. I am especially interested in rare-event prediction, Bayesian analysis, deep learning and a deeper understanding of... Read More →

Thursday July 6, 2017 5:35pm - 5:40pm CEST
3.01 Wild Gallery

Speakers
RN

## Robert Noble-Eddy

Thursday July 6, 2017 5:35pm - 5:40pm CEST
2.02 Wild Gallery

### 5:40pm CEST

GREEN-Rgrid is an advanced R version of the model GREEN (Grizzetti, Bouraoui, and Aloe 2012), originally developed in Fortran (Aguzzi, Gasparo, and Macconi 1987) for estimating nutrient loads from diffuse and points sources in Europe. Using R we improved the original structure of the GREEN model ensuring a model fully open to scrutiny since there is a considerable reliance on proprietary software for environmental modelling assessment in Europe (Carslaw and Ropkins 2012). The GREEN-Rgrid model works on a grid cell discretization that the user can change depending on the purpose of the modelling. The model input consists of the latest and best available global data. The GREEN-Rgrid code integrates a landscape routing model to simulate nutrient fluxes of nitrogen-nitrates, total nitrogen, total phosphorous and phosphates across discretized routing units. The grid-based approach was adopted to adapt to the readily available global raster data that can be easily incorporated as model inputs providing a more homogeneous nutrient assessment between different areas of the world. With respect to the original GREEN model, the diffuse source DS was calculated as a function of the gross nutrient balance from agricultural land that is computed as the difference between the inputs (for N: fertilizer application, fixation, and atmospheric deposition; for P: fertilizer application, P release, P transported with sediment) and the output (crop nutrient uptake). A basin attenuation coefficient was applied (aP) to the diffuse sources while two additional coefficients aM and bM are used in the calculation of the in-stream retention (Venohr and et al. 2011). These parameters were calibrated in the model using a Latin Hypercube approach based on packages FME (Soetaert and Petzoldt 2010) and the best parameters set was selected comparing the predicted and observed annual loads using the package HydroGOF (Zambrano-Bigiarini and Bellin 2010). Other important packages used for specific purposes in the GREEN-Rgrid model include the package raster (Hijmans and van Etten 2012), data.table (Dowle et al. 2015) and sqldf (Grothendieck 2012). The structure of the model is organized in three folders: the inputs, where the input table and grid raster cell are stored; the Scripts that includes the master R file and the functions and the outputs where all tables and figures generated are saved. We applied the GREEN-Rgrid model in the Mediterranean area (about 8.066x103 km2) using a grid cell size of 5 arc-minute resolution (9.2 km at the equator) and three years of simulations 2005, 2006, 2007. We show the predicted fluxes and concentrations of nutrients in gauged and ungauged grid cells, showing plots and raster maps automatically generated. Finally, some considerations are given for future developments.

Speakers
AM

## Anna Malago

Thursday July 6, 2017 5:40pm - 5:45pm CEST
2.01 Wild Gallery

### 5:40pm CEST

Keywords: Machine Learning, Ensemble Learning, Automatic Machine Learning, Black-Box Optimization, Distributed Computing
Webpages: https://CRAN.R-project.org/package=h2o, https://gitbub.com/h2oai/h2o-3
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. The first steps toward simplifying machine learning in R focused on developing simple, unified interfaces to a variety of machine learning algorithms. This effort also involved providing a robust toolkit of utility functions that perform common tasks in machine learning such as random data partitioning, cross-validation and model evaluation. Successful examples of this simplification effort include the caret, mlr and h2o R packages.
Although these tools have made it easier for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing, production-ready/research-grade machine learning models. Deep Neural Networks in particular (which have become wildly popular in the past five years) are notoriously difficult for a non-expert to tune properly. In order for machine learning software to truly be accessible to non-experts, such systems must be able to automatically perform proper data pre-processing steps and return a highly optimized machine learning model.
H2O.ai has developed a distributed Automatic Machine Learning system called H2O AutoML (to be officially released in the h2o R package (H2O.ai 2017) approx. May-June 2017; currently in pre-release), which will be the first open source Automatic Machine Learning system available in R. In this presentation, we will present our methodology for automating the machine learning workflow, which includes feature pre-processing and automatic training and tuning of many models within a user-specified time-limit. The user can also specify which model performance metric that they’d like to optimize and use a metric-based stopping criterion for the AutoML process rather than a specific time constraint. By default, stacked ensembles will automatically trained on subset of the individual models to produce a highly predictive ensemble model, although this can be turned off if the user prefers to return singleton models only.
The interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time-constraint. Below is an example of how to specify an AutoML run for the default run-time.
aml <- h2o.automl(training_frame = train, response_column = "class") The AutoML object includes a history of all the data-processing and modeling steps that were taken, and will return a “leaderboard” of all the models that were trained in the process, ranked by a user’s model performance metric of choice.
References H2O.ai. 2017. H2O R Package. https://github.com/h2oai/h2o-3/tree/master/h2o-r.

Speakers
EL

## Erin LeDell

Thursday July 6, 2017 5:40pm - 5:45pm CEST
4.02 Wild Gallery

### 5:40pm CEST

Minecraft is an open-world creativity game, and a hit with kids. To get kids interested in learning to program with R, a team at the ROpenSci unconference created the "miner" package. Kids can use this package interact with the Minecraft world with simple R commands, and learn to use R with the help of a companion book.
Keywords: R, Minecraft, education

Speakers

## David Smith

Ask me about R at Microsoft, the R Consortium, or the Revolutions blog.

Thursday July 6, 2017 5:40pm - 5:45pm CEST
3.01 Wild Gallery

### 5:40pm CEST

The determination of microbial mutation rates in the laboratory is a routine yet com- putationally challenging task in biological research. The experimentalist conducts exper- iments in accord with the classic Luria-Delbrück protocol [1] (aka the fluctuation assay protocol). But the resulting fluctuation assay data can pose a formidable challenge, not only to bench biologists, but also to bioinformaticians unfamiliar with the biological and statistical subtleties inherent in the fluctuation assay protocol. Due to the increasing popularity of the fluctuation experiment in recent biological research, more and more bench scientists are eager to analyze their fluctuation assay data by themselves. Some understandably expect the analyses to be as simple as calculating the sample mean us- ing a pocket calculator. The popular web tool FALCOR [2] almost fulfilled this dream. Unknown to most practitioners, this web tool has important limitations, which can lead an unwary user to faulty conclusions [3]. For example, the comparison of microbial mutation rates is beyond the capabilities of this web tool. The R package rSalvador (available at http://eeeeeric.github.io/rSalvador) makes accessible to bench scientists a myriad of latest computational methods that are necessary for proper analysis of fluctu- ation assay data. rSalvador allows the user to properly account for relative fitness and plating efficiency. In particular, rSalvador is the only tool that affords strictly likelihood- based methods for the comparison of microbial mutation rates. This presentation will discuss the role of rSalvador in biological education, in mutation research, and in the development of new algorithms for fluctuation assay data.

Speakers
QZ

## Qi Zheng

Thursday July 6, 2017 5:40pm - 5:45pm CEST
4.01 Wild Gallery

### 5:40pm CEST

Keywords: Naming Conventions, R
Coming from another programming language one quickly notes that there are many different naming conventions in use in the R community. Looking through packages published on CRAN one will find that functions and variables most often are either period.separated or underscore_separated, or written in lowerCamelCase or UpperCamelCase. In 2012 we did a survey of the popularity of different naming conventions used in all the packages on CRAN (Bååth, 2012), but a lot has happened since then! Since 2012 CRAN has more than doubled from 4000 packages to now over 10,000 packages, and we have also seen the rising popularity of the tidyverse packages that often follow the underscore_separated naming convention.
In this presentation we will show you the current state of naming conventions used in the R community, we will look at what has happened since 2012 and what the current trend is.
References Bååth, R. (2012). The state of naming conventions in R. The R Journal, 4(2), 74-75. https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf

Speakers
RB

## Rasmus Bååth

Thursday July 6, 2017 5:40pm - 5:45pm CEST
3.02 Wild Gallery

### 5:40pm CEST

Keywords: reporting, reproducibility, automation, Rmarkdown
Have you ever had the feeling that the creation of your data analysis report(s) resulted in quite some copy-paste from previous analyses? This copy-pasting is time-consuming and prone to errors.
If you need to analyze frequently quite similar data, e.g. from a standardized workflow or different experiments on a specific platform, automation of your analysis can be a useful and time-saving step.
An efficient solution might be the development of modular template documents integrated in an R ‘template’ package. This package contains the common analysis parts consistent throughout the different analyses, in different child (potentially nested) template documents (module). These templates can be seen as the equivalent of an R function, integrated within an R package, for reporting with input parameters and potentially some default values (necessary/specific analysis parts).
A main ‘master’ template, specific for each analysis (e.g. experiment) can call the child document(s) contained in the package. It is advisable and user-friendly, for yourself and potential other users, to create a start template for this master document, where the required and optional input parameters necessary for the downstream analysis are specified.
For the developers, the use of the R package unity containing all functionalities and part of the workflow can facilitate code exchange and lower the possible errors during code writing. During package use and development you might encounter possible extensions - depending on specific requests of the users - which can be implemented and easily incorporated in previous as well as future reports. Although the development of such an R ‘template’ package might seem time-consuming at first, a lot of time is gained when using this package afterwards and making this effort worthwhile. For the users, reports are consistent across different analyses and appropriate package versioning - to keep track of changes, extensions and bug fixes - ensures the reproducibility of an entire analysis.
The knitr package can be used for the creation and successive integration of child templates.
A shiny app can be created to allow for an user-friendly and easy way of creating reports without need to be familiar to R.
Example of the implementation and the use of such workflow in rmarkdown format will be presented.

Speakers
KV

## Kirsten Van Hoorde

Thursday July 6, 2017 5:40pm - 5:45pm CEST
2.02 Wild Gallery

### 5:45pm CEST

Keywords: Speaced Repetition Learning, r2anki, RMarkdown
Webpages: https://github.com/henningsway/r2anki
When you learn and use R you need to memorize important commands to solve programming tasks effectively. Unfortunately some less frequently used function calls can be forgot quite easily as you learn more about the language.
Spaced repetition learning offers a solution to this problem by exposing you only to learning content, that you are about to forget. The open source software Anki offers a fantastic cross-plattform implementation of this approach.
The lighning talk will briefly introduce the idea of spaced repetition learning and how the r2anki-package can be used to easily convert RMarkdown-scripts into a set of Anki-flashcards, that can be shared among the commmunity.
References

Speakers

## Henning Bumann

Junior Data Scientist, Iqast.de

Thursday July 6, 2017 5:45pm - 5:50pm CEST
3.02 Wild Gallery

### 5:45pm CEST

Keywords: Signal processing, signal segmentation, predictive modeling
Digital signal processing (DSP) is widely done using languages like MATLAB and C++. R is relatively less known for its support in this domain. However R provides strong support for DSP work.
In this talk, we will demonstrate the solution to a standard signal processing and segmentation problem using R. In this problem we will use an input stream of data such as a time series, that will be analysed for power content using short-term Fourier transform methods. The power content in the frequency range of interest, is then smoothed to attenuate short-term variations. This can then be suitably thresholded to find the segment of the signal that corresponds to the region of interest.
The implementation will be done purely in R using packages such as, signal, zoo, e1071, etc. Visualisation of the processed data at intermediate stages is also useful and this can be done very efficiently using packages such as ggplot2.

Speakers
MI

## MUNSHI IMRAN HOSSAIN

Thursday July 6, 2017 5:45pm - 5:50pm CEST
3.01 Wild Gallery

### 5:45pm CEST

Keywords: ggplot2, shiny, graphics
Webpages: https://analytics.huma-num.fr/Lise.Vaudor/graphiT/
The package graphiT is based on both packages shiny and ggplot2 and provides a user-friendly interface that helps users produce statistical graphics.
It is also a pedagogical tool, that helps with the teaching of the ggplot2 package principles and use. Indeed, besides the graphic itself, it also provides the R command lines that would generate it, based on user input. Hence, graphiT is not only intended for users who are R and/or ggplot2 newbies, but also for users who need a quick tool or reminder of the ggplot2 commands.
graphiT is useable online here and also available as a gitHub repository here.

Speakers
LV

## Lise Vaudor

Thursday July 6, 2017 5:45pm - 5:50pm CEST
2.02 Wild Gallery

### 5:45pm CEST

Keywords: R, Plasmids, Visualization, WGS, AMR,
Webpages: https://cran.r-project.org/web/packages/Plasmidprofiler/index.html, http://biorxiv.org/content/early/2017/03/28/121350.1
Summary: Comparative analysis of bacterial plasmids from short read whole genome sequencing (WGS) data is challenging. This is due to the difficulty in identifying contigs harbouring plasmid sequence data, and further difficulty in assembling such contigs into a full plasmid. As such, few software programs and bioinformatics pipelines exist to perform comprehensive comparative analyses of plasmids within and amongst sequenced isolates. To address this gap, we have developed Plasmidprofiler, a pipeline that uses Galaxy and R to perform comparative plasmid content analysis without the need for de novo assembly. The pipeline is designed to rapidly identify plasmid sequences by mapping reads to a plasmid reference sequence database. Predicted plasmid sequences are then annotated with their incompatibility group, if known. The pipeline allows users to query plasmids for genes or regions of interest and visualize results as an interactive heat map.
Availability and Implementation: Plasmid Profiler is freely available software released under the Apache 2.0 open source software license.
A stand-alone version of the entire Plasmid Profiler pipeline is available as a Docker container at
https://hub.docker.com/r/phacnml/plasmidprofiler_0_1_6/
The conda recipe for the Plasmidprofiler R package is available at
https://anaconda.org/bioconda/r-plasmidprofiler
The Plasmidprofiler R package is also available as a CRAN package at
https://cran.r-project.org/web/packages/Plasmidprofiler/index.html
Galaxy tools associated with the pipeline are available as a Galaxy tool suite at
https://toolshed.g2.bx.psu.edu/repository?repository_id=55e082200d16a504
The source code is available at
https://github.com/phac-nml/plasmidprofiler
The Galaxy implementation is available at
https://github.com/phac-nml/plasmidprofiler-galaxy.

Speakers
AZ

Thursday July 6, 2017 5:45pm - 5:50pm CEST
4.01 Wild Gallery

### 5:45pm CEST

Motivation Initiated by recently graduated statisticians a push for R is noticeable. Nice development tools and more so shiny web-based reports attract interest from management.

Implementation From a company perspective some technical foundation for R has to be provided beyond user’s desktop systems, for instance dedicated server(s) and maybe some High Performance Cluster - and all to interact nicely with each other.
Maintenance The R ecosystem evolves at a good pace. When to centrally install which R version, which packages can raise a number of issues.

Infrastructure Besides R and shiny servers we found generally web portals were appreciated by users. For R code development a central git repository server seems indispensable nowadays, best with facilities for continuous integration and a server for locally developed packages.

Community In a company spread out across many time zones electronic support for discussions about R helps a lot to keep the ball rolling and advance usage. This is no substitute for local meetings that easily can span across department boundaries.
Similarly training in R benefits from a two-pronged strategy:
• electronic
via self-paced online courses
• local classroom
for dedicated topics
• peer communication

Speakers
RK

## Reinhold Koch

Thursday July 6, 2017 5:45pm - 5:50pm CEST
4.02 Wild Gallery

### 5:45pm CEST

Keywords: reproducible research, reproducible reporting, groundwater hydrology, groundwater modelling
Webpages: https://rogiersbart.github.io/RMODFLOW/, https://rogiersbart.github.io/RMT3DMS/
Recently there have been different calls for reproducibility in computational hydrology (e.g. Hutton et al. 2016, Fienen and Bakker (2016), Skaggs, Young, and Vrugt (2015)). The use of open-source languages like R and python, and collaborative coding tools like Git may offer a solution (Fienen and Bakker 2016), but only in combination with literate programming (Knuth 1984) the full potential for reproducible research can be reached. With tools like utils::Sweave (Leisch 2002) and knitr (Xie 2015), R has been at the forefront of reproducible research in the last few years, and provides a very interesting environment for reproducible research in computational hydrology.
The Environmetrics task view provides a list of different contributed packages relating to surface water hydrology and soil science, but the number of packages dealing with subsurface hydrology remains limited to date. There are packages for creating specific types of plots, like hydrogeo which provides Piper diagram (Piper 1944) plotting, or packages for very specific purposes like quarrint or kwb.hantush.
In order to bring the potential of R to computational subsurface hydrology, in the last few years I have been compiling the RMODFLOW and RMT3DMS packages. These provide interfaces with two of the most-widely used open source codes for groundwater flow and contaminant transport modelling: MODFLOW (Harbaugh 2005) and MT3DMS (Zheng and Wang 1999). Different model input and output file reading functions have been implemented, and different pre- and post-processing tools are available. For visualization of the model data, S3 methods making use of ggplot2 were implemented as well. The current capabilities of the packages will be demonstrated and examples of reproducible workflows will be provided.
References Fienen, Michael N., and Mark Bakker. 2016. “HESS opinions: Repeatable research: What hydrologists can learn from the Duke cancer research scandal.” Hydrology and Earth System Sciences 20 (9): 3739–43. doi:10.5194/hess-20-3739-2016.

Harbaugh, Arlen W. 2005. MODFLOW-2005, the Us Geological Survey Modular Ground-Water Model: The Ground-Water Flow Process.

Hutton, C, T Wagener, J Freer, D Han, C Duffy, and B Arheimer. 2016. “Most computational hydrology is not reproducible, so is it really science?” Water Resources Research 52: 7548–55. doi:10.1002/2016WR019285.

Knuth, Donald E. 1984. “Literate Programming.” Comput. J. 27 (2). Oxford, UK: Oxford University Press: 97–111. doi:10.1093/comjnl/27.2.97.

Leisch, Friedrich. 2002. “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis.” In Compstat: Proceedings in Computational Statistics, edited by Wolfgang Härdle and Bernd Rönz, 575–80. Heidelberg: Physica-Verlag HD. doi:10.1007/978-3-642-57489-4_89.

Piper, Arthur M. 1944. “A Graphic Procedure in the Geochemical Interpretation of Water-Analyses.” Eos, Transactions American Geophysical Union 25 (6): 914–28. doi:10.1029/TR025i006p00914.

Skaggs, T.H., M.H. Young, and J.A. Vrugt. 2015. “Reproducible Research in Vadose Zone Sciences.” Vadose Zone Journal 14 (10): 0. doi:10.2136/vzj2015.06.0088.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Zheng, Chunmiao, and Patrick Wang. 1999. “MT3DMS, A modular three-dimensional multi-species transport model for simulation of advection, dispersion and chemical reactions of contaminants in groundwater systems; documentation and user’s guide.” U.S. Army Engineer Research and Development Center Contract Report SERDP-99-1, Vicksburg, MS, 202+.

Speakers

## Bart Rogiers

SCK•CEN ǀ Belgian Nuclear Research Centre

Thursday July 6, 2017 5:45pm - 5:50pm CEST
2.01 Wild Gallery

### 5:50pm CEST

Keywords: cluster heatmap, interactive visualization, ggplot2, plotly, shiny
Webpages: heatmaply, shinyHeatmaply
A cluster heatmap is a popular graphical method for visualizing high dimensional data, in which a table of numbers are encoded as a grid of colored cells (Wilkinson and Friendly 2009, Weinstein (2008)). The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms and extra columns of categorical annotation. Heatmaps are used in many fields for visualizing observations, correlations, and missing values patterns. There are many R packages and functions for creating static heatmap figures (the most famous one is probably gplots::heatmap.2).
The heatmaply R package allows the creation of interactive cluster heatmaps, enabling tooltip hover text and zoom-in capabilities (from either the grid or the dendrograms), while supporting sidebar annotation. The package brings together many well known packages such as ggplot2 (Wickham 2016), plotly, viridis, seriation (Hahsler, Hornik, and Buchta 2008), dendextend (Galili 2015), and others. Also, it is now supported by the shinyHeatmaply shiny app.
You can play with a simple interactive example by running:
install.packages('heatmaply'); library('heatmaply') heatmaply(percentize(mtcars), k_row = 4, k_col = 2, margins = c(40,120,40,20)) This talk will provide an overview of design principles for creating a useful, and beautiful, cluster heatmap. Attention will be given to data preprocessing, choosing a color palette, and careful dendrograms creation.
This work was made possible thanks to the essential contribution of Jonathan Sidi, Alan O’Callaghan, Carson Sievert, and Yoav Benjamini. As well as the joint work of Joe Cheng and myself on the d3heatmap package (which laid the foundation for heatmaply). The speaker is the creator of the R packages installr, dendextend, and heatmaply, and blogs at: www.r-statistics.com.
References Galili, Tal. 2015. “Dendextend: An R Package for Visualizing, Adjusting and Comparing Trees of Hierarchical Clustering.” Bioinformatics. Oxford Univ Press, btv428.

Hahsler, Michael, Kurt Hornik, and Christian Buchta. 2008. “Getting Things in Order: An Introduction to the R Package Seriation.” Journal of Statistical Software 25 (3). American Statistical Association: 1–34.

Weinstein, John N. 2008. “A Postgenomic Visual Icon.” Science 319 (5871). American Association for the Advancement of Science: 1772–3.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.

Wilkinson, Leland, and Michael Friendly. 2009. “The History of the Cluster Heat Map.” The American Statistician 63 (2). Taylor & Francis: 179–84.

Speakers
TG

## Tal Galili

Thursday July 6, 2017 5:50pm - 5:55pm CEST
2.02 Wild Gallery

### 5:50pm CEST

Keywords: multiomics, drug screen, blood cancer, personalized medicine, shiny
Webpages: https://github.com/lujunyan1118/DrugScreenExplorer
Better tools for response prediction would improve quality of cancer care. To gain further insight into the pathogenesis of blood cancers as well as to understand determinants of drug response, we measured the sensitivity of primary tumor samples from a large cohort of leukemia/lymphoma patients to marketed drugs and chemical probes. Alongside, genome, transcriptome, DNA methylome and metablome data were obtained for the same set of patient samples, providing a valuable multidimensional resource for blood cancer study.
To facilitate the query and analysis of our dataset, we have created an R and Shiny based online platform – DrugScreenExplorer. This platform incorporates various tools for quality assessment, data visualization, exploratory data analysis and association test. For example, the drug screening quality can be readily examined by interactive heatmap plots of the screening plates and outlier samples and drugs can be detected by unsupervised clustering methods, such as principal component analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Moreover, associations among different omics datasets can be analyzed and visualized within this platform, facilitating hypothesis generation and subsequent experimental validation.
Those handy tools enable us to achieve seamless and efficient collaboration between dry lab and wet lab groups and to extract useful information from out multi-layer structure dataset in order to gain insight into the complexity of drug response and genotype-phenotype relationships in cancer. Currently, this Shiny platform are customized for our in-house data. But with further extensions, such as allowing users to upload their own data, it can be used as general-purpose tools to streamline the pre-processing, quality control, data visualization and reporting for other drug screening projects as well.

Speakers

## Junyan Lu

Thursday July 6, 2017 5:50pm - 5:55pm CEST
4.01 Wild Gallery

### 5:50pm CEST

Keywords: Hierarchical likelihood, Hierarchical Generalized Linear Models, Statistical Modelling
Webpages: https://CRAN.R-project.org/package=hglm, https://CRAN.R-project.org/package=dhglm, https://CRAN.R-project.org/package=mdhglm
Since their introduction, hierarchical generalized linear models (HGLMs) have proven useful in various fields by allowing random effects in regression models. Interest in the topic has grown, and various practical analytical tools have been developed. We have summarized developments within the field and, using data examples, show how to analyse various kinds of data using R. The work is currently being published as a monograph. It provides a likelihood approach to advanced statistical modelling including generalized linear models with random effects, survival analysis and frailty models, multivariate HGLMs, factor and structural equation models, robust modelling of random effects, models including penalty and variable selection and hypothesis testing. This example-driven book is aimed primarily at researchers and graduate students, who wish to perform data modelling beyond the frequentist framework, and especially for those searching for a bridge between Bayesian and frequentist statistics.

Speakers

## Lars Rönnegård

Professor in Statistics, Dalarna University
I have recently coauthored the book "Data Analysis Using Hierarchical Generalized Linear Models with R".

Thursday July 6, 2017 5:50pm - 5:55pm CEST
3.01 Wild Gallery

### 5:50pm CEST

Keywords: Community Project, Infrastructure, Open Data, Open Science, R packages
Webpage: https://ropengov.github.io
Dedicated developer communities provide essential resources for the R ecosystem in the form of software packages, documentation, case studies, web applications, and other material. In 2010, we initiated the rOpenGov project (Lahti et al. 2013) to develop open source tools for open government data, computational social sciences, and digital humanities. We are a community of independent R package developers working on these topics in both public and private sector. Whereas the overall rOpenGov infrastructure is maintained by a core team, a number of independent authors have contributed projects and blog posts, supporting the overall objectives of this community-driven project. The main focus of our community is on knowledge sharing and peer support, and this has led to the release of several CRAN packages, including for instance the mature eurostat (Lahti et al. 2017), pxweb, and gisfin packages, and altogether over 20 projects at various stages of development and thousands of downloads per month. We welcome new contributions to the rOpenGov blog and the package collection, and participation on our online communication channels. In exchange, we aim to support the online community, provide feedback and support for package developers and initiate collaborations. Further details, full list of contributors, and up-to-date contact information are provided at the project website.
References Lahti, Leo, Janne Huovari, Markus Kainu, and Przemyslaw Biecek. 2017. “Eurostat R Package.” R Journal. Accepted for Publication. http://ropengov.github.io/eurostat.

Lahti, Leo, Juuso Parkkinen, Joona Lehtomäki, and Markus Kainu. 2013. “rOpenGov: open source ecosystem for computational social sciences and digital humanities.” Presentation at Int’l Conf. on Machine Learning - Open Source Software workshop ICML/MLOSS). http://ropengov.github.io.

Speakers

## Leo Lahti

Computational science, molecular systems biology, and digital humanities

Thursday July 6, 2017 5:50pm - 5:55pm CEST
3.02 Wild Gallery

### 5:50pm CEST

Keywords: package, reproducibility, science, quality control, personal monitoring
Webpages: http://www.masalmon.eu/rtimicropem/
RTI MicroPEM is a small particulate matter personal exposure monitor, increasingly used in developed and developing countries. Each measurement session produces a csv file which includes a header with information on instrument settings and a table of thousands of observations of time-varying variables such as particulate matter concentration, relative humidity. Files need to be processed for 1) generating a format suitable for further analysis and 2) cleaning the data to deal with the instruments shortcomings. Currently, this is not done in a harmonized and transparent way. Our package pre-processes the data and converts them into a format that allows the integration the rich set of data manipulation and visualization functionalities that the tidyverse provides.
We made our software open-source for better reproducibility, easier involvement of new contributors and free use, particularly in developing countries. We applied the package in a research project for a large number of measurements. The functionalities of our package are three-fold: allowing conversion of files, empowering easy data quality checks, and supporting reproducible data cleaning through documentation of current workflows.
For inspection of individual files, the package has a R6 class where each object represents one MicroPEM file, with summary and plot methods including interactivity thanks to rbokeh. The package also contains a Shiny app for exploration by non-experienced R users. The Shiny app includes a tab with tuneable alarms, e.g. “Nephelometer slope was not 3” which empowered rapid checks after a day on the field. For later stages of a study after a bunch of files has been collected, the package supports the creation of a measurements and a settings data.frames from all files in a directory. We exemplify data cleaning processes, in particular the framework used for the CHAI project, in a vignette, in a transparency effort.
The package is currently available on Github. Since air pollution sensors that would output csvy (csv file with yaml frontmatter) instead of weird csv; and produce ready-to-use data are currently unavailable, rtimicropem can be an example of how to use an R package as a central place for best practices, thus fostering reproducibility and harmonization of data cleaning across studies. We also hope it can trigger more use of R in the fields of epidemiology and exposure science.

Speakers

## Maëlle Salmon

Thursday July 6, 2017 5:50pm - 5:55pm CEST
2.01 Wild Gallery

Speakers
MS

## Mariana Simons

Thursday July 6, 2017 5:50pm - 5:55pm CEST
4.02 Wild Gallery

### 5:55pm CEST

Keywords: maps, package, sustainability, community
Webpages: https://CRAN.R-project.org/package=rnaturalearth, https://github.com/ropenscilabs/rnaturalearth
rnaturalearth is a new R package, accepted to CRAN in March this year. It makes Natural Earth map data, a free and open resource, more easily accessible to R users. It aims for a simple, reproducible and sustainable workflow from Natural Earth to rnaturalearth enabling updating as new versions become available.
rnaturalearth follows from rworldmap, a CRAN package for mapping world data, which I released more than 7 years ago. rworldmap was targetted particularly at relative newcomers to R, and has now been downloaded more than 100 thousand times. However, the code is ugly and I haven’t had the time to maintain it actively. I have been concerned for a while that making any changes will break it. Now more recently released options such as tmap and choroplethr are better than rworldmap in most respects.
Where rworldmap tried to do everything, rnaturalearth aims to do fewer things, but to do them better. This approach my be familiar to people. Also being more specialised allows this pacakage to be used in combination with other packages of the users choice.
It is possible to use rnaturalearth to have more control over accessing map data, for example specifiying exactly which areas are wanted when dealing with trickiness of countries and dependencies. In this example I use sp::plot as a simple, quick way to plot map data, however the output can also be returned as sf objects for plotting using other packages.
library(rnaturalearth) library(sp) # countries, UK undivided sp::plot(ne_countries(country = 'united kingdom', type = 'countries')) # map_units, UK divided into England, Scotland, Wales and Northern Ireland sp::plot(ne_countries(country = 'united kingdom', type = 'map_units')) # map_units, select by geounit to plot Scotland alone sp::plot(ne_countries(geounit = 'scotland', type = 'map_units')) # sovereignty, Falkland Islands included in UK sp::plot(ne_countries(country = 'united kingdom', type = 'sovereignty'), col = 'red') The package contains pre-downloaded country and state boundaries at different resolutions and facilitates access to other vector and raster data for example of lakes, rivers and roads. Each Natural Earth dataset is characterised on the website according to scale, type and category. rnaturalearth will construct the url and download the corresponding file.
lakes110 <- ne_download(scale = 110, type = 'lakes', category = 'physical') sp::plot(lakes110, col = 'blue') I found the early stages of rworldmap development a somewhat lonely process. rnaturalearth has been through the rOpenSci community open review which improved the code considerably and my experience of developing it. I look forward to this package being more collaborative. I will comment on my experience of issues of community and sustainability within R package development.

Speakers
AS

## Andy South

Thursday July 6, 2017 5:55pm - 6:00pm CEST
2.01 Wild Gallery

### 5:55pm CEST

Keywords: ggplot2, visualization, shiny, colours, plotting
Webpages: https://github.com/daattali/colourpicker, https://cran.r-project.org/package=colourpicker
You’ve just made an amazing plot in R, and the only thing remaining is finding the right colours to use. Arghhh this part is never fun… Yu’re probably familiar with this loop: try some colour values -> plot -> try different colours -> plot -> repeat. Don’t you wish there was a better way?
Well, now there is :)
If you’ve ever had to spend a long time perfecting the colour scheme of a plot, you might find the new Plot Colour Helper handy. It’s an RStudio addin that lets you interactively choose combinations of colours for your plot, while updating your plot in real-time so you can see the colour changes immediately.

Speakers
DA

## Dean Attali

Thursday July 6, 2017 5:55pm - 6:00pm CEST
2.02 Wild Gallery

### 5:55pm CEST

Keywords: Government, Public Sector, Official Statistics
Webpages: https://rdotgov.wordpress.com/
Over the last decade, government organisations around the world have increasingly adopted R for their analytical needs, driven by the promise of more powerful and reproducible data analysis pipelines, Shiny - and lower costs. R maturity varies considerably across the public sector, with some organisations just starting to experiment with R, and others already using it as their primary workhorse for official statistics production and dissemination (Templ and Todorov 2016).
We will outline the main barriers to introducing R in government organisations, from IT culture to the career progression of statisticians, and how the Scottish Government is overcoming them.
We will also present R.gov, an informal group open to all public sector organisations which aims to enable and promote the use of R in government. The group already has members in ten countries, and provides a forum for sharing knowledge and fostering collaborations.
We hope this lightning talk will spark productive conversations and help create new connections, not just within government, but across the entire R community.
References Templ, Matthias, and Valentin Todorov. 2016. “The Software Environment R for Official Statistics and Survey Methodology.” Austrian Journal of Statistics 45: 97–124. doi:10.17713/ajs.v45i1.100.

Speakers

## Jeremy Darot

Statistician, Scottish Government
I am the R lead in the Scottish Government, and the R.gov admin. If you are interested in joining our cross-government R network, please come and speak with me, everyone is welcome. I am giving a lightning talk on R in government on Thursday at 5:55pm.

Thursday July 6, 2017 5:55pm - 6:00pm CEST
3.02 Wild Gallery

### 5:55pm CEST

Keywords: simulation, multiple phenotyes, epistatic interactions
Webpages: https://CRAN.R-project.org/package=SimPhe
For complex traits, genome-wide association studies (GWAS) are the standard tool to detect variants contributing to the variance of the phenotype of interest. However, limited to single-locus effects they can only explain a small fraction of the heritability of complex traits. Epistasis, generally defined as the interaction between different genes, has been hypothesized as one of the factors contributing to missing heritability. This has been a hot topic in quantitative genetics for a long time and there is a controversy about the role of epistasis because the majority of researchers only concentrate on additive effects as most genetic variation is (approximately) additive. Even for epistasis analysis, many tools cannot take the dominance effects into consideration properly. Recently, the detection of dominance or the interactions it is involved in have been reported. Meanwhile, simulation tools have been developed for evaluating type I error rates for new statistical association tests or power comparisons between the new tests and other existing tests. However, few of them focus on the dominance effect and its interactions with other genetic items. Here, we present an R package, SimPhe, to simulate single or multiple quantitative phenotypes based on genotypes with additive, dominance and epistatic effects using the Cockerham epistasis model. With optional parameters in different functions, users can easily specify the number of quantitative trait loci (QTLs), genetic effect size, the number of quantitative traits, and proportions of variance explained by the QTLs.
References Cockerham, C. Clark, and Bruce Spencer Weir. 1977. “Quadratic Analyses of Reciprocal Crosses.” Biometrics 33 (1). JSTOR: 187–203. doi:10.2307/2529312.

Gibran, Hemani, Shakhbazov Konstantin, Harm-Jan Westra, Tonu Esko, Anjali K. Henders, Allan F. McRae, Jian Yang, et al. 2014. “Detection and Replication of Epistasis Influencing Transcription in Humans.” Nature 508 (April). Nature Publishing Group: 249–53. doi:10.1038/nature13005.

Kao, Chen-Hung, and Zhao-Bang Zeng. 2002. “Modeling Epistasis of Quantitative Trait Loci Using Cockerham’s Model.” Genetics 160 (3). Genetics Society of America: 1243–61. doi:10.1534/genetics.104.035857.

Speakers
BJ

## Beibei Jiang

Thursday July 6, 2017 5:55pm - 6:00pm CEST
4.01 Wild Gallery

### 5:55pm CEST

We will introduce you to a framework we developed to achieve effective collaboration around data analysis in our enterprise environment at Vestas. In this talk we will describe our implementation in R, why we chose R, which challenges we faced and what we learned during the process.
Setting the scene We had the task of creating statistical models to be used by the sales teams. Sales already had an Excel based tool, and the requirement was that we should continue with this front end. The statistical work would require models developed by a team of people as well as involvement of subject matter experts, hence the framework needed to support collaboration.

On stage
• Sales (end users, 50-100 people around the globe)
• Data analysts and subject matter experts (project team, 10 people in DK + IN)
• In front: Existing Excel front end
• In the background: R, GIT, rmarkdown, SQL

Scenography Being in an enterprise world we had to fulfill requirements for maintainability, documentation and reproducibility. At the same time we wanted to achieve i) a code base approach, ii) easier validation methods, iii) automated model deployment and iv) a strong collaborative platform.

Orchestration On the technical side the main new feature is a self-made package called harvester. The harvester’s main functions allow us to run markdown-files and fetch selected objects, typically our statistical models.
These fetched models are then wrapped into another internal package called models together with interface functions. This is the package used by Sales. The models are made available to Excel through a self-developed .NET-wrapper. In this way the end users will be able to get the most recent models through their normal Excel tool.
The collaboration is done through GIT where all team members store their analysis R-markdowns, shared and validated by subject matter experts. The harvester is designed to run the markdowns in GIT and fetch the selected output models.

Review Get inspired on how to integrate validated statistical models into the decision making in the business front line: It is a five star movie.

Speakers
AL

## Anne Lund Christophersen

Senior Specialist, Vestas Wind Systems A/S

Thursday July 6, 2017 5:55pm - 6:00pm CEST
4.02 Wild Gallery

### 5:55pm CEST

Keywords: Statistics, Big Data, Memory-mapping, Parallelism
Webpages: https://github.com/privefl/bigstatsr
The R package bigstatsr provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory. The package bigstatsr is based on the format big.matrix provided by the R package bigmemory (Kane, Emerson, and Weston 2013).
The package bigstatsr enables users with laptop to perform statistical analysis of several dozens of gigabytes of data. The package is fast and efficient because of four different reasons. First, bigstatsr is memory-efficient because it uses only small chunks of data at a time. Second, special care has been taken to implement effective algorithms. Third, big.matrix objects use memory-mapping, which provides efficient accesses to matrices. Finally, as matrices are stored on-disk, many processes can easily access them in parallel.
The main features currently available in bigstatsr are:
• singular value decomposition (SVD) and randomized partial SVD (Lehoucq and Sorensen 1996),
• sparse linear and logistic regressions (Zeng and Breheny 2017),
• sparse linear Support Vector Machines,
• column-wise linear and logistic regressions tests,
• matrix operations,
• parallelization / apply.
References Kane, Michael J, John W Emerson, and Stephen Weston. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19. doi:10.18637/jss.v055.i14.

Lehoucq, Rich Bruno, and D. C. Sorensen. 1996. “Deflation Techniques for an Implicitly Restarted Arnoldi Iteration.” SIAM Journal on Matrix Analysis and Applications 17 (4). Society for Industrial; Applied Mathematics: 789–821. doi:10.1137/S0895479895281484.

Zeng, Yaohui, and Patrick Breheny. 2017. “The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R,” January. http://arxiv.org/abs/1701.05936.

Speakers

## Florian Privé

Thursday July 6, 2017 5:55pm - 6:00pm CEST
3.01 Wild Gallery

### 6:00pm CEST

Working through a complex textbook can be cumbersome and frustrating. Despite a strong motivation to understand the content, one needs a good memory to bear all the details in mind, discipline to stay focused on the content, as well as patience to finish.
Solving exercises throughout the textbook can help to practice the learned, get more involved and gain deeper insights. Furthermore it can help to validate your newly gotten skills and understandings.
In this lightning talk, we will briefly discuss our experiences, while working through Hadley Wickham’s Advanced R book (Wickham 2014), which provides exercises after most of its chapters. In particular we will describe the approach to document and monitor our progress via Yihui Xie’s bookdown package (Xie 2017) to address the issues mentioned above.

Xie, Yihui. 2017. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.

Speakers
MG

## Malte Grosser

Thursday July 6, 2017 6:00pm - 6:05pm CEST
3.01 Wild Gallery

Speakers
ML

## Mélissa LEPAGE

Thursday July 6, 2017 6:00pm - 6:05pm CEST
4.02 Wild Gallery

### 6:00pm CEST

Keywords: shiny, rmarkdown, HTML, Twitter Bootstrap
Webpages: https://cran.r-project.org/package=bsplus, http://ijlyttle.github.io/bsplus/
With the advent of shiny (Chang et al. 2017) modules, you can create and support apps with more components and more complexity. One of the limiting factors is that we have but one “dimension” of interfaces using a tabsetPanel in the UI. This was the motivation to develop a second, independent “dimension” of interfaces in an “accordion-sidebar” framework. This is one of the function families provided in the bsplus (Lyttle 2017) package.
As well, the bsplus package lets you compose HTML using pipes. Its functions are designed to help you access Twitter Bootstrap (Bootstrap Core Team 2017) components independent of the server side of shiny. It also includes collapsible panels, accordions, carousels, tooltips, popovers and modals. You can use carousels to contain and display images (plots), whereas tooltips, popovers and modals can be useful for providing help and documentation for your apps.
References Bootstrap Core Team. 2017. Bootstrap: The World’s Most Popular Mobile-First and Responsive Front-End Framework. https://getbootstrap.com.

Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Lyttle, Ian. 2017. Bsplus: Adds Functionality to the R Markdown + Shiny Bootstrap Framework. https://CRAN.R-project.org/package=bsplus.

Speakers

## Ian Lyttle

Sr. Staff Engineer, Schneider Electric

bsplus pdf

Thursday July 6, 2017 6:00pm - 6:05pm CEST
2.02 Wild Gallery

### 6:00pm CEST

Keywords: Nomograms, Shiny, Dynamic Nomograms, Translational Statistics
Webpages: https://cran.r-project.org/web/packages/DynNom/index.html
Nomograms are useful computational tools for model visualisation as they allow the calculation of a point estimate of a response variable for a set of values of the corresponding explanatory variables. The nomogram function in Frank Harrell’s rms package is a popular way of creating static nomograms for regression models. Our DynNom package, built using shiny, allows the creation of dynamic nomograms, using a simple wrapper function from a variety of model objects including lm, glm and coxph and models generated using the Ols, Glm, lrm and cph functions in the rms package. In this presentation examples will be given where dynamic nomograms will be generated for a variety of models and the potential of this approach to be a useful translational tool explored.

Speakers
JN

## John Newell

Thursday July 6, 2017 6:00pm - 6:05pm CEST
4.01 Wild Gallery

### 6:00pm CEST

Keywords: API, CURL, Official Statistics
Webpages: http://rdata.work/slides/nsoapi/
National Statistical Offices have started setting up web services to provide published information through data APIs. Even though international standards exist, e.g. SDMX, the majority of NSOs create their individual API and few use existing community standards.
nsoAPI is an attempt to create a single package with functions for each provider that convert a custom data format into an R standard time series format ready for analysis or further transformation.
(“Opendata Tables” 2015) lists tables that can be retrieved from SDMX (International Organizations, ABS: Australia, INEGI: Mexico, INSEE: France and ISTAT: Italy, NBB: Belgium), the pxweb package (PXNET2: Finland, SCB: Sweden) and the nsoAPI package (BEA: USA, CBS: the Netherlands, GENESIS: Germany, ONS: UK, SSB: Norway, STATAT: Austria, STATBANK: Denmark).
With the exception of France, large countries tend to set their own standards. The BEA (USA) and ONS (UK) require the user to create an ID that needs to be submitted with each request. GENESIS (Germany) require the user to pay 500 Euros per year (250 Euros for academic users) to access the API.
References “NsoApi Vignette.” 2015. https://github.com/bowerth/nsoApi/blob/master/vignettes/nsoApi.md.

Speakers
BW

## Bo Werth

Thursday July 6, 2017 6:00pm - 6:05pm CEST
3.02 Wild Gallery

### 6:00pm CEST

The availability of spatial data is increasing every year, with the growing access to free satellite imagery, to cheap drone cameras and the omnipresence of GPS on all sort of devices, from our phones to farm tractors. Many applications in agriculture and environmental sciences, which are often dealing with the organization of space and territories, can profit from those new sources of data, like precision farming, land-use use planning, environmental monitoring,etc.

Those information can be extracted, analysed and turned into models to support better decisions: apply the right amount of fertilizer at the right place in the field, predict the future extension of an urban zone, draw a map of the flooding or fire risks. But to do that, people need skills and tools to access, extract, explore and analyse such data.

That's why we started a project to design a lifelong learning course on spatial data analysis at the european level, with three partners (University of Liege, University of Lisboa and Montpellier SupAgro), based on free and open tools like *QGIS* and *R* so that everybody can install and use them, because knowledge access shouldn't be restricted to organizations which can afford costly licenses.

We will present the construction of the program and the pedagogical approach we'll use. Blended learning will be tested for a european master level course for adult learning.

Speakers
YB

## Yves Brostaux

Thursday July 6, 2017 6:00pm - 6:05pm CEST
2.01 Wild Gallery

### 6:05pm CEST

Keywords: input validation, type safety, defensive programming, functional programming
To cope with the everyday hazards of invalid function inputs, R provides the functions stop() and stopifnot(), which can express input requirements as show-stopping assertions. While this way of validating inputs is both straightforward and effective, its rigidity as a fixture of a function, and its tendency to clutter code, add inertia to the process of interacting and programming with data.
In this talk, we demonstrate a more nimble take on input validation using the valaddin package, which address these shortcomings by viewing input validation as a functional transformation. We explore concrete use cases to illustrate the flexibility and benefits of this alternative approach.

Speakers
EH

## Eugene Ha

Thursday July 6, 2017 6:05pm - 6:10pm CEST
3.01 Wild Gallery

### 6:05pm CEST

Keywords: R, ggplot2, visualization, research, observational study
Webpages: https://CRAN.R-project.org/package=ggplot2
In late 2013, Ebola Virus Disease began in Guinea; it later spread to Liberia and subsequently to Sierra Leone in early 2014. Although the three countries were declared Ebola-Free almost a year ago, Ebola survivors are still struggling with lingering issues. In order to better understand post-Ebola sequelae, a Liberian-US partnership called the Partnership for Research on Ebola Virus in Liberia (PREVAIL) is in the midst of a large observational study that plans to follow Ebola survivors and their close contacts for up to 5 years. In this lightning talk, we will use R’s visualization capabilities to guide you through the struggle of Ebola survivors as told by the data from the PREVAIL study. No crowded tables with a long list of symptoms and p-values, but rather beautiful visualizations created using the gglot2 package. At the end of this talk, it is hoped that you will be able to answer questions such as: Where in Liberia do survivors come from? What were their professions before the epidemic? How sick were they? Have they reached complete physical, social, and mental well-being? In the process, we hope you will be compelled to use ggplot2 to increase the overall quality of data visualizations in your reports.

Speakers

## Bionca Davis

University of Minnesota

Thursday July 6, 2017 6:05pm - 6:10pm CEST
4.01 Wild Gallery

### 6:05pm CEST

Keywords: Jurimetrics, Judicial decisions,Webscraping,Text mining, Predictive modeling
Webpages: https://josejesus.info
Abstract: The increasing availability of online access to judicial decisions coupled with modern R packages that perform webscraping, text mining, topic modeling and predictive modeling, allow for the application of quantitative methods in the simultaneous analysis of thousands of court judgements. Extraction, manipulation, and analysis of judicial decisions require a variety of tecniques and the use of multiple R packages, as well as bulding new functions to attain and analyze relevant content. As an example, unsupervised learning, such as topic modeling, has revealed here-to unknown aspects of how courts handle traffic accident cases. Supervised learning, such as classification, is a very important tool to identify determinants of judicial decisions, which are influenced by the interpretation of facts and, suprisingly, by judges’ ideology. It is now possible to predict with a high level of accuracy how courts will decide in criminal cases. The talk will address a set of tecniques that have been developed to analyze court rulings.

Speakers

## José de Jesus Filho

member, Brazilian Association of Jurimetry
I am a criminal lawyer with interest the application of statistics and machine learning to law.

Thursday July 6, 2017 6:05pm - 6:10pm CEST
3.02 Wild Gallery

### 6:05pm CEST

Keywords: Ecology, Hydrology, Framework, Time Series
Webpages: https://github.com/mundl/smires, http://www.smires.eu/
Many hydrological and ecological metrics are constructed in a similar way. A common family of metrics is calculated from a univariate time series (e.g. daily streamflow observations) aggregated for given periods of time. More complex ones involve the detection of events (e.g. no-flow periods or flood events) or several levels of aggregation (e.g. mean annual minimum flow).
Although some R packages (hydrostats, IHA, hydroTSM, …) providing hydrological metrics exist, they usually strictly require daily time series and do not allow for a free choice of the aggregation period. By contrast the package smires tries to generalize the calculation and visualization of hydrological metrics for univariate time series providing a generic framework which is developed around dplyrs (Wickham and Francois 2016) split-apply-combine strategy. It takes into account the peculiarities of hydrological data e.g., the strong seasonal component or the handling of missing data.
The general approach comprises four steps. (1) First the time series can be preprocessed, e.g. by interpolating missing values or by applying a moving average. If necessary, an optional step (2) involves the identification of distinct events such as low flow periods. For each event a set of new variables (e.g. event duration or event onset) is derived. In a third step (3) summary statistics are calculated for arbitrary periods (e.g. months, seasons, calendar years, hydrological years, decades). This step can be repeated until the original time series is aggregated to a single value.
The user keeps full control over the frequency of the time series (daily, weekly, monthly), the choice of preprocessing functions, the aggregation periods, the aggregation functions as well as the handling of events spanning multiple periods. Thus, smires enables the user to obtain a wide range of metrics whilst minimizing programming effort and error-proneness.
References Bond, Nick. 2016. Hydrostats: Hydrologic Indices for Daily Time Series Data. https://CRAN.R-project.org/package=hydrostats.

The Nature Conservancy. 2009. Indicators of Hydrologic Alteration. https://www.conservationgateway.org/ConservationPractices/Freshwater/EnvironmentalFlows/MethodsandTools/IndicatorsofHydrologicAlteration/Pages/indicators-hydrologic-alt.aspx.

Wickham, Hadley, and Romain Francois. 2016. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Zambrano-Bigiarini, Mauricio. 2014. HydroTSM: Time Series Management, Analysis and Interpolation for Hydrological Modelling. https://CRAN.R-project.org/package=hydroTSM.

Speakers
TG

## Tobias Gauster

PhD Student, BOKU

Thursday July 6, 2017 6:05pm - 6:10pm CEST
2.01 Wild Gallery

### 6:05pm CEST

Keywords: Excel spreadsheet, htmlwidget, data import

Speakers
PB

## Paulo Bargo

Thursday July 6, 2017 6:05pm - 6:10pm CEST
2.02 Wild Gallery

### 6:05pm CEST

Keywords: Brewing, beer, yeast, visualisation, summary statistics
Webpages: https://dynamobrew-stats.shinyapps.io/WhiteLabsBrewAppHb/
How is R helping brewers to choose the best yeast for their beer? How does yeast choice influence predicted quantities like bitterness versus measured bitterness?
This is meant as a short, fun presentation, touching on Beglian brewing heritage.
I will present walk-through of an interactive shiny app I created for a client in the brewing industry.
The initial task was primarily to ingest the data and produce an interactive enviroment, where the client’s employees could explore their data. I will not spend too much time on this, but will mention briefly the database technologies used to access the data. I will mention some of my experiences with productionalised a full data flow (ingestion, transformations, outlier handling, visualisation).
The main goal of the presentation will be to visually demonstrate differences between beer styles with regard to their recipes, and to demonstrate the importance of matching beer style with an appropriate yeast.
Before the Conference, I plan on implementing a clustering method based on Self-Organising Maps. This should be a very nice way to explore the natural clustering of recipes – and it should map very neatly to beer styles.
Pending approval from RateBeer, I will also join some high level beer styles data scraped from the https://www.ratebeer.com website over the brewing yeast data.

Speakers
BH

## Benjamin Høyer

Thursday July 6, 2017 6:05pm - 6:10pm CEST
4.02 Wild Gallery

### 6:10pm CEST

Keywords: Tolerance Intervals, Method Comparison Studies, Agreement, Errors-in-Variables Regression, Bivariate Least Square
Webpages: https://CRAN.R-project.org/package=BivRegBLS
The need of laboratories to quickly assess the quality of samples leads to the development of new measurement methods. These methods should lead to results comparable with those obtained by a standard method.
Two main methodologies are presented in the literature. The first one is the Bland-Altman approach with its agreement intervals (AIs) in a (M=(X+Y)/2,D=Y-X) space, where two methods (X and Y) are interchangeable if their differences are not clinically significant. The second approach is based on errors-in-variables regression in a classical (X,Y) plot, whereby two methods are considered equivalent when providing similar measures notwithstanding the random measurement errors. These methodologies can be used in many other domains than clinical.
During this talk, novel tolerance intervals (TIs) (based on unreplicated or replicated designs) will be shown to be better than AIs as TIs are easier to calculate, easier to interpret, and are robust to outliers. Furthermore, it has been shown recently that the errors are correlated in the Bland-Altman plot. The coverage probabilities collapse drastically and the biases soar when this correlation is ignored. A novel consistent regression, CBLS (Correlated Bivariate Least Square), is then introduced. Novel predictive intervals in the (X,Y) plot and in the (M,D) plot are also presented with excellent coverage probabilities.
Guidelines for practitioners will be discussed and illustrated with the new and promising R package BivRegBLS. It will be explained how to model and plot the data in the (X,Y) space with the BLS regression (Bivariate Least Square) or in the (M,D) space with the CBLS regression by using BivRegBLS. The main functions will be explored with an emphasis on the output and how to plot the results.
References BG Francq, B Govaerts (2016). How to regress and predict in a Bland-Altman plot? Review and contribution based on tolerance intervals and correlated-errors-in-variables models. Statistics in Medicine, 35:2328-2358.

Speakers
BF

## Bernard Francq

Thursday July 6, 2017 6:10pm - 6:15pm CEST
4.01 Wild Gallery

### 6:10pm CEST

Keywords: big data, packages, shiny, reproducible research, business analytics
Abstract: Over the last 5 years mobile gaming industry has experienced a massive growth. Hundreds of millions of players play King games every month. We will briefly talk about how King data science teams coped with this amount of information and how R has allowed them to collaborate and grow a culture of reproducible research. Large data science teams are typically made of people coming from very different backgrounds: hard sciences, engineering, economics, computer science, business, psychology, etc. That poses the challenge to develop a common technology stack that could allow for a fluid and agile collaboration. R is the perfect language for that.
Namely, we will see how R has been used to:
• Build data access packages
• Quickly assemble dashboards and reporting tools by leverging the shiny package.
• Implement a reproducible research mindset with github, Rmarkdown and notebooks.

Speakers
XG

## Xavier Guardiola

Thursday July 6, 2017 6:10pm - 6:15pm CEST
4.02 Wild Gallery

### 6:10pm CEST

Abstract: In this talk, we explore and discuss the possibility of reducing overall cost and effort in a scientific experiment. According to the objectives and assumptions of the study, an experimenter can adopt a suitable experimental design using available numerous tools, software and R packages. But such designs do not consider the sequence of experimental runs to be applied on experimental units. This might result in increase in cost and effort, e.g., if a factor in the experiment is temperature, then the experimenter might have to change the temperature levels from high to low many times in successive runs and in doing so he/she has to wait and adjust the instrument many times. We have addressed this issue and proposed theoretical framework for minimizing the changes in factor levels in an experimental design. To apply our findings we developed two R packages: minimalRSD and FMC.
Package minimalRSD can be used to generate response surface designs namely, central composite designs (CCD) with full factorial or fractional factorial points and Box Behnken designs (BBD) and the factorial designs with symmetrical as well as asymmetrical factor level combinations can be constructed using the package FMC. The output gives the respective design, the number of changes in each factor and the overall number of level changes. We intend to extend our theoretical findings to the scientific community using the power of R.

Speakers

## SHWETANK LALL

PhD Student, ICAR - IASRI
I am doing PhD in statistics at ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India. I use R for statistical data analysis. I mainly work in the field of design of experiments and statistical modelling. I have some experience with analysis of gene expression data... Read More →

Thursday July 6, 2017 6:10pm - 6:15pm CEST
2.01 Wild Gallery

### 6:10pm CEST

Abstract: A frequently expressed barrier to the transition from SAS to R at our institution is the challenge in generating “quick and dirty” output that combines text and graphical summaries of data for offline viewing or sharing with investigators. Depending on a person’s prior training and programming style, a full markdown approach to produce this integrated summary often requires significant reprogramming, particularly when the project involves multiple programmers or complex data manipulation. The philosophy behind an approach entitled “object-oriented markdown” will be presented and illustrated using a series of research projects utilizing the RJafroc package. The presentation will illustrate how data management and analysis standards can provide a framework that enables collaboration amongst statisticians on the project and ease of integration of final statistical results into a markdown document. By utilizing a markdown file only as a means to print stored R objects, one is able to rapidly summarize and interpret statistical output while maintaining efficient programming styles.
Keywords: Reproducible research, statistical output, data analysis pipeline

Speakers
RC

## Rickey Carter

Thursday July 6, 2017 6:10pm - 6:15pm CEST
2.02 Wild Gallery

### 6:10pm CEST

Keywords: optimization, mathematical programming
Webpages: https://cran.r-project.org/web/packages/ROI/index.html, https://r-forge.r-project.org/projects/roi/
Optimization plays an increasingly important role in statistical computing. Typical applications include, among others, various types of regression, classification and low rank matrix approximations. Due to its wide application there exist many resources concerned with optimization. These resources involve software for modeling, solving and randomly generating optimization problems, as well as optimization problem collections used to benchmark optimization solvers. The R Optimization Infrastructure package ROI bundles many of the available resources used in optimization into a unified framework. It constitutes a unified way to formulate and store optimization problems by utilizing the rich language features R has to offer, rather than creating a new language. In ROI an optimization problem is stored as a single object, which ensures that it can be easily be saved and exchanged. Furthermore, the streamlined construction of optimization problems combined with a sophisticated plugin structure allows package authors and users to exploit different solver options by just changing the solver name. Currently, the ROI plugins include solvers for general purpose nonlinear optimization as well as for linear, quadratic and conic programming. Additionally, plugins for reading and writing optimization problems in various formats (e.g. MPS, LP) and plugins for problem collections (e.g. netlib, miplib) transformed into the ROI format are available.

Speakers
FS

## Florian Schwendinger

Thursday July 6, 2017 6:10pm - 6:15pm CEST
3.01 Wild Gallery

### 6:10pm CEST

Keywords: shiny apps, student exercises
Webpages: http://prof.beuth-hochschule.de/mmueller/shiny-apps/
The talk presents some ideas to generate exercises for Maths and Stats courses using shiny apps. The R software environment in the background is quite useful here: It allows to randomly choose parameters and data in such exercises and provides calculation results and graphical illustrations. In addition, the MathJax capability of shiny allows to use formulas as in classical textbooks.
Since shiny apps generate web documents, they can be easily linked into websites or online platforms (e.g. into Moodle courses). Self-written shiny apps allow for an implementation that precisely fits the needs of a specific mathematics or statistics course.
The Maths and Stats apps presented here are intended as a complementary offer, i.e. in addition to usual classroom exercises. The students should use them independently to train and self-test their mathematics or statistics skills.

Speakers

## Marlene Müller

Thursday July 6, 2017 6:10pm - 6:15pm CEST
3.02 Wild Gallery

### 6:15pm CEST

Keywords: Gamification, Text Classification, R Shiny, Interactive Machine Learning.
Webpages: https://gmdn.shinyapps.io/Classification/
Supervised machine learning algorithms require a set of labelled examples to be trained; however, the labelling process is a costly, time consuming task. In the last years, mixed approaches that use crowd-sourcing and interactive machine learning (Amershi et al. 2014) have shown that it is possible to create annotated datasets at affordable costs (Morschheuser, Hamari, and Koivisto 2016). One major challenge in motivating people to participate in these labelling tasks is to design a system that promotes and enables the formation of positive motivations towards work as well as fits the type of the activity.
In this context, an approach named ‘gamification’ has become popular. Gamification is defined as ‘the use of game design elements in non-game contexts’ (Deterding et al. 2011), i.e. tipical game elements, like rankings, leaderboards, points, badges, etc, are used for purposes different from their normal expected employment. Nowadays, gamification spreads through a wide range of disciplines and its applications are implemented in different areas. For instance, an increasingly common feature of online communities and social media sites is a mechanism for rewarding user achievements based on a system of badges and points. They have been employed in many domains, including educational sites like Khan Academy, and tourist review sites like Tripadvisor. At the most basic level, these game elements serve as a summary of a user’s key accomplishments; however, experience with these sites also shows that users will put in non-trivial amounts of work to achieve particular badges, and as such, badges can act as powerful incentives (Anderson et al. 2013).
In this work, we present the recent studies of gamification in text classification and the development of a Web application written in Shiny (Di Nunzio, Maistro, and Zilio 2016). This application, initially designed to understand probabilistic models, has been redesigned as a game to gather labelled data from lay people, especially kids from primary and secondary schools, during the European Researchers’ Night in September 2016 at the University of Padua. We have tested this application with a two-fold goal in mind: i) how the gamification of a classification problem can be used to understand what is the `price’ of labelling a small amount of objects for building a reasonably accurate classifier, ii) to analyze the classification performance given the presence of small sample sizes and little training. We will describe three different interfaces and the analysis of the results: a pilot experiment with PhD and post-doc students, a second experiment with primary and secondary school students, and a third experiment with a computer instsalled in a Bank of the city center.
References Amershi, Saleema, Maya Cakmak, W. Bradley Knox, and Todd Kulesza. 2014. “Power to the People: The Role of Humans in Interactive Machine Learning.” AI Magazine 35 (4): 105–20. http://www.aaai.org/ojs/index.php/aimagazine/article/view/2513.

Anderson, Ashton, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. “Steering User Behavior with Badges.” In Proceedings of the 22Nd International Conference on World Wide Web, 95–106. WWW ’13. New York, NY, USA: ACM. doi:10.1145/2488388.2488398.

Deterding, Sebastian, Dan Dixon, Rilla Khaled, and Lennart Nacke. 2011. “From Game Design Elements to Gamefulness: Defining ‘Gamification’.” In Proc. of the 15th International Academic Mindtrek Conference: Envisioning Future Media Environments, 9–15. MindTrek ’11. New York, NY, USA: ACM. doi:10.1145/2181037.2181040.

Di Nunzio, Giorgio Maria, Maria Maistro, and Daniel Zilio. 2016. “Gamification for Machine Learning: The Classification Game.” In Proceedings of the Third International Workshop on Gamification for Information Retrieval Co-Located with 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), Pisa, Italy, July 21, 2016., 45–52. http://ceur-ws.org/Vol-1642/paper7.pdf.

Morschheuser, B., J. Hamari, and J. Koivisto. 2016. “Gamification in Crowdsourcing: A Review.” In 2016 49th Hawaii International Conference on System Sciences (Hicss), 4375–84. doi:10.1109/HICSS.2016.543.

Speakers
GM

## Giorgio Maria Di Nunzio

Thursday July 6, 2017 6:15pm - 6:20pm CEST
4.02 Wild Gallery

### 6:15pm CEST

Keywords: Principal component analysis, Independent component analysis, Non-Gaussian component analysis, Sliced inverse regression
Webpages: https://CRAN.R-project.org/package=ICtest
Choosing the number of components to retain is a crucial step in every dimension reduction method. The package ICtest introduces various tools for estimating the number of interesting components, or the true reduced dimension, in three classical situations: principal component analysis, independent component analysis and reducing the number of covariates in prediction. The estimation methods are provided in the form of hypothesis tests and in each of the three cases tests based both on asymptotic distributions and on bootstrapping are provided. The talk goes to shortly introduce the used methodology and showcase the package in action.
References Nordhausen, Klaus, Hannu Oja, and David E. Tyler. 2016. “Asymptotic and Bootstrap Tests for Subspace Dimension.” arXiv:1611.04908.

Nordhausen, Klaus, Hannu Oja, David E. Tyler, and Joni Virta. 2017. “Asymptotic and Bootstrap Tests for the Dimension of the Non-Gaussian Subspace.” arXiv:1701.06836.

Virta, Joni, Klaus Nordhausen, and Hannu Oja. 2016. “Projection Pursuit for Non-Gaussian Independent Components.” arXiv:1612.05445.

Speakers
JV

## Joni Virta

Thursday July 6, 2017 6:15pm - 6:20pm CEST
2.01 Wild Gallery

### 6:15pm CEST

Keywords: Discrete-Event Simulation
Webpages: https://CRAN.R-project.org/package=simmer, http://r-simmer.org/
Discrete-Event Simulation (DES) is a powerful modelling technique that breaks down complex systems into ordered sequences of well-defined events. Its applications are broad (from process design, planification and optimisation to decision making) in a wide range of fields, such as manufacturing, logistics, healthcare and networking.
This talk presents simmer, a package that brings DES to R. It is designed as a generic yet powerful process-oriented framework. The architecture encloses a robust and fast simulation core written in C++ with integrated monitoring capabilities, allowing for easy access to time series data on processes and resources. It provides a rich and flexible R API that revolves around the concept of a trajectory, a common path in the simulation model for entities of the same type. A trajectory can be defined as a recipe-like set of activities that correspond to common functional DES blocks. These activities are exposed as intuitive verbs (e.g., seize, release and timeout) and chained using the popular pipeline notation %>%, which makes for clear and transparent DES modelling.
Over time, the simmer package has seen significant improvements and has been at the forefront of DES for R.

Speakers
BS

## Bart Smeets

Thursday July 6, 2017 6:15pm - 6:20pm CEST
3.01 Wild Gallery

Speakers
SL

## Sean Lopp

Thursday July 6, 2017 6:15pm - 6:20pm CEST
2.02 Wild Gallery

### 6:15pm CEST

At the time of writing the CRAN Task View MetaAnalysis contains 70+ packages. Despite this large resource there are many overlaps and some important gaps. After a very brief overview of meta-analysis we shall discuss the relationships between the existing packages and the coverage they offer. The main techniques for preparing summary statistics and analysing them in univariate models are well covered with considerable overlaps. Graphical displays are another area of strength but diagnostics have received less coverage. Other topics with adequate coverage include meta-analysis of diagnostic tests, multivariate meta-analysis and network meta-analysis. We shall then outline the current gaps: meta-analysis where some studies have individual participant data available but others only have summary statistics (discussed in the literature but not coded), the method of Hunter and Schmidt (primarily covered in books with closed source software), and trial sequential analysis (discussed in the literature and available in closed source software).

Speakers
MD

## Michael Dewey

Thursday July 6, 2017 6:15pm - 6:20pm CEST
4.01 Wild Gallery

### 6:15pm CEST

Keywords: miniCRAN, Agricultural Sciences, Statistics
Webpages: https://CRAN.R-project.org/package=miniCRAN, http://rstats4ag.org/
R has evolved over time and currently consists of more than 10,000 packages. Virtually any aspects of statistical methods e.g., in the agricultural sciences, are readily available. Packages can be accessed anywhere, also in countries with few resources for purchasing commercial programmes (e.g., SAS, SPSS, Matlab). However, when in regions where seamless running internet is an exception rather than the rule we have problems. Introducing R and relevant packages to new students should not begin with struggling downloading the packages. In training and instruction situations, the package miniCRAN can help us maintain a private mirror of a subset of packages that are relevant to distribute among students irrespective of the functioning of the internet. In addition, the package miniCRAN makes it possible to make a dependency tree for a given set of packages. An important facility is the capability to download older version packages from the CRAN archives. However, dependencies for old package versions cannot be determined automatically and the end user must specify.

Speakers
JC

## Jens Carl Streibig

Thursday July 6, 2017 6:15pm - 6:20pm CEST
3.02 Wild Gallery

### 6:20pm CEST

Keywords: Coverage, Comparing R packages, Uncertainty, Bootstrap
Webpages: http://www.math.su.se/~hoehle
Inspired by the work of Höhle and Höhle (2009) concerned with the assessment of accuracy for digital elevation models in photogrammetry, we discuss the computation of confidence intervals for the median or any other quantile in R. In particular we are interested in the interpolated order statistic approach suggested by Hettmansperger and Sheather (1986) and generalized in Nyblom (1992). In order to make the methods available to a greater audience we provide an implementation of these methods in the R package quantileCI and conduct a small simulation study to show that these intervals indeed have a very good coverage. The study also shows that these intervals perform better than the currently available approaches in R. We therefore propose that these intervals should be used more in the future!
Details on the work can be found in the presenter’s blog entitled Theory meets practice available at http://www.math.su.se/~hoehle/blog.
References Hettmansperger, T. P., and S. J Sheather. 1986. “Confidence Intervals Based on Interpolated Order Statistics.” Statistics and Probability Letters 4: 75–79. doi:10.1016/0167-7152(86)90021-0.

Höhle, J., and M. Höhle. 2009. “Accuracy Assessment of Digital Elevation Models by Means of Robust Statistical Methods.” ISPRS Journal of Photogrammetry and Remote Sensing 64 (4): 398–406. doi:10.1016/j.isprsjprs.2009.02.003.

Nyblom, J. 1992. “Note on Interpolated Order Statistics.” Statistics and Probability Letters 14: 129–31. doi:10.1016/0167-7152(92)90076-H.

Speakers

## Michael Höhle

Stockholm University

Thursday July 6, 2017 6:20pm - 6:25pm CEST
2.01 Wild Gallery

### 6:20pm CEST

Keywords: Error Datacleaning
Webpages: https://CRAN.R-project.org/package=errorlocate, https://github.com/data-cleaning
An important but undermentioned activity needed for statistical analysis is data-cleaning. No measurement is perfect, so data often contain errors. Obvious errors e.g. negative age are easily detected, but observations that contain variables that are logically related e.g. marital status and age are more tricky. R package errorlocate allows for pin pointing errors in observations using the Feligi-Holt algorithm and validation rules from R package validate. The errors can automatically be removed using a pipe-line syntax.

Speakers
ED

## Edwin de Jonge

Thursday July 6, 2017 6:20pm - 6:25pm CEST
3.01 Wild Gallery

### 6:20pm CEST

Keywords: kdb+, big data mining, R-KDB+ Interface, business/industry, high-performance computing
Webpages: http://code.kx.com/wiki/Cookbook/IntegratingWithR
Commercial application of ultra-low latency techniques for data mining and machine learning have been ubiquitous in financial trading and related disciplines for many years. As early as 2005, algorithmic trading desks at hedge funds and large investment banks have relied on in-memory, columnar databases and map-reduce techniques for analysing millions of data points in milliseconds long before such tools were used in other verticals. In particular, technologies such as kdb+ and Q - a vector-based programming platform developed as a successor to APL (A Programming Language developed in the 1950s/60s in Harvard by Ken Iverson), provide an unchallenged ability to perform both simple and complex data manipulations at scale with speeds that are orders of magnitude faster than contemporary platforms used for Big Data. A lesser-used, but formidable capability that has been used by R-enthusiasts who were also kdb+ experts has been the R-KDB+ Interface used for interprocess-communication to share data between R and KDB+ processes all from within the user’s R-console or Q-console. In my nearly, 12 years of using R, I, like many of my colleagues who have worked in financial trading environments have found such capabilities indispensable especially when working with large, oftentimes, TeraByte-scale datasets. The proposed talk features the basics of using the R-KDB+ interface as a faster, superior and more optimal method to extract aggregated data from TB-scale data warehouses prior to statistical analysis in R.

Speakers
ND

## Nataraj Dasgupta

Thursday July 6, 2017 6:20pm - 6:25pm CEST
4.02 Wild Gallery

### 6:30pm CEST

Thursday July 6, 2017 6:30pm - 7:00pm CEST
Wild Gallery Getijstraat 11, 1190 Vorst

### 7:00pm CEST

Thursday July 6, 2017 7:00pm - 10:00pm CEST
BRUSSELS EXPO Wild Gallery, Gerijstraat 11 - 1190 Vorst

Friday, July 7

### 8:00am CEST

Friday July 7, 2017 8:00am - 9:15am CEST
Wild Gallery Getijstraat 11, 1190 Vorst

### 9:15am CEST

Friday July 7, 2017 9:15am - 9:30am CEST
PLENARY Wild Gallery

### 9:30am CEST

One of the important goals of modern statistics is to provide comprehensive and integrated inferential frameworks for data analysis (from exploratory analysis to prediction and visualisation). R is a very flexible software for the implementation of these frameworks and for this reason, it represents an excellent research and learning tool for end-users in both academia and industry.

Speakers
IG

## Isabella Gollini

Friday July 7, 2017 9:30am - 10:30am CEST
PLENARY Wild Gallery

### 10:30am CEST

Friday July 7, 2017 10:30am - 11:00am CEST
CATERING POINTS Wild Gallery
BREAK

### 11:00am CEST

Keywords: extreme value theory, censoring, splicing, risk measures
Webpages: https://CRAN.R-project.org/package=ReIns, https://github.com/TReynkens/ReIns
Reinsurance is an insurance purchased by one party (usually an insurance company) to indemnify parts of its underwritten insurance risk. The company providing this protection is then the reinsurer. A typical example of a reinsurance is an excess-loss insurance where the reinsurer indemnifies all losses above a certain threshold that are incurred by the insurer. Albrecher, Beirlant, and Teugels (2017) give an overview of reinsurance forms, and its actuarial and statistical aspects: models for claim sizes, models for claim counts, aggregate loss calculations, pricing and risk measures, and choice of reinsurance. The ReIns package, which complements this book, contains estimators and plots that are used to model claim sizes. As reinsurance typically concerns large losses, extreme value theory (EVT) is crucial to model the claim sizes. ReIns provides implementations of classical EVT plots and estimators (see e.g. Beirlant et al. 2004) which are essential tools when modelling heavy-tailed data such as insurance losses.
Insurance claims can take long before being completely settled, i.e. there is a long time between the occurrence of the claim and the final payment. If the claim is notified to the (re)reinsurer but not completely settled before the evaluation time, not all information on the final claim amount is available, and hence censoring is present. Several EVT methods for censored data are included in ReIns.
A global fit for the distribution of losses is e.g. needed in reinsurance. Modelling the whole range of the losses using a standard distribution is usually very hard and often impossible. A possible solution is to combine two distributions in a splicing model: a light-tailed distribution for the body, i.e. light and moderate losses, and a heavy-tailed distribution for the tail to capture large losses. Reynkens et al. (2016) propose a splicing model with a mixed Erlang (ME) distribution for the body and a Pareto distribution for the tail. This combines the flexibility of the ME distribution with the ability of the Pareto distribution to model extreme values. ReIns contains the implementation of the expectation maximisation (EM) algorithm to fit the splicing model to censored data. Risk measures and excess-loss insurance premiums can be computed using the fitted splicing model.
In this talk, we apply the plots and estimators, available in ReIns, to model real life insurance data. Focus will be on the splicing modelling framework and other methods adapted for censored data.
References Albrecher, Hansjörg, Jan Beirlant, and Jef Teugels. 2017. Reinsurance: Actuarial and Statistical Aspects. Wiley, Chichester.

Beirlant, Jan, Yuri Goegebeur, Johan Segers, and Jef Teugels. 2004. Statistics of Extremes: Theory and Applications. Wiley, Chichester.

Reynkens, Tom, Roel Verbelen, Jan Beirlant, and Katrien Antonio. 2016. “Modelling Censored Losses Using Splicing: A Global Fit Strategy with Mixed Erlang and Extreme Value Distributions.” https://arxiv.org/abs/1608.01566.

Speakers
TR

## Tom Reynkens

Friday July 7, 2017 11:00am - 11:18am CEST
2.02 Wild Gallery

### 11:00am CEST

Keywords: TDA, Persistence, Wavelets, Change-Point Detection
Webpages: https://github.com/speegled/cpbaywave
Topological data analysis (TDA) offers a multi-scale method to represent, visualize and interpret complex data by extracting topological features using persistent homology. We will focus on persistence diagrams, which are a way of representing the persistent homology of a point cloud. At their most basic level, persistence diagrams can give something similar to clustering information, but they also can give information about loops or other topological structures within a data set.
Wavelets are another multi-scale tool used to represent, visualize and interpret complex data. Wavelets offer a way of examining the local changes of a data set while also estimating the global trends.
We will present two algorithms that combine wavelets and persistence. First, we use a wavelet based density estimator to bootstrap confidence intervals in persistence diagrams. Wavelets seem well-suited for this, since if the underlying data lies on a manifold, then the density should have discontinuities that will need to be detected. Additionally, the wavelet based algorithm is fast enough to allow some cross-validation of the tuning parameters. Second, we present an algorithm for detecting the most likely change point of the persistent homology of a time series.
The majority of this talk will consist of presenting examples which will illustrate persistence diagrams, the change point detection algorith, and the types of changes in geometric and/or topological structure in data that can be detected via this algorithm.

Speakers
DS

## Darrin Speegl

Friday July 7, 2017 11:00am - 11:18am CEST
3.01 Wild Gallery

### 11:00am CEST

Keywords: Bioactivity, Biomarkers, Chemical Structure, Joint Model, Multi-source
Webpages: https://cran.r-project.org/web/packages/IntegratedJM/index.html
In recent days, data from different sources need to be integrated together in order to arrive at meaningful conclusions. In drug-discovery experiments, most of the different data sources, related to a new set of compounds under development, are of high-dimension. For example, in order to investigate the properties of a new set of compounds, pharmaceutical companies need to analyse chemical structure (fingerprint features) of the compounds, phenotypic bioactivity (bioassay read-outs) data for targets of interest and transcriptomic(gene expression) data. Perualila-Tan et al. (2016) proposed a joint model in which the three data sources are included to better detect the association between gene expression and biological activity. For a given set of compounds, the joint modeling approach accounts for a possible effect of the chemical structure of the compound on both variables. The joint model allows us to identify genes as potential biomarkers for compound’s efficacy. The joint modeling approach, proposed by Perualila-Tan et al. (2016), is implemented in the IntegratedJM R package which provides, in addition to model estimation and inference, a set of exploratory and visualization functions that can be used to clearly present the results. The joint model and the IntegratedJM R package are discussed in details in Perualila et al. (2016) as well.
References Perualila, Nolen Joy, Ziv Shkedy, Rudradev Sengupta, Theophile Bigirumurame, Luc Bijnens, Willem Talloen, Bie Verbist, Hinrich W.H. Göohlmann, Adetayo Kasim, and QSTAR Consortium. 2016. “Applied Surrogate Endpoint Evaluation Methods with Sas and R.” In, edited by Ariel Alonso, Theophile Bigirumurame, Tomasz Burzykowski, Marc Buyse, Geert Molenberghs, Leacky Muchene, Nolen Joy Perualila, Ziv Shkedy, and Wim Van der Elst, 275–309. CRC Press.

Perualila-Tan, Nolen, Adetayo Kasim, Willem Talloen, Bie Verbist, Hinrich W.H. Göhlmann, QSTAR Consortium, and Ziv Shkedy. 2016. “A Joint Modeling Approach for Uncovering Associations Between Gene Expression, Bioactivity and Chemical Structure in Early Drug Discovery to Guide Lead Selection and Genomic Biomarker Development.” Statistical Applications in Genetics and Molecular Biology 15: 291–304. doi:10.1515/sagmb-2014-0086.

Speakers
RS

Friday July 7, 2017 11:00am - 11:18am CEST
3.02 Wild Gallery

### 11:00am CEST

1: Monash University, Department of econometrics and business statistics nicholas.tierney@gmail.com 2: Monash University, Department of econometrics and business statistics dicook@monash.edu 3: Queensland University of Technology, ARC Centre of Excellence for Statistical and Mathematical Frontiers milesmcbain@gmail.com
Keywords
• Missing Data
• Exploratory Data analysis
• Imputation
• Data Visualization
• Data Mining
• Statistical Graphics
Missing values are ubiquitous in data and need to be carefully explored and handled in the initial stages of analysis to avoid bias. However, exploring why and how values are missing is typically an inefficient process. For example, visualising data with missing values in ggplot2 results in omission of missing values with a warning, and base R silently omits missing values Wickham (2009). Additionally, imputed missing data are not typically distinguished in visualisation and data summaries. Tidy data structures described in Wickham (2014) provide an efficient, easy and consistent approach to performing data manipulation and wrangling, where each row is an observation and each column is a variable. There are currently no guidelines for representing missing data structures in a tidy format, nor simple approaches to visualising missing values. This paper describes an R package, naniar, for exploring missing values in data with minimal deviation from the common workflows of ggplot and tidy data. Naniar builds data structures and functions that ensure missing values are handled effectively for plotting and summarising data with missing values, and examining the effects of imputation.
References
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23.

Speakers
NT

## Nicholas Tierney

Friday July 7, 2017 11:00am - 11:18am CEST
2.01 Wild Gallery