Keywords: Tidyverse,
dplyr, SQL, Apache Impala, Big Data
Webpages:
https://CRAN.R-project.org/package=implyr This talk introduces
implyr, a new
dplyr backend for
Apache Impala (incubating). I compare the features and performance of
implyr to that of
dplyr backends for other distributed query engines including
sparklyr for Apache Spark’s Spark SQL,
bigrquery for Google BigQuery, and
RPresto for Presto.
Impala is a massively parallel processing query engine that enables low-latency SQL queries on data stored in the Hadoop Distributed File System (HDFS), Apache HBase, Apache Kudu, and Amazon Simple Storage Service (S3). The distributed architecture of Impala enables fast interactive queries on petabyte-scale data, but it imposes limitations on the
dplyr interface. For example, row ordering of a result set must be performed in the final phase of query processing. I describe the methods used to work around this and other limitations.
Finally, I discuss broader issues regarding the
DBI-compatible interfaces that
dplyr requires for underlying connectivity to database sources.
implyr is designed to work with any
DBI-compatible interface to Impala, such as the general packages
odbc and
RJDBC, whereas other
dplyr database backends typically rely on one particular package or mode of connectivity.