useR!2017 has ended
Wednesday, July 5 • 2:42pm - 3:00pm
Neural Embeddings and NLP with R and Spark

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Keywords: NLP, Spark, Deep Learning, Network Science
Webpages: https://github.com/akzaidi
Neural embeddings (Bengio et al. (2003), Olah (2014)) aim to map words, tokens, and general compositions of text to vector spaces, which makes them amenable for modeling, visualization, and inference. In this talk, we describe how to use neural embeddings of natural and programming languages using R and Spark. In particular, we’ll see how the combination of a distributed computing paradigm in Spark with the interactive programming and visualization capabilities in R can make exploration and inference of natural language processing models easy and efficient.
Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we’ll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the RevoScaleR packages.
To describe the utility of this framework we’ll provide an example where we’ll train a sequence to sequence neural attention model for summarizing git commits, pull request and their associated messages (Zaidi (2017)), and then deploy them on Spark clusters where we will then be able to do efficient network analysis on the neural embeddings with a sparklyr extension to GraphFrames.
References Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. 3 (March). JMLR.org: 1137–55. http://dl.acm.org/citation.cfm?id=944919.944966.

Olah, Christopher. 2014. “Deep Learning, NLP, and Representations.” https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. doi:10.18637/jss.v059.i10.

Zaidi, Ali. 2017. “Summarizing Git Commits and Github Pull Requests Using Sequence to Sequence Neural Attention Models.” CS224N: Final Project, Stanford University.

avatar for Ali Zaidi

Ali Zaidi

Data Scientist, Microsoft

Wednesday July 5, 2017 2:42pm - 3:00pm CEST
4.02 Wild Gallery