Loading…
useR!2017 has ended
Back To Schedule
Wednesday, July 5 • 2:24pm - 2:42pm
Text Analysis and Text Mining Using R

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Keywords: text analysis, text mining, machine learning, social media
Summary A useR! Talk about text analysis and text mining using R. I would cover the broad set of tools for text analysis and natural language processing in R, with an emphasis on my R package quanteda but also covering other major tools in the R ecosystem for text analysis (e.g. stringi).
The talk would is tutorial covers how to perform common text analysis and natural language processing tasks using R. Contrary to a belief popular among some data scientists, when used properly, R is a fast and powerful tool for managing even very large text analysis tasks. My talk would present the many option available, demonstrate that these work on large data, and compare the features of R for these tasks versus popular options in Python.
Specifically, I will demonstrate how to format and input source texts, how to structure their metadata, and how to prepare them for analysis. This includes common tasks such as tokenisation, including constructing ngrams and “skip-grams”, removing stopwords, stemming words, and other forms of feature selection. I will also show to how to tag parts of speech and parse structural dependencies in texts. For statistical analysis, I will show how R can be used to get summary statistics from text, search for and analyse keywords and phrases, analyse text for lexical diversity and readability, detect collocations, apply dictionaries, and measure term and document associations using distance measures. Our analysis covers basic text-related data processing in the R base language, but most relies on the quanteda package (https://github.com/kbenoit/quanteda) for the quantitative analysis of textual data. We also cover how to pass the structured objects from quanteda into other text analytic packages for doing topic modelling, latent semantic analysis, regression models, and other forms of machine learning.

About me Kenneth Benoit is Professor of Quantitative Social Research Methods at the London School of Economics and Political Science. His current research focuses on automated, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining. For the past 5 years, he has been developing a major R package for text analysis, quanteda, as part of European Research Council grant ERC-2011-StG 283794-QUANTESS.


Speakers

Wednesday July 5, 2017 2:24pm - 2:42pm CEST
4.02 Wild Gallery