Loading…
useR!2017 has ended
Wednesday, July 5 • 1:48pm - 2:06pm
manifestoR - a tool for data journalists, a source for text miners and a prototype for reproducibility software

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Keywords: political science, reproducibility, corpus, data journalism, text mining
Webpages: https://CRAN.R-project.org/package=manifestor, https://manifesto-project.wzb.eu/information/documents/manifestoR
The Manifesto Project is a long-term political science research project that has been collecting, archiving and analysing party programs from democratic elections since 1979, and is one of longest standing and most widely used data sources in political science. The project recently released manifestoR as its official R package for accessing and analysing the data collected by the project. The package is aimed at three groups: it is a valuable tool for data journalism and social sciences, a data source for text mining, and a prototype for software that promotes research reproducibility.
The manifestoR package provides access to the Manifesto Corpus (Merz, Regel & Lewandowski 2016) – the project’s text database – which contains more than 3000 digitalised election programmes from 573 parties, together running in elections between 1946 and 2015 in 50 countries, and includes documents in more than 35 different languages. More than 2000 of these documents are available as digitalised, cleaned, UTF-8 encoded full text – the rest as PDF files. As these texts are accessible from directly within R, manifestoR provides a comfortable and valuable data source for text miners interested in political and/or multilingual training data, as well as for data journalists.
The manifesto texts accessible through manifestoR are labelled statement by statement, according to a 56 category scheme which identifies policy issues and positions. On the basis of this labelling scheme, the political science community has developed many aggregate indices on different scales for parties’ ideological positions. Most of these algorithms have been collected and included in manifestoR in order to provide a centralised and easy to use starting point for scientific and journalistic analyses and inquiries.
Replicability and reproducibility of scientific analyses are core values of the R community, and are of growing importance in the social sciences. Hence, manifestoR was designed with the goal of reproducible research in mind and tries to set an example of how a political science research project can publish and maintain an open source package to promote reproducibility when using its data. The Manifesto Project’s text collection is constantly growing and being updated, but any version ever published can easily be used as the basis for scripts written with manifestoR. In addition, the package integrates seamlessly with the widely-used tm package (Feinerer 2008) for text mining in R, and provides a data_frame representation for every data object in order to connect to the tidyverse packages (Wickham 2014), including the text-specific tidytext (Silge & Robinson 2016). For standardising and open-sourcing the implementations of aggregate indices from the community in manifestoR, we sought collaboration with the original authors. Additionally, the package provides infrastructure to easily adapt such indices, or to create new ones. The talk will also discuss the lessons learned and the unmet challenges that have arisen in developing such a package specifically for the political science community.
References
  • Feinerer, Ingo (2008). A text mining framework in R and its applications. Doctoral thesis, WU Vienna University of Economics and Business.
  • Merz, N., Regel, S., & Lewandowski, J. (2016). The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis. Research &Amp; Politics, 3(2), 2053168016643346. doi: 10.1177/2053168016643346
  • Silge, J., & Robinson, D. (2016). Tidytext: Text Mining and Analysis Using Tidy Data Principles in R. JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.
  • Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10




Wednesday July 5, 2017 1:48pm - 2:06pm CEST
4.02 Wild Gallery