Internet as Data Source in the Istat Survey on ICT in Enterprises
DOI:
https://doi.org/10.17713/ajs.v44i2.53Abstract
The Istat sampling survey on ICT in enterprises aims at producing information on
the use of ICT and in particular on the use of Internet by Italian enterprises for various purposes (e-commerce, e-recruitment, advertisement, e-tendering, e-procurement, egovernment). To such a scope, data are collected by means of the traditional instrument of the questionnaire. Istat began to explore the possibility to use web scraping techniques, associated, in the estimation phase, to text and data mining algorithms, with the aim to replace traditional instruments of data collection and estimation, or to combine them in an integrated approach. The 8,600 websites, indicated by the 19,000 enterprises responding to ICT survey of year 2013, have been scraped and the acquired texts have been processed in order to try to reproduce the same information collected via questionnaire. Preliminary results are encouraging, showing in some cases a satisfactory predictive capability of fitted
models (mainly those obtained by using the Naive Bayes algorithm). Also the method known as Content Analysis has been applied, and its results compared to those obtained with classical learners. In order to improve the overall performance, an advanced system for scraping and mining is being adopted, based on the open source Apache suite Nutch-Solr-Lucene. On the basis of the nal results of this test, an integrated system harnessing both survey data and data collected from Internet to produce the required estimates will be implemented, based on systematic scraping of the near 100,000 websites related to the whole population of Italian enterprises with 10 persons employed and more, operating in industry and services. This new approach, based on Internet as Data source (IaD), is characterized by advantages and drawbacks that need to be carefully analysed.
References
Hoekstra R, ten Bosh O, Harteveld F (2012). "Automated data collection from web sources for official statistics: First experiences." Statistical Journal of the IAOS: Journal of the
International Association for Official Statistics, 28(3-4), 99-111.
Hopkins D, King G (2010). "A Method of Automated Nonparametric Content Analysis for Social Science." American Journal of Political Science, 54(1), 229-247.
James G, Witten D, Hastie T, Tibshirani R (2013). An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics.
Jurka T, Collingwood L, Boydstun A, Grossman E, van Atteveldt W (2014). RTextTools: AutomaticText Classication via Supervised Learning. R package version 1.4.2., URL
http://CRAN.R-project.org/package=RTextTools.
Lantz B (2013). Machine Learning with R. Packt Publishing Ltd.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2014). e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.6-3, URL http:
//CRAN.R-project.org/package=e1071.
ten Bosh O, Windmeijer D (2014). "On the Use of Internet Robots for Official Statistics." In MSIS-2014.
Williams G (2011). Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Use R!, Springer.
Downloads
Published
How to Cite
Issue
Section
License
The Austrian Journal of Statistics publish open access articles under the terms of the Creative Commons Attribution (CC BY) License.
The Creative Commons Attribution License (CC-BY) allows users to copy, distribute and transmit an article, adapt the article and make commercial use of the article. The CC BY license permits commercial and non-commercial re-use of an open access article, as long as the author is properly attributed.
Copyright on any research article published by the Austrian Journal of Statistics is retained by the author(s). Authors grant the Austrian Journal of Statistics a license to publish the article and identify itself as the original publisher. Authors also grant any third party the right to use the article freely as long as its original authors, citation details and publisher are identified.
Manuscripts should be unpublished and not be under consideration for publication elsewhere. By submitting an article, the author(s) certify that the article is their original work, that they have the right to submit the article for publication, and that they can grant the above license.