Discovering Data Sets in Unstructured Corpora: Discovering Use and Identifying New Opportunities

TitleDiscovering Data Sets in Unstructured Corpora: Discovering Use and Identifying New Opportunities
Publication TypeJournal Article
Year of Publication2024
AuthorsPallotta, N, Locklear, JM, Ren, X, Robila, V, Alaeddini, A
JournalHarvard Data Science Review
Abstract

Federal statistical agencies are keenly aware that scientific research is not the only way in which their data assets are used for evidence building. Entities around the world have been building services on top of curated corpuses of scientific research, to help provide insight as to the importance of the works in the collection. This also provides a much-needed framework for statistical agencies and other data set creators to search and find the usages and impacts of those data sets. While the other articles in this special issue largely apply machine learning models to find data sets in an extensively curated corpus, this article starts with a much less structured framework and examines the potential to discover how data sets are used in writing that is targeted at a broader base of users than scientific researchers. It describes the challenges and lessons learned from the exercise; and highlights an ever-growing value proposition for organizing collections of work.

DOI10.1162/99608f92.77bfa1c9