Early Detection of Research Trends

This post is a hub for all the relevant information about my doctoral project. It contains all relevant pointers to my papers and activities.

Abstract

Being able to rapidly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. The literature presents several approaches to identifying the emergence of new research topics, which rely on the assumption that the topic is already exhibiting a certain degree of popularity and consistently referred to by a community of researchers. However, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. In this dissertation, we begin to address this challenge by performing a study of the dynamics preceding the creation of new topics. This study indicates that the emergence of a new topic is anticipated by a significant increase in the pace of collaboration between relevant research areas, which can be seen as the ‘ancestors’ of the new topic. Based on this understanding, we developed Augur, a novel approach to effectively detecting the emergence of new research topics. Augur analyses the diachronic relationships between research areas and is able to detect clusters of topics that exhibit dynamics correlated with the emergence of new research topics. Here we also present the Advanced Clique Percolation Method (ACPM), a new community detection algorithm developed specifically for supporting this task. Augur was evaluated on a gold standard of 1,408 debutant topics in the 2000-2011 timeframe and outperformed four alternative approaches in terms of both precision and recall.

I submitted my Ph.D. dissertation on 28 Nov 2018 (post) and defended it on 31 May 2019 (post). On 9 Oct 2019, I have been officially awarded of the degree of Doctor of Philosophy. Below you can find a more detailed timeline.

 

Download Dissertation from arXiv

 

 

Timeline

Here is an interactive timeline of my Ph.D. journey.

 

Introduction

The research environment changes and evolves rapidly: new research areas constantly emerge, whereas some others fade out. The ability to promptly recognise the emergence of new research topics is an important asset for anybody involved in the research environment, including journal editors, academic publishers, researchers, institutional funding bodies and other relevant stakeholders. Nowadays, as we are experiencing an exponential growth of research publications [1], keeping up with new trends is becoming progressively more challenging.

In the last two decades, very large repositories of scholarly data and other relevant sources have become available, opening the way to novel data-intensive approaches capable of detecting novel topics and their trends [2–4]. However, a timely detection of research topics is still open challenge.

In this work, we are going to face this challenge by creating a system able to anticipate the emergence of new research topics. Specifically, we define a research topic as a subject of study or issue that is of interest to the research community, and it is addressed in at least few research papers. Subjects of study can encompass different research areas and application-dependent concepts. For instance, a paper introducing a new technology to classify data, is the event that triggers the emergence of new topic. Any paper discussing the implementation of such technology, its application, the release of a new efficient implementation, its evaluation and so on, are all part of the same research topic.

 

Problem Statement

Looking deeply into the evolution of research topics we can observe that they go through a sequence of life stages. Specifically, we can distinguish three main stages: i) embryonic stage, ii) early stage and iii) recognised, as showed in figure below. A research topic, in its embryonic stage, is still an idea or concept and it did not emerge, yet. Specifically, in this stage, a topic has not yet been explicitly labelled and recognised by a research community, but it is already taking shape, as evidenced by the fact that researchers from a variety of fields are forming new collaborations and producing new work, starting to define the challenges and the paradigms associated with the emerging new area. A research topic in its early stage, instead, has recently emerged and a relatively small group of researchers agree on certain theories which will allow the topic to thrive. As a result, there is a new label for it and it is associated to a limited number of papers. Afterwards, when a research topic enters in its recognised stage, it becomes mature and many researchers are actively producing and disseminating their results. The topic is then associated with a substantial number of papers. As an example, for the Semantic Web, we can observe that before its emergence in 2001, in its embryonic stage, it still was a concept in which communities of researchers from Artificial Intelligence, World Wide Web, Knowledge Representation, and Knowledge-Based Systems were joining their forces. After the 2001, the topic emerges and enters in its early stage, earning its own identity, in which an increasing number of practitioners started to work in it. Around the 2005, the Semantic Web reaches its own maturity (recognised phase) with over 1500 papers published each year.

Representation of a topic lifecycle in terms of published research papers.

Current approaches for detecting research topics [2–4], focus on highlighting topics that are already associated with a number of publications and consistently referred to by a community of researchers. This limitation means that these solutions can only identify topics that already have an initial degree of consolidation, i.e., after a certain latency from their emergence. As a result, they can only provide limited value to stakeholders who wish to anticipate and promptly react to new developments in the research landscape.

A strategy to address this problem would be to anticipate the emergence of new research topics by detecting them at their embryonic stage. This stage was initially observed by Thomas Kuhn, who argued in his book [5] that, topics might exist in the form of embryo. However, it has never been studied quantitatively, probably due to the complexity of the issue and the difficulty of formally defining the notion of research topic.

Nevertheless, in this dissertation we address this problem by introducing a novel framework, Augur, for identifying the appearance of new topics at an embryonic stage. Augur analyses networks of research topics, detects areas exhibiting a significant increase in the pace of collaboration, and produces clusters of topics correlated with the future emergence of new research areas. For instance, if available, before 2001, Augur would have allowed us to observe that previously less connected topics (e.g., Artificial Intelligence, World Wide Web and Knowledge-Based Systems) were increasingly collaborating with each other*. These dynamics indeed led to the emergence of a new research area, later labelled as Semantic Web by Berners-Lee.

 

* In this dissertation, we used the expression “collaboration between research area” as a shortcut for “collaboration between research communities associated with specific research areas”. The community of a research area is given by the authors who publish in the area in question.

 

Motivation

As already mentioned, understanding and reacting timely to new developments in the research landscape is critical for a variety of stakeholders. For instance, researchers need to stay up-to-date with new trends related to their topics and potentially interesting new research areas. Thanks to these insights, they can evolve their research agenda, ensuring they focus on ideas and concepts at the leading edge of the current landscape.

Institutional funding bodies and companies also need to be aware of the latest research developments and promising trends. For instance, this knowledge can inform business decisions and suggest the selection of technologies on which to invest.

Similarly, it is crucial for academic publishers and editors to know in advance new emerging topics with the aim of offering the most up-to-date and interesting content. For instance, a publisher can gain a competitive advantage by being the first to recognise the importance of a new trend, and publish a special issue or a journal about it. Indeed, financial support for this doctoral project comes from Springer Nature, which is a global publishing company.

The undeniable potential of an approach for detecting novel research trends has recently attracted an increasing research interest to this area. However, current state-of-the-art solutions suffer from significant limitations when applied to the detection of trends at their early stage.

 

Research questions

The main research question investigated in this dissertation is:

Is it possible to detect a new research topic at the embryonic stage before it is consistently recognised by a research community (e.g., there is an established label for it)?

We focused specifically on the creation of a novel approach that is able to detect the emergence of new research topics at their embryonic stage. Given the dimension of such problem, we articulated this main question in a set of related questions:

  1. Is it possible to precisely define the notion of established topic?
  2. How early in the topic lifecycle is it possible to identify an emerging topic?
  3. What are the indicators that can be exploited to predict the emergence of new topics?
  4. Is it possible to develop an effective computational method that can support this prediction task?
  5. Are there commonalities between our approach to predicting the emergence of new topics and epistemological theories of research dynamics?
  6. What evaluation mechanisms are appropriate for this task?

 

Research hypotheses

This research project centred around three main hypotheses:

  1. Before being labelled and recognised by research communities, new topics go through an embryonic stage, in which researchers from different topics start to work on it.
  2. The emergence of a new research topic is anticipated by an increased rate of interaction of pre-existing topics, involved in developing this new area which is still in its embryonic stage.
  3. It is possible to create an automatic approach for detecting new emerging topics in their embryonic stage by analysing the dynamics of existing topics (i.e., observing their patterns of collaboration).

 

The AUGUR framework

Augur is a novel framework that detects the emergence of new research areas by analysing topic networks and identifying clusters associated with an overall increase of the pace of collaboration.

Augur operates in three steps. Firstly, it creates an evolutionary network describing the collaboration of research topics in time. The evolutionary graph is a fully weighted graph in which nodes represent the topics, and arcs represent the collaboration between those two topics. Then node and arc weights represent how much topics and their inner collaboration are growing in time.

As a second step, Augur uses a novel clustering algorithm, the Advanced Clique Percolation Method (ACPM), for locating areas of the network that exhibit a significant increase in the pace of collaboration. ACPM crawls the evolutionary network and groups together nodes (i.e., topics) into sets of nodes (i.e., clusters), such that each cluster is densely connected internally (i.e., high pace of collaboration).

In the third and final step, Augur post-processes the results, by merging and filtering the resulting clusters. As some communities can overlap, Augur aggregates the most similar ones and for each cluster it returns the topics that have the most intense collaboration.

The output of the process is sets of already existent topics that are nurturing a new research area that should shortly emerge, associated with relevant authors and publications.

Augur was evaluated on the task of forecasting the emergence of 1,408 research topics in the 2000-2011 period and it outperformed four alternative approaches, successfully predicting many new research topics as soon as two years before they became explicitly recognised in the research community.

 

Presentation

Here are the slides I prepared for a 10 min presentation about this project for my thesis defence.


 

Relevant Papers

Here is the list of my papers in reverse chronological order:

On the 8th of February 2017, I gave a seminar to my department in which I described my doctoral work, including hypotheses, research questions, main assumptions, approaches, and some preliminary results. To replay the seminar follow this link: PLAY

 

Extended Abstract

 

Download Dissertation

You can download an electronic version of my dissertation from our Institutional Repository (ORO): http://oro.open.ac.uk/67224

or from ArXiv: https://arxiv.org/abs/1912.08928

Bibliography

  1. Larsen, P.O., von Ins, M.: The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics. 84, 575–603 (2010).
  2. Wu, Y., Venkatramanan, S., Chiu, D.M.: Research collaboration and topic trends in Computer Science based on top active authors. PeerJ Comput. Sci. 2, e41 (2016).
  3. He, Q., Chen, B., Giles, C.L.: Detecting Topic Evolution in Scientific Literature : How Can Citations Help ? Cikm. 957–966 (2009).
  4. Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 5478 LNCS, 776–780 (2009).
  5. Kuhn, T.S.: The structure of scientific revolutions.