Information overload

info overload
info overload

Bhaskar, Suri

Almost every scientist is trying to make a contribution to the pandemic. This has led to everyone, including non-scientists, writing on almost every aspect of the crisis. How does one make sense?

COVID-19 has left the bewildered as much as it has challenged the global scientific community, which is making every possible effort to learn more about the virus, find effective diagnostic assays, drugs and vaccines for its cure. Learning about the structure, behaviour and prevalence of the virus to help design public strategies and medical interventions is the need of the hour. Almost every scientist is trying to make a contribution to the pandemic. This has led to everyone, including non-scientists, writing and publishing on almost every aspect of the spread. The research efforts of the scientific community are visible from the mammoth quantities of data generated through publications, growing by thousands every day. More than 4,000 new scientific papers pertaining to the disease and the virus were added just in a week's time.

COVID-19 papers have been downloaded more than 150 million times since publishers brought down paywalls on research related to the pandemic. Since January, several major publishers have made around 50,000 COVID-19-related papers freely available. This is the biggest explosion scientific literature has ever seen.  It is becoming increasingly difficult for scientists to keep pace with the growing volumes of information and sifting through all of it to find that which is relevant to their areas of research. The world is grappling to find ways to manage and effectively use the scientific information being generated. Thankfully, data scientists and software developers across the world have geared up with the help of  journal publishers to create new search tools. These are in the form of datasets through data cleaning efforts and curated sets that bunch publications into collections of similar studies, while also highlighting the strong papers in those areas of research.

Efforts are also being made to cut out unnecessary noise through automated search tools via Artificial Intelligence (AI) so that a researcher lands at the information being sought, thus saving a lot of time and effort. Several datasets and databases have surfaced to help ease the COVID information overload crisis for researchers.

WHO COVID-19 database: The World Health Organisation has a WHO COVID-19 database that gathers latest multilingual scientific findings and knowledge. Searches and additions are made on a daily basis to the dataset through bibliographic databases, hand searches and expert-referred scientific articles from global literature. Efforts are on to build a more comprehensive database through collaboration with key partners to enrich citations. The WHO database has over 18,000 publications, searchable in many languages by title, abstract or subject. Its global research page provides quick updates. There is an international clinical trials registry platform that provides updates on the WHO Solidarity Trial for accelerating a safe and effective vaccine.

The Lancet COVID-19 Resource Centre: It brings together new content from across The Lancet journals as it is published, making the content free to access in order to assist health workers and researchers. Similarly, there are other resources on COVID-19 like Cambridge University Press, Centers for Disease Control and Prevention, Chinese Medical Association, Cochrane, Elsevier, European Centre for Disease Prevention and Control (ECDC), JAMA Network, The Lancet, LITCOVID: US Library of Medicine, New England Journal of Medicine and, Oxford University Press.

The CORD-19 dataset: The COVID-19 Open Research Dataset Challenge,  an initiative of the White House Office of Science and Technology Policy, has brought together the Semantic Scholar team at the Allen Institute for AI with the likes of Google, the Chan-Zuckerberg Initiative and National Institutes of Health (NIH) to create a free resource of open tools and datasets of over 63,000 scholarly articles.

This is the largest structured dataset that caters to the ongoing need of the global research community for which the corpus is updated regularly with current research featuring in peer-reviewed publications from sources like PubMed's PMC, corpus maintained by the WHO and from archival services like bioRxiv, medRxiv and so on, based on search COVID-19 and Coronavirus research. In addition to the above major databases, there are numerous other datasets, literature repositories, specific information resources, re-purposing databases and technological advancements that are helping the world make meaning out of the mayhem.

Ethical concerns due to rapid publication rate: There is ample reason to be concerned about the quality of data as well as regulatory and ethical issues surrounding data generated at such a quick pace. Questions on gaps in information generated, its quality and thoroughness have been raised. A recent retraction in The Lancet and the scandal associated with it is proof enough of this cause for concern. Social media platforms have been instrumental in releasing quick information about research findings of significance and providing instant feedback through online comments or suggestions for the study, as also linking the information with similar studies going on elsewhere in the world. However, not all significant research findings garner the same attention on social media platforms for researchers to pick up and relate. In fact, some do not surface on such platforms at all or drown in endless tweets or Facebook posts when there is too much to report on a subject and it's hard to catch up, till you are spending too much time on these platforms.

A general search for research publications, based on key words like Coronavirus and COVID-19, has shown that while there are quite a few promising studies and publications of high quality that can be pursued further, most other publications are either analysis, commentaries or incomplete studies that have been reported to either hasten publication or be visible. A thorough peer review is missing in most cases and hence, the authenticity or the quality of data raises grave concern. In many cases, research findings reported do not support the conclusions that have been stepped up to ensure publication in prestigious journals. While most of the literature on COVID-19 is freely available, around 20 per cent of the research publications are still behind paywalls, and this percentage is expected to grow in the near future to almost half of the total. That makes a comprehensive analysis quite difficult.

The role of biological resource centres: Rapid data sharing is important to help identify the causative agent; investigate and predict the extent of disease spread; define diagnostic protocols and evaluate treatments and methods to contain further spread. The types of information that can be collated and shared may include surveillance data, trial data, pathogen genomic and proteomic data, case study reports and summary of observations from these data sources.

The users may include data scientists, bioentrepreneurs, clinicians, public health workers, researchers, governments, NGOs, disaster management experts, regulatory bodies and so on. However, there are multiple barriers to rapid data sharing, including concerns over data protection, confidentiality and different data protection legislations across countries. Other major barriers may be poor curation tools and quality of data. There is little doubt that breakthroughs in various facets of biotechnology will hugely impact our societies and lives almost as profoundly as information technologies have done in the past. Data sharing is necessary for enabling the global community to prepare for and respond to pandemics and similar global health crisis and speed up the diagnostic and therapeutic regimen.

With major sequencing efforts across the world resulting in massive biological data accumulation, storing, managing, annotating and archiving it has become quite a scientific challenge. Using this growing body of information to dig out solutions to the challenges is the need of the hour. Although, there are many biological data centres across US and Europe, access to biological data resources remains restricted due to different data protection policies. As a result, researchers from many developing economies could not access this.

While the world awaits a biological discovery, our trust in science to handle the global crisis and impact scientific, societal, political and economic decisions only grows with passing time. This can also have a separate dimension for storage, accession and archival of published information on infectious organisms to aid the researcher.

(Bhaskar is Registrar and Suri is CEO, Office of Connectivity, Regional Center for Biotechnology)