Researchers generate hypotheses in different ways. A dominant approach in biology and medicine is first-hand observation and lab test results, mining electronic medical records, and engagement with experimental and gene sequence data. Generating hypothesis from literature is viewed as a serendipitous process with great uncertainty. With the digitization of the international medical corpus, and increased production of born-digital publications, the vast amount of published knowledge contains a diversity of insight to which domain experts are rarely exposed and about which they cannot casually reason. This situation is exacerbated for transdisciplinary domains. The flipside of this challenge reveals the promise. Generating hypothesis from literature in different but related disciplines can reveal potential connections never before realized because experts from distant domains have not mastered each other’s knowledge. Mining literature to generate hypotheses need not be confined to biology or medicine. It should be extended to all areas of science, scholarship, engineering and the arts. Given the enormous resource constraints facing contemporary research, the acceleration of discovery is indispensable scientific and societal advance.

New ways of conducting sciences are in high demand. The current amount of published knowledge is beyond what a single scientist can consume, and knowledge transfer has been restrained due to the limits of human cognition and communication. Donald Swanson’s classic work about undiscovered public knowledge achieved wide impact on association discovery and demonstrated that new knowledge can be discovered by linking sets of disjointed scientific articles (Swanson, 1986). Swanson’s vision of the hidden value within the scientific literature in biomedical digital databases was innovative for information scientists, biologists, and physicians, and still points a pathway forward (Bekhuis, 2006). Literature-related discovery, which mines knowledge in two disparate literatures has identified several non-drug approaches now used to halt or reverse the symptoms of multiple sclerosis, cataracts, and other chronic diseases (Kostoff, 2012). Moreover, by combining PubMed literature and public datasets. The IBM Watson supercomputer can process millions of articles, patents, Wikipedia pages, and datasets to facilitate research and diagnostic decision making in lung cancer treatment (Upbin, 2013). Even in the humanities, Franco Moretti and Matthew Jocker’s distance reading approach tackles literary problems by applying computational methods to aggregate and analyze massive data to generate hypotheses and possible interpretations. He argues that distance reading is needed because no one can read the 60,000 novels published in 19th-century England to understand structure of Victorian fiction (Schulz, 2011). Generating hypothesis from mining existing literature, open datasets and informal text from the Web can provide a new way to advance science and generate outsized societal impact.

The KDD Conference gathers top notch experts in data mining and data-driven analytics to advance sciences and technologies. This topic is important to KDD as it is addressing a fundamental issue on how to conduct science given the increasing challenges of information overload. Recent breakthroughs in artificial intelligence and machine learning have demonstrated the promising potential of machine intelligence, especially the combination of machine and human intelligence, which can lead to a paradigm shift in how humans and machines conduct science in the near future. This workshop aims to explore this timely topic with the audience from KDD because the Web has become the essential infrastructure to acquire, disseminate, and create data, information, and knowledge. It has also become the unique locations of many informal intuitions (e.g., patient experiences), which can productively constrain existing models. KDD has a broad audience from both the technical and the social side of science. To better understand this topic, it is critical to explore a wide range of perspectives. In this way, KDD represents an ideal forum for this workshop.

Douglas, B. K. (2006). Metaolomics, modeling and machine learning in systems biology: Towards an understanding of the languages of cells. FEBS Journal, 273(5), 873-894.
Kostoff, R.N. (2012). Literature-related discovery and innovation—update. Technological Forecasting & Social Change, 79 (4), 789–800.
Schulz, K. (2011). What is Distance Reading. New York Times, Jan 24, 2011.
Swanson, D.R. (1986). Fish oil, Raynaud's syndrome and undiscovered public knowledge. Perspect Biol. Med, 30(1), 7–18.
Upbin, B. (2013). IBM’s Watson gets its first piece of business in healthcare.