The traditional mode of conducting science involves reading related articles, selectively evaluating datasets and “thinking through” testable hypotheses. Data was the bottleneck in this traditional scientific method, reserved only for final testing. With the rise of high-throughput experiments and sensors, the associated production of enormous data, and publication of more papers on many topics than any individual or team can peruse, the same artisanal approach to hypothesis generation both slows and narrows the scope of science relative to its potential. For one common disease (i.e., diabetes), more than 500,000 articles have been published to date. If a scientist read 20 papers per day, it would take 68 years, by which time millions more will have been published. We need computational approaches to read, reason, and design hypotheses that transcend the capacity of individual teams. We need to deploy scientific creativity not only to craft individual questions, but the models and algorithms that will generate the most promising collections of questions. In short, we need computation to generate Big Questions equal to Big Data.
Spangler et al. (2014) recently proposed an approach to accelerate scientific progress by the combination of mining, visualizing, and analyzing related publications on a subject to propose hypotheses that are new, testable, and likely to ring true for domain experts. They found that even relatively simple approaches can help domain experts generate useful hypotheses which lead to significant discoveries in a complex domain. Other promising approaches have leveraged network theory (Liu et al. 2009), genetic algorithms (Schmidt and Lipson 2009), and alternative modeling approaches, linking literature, medical records (Blair et al. 2013) and high-throughput data.
Here we call for contributions from the best global teams working on computational hypothesis generation to share their insights and move this field forward to generate scientific, technological and societal impact. Ultimately, we hope that this event will visibly bring data and computation up the value chain of science from answers and certainty to questions and creativity.
Blair, D. R., Lyttle, C. S., Mortensen, J. M., Bearden, C. F., Jensen, A. B., Khiabanian, H., ... & Jensen, L. J. (2013). A nondegenerate code of deleterious variants in Mendelian loci contributes to complex disease risk. Cell, 155(1), 70-80.
Liu, J., Ghanim, M., Xue, L., Brown, C. D., Iossifov, I., Angeletti, C., ... & Al-Ahmadie, H. A. (2009). Analysis of Drosophila segmentation network identifies a JNK pathway factor overexpressed in kidney cancer. Science, 323(5918), 1218-1222.
Schmidt, M., & Lipson, H. (2009). Distilling free-form natural laws from experimental data. Science, 324 (3), 81-85.
Spangler, S., Wilkins, A. D., Bachman, B. J., Nagarajan, M., Dayaram, T., Haas, P., ... & Stanoi, I. (2014, August). Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1877-1886). ACM.
Indiana University Bloomington 2017