To: Google Scholar’s Dad — Data-driven science hypotheses

A week ago I sent an email to Anurag Acharia, the man behind “Google Scholar”. Scholar is a search engine that allows you to browse through scientific papers, a specialized version of Google. You can use it for free (although accessing the papers is often not free).


Scholar is an extraordinary tool. It does something that nobody else can right now. Which is why I think they can solve a weird issue of science: the way researchers come up with hypotheses is everything but scientific. It relies on the same methods that your grandma’s grandma used to cure a cold: tradition and gut feeling.

I believe that the generation of scientific hypotheses must be data-driven, just as science itself is. Here is what I wrote in my proposal (original PDF here). There was no answer, unsurprisingly: I can’t imagine what the timetable of someone like Anurag Acharya looks like. But I put this here in the hope that someone finds it worth debating.



By definition, science follows the scientific process. Hypotheses are adopted or discarded based on objective analysis of data. But surprisingly, the process of generating hypotheses itself is hardly scientific: it relies on hunches and intuition.

It often goes like this: a researcher gets an idea from reading a colleague’s paper or listening to a talk. Literature from the field is reviewed, which allows for refinement of the original idea. Then it is time for designing experiments, analysing data and writing a paper. If the researcher is actually a student, things can be more complicated. But in all cases, the original hypothesis relies tremendously on the researcher’s own subjective collection and appreciation of information, that must be selected from the gigantic amount of existing scientific papers.

Clearly, the fact that we now have access to all this scientific information is a giant leap from the situation of a few decades ago; and it has been made possible single-handedly by Google Scholar. But it is also a fact that researchers everywhere have more and more data to look at, and that “to look at” too often becomes “to subjectively pick from”.

Hypothesis generation is the basis of science – arguably the most crucial and exciting part of actually doing science. Yet it is not based on anything scientific. This document summarise 3 proposals to make of hypothesis generation a datadriven process. I believe this is not restricting the creativity of scientists, but enhancing it; that it can make science more efficient and limit the waste of time and resources caused by irrelevant, biased, or outdated hypotheses – especially for graduate students. Not only does this respect the philosophy of Google and more specifically Google Scholar, but Google Scholar is currently the only organism that has the resources to make it happen. Here are my 3 proposals, from the easiest to implement to the more hypothetical.

1. Paper Networks

Going through several dozen of references at the end of a paper is far from optimal: the reason why a paper is cited and the paper itself are not physically close; the authors tend to unconsciously cite papers that support their view; the place of the papers in the field and their relationship to each other are virtually inaccessible.

Numerous services suggest papers supposed to be close to the one you have just read, but this is not enough. We need, at a glance, to know which papers support each other’s views and which support conflicting opinions, and we need to know how many there are. A visual map, a graph of networks of papers or of clusters of papers could be the ideal tool to reach this goal. The benefits would go beyond simple graphical structuring of the information:

• Reducting confirmation bias. When we look for papers simply by inputting keywords in Google Scholar, the keyword choice itself tend to be biased. A Paper Network would make supporting and opposing papers equally accessible.

• Promoting interdisciplinarity. It’s easy to say that interdisciplinary approaches are good. It’s better to actually have the tools to make it happen. A Paper Network would make it clear which approaches are related in different fields.

• Sparking inspiration. Standard search methods tell us what is there. But science is about bringing forth what is not yet here. A Paper Network would show existing papers in different fields, helping us to avoid re-doing what has already been done. More importantly, it would make it visually clear where the gaps are, where some zones are still blank, and what may be needed to fill them.

2. Burst Detection

Artificial Intelligence, my field, has known several “winters” and “summers”: periods when it seemed like all had already been done and the field fell in hibernation, and periods when suddenly everyone seemed to do AI (now is such a period). I suspect that other field know these brisk oscillations as well: several teams announcing the same big discovery in parallel, or a rapid succession of findings that leads to revival of the field, or even spawn new specialised fields.

These bursts are most likely not completely random. If we could predict, even very roughly, when which field will boom, we could prepare for it, invest in it and even maybe make it happen faster. What are the factors influencing winters and summers? How many steps in advance can we predict? How many more Moore Laws are waiting to be discovered? Being able to predict winters would also be an asset, because we could look for the profound causes that force science to slow down and try to prevent it. Is it the lack of funds? Relying too much on major paradigms? Only analysing data from the past can transform hunches into successful policies for the advance of science.

3. Half Life of Facts

The destiny of scientific facts if to be overturned – it is the proof that science works. Better tools, better theories: these are obvious first level parameters influencing the shelf life of scientific papers. But we need to go deeper and look for meta-parameters: properties that allow us to predict this shelf life, and identify which papers, which parts of a theory are statistically more likely to be busted.

As anyone who has assisted to a heated scientific debate can testify, right now, the leading cause for accepting a non-trivial theory or choosing to challenge it is the researcher’s own “common sense”; yet all science is about is rejecting common sense as an explanation to anything and looking for facts in hard data. In these conditions, how can we continue to rely on gut feeling to justify our opinions? We need more sound foundations to our beliefs, even if in the absence of experimental verification they are just that: beliefs.

If a specific part of a theory looks perfectly sound but is statistically close to death, we must start looking at its opponents, or even better, think about what a good opponent theory would look like and choose research topics accordingly.

4. Conclusion

These proposals could change the way we, researchers, do science. They also come with a flurry of ethical issues: new tools would change the way resources (financial and human) are distributed, with desirable and undesirable outcomes. Just like prenatal genetic screening leads to difficult ethical questions, building tools allowing the hierarchisation of research projects should be a very careful enterprise.

But here is the catch: unlike genetic screening, new research tools have an objective component to them. These 3 proposals are about bringing more science to science: allowing the generation of science seeds to be data-driven. Science changes the world, every day. Any tiny improvement to the scientific process is worth striving for – and these 3 changes would, I believe, bring major improvements.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: