This article was originally posted on December 7th, 2021 on Markus’ website.
TL;DR: I worked on biomedical literature search, discovery and recommender web applications for many months and concluded that extracting, structuring or synthesizing “insights” from academic publications (papers) or building knowledge bases from a domain corpus of literature has negligible value in industry.
Close to nothing of what makes science actually work is published as text on the web

Here’s the outline:
-
The Business of Extracting Knowledge from Academic Publications
-
Psychoanalysis of a Troubled Industry
-
My Quixotic Escapades in Building Literature Search and Discovery Tools
-
Fundamental Issues with Structuring Academic Literature as a Business
-
Just a Paper, an Idea, an Insight Does Not Get You Innovation
-
Contextual, Tacit Knowledge is not Digital, not Encoded or just not Machine-Interpretable yet
-
Experts have well defined, internalized maps of their field
-
Scientific Publishing comes with Signaling, Status Games, Fraud and often Very Little Information
-
Non-technical Life Science Labor Is Cheap
-
The Literature is Implicitly Reified in Public Structured Knowledge Bases
-
Advanced Interactives and Visualizations are Surprisingly Unsatisfying to Consume
-
Unlike other Business Software, Domain Knowledge Bases of Research Companies are Maximally Idiosyncratic
-
Divergent Tasks are Hard to Evaluate and Reason About
-
-
Public Penance: My Mistakes, Biases and Self-Deceptions
-
Onwards
-
Atop the published biomedical literature is an evolved industry around the extracting, semantic structuring and synthesizing of research papers into search, discovery and knowledge graph software applications (table of example companies). The usual sales pitch goes something like this:
Generate and Validate Drug Discovery Hypotheses Faster Using our Knowledge Graph
Keep up with the scientific literature, search for concepts and not papers, and make informed discovery decisions
[tellic]Find the relevant knowledge for your next breakthrough in the X million documents and Y million extracted causal interactions between genes, chemicals, drugs, cells…
[biorelate – Galaxy]Our insight prediction pipeline and dynamic knowledge map is like a digital scientist [OccamzRazor]
Try our sentence-level, context-aware, and linguistically informed extractive search system [AllenAI’s Spike]
Or from a grant application of yours truly:
a high-level semantic search engine and user interface for answering complex, quantitative research questions in biomedicine
You get the idea. All the projects above spring from similar premises:
-
The right piece of information is “out there”
-
If only research outputs were more machine interpretable, searchable and discoverable then “research” could be incrementally automated and progress would go through the roof
-
No mainstream public platform (eg. arxiv, pubmed, google scholar) provides features that leverage the last decade’s advances in natural language processing (NLP) and they do not improve on tasks mentioned in 2.
Which leads them to conclude: Let’s semantically annotate every piece of literature and incorporate it into a structured knowledge base that enables complex research queries, revealing undiscovered public knowledge by connecting causal chains (swanson linking) or generating new hypotheses altogether.
It really does sound exciting and VCs and governments like to fund the promise of it. Companies get started, often get a few pivot customers, end up as software consultancies, or silently fail. In rare cases they get acquired for cheap or survive as a modest “request a demo” B2B SaaS with 70% of the company in sales with unit economics that lock them in to only sell to enterprises and never to individuals.
Meanwhile big players with all the means to provide a category leading platform have either stopped developing such an effort altogether (Google Scholar), shut it down and nobody cares (ie. Microsoft Academic Search) or are unable to commercialize any of it (AllenAI with SemanticScholar, Spike etc.).
Let’s take AllenAI as an example. AllenAI is a world-renowned AI research organization with a focus on NLP and automated reasoning. They sponsor a startup incubator to spin-off and commercialize their technologies. The AllenNLP platform and SemanticScholar is used by millions of engineers and researchers. And unlike most big AI research groups they also build domain-focused apps for biomedicine (like supp.ai and Spike).
And yet, despite all of that momentum, they can’t commercialize any of it:
We have been looking at opportunities to commercialize the technologies we developed in Semantic Scholar, Spike, bioNLP, etc. The challenge for us so far is in finding significant market demand. I’m curious if you have found an angle on this ? […]
This was an email reply I received from a technical director at the AllenAI institute. At that point, I had already spent months building and trying to sell biomedical NLP applications (bioNLP) to biotechs which had made me cynical about the entire space. Sure, I might just be bad at sales but if the AllenAI with 10/10 engineering, distribution and industry credibility can’t sell their offerings, then literature review, search and knowledge discovery are just not that useful…
When I then looked at the SUM of all combined funding, valuations, exits of _NLP_ companies in the semantic search / biomedical literature search / biomedical relation extraction space I could find it was less than the valuation of any single bigger _bioinformatic_ drug discovery platform (i.e. Atomwise, Insitro, nFerence). Assuming the market is efficient and not lagging, it shows how little of a problem biomedical literature search and discovery actually is.
Just to clarify: This post is about the issues with semantic intelligence platforms that predominantly leverage the published academic literature. Bioinformatic knowledge apps that integrate biological omics data or clinical datasets are actually very valuable (ie. Data4cure), as is information extraction from Electronic Health Records [EHRs] (like nFerence, scienceIO) and patent databases.
Back in March 2020 when the first covid lockdowns started I evacuated San Francisco and moved back to Austria. I decided to use the time to get a deeper understanding of molecular biology and began reading textbooks.
Before biology, I had worked on text mining, natural language processing and knowledge graphs and it got me thinking…could you build something that can reason at the level of a mediocre biology graduate?
After a few weeks of reading papers in ontology learning, information extraction and reasoning systems and experimenting with toy programs, I had some idea of what’s technologically possible. I figured that a domain-focused search engine would be much simpler to build than a reasoner/chatbot and that the basic building blocks are similar in any case.
I wrote up what I had in mind, applied to Emergent Ventures and was given a grant to continue working on the idea.
At that time I also moved to London as an Entrepreneur First (EF) fellow. I made good friends in the program and teamed up with Rico. We worked together for the entirety of the three month program. During that time we prototyped:
-
a search engine with entity, relation and quantitative options: A user could go detailed, expressive queries like:
< studies that used gene engineering [excluding selective evolution] in yeast [species A and B only] and have achieved at least [250%] in yield increase >
< studies with women over age 40, with comorbidity_A that had an upward change in biomarker_B >
They’d tell us in English and we’d translate it into our query syntax
This collapses a lot of searches and filtering into one query that isn’t possible with known public search engines but we found out that it’s not often in a biotechs lifecycle that questions like this need researching. To get a notion of ratios: one search like this could return enough ideas for weeks of lab work
-
a query builder (by demonstration): A user would upvote study abstracts or sentences that fit their requirements and it would iteratively build configuration. In a sense it was a no-code way for biologists to write interpretable labeling functions while also training a classifier (a constrained case of program synthesis)
We built this because almost no biologist could encode or tell us exactly which research they wanted to see but they “know it when I see it“
-
A paper recommender system (and later claim/sentence/statement level) that, in addition to academic literature, included tweets, company websites and other non-scholarly sources (lots of scientific discourse is happening on Twitter these days)
-
A claim explorer that takes a sentence/paragraph and lets you browse through similar claims
-
A claim verifier that showed sentences that confirm or contradict an entered claim sentence
It worked ok-ish but the tech is not there to make fact checking work reliably, even in constraint domains -
And a few more Wizard of Oz experiments to test different variants and combinations of the features above
At that point, lots of postdocs had told us that some of the apps would have saved them months during their PhD, but actual viable customers were only moderately excited. It became clear that this would have to turn into another B2B SaaS specialized tool with a lot of software consulting and ongoing efforts from a sales team…
We definitely did not want to go down that route. We wanted to make a consumer product for consumers, startups, academics or independent researchers and had tested good-enough proxies for most ideas we thought of as useful w.r.t. the biomedical literature. We also knew that pivoting to Electronic Health Records (EHRs) or patents instead of research text was potentially a great business, but neither of us was excited to spend years working on that, even if successful.
So we were stuck. Rico, who didn’t have exposure to biotech before we teamed up, understandably wanted to branch out and so we decided to build a discovery tool that we could use and evaluate the merits of ourselves.
And so we loosened up and spent three weeks playing around with knowledge search and discovery outside of biomed. We built:
-
a browser extension that finds similar paragraphs (in articles/essays from your browser history) to the text you’re currently highlighting (when reading on the web)
A third of the time the suggestions were insightful, but the download-retrain-upload loop for the embedding vectors every few days was tedious and we didn’t like it enough to spend the time automating -
similar: an app that pops up serendipitous connections between a corpus (previous writings, saved articles, bookmarks…) and the active writing session or paragraph. The corpus, preferably your own, could be from folders, text files, blog archive, a Roam Research graph or a Notion/Evernote database.
This was a surprisingly high signal to noise ratio from the get go and I still use it sometimes -
a semantic podcast search (one version for the Artificial Intelligence Podcast is still live)
-
an extension that after a few days of development was turning into Ampie and was shelved
-
an extension that after a few days of development was turning into Twemex and was shelved
Some of the above had promise for journalists, VCs, essayists or people that read, cite and tweet all day, but that group is too heterogeneous and fragmented and we couldn’t trust our intuition building for them.
To us, these experiments felt mostly like gimmicks that, with more love and industry focus, could become useful but probably not essential.
Now it was almost Christmas and the EF program was over. Rico and I decided to split up as a team because we didn’t have any clear next step in mind. I flew to the south of Portugal for a few weeks to escape London’s food, bad weather and the upcoming covid surge.
There, I returned to biomed and tinkered with interactive, extractive search interfaces and no-code data