Sridhar Ramaswamy didn’t leave Google to build another search engine. At least not at first. At the close of his 15-year tenure at Google, Ramaswamy was running the company’s entire advertising division, overseeing more than 10,000 people — he knew better than most exactly how much work it took to do search well.
You almost can’t overstate how dominant Google is in search. Most studies put Google at about 90 percent of the global search market, and that number has been steadily climbing for 20 years. Google is the default search engine in almost every browser, on almost every device. We don’t search the internet; we Google it. Bing and Yahoo are the second and third largest players, and when was the last time you Binged or Yahooed anything? Google has spent its enormous political, engineering, and financial capital to keep it that way.
But what Ramaswamy also knew better than most were all the things Google couldn’t or wouldn’t do to its search engine. With billions of users and hundreds of billions of dollars to protect, Google was unlikely to ever explore huge changes to its results page, new business models, or any kind of products that might make users search less. (Ramaswamy had actually tested a feature called Google Contributor that let people pay for an ad-free experience on some sites. It didn’t work.) There was an opportunity here to make something that Google simply couldn’t or wouldn’t. So when he left the company in 2018, Ramaswamy and Vivek Raghunathan — a longtime Google and YouTube executive — co-founded a company called Neeva to build the search engine of the future.
This year, The Verge is exploring how Google Search has reshaped the web into a place for robots — and how the emergence of AI threatens Google itself.
The road was rocky, but the team at Neeva ended up building a search engine they were proud of, a search engine that came close to beating Google both by Neeva’s internal metrics and in user studies. People who tried it liked it, and Neeva had a long road map filled with ideas on how to make search even better. A little more time, and they may very well have built the future of search. But only four years in, Neeva shut down.
In a way, the brief flicker of Neeva’s existence tells everything you need to know about the last 20 years of search-engine supremacy. Building a search engine is hard. Building one better than Google is even harder. But if you want to beat Google, a better search engine is only the very beginning. And it only gets harder from there.
A search engine is both an enormously complex thing and a fairly simple idea.
All a search engine is doing, really, is compiling a database of webpages — known as the “search index” — then looking through that database every time you issue a query and serving the best and most relevant set of those pages. That’s the whole job.
At every tiny step of that journey, though, there are huge complications that require critical and complex tradeoffs. Most of them boil down to two things: time and money.
Even if you could hypothetically build a constantly updating database of all of the untold billions of pages on the internet, the storage and bandwidth costs alone would bankrupt practically any company on the planet. And that’s not even counting the cost of searching that database millions or billions of times a day. Add in the fact that every millisecond matters — Google still advertises how long every query took at the top of your results — and you don’t have time to look over the whole database, anyway.
But first, building your own search engine thus starts with a surprisingly philosophical question: what makes a webpage good? You have to decide what counts as reasonable disagreement and what’s just misinformation. You have to figure out how many ads are too many ads. Sites clearly written by AI and rife with SEO garbage: bad. Recipe blogs written by a person and rife with SEO garbage: mostly fine. Porn? Sometimes okay, sometimes not.
Once you’ve had all these discussions and set your boundaries, you might identify, say, a few thousand domains that you definitely want in your search engine. You’ll include news sites from CNN to Breitbart, popular discussion boards like Reddit and Stack Overflow and Twitter, useful services like Wikipedia and Craigslist, sprawling platforms like YouTube and Amazon, and all the best recipe / sports / shopping / everything else sites on the web. Sometimes, you can partner with those sites to get that data in a structured way without having to look at each page individually; lots of big platforms make this easy and occasionally even free.
Building your own search engine thus starts with a surprisingly philosophical question: what makes a webpage good?
Then it’s time to turn the spiders loose. These are bots that grab the content on a given webpage, then find and follow every link on the page, index all those pages, find and follow every link, index, find, follow. (They’re called spiders because they crawl the web — get it?) Every time the spider lands on a page, it evaluates it against the criteria you set for a good page. Anything that passes gets downloaded onto servers somewhere, and your search index begins to grow.
Spiders aren’t welcome everywhere, though. Every time a crawler opens a webpage, the provider incurs a bandwidth cost; now imagine a search engine that is trying to load and save every single page on your website, once a second, just to make sure they’re up to date. The bill adds up.
So most sites have a file called robots.txt that defines which bots can and cannot access their content and which URLs they’re allowed to crawl. Search engines don’t technically have to respect the wishes of robots.txt, but doing so is part of the fabric and culture of the web. Nearly all sites allow Google and Bing because discoverability outweighs the bandwidth costs. Many will block specific providers, such as shopping sites that do