Bootstrapping Domain-Specific Content Discovery on the Web

by   Kien Pham, et al.

The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites. DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-of-the-art methods.


From 10 Blue Links Pages to Feature-Full Search Engine Results Pages – Analysis of the Temporal Evolution of SERP Features

Web Search Engine Results Pages (SERP) are one of the most well-known an...

Where the Earth is flat and 9/11 is an inside job: A comparative algorithm audit of conspiratorial information in web search results

Web search engines are important online information intermediaries that ...

Variational Quantum PageRank

The PageRank algorithm is used to rank web pages by their importance. Si...

Prediction of new outlinks for focused Web crawling

Discovering new hyperlinks enables Web crawlers to find new pages that h...

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Over the last few years, the complexity of web applications has increase...

The Dawn of Today's Popular Domains: A Study of the Archived German Web over 18 Years

The Web has been around and maturing for 25 years. The popular websites ...

Information Extraction in Illicit Domains

Extracting useful entities and attribute values from illicit domains suc...

Please sign up or login with your details

Forgot password? Click here to reset