Semi-Supervised Method using Gaussian Random Fields for Boilerplate Removal in Web Browsers

by   Joy Bose, et al.

Boilerplate removal refers to the problem of removing noisy content from a webpage such as ads and extracting relevant content that can be used by various services. This can be useful in several features in web browsers such as ad blocking, accessibility tools such as read out loud, translation, summarization etc. In order to create a training dataset to train a model for boilerplate detection and removal, labeling or tagging webpage data manually can be tedious and time consuming. Hence, a semi-supervised model, in which some of the webpage elements are labeled manually and labels for others are inferred based on some parameters, can be useful. In this paper we present a solution for extraction of relevant content from a webpage that relies on semi-supervised learning using Gaussian Random Fields. We first represent the webpage as a graph, with text elements as nodes and the edge weights representing similarity between nodes. After this, we label a few nodes in the graph using heuristics and label the remaining nodes by a weighted measure of similarity to the already labeled nodes. We describe the system architecture and a few preliminary results on a dataset of webpages.


page 1

page 2

page 3

page 4


Graph-Based Semi-Supervised Conditional Random Fields For Spoken Language Understanding Using Unaligned Data

We experiment graph-based Semi-Supervised Learning (SSL) of Conditional ...

Extraction of Relevant Images for Boilerplate Removal in Web Browsers

Boilerplate refers to unwanted and repeated parts of a webpage (such as ...

End-To-End Graph-based Deep Semi-Supervised Learning

The quality of a graph is determined jointly by three key factors of the...

Semi-Supervised Node Classification on Graphs: Markov Random Fields vs. Graph Neural Networks

Semi-supervised node classification on graph-structured data has many ap...

Scientific Information Extraction with Semi-supervised Neural Tagging

This paper addresses the problem of extracting keyphrases from scientifi...

DefExt: A Semi Supervised Definition Extraction Tool

We present DefExt, an easy to use semi supervised Definition Extraction ...

Implementation of a noisy hyperlink removal system: A semantic and relatedness approach

As the volume of data on the web grows, the web structure graph, which i...

Please sign up or login with your details

Forgot password? Click here to reset