Navigation Objects Extraction for Better Content Structure Understanding

by   Kui Zhao, et al.

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the understanding of the content structure in a web site increasingly difficult. Dynamic and personalized elements such as top stories, recommended list in a webpage are vital to the understanding of the dynamic nature of web 2.0 sites. To better understand the content structure in web 2.0 sites, in this paper we propose a new extraction method for navigation objects in a webpage. Our method will extract not only the static navigation menus, but also the dynamic and personalized page-specific navigation lists. Since the navigation objects in a webpage naturally come in blocks, we first cluster hyperlinks into different blocks by exploiting spatial locations of hyperlinks, the hierarchical structure of the DOM-tree and the hyperlink density. Then we identify navigation objects from those blocks using the SVM classifier with novel features such as anchor text lengths etc. Experiments on real-world data sets with webpages from various domains and styles verified the effectiveness of our method.


page 1

page 2

page 3

page 4


A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

Search engines have become an indispensable tool for browsing informatio...

Don't read, just look: Main content extraction from web pages using visually apparent features

The extraction of main content provides only primary informative blocks ...

Effective Blog Pages Extractor for Better UGC Accessing

Blog is becoming an increasingly popular media for information publishin...

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Existing techniques for efficiently crawling social media sites rely on ...

MOVESe: MOVablE and Moving LiDAR Scene Segmentation with Improved Navigation in Seg-label free settings

Accurate detection of movable and moving objects in LiDAR is of vital im...

Automated Discovery of Internet Censorship by Web Crawling

Censorship of the Internet is widespread around the world. As access to ...

Extraction of Product Specifications from the Web – Going Beyond Tables and Lists

E-commerce product pages on the web often present product specification ...

Please sign up or login with your details

Forgot password? Click here to reset