Web Crawler: Design And Implementation For Extracting Article-Like Contents

by   Ngo Le Huy Hien, et al.

The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users' requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity. However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.


page 3

page 7


Intelligent Search Optimization using Artificial Fuzzy Logics

Information on the web is prodigious; searching relevant information is ...

Dark Web Activity Classification Using Deep Learning

In contemporary times, people rely heavily on the internet and search en...

Unveiling the I2P web structure: a connectivity analysis

Web is a primary and essential service to share information among users ...

Decentralized Search on Decentralized Web

Decentralized Web, or DWeb, is envisioned as a promising future of the W...

A systematic framework to discover pattern for web spam classification

Web spam is a big problem for search engine users in World Wide Web. The...

Information Retrieval in Intelligent Systems: Current Scenario & Issues

Web space is the huge repository of data. Everyday lots of new informati...

Variational Quantum PageRank

The PageRank algorithm is used to rank web pages by their importance. Si...

Please sign up or login with your details

Forgot password? Click here to reset