Scuttling Web Opportunities By Application Cramming

Vijaya Sree Dhulipalla, Hanumat Prasad Alahari

Abstract


The web contains large data and it contains innumerable websites that is monitored by a tool or a program known as Crawler. The main goal of this paper is to focus on the web forum crawling techniques. In this paper, the various techniques of web forum crawler and challenges of crawling are discussed. The paper also gives the overview of web crawling and web forums.

Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant information from internet. The rapid growth of the internet poses unprecedented scaling challenges for general purpose crawlers and search engines. In this paper, we present a novel Forum Crawler under Supervision (FoCUS) method, which supervised internet-scale forum crawler. The intention of FoCUS is to crawl relevant forum information from the internet with minimal overhead, this crawler is to selectively seek out pages that are pertinent to a predefined set of topics, rather than collecting and indexing all accessible web documents to be capable to answer all possible ad-hoc questions. FoCUS is continuously keeps on crawling the internet and finds any new internet pages that have been added to the internet, pages that have been removed from the internet. Due to growing and vibrant activity of the internet; it has become more challengeable to navigate all URLs in the web documents and to handle these URLs. We will take one seed URL as input and search with a keyword, the searching result is based on keyword and it will fetch the internet pages where it will find that keyword

Keywords


Web Crawling, Web Forums, FoCUS, EIT path, forum crawling, ITF regex, page classification, page type, URL pattern learning, URL type

References


S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine.â€Computer Networks and ISDN Systems,vol. 30, nos. 1-7, pp. 107-117, 1998.

R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, “iRobot: An Intelligent Crawler for Web Forums,†Proc. 17th Int‟l Conf. World Wide Web,pp. 447-456, 2008.

Dasgupta, R. Kumar, and A. Sasturkar, “De-Duping URLs via Rewrite Rules,†Proc. 14th ACM SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining, pp. 186-194, 2008.

Gao, L. Wang, C.-Y. Lin, and Y.-I. Song, “Finding Question-Answer Pairs from Online Forums,†Proc. 31st Ann. Int‟l ACMSIGIR Conf. Research and Development in Information Retrieval,pp. 467-474, 2008.

N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T.Tomokiyo, “Deriving Marketing Intelligence from Online Discussion,†Proc. 11th ACM SIGKDD Int‟l Conf. Knowledge Discovery and Data Mining, pp. 419-428, 2005.

Y. Guo, K. Li, K. Zhang, and G. Zhang, “Board Forum Crawling: A Web Crawling Method for Web Forum,â€Proc. IEEE/WIC/ACM Int‟l Conf. Web Intelligence, pp. 475-478, 2006.

M. Henzinger, “Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms,†Proc. 29th Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 284-291, 2006.

H.S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, S. Garg, and A. Sasturkar, “Learning URL Patterns for Webpage De-Duplication,†Proc. Third ACM Conf. Web Search and Data Mining, pp. 381-390, 2010.

K. Li, X.Q. Cheng, Y. Guo, and K. Zhang, “Crawling Dynamic Web Pages in WWW Forums,†Computer Eng., vol. 33, no. 6, pp. 80-82, 2007.

G.S. Manku, A. Jain, and A.D. Sarma, “Detecting Near-Duplicates for Web Crawling,†Proc. 16th Int‟l Conf. World Wide Web, pp. 141-150, 2007.

U. Schonfeld and N. Shivakumar, “Sitemaps: Above and Beyond the Crawl of Duty,†Proc. 18th Int‟l Conf. World Wide Web,pp. 991-1000, 2009.

X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin, “Automatic Extraction of Web Data Records Containing User-Generated Content,†Proc. 19th Int‟l Conf. Information and Knowledge Management, pp. 39-48, 2010.

V.N. Vapnik,The Nature of Statistical Learning Theory. Springer, 1995.

M.L.A. Vidal, A.S. Silva, E.S. Moura, and J.M.B. Cavalcanti, “Structure-Driven Crawler Generation by Example,†Proc. 29 th Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 292-299, 2006.

Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, “Exploring Traversal Strategy for Web Forum Crawling,†Proc. 31st Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.

J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma, “Incorporating Site-Level Knowledge to

Extract Structured Data from Web Forums,†Proc. 18th Int‟l Conf. World Wide Web,pp. 181-190, 2009.

Y. Guo, K. Li, K. Zhang, and G. Zhang, “Board Forum Crawling: A Web Crawling Method for Web Forum,†Proc. IEEE/WIC/ACM Int‟l Conf. Web Intelligence, pp. 475-478, 2006.

M. Henzinger, “Finding Near-Duplicate Web Pages: A Large- Scale Evaluation of Algorithms,†Proc. 29th Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 284-291, 2006.

H.S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, S. Garg, and A. Sasturkar, “Learning URL Patterns for Webpage De- Duplication,†Proc. Third ACM Conf.

K. Li, X.Q. Cheng, Y. Guo, and K. Zhang, “Crawling Dynamic Web Pages in WWW Forums,†Computer Eng., vol. 33, no. 6, pp. 80-82, 2007.

G.S. Manku, A. Jain, and A.D. Sarma, “Detecting Near-Duplicates for Web Crawling,†Proc. 16th Int‟l Conf. World Wide Web, pp. 141- 150, 2007.

U. Schonfeld and N. Shivakumar, “Sitemaps: Above and Beyond the Crawl of Duty,†Proc. 18th Int‟l Conf. World Wide Web, pp. 991- 1000, 2009.

X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin, “Automatic Extraction of Web Data Records Containing User-Generated Content,†Proc. 19th Int‟l Conf.Information and Knowledge Management, pp. 39-48, 2010.

V.N. Vapnik, The Nature of Statistical Learning Theory.Springer, 1995.

M.L.A. Vidal, A.S. Silva, E.S. Moura, and J.M.B. Cavalcanti, “Structure-Driven Crawler Generation by Example,†Proc. 29th Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 292-299, 2006.

Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, “Exploring Traversal Strategy for Web Forum Crawling,†Proc. 31st Ann. Int‟l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008


Full Text: PDF [Full Text]

Refbacks

  • There are currently no refbacks.


Copyright © 2013, All rights reserved.| ijseat.com

Creative Commons License
International Journal of Science Engineering and Advance Technology is licensed under a Creative Commons Attribution 3.0 Unported License.Based on a work at IJSEat , Permissions beyond the scope of this license may be available at http://creativecommons.org/licenses/by/3.0/deed.en_GB.

Â