Scuttling Web Opportunities By Application Cramming

Vijaya Sree Dhulipalla; Hanumat Prasad Alahari

Scuttling Web Opportunities By Application Cramming

Vijaya Sree Dhulipalla, Hanumat Prasad Alahari

Abstract

The web contains large data and it contains innumerable websites that is monitored by a tool or a program known as Crawler. The main goal of this paper is to focus on the web forum crawling techniques. In this paper, the various techniques of web forum crawler and challenges of crawling are discussed. The paper also gives the overview of web crawling and web forums.

Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant information from internet. The rapid growth of the internet poses unprecedented scaling challenges for general purpose crawlers and search engines. In this paper, we present a novel Forum Crawler under Supervision (FoCUS) method, which supervised internet-scale forum crawler. The intention of FoCUS is to crawl relevant forum information from the internet with minimal overhead, this crawler is to selectively seek out pages that are pertinent to a predefined set of topics, rather than collecting and indexing all accessible web documents to be capable to answer all possible ad-hoc questions. FoCUS is continuously keeps on crawling the internet and finds any new internet pages that have been added to the internet, pages that have been removed from the internet. Due to growing and vibrant activity of the internet; it has become more challengeable to navigate all URLs in the web documents and to handle these URLs. We will take one seed URL as input and search with a keyword, the searching result is based on keyword and it will fetch the internet pages where it will find that keyword

Keywords

Web Crawling, Web Forums, FoCUS, EIT path, forum crawling, ITF regex, page classification, page type, URL pattern learning, URL type

References

S. Brin and L. Page, â€œThe Anatomy of a Large-Scale Hypertextual Web Search Engine.â€Computer Networks and ISDN Systems,vol. 30, nos. 1-7, pp. 107-117, 1998.

R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, â€œiRobot: An Intelligent Crawler for Web Forums,â€ Proc. 17th Intâ€Ÿl Conf. World Wide Web,pp. 447-456, 2008.

Dasgupta, R. Kumar, and A. Sasturkar, â€œDe-Duping URLs via Rewrite Rules,â€ Proc. 14th ACM SIGKDD Intâ€Ÿl Conf. Knowledge Discovery and Data Mining, pp. 186-194, 2008.

Gao, L. Wang, C.-Y. Lin, and Y.-I. Song, â€œFinding Question-Answer Pairs from Online Forums,â€ Proc. 31st Ann. Intâ€Ÿl ACMSIGIR Conf. Research and Development in Information Retrieval,pp. 467-474, 2008.

N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T.Tomokiyo, â€œDeriving Marketing Intelligence from Online Discussion,â€ Proc. 11th ACM SIGKDD Intâ€Ÿl Conf. Knowledge Discovery and Data Mining, pp. 419-428, 2005.

Y. Guo, K. Li, K. Zhang, and G. Zhang, â€œBoard Forum Crawling: A Web Crawling Method for Web Forum,â€Proc. IEEE/WIC/ACM Intâ€Ÿl Conf. Web Intelligence, pp. 475-478, 2006.

M. Henzinger, â€œFinding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms,â€ Proc. 29th Ann. Intâ€Ÿl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 284-291, 2006.

H.S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, S. Garg, and A. Sasturkar, â€œLearning URL Patterns for Webpage De-Duplication,â€ Proc. Third ACM Conf. Web Search and Data Mining, pp. 381-390, 2010.

K. Li, X.Q. Cheng, Y. Guo, and K. Zhang, â€œCrawling Dynamic Web Pages in WWW Forums,â€ Computer Eng., vol. 33, no. 6, pp. 80-82, 2007.

G.S. Manku, A. Jain, and A.D. Sarma, â€œDetecting Near-Duplicates for Web Crawling,â€ Proc. 16th Intâ€Ÿl Conf. World Wide Web, pp. 141-150, 2007.

U. Schonfeld and N. Shivakumar, â€œSitemaps: Above and Beyond the Crawl of Duty,â€ Proc. 18th Intâ€Ÿl Conf. World Wide Web,pp. 991-1000, 2009.

X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin, â€œAutomatic Extraction of Web Data Records Containing User-Generated Content,â€ Proc. 19th Intâ€Ÿl Conf. Information and Knowledge Management, pp. 39-48, 2010.

V.N. Vapnik,The Nature of Statistical Learning Theory. Springer, 1995.

M.L.A. Vidal, A.S. Silva, E.S. Moura, and J.M.B. Cavalcanti, â€œStructure-Driven Crawler Generation by Example,â€ Proc. 29 th Ann. Intâ€Ÿl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 292-299, 2006.

Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, â€œExploring Traversal Strategy for Web Forum Crawling,â€ Proc. 31st Ann. Intâ€Ÿl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.

J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma, â€œIncorporating Site-Level Knowledge to

Extract Structured Data from Web Forums,â€ Proc. 18th Intâ€Ÿl Conf. World Wide Web,pp. 181-190, 2009.

Y. Guo, K. Li, K. Zhang, and G. Zhang, â€œBoard Forum Crawling: A Web Crawling Method for Web Forum,â€ Proc. IEEE/WIC/ACM Intâ€Ÿl Conf. Web Intelligence, pp. 475-478, 2006.

M. Henzinger, â€œFinding Near-Duplicate Web Pages: A Large- Scale Evaluation of Algorithms,â€ Proc. 29th Ann. Intâ€Ÿl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 284-291, 2006.

H.S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, S. Garg, and A. Sasturkar, â€œLearning URL Patterns for Webpage De- Duplication,â€ Proc. Third ACM Conf.

K. Li, X.Q. Cheng, Y. Guo, and K. Zhang, â€œCrawling Dynamic Web Pages in WWW Forums,â€ Computer Eng., vol. 33, no. 6, pp. 80-82, 2007.

G.S. Manku, A. Jain, and A.D. Sarma, â€œDetecting Near-Duplicates for Web Crawling,â€ Proc. 16th Intâ€Ÿl Conf. World Wide Web, pp. 141- 150, 2007.

U. Schonfeld and N. Shivakumar, â€œSitemaps: Above and Beyond the Crawl of Duty,â€ Proc. 18th Intâ€Ÿl Conf. World Wide Web, pp. 991- 1000, 2009.

X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin, â€œAutomatic Extraction of Web Data Records Containing User-Generated Content,â€ Proc. 19th Intâ€Ÿl Conf.Information and Knowledge Management, pp. 39-48, 2010.

V.N. Vapnik, The Nature of Statistical Learning Theory.Springer, 1995.

M.L.A. Vidal, A.S. Silva, E.S. Moura, and J.M.B. Cavalcanti, â€œStructure-Driven Crawler Generation by Example,â€ Proc. 29th Ann. Intâ€Ÿl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 292-299, 2006.

Full Text: PDF [Full Text]

Refbacks

There are currently no refbacks.

International Journal of Science Engineering and Advance Technology is licensed under a Creative Commons Attribution 3.0 Unported License.Based on a work at IJSEat , Permissions beyond the scope of this license may be available at http://creativecommons.org/licenses/by/3.0/deed.en_GB.

Username
Password
Remember me

Scuttling Web Opportunities By Application Cramming

Abstract

Keywords

References

Refbacks

Copyright Â© 2013, All rights reserved.| ijseat.com