Maps

Data Mining Model for the Data Retrieval from Central Server Configuration

Description
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most relevant and updated for continuous text search queries. This paper focuses on handling continuous text extraction sustaining high document
Categories
Published
of 9
0
Categories
Published
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Similar Documents
Share
Transcript
  International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 DOI : 10.5121/ijcsit.2013.5514 177 Data Mining Model for the Data Retrieval from Central Server Configuration Srivatsan Sridharan 1 , Kausal Malladi and Yamini Muralitharan 2 1  Department of Computer Science, IIIT - Bangalore, India. 2  Department of Software Engineering, IIIT - Bangalore, India.  ABSTRACT     A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most relevant and updated for continuous text search queries. This paper focuses on handling continuous text extraction sustaining high document traffic. The main objective is to retrieve recent updated documents that are most relevant to the query by applying sliding window technique. Our solution indexes the streamed documents in the main memory with structure based on the principles of inverted file, and  processes document arrival and expiration events with incremental threshold-based method. It also ensures elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are ranked based on user feedback and given higher priority for retrieval.  KEYWORDS Continuous Text Queries, MapReduce Technique, Sliding Window. 1.   INTRODUCTION Data intensive applications such as electronic mail, news feed, telecommunication management, automation of business reporting etc raise the need for a continuous text search and monitoring model. In this model the documents arrive at the monitoring server as in the form of a stream. Each query Q continuously retrieves, from a sliding window of the most recent documents, the k that is most similar to a fixed set of search terms. Sliding window . This window reflects the interest of the users in the newest available documents. It can be defined in two alternative ways. They are a) count-based window contains the N most recent documents for some constant number N, b) time-based window contains only documents that arrived within last N time units. Thus, although a document which may be relevant to a query, it is ignored, because it may not satisfy the time and count constraints of the user.  Incremental threshold method  . The quintessence of the algorithm is to employ threshold-based techniques to derive the initial result for a query, and then continue to update the threshold to reflect document arrivals and expirations. At its core lies a memory-based index similar to the conventional inverted file, complimented with fast updated techniques.  MapReduce technique . MapReduce is a powerful platform for large scale data processing. This technique involves two steps namely a) map step:  The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level structure. The worker node processes the smaller problem, and passes the answer back to its master node, b) reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was srcinally trying to solve. Unsupervised duplicate detection .  [3]  The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Here this method is used to identify documents that are all  International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 178 alike and prevent them from being prepared in the result set. Our paper also focuses on ranking the documents based on user feedback. The user is allowed to give feedback for each document that has been retrieved. This feedback is used to rank the document and hence increase the probability of the document to appear in the sliding window. Visual Web Spider    is a fully automated, multi-threaded web crawler that allows us to index and collect specific web pages on the Internet. Once installed, it enables us to browse the Web in an automated manner, indexing pages that contain specific keywords and phrases and exporting the indexed data to a database on our local computer in the format of our choice. We want to collect website links to build our own specialized web directory. We can configure Visual Web Spider  automatically. This program’s friendly, wizard-driven interface lets us customize our search in a step-by-step manner. To index relevant web pages, just follow this simple sequence of steps. After opening the wizard, enter the starting web page URL or let the program generate URL links based on specific keywords or phrases. Then set the crawling rules and depth according to your search strategy. Finally, specify the data you want to index and your project filename. That’s pretty much it. Clicking on ‘Start’ sets the crawler to work. Crawling is fast, thanks to multi-threading that allows up to 50 simultaneous threads. Another nice touch is that Visual Web Spider can index the content of any HTML tag such as: page title (TITLE tag), page text (BODY tag), HTML code (HTML tag), header text (H1-H6 tags), bold text (B tags), anchor text (A tags), alt text (IMG tag, ALT attribute), keywords, description (META tags) and others. This program can also list each page size and last modified date. Once the web pages have been indexed, Visual Web Spider can export the indexed data to any of the following formats: Microsoft Access, Excel (CSV), TXT, HTML, and MySQL script. 1.1.Key Features A Personal, Customizable Web crawler. Crawling rules. Multi-threaded technology (up to 50 threads). Support for the robots exclusion protocol/standard (Robots.txt file and Robots META tags);Index the contents of any HTML tag. Indexing rules; Export the indexed data into Microsoft Access database, TEXT file, Excel file (CSV), HTML file, MySQL script file; Start crawling from a list of the URLs specified by user; Start crawling using keywords and phrases; Store web pages and media files on your local disk; Auto-resolve URL of redirected links; Detect broken links; Filter the indexed data; 2.   EXISTING SYSTEM Drawbacks of the existing servers that tend to handle the heavy document traffic are: Cannot efficiently monitor the data stream that has highly dynamic document traffic. The server alone does the processing hence it involves large amount of time consumption. In case of continuous text search queries and extraction every time the entire document set has to be scanned in order to find the relevant documents. There is no confirmation that duplicate documents are not retrieved for the given query. A large amount of documents cannot be stored in the main memory as it involves large amount of CPU cost.  Naïve solution:  The most straightforward approach to evaluate the  con tinuous queries defined above is to scan the entire window contents D after every update or in fixed time intervals, compute all the document scores, and report the top-k documents. This method incurs high processing costs due to the need for frequent re computations from scratch.  International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 179 3.   PROPOSED SYSTEM 3.1.Problem Formulation In our model, a stream of documents flows into a central server. The user registers text queries at the server, which is then responsible for continuously monitoring/reporting their results. As in most stream processing systems, we store all the data in main memory in order to cope with frequent updates, and design our methods with the primary goal of minimizing the CPU cost. Moreover it is necessary to reduce the work load of the monitoring server. 3.2.Proposed Solution In our solution we use the MapReduce technique in order to reduce the work load of the central server, where the server acts as the master node, which splits up the processing task to several worker nodes. The number of worker nodes, which have been assigned the processing task, depends on the nature of query that has been put up by the user. Here the master node, upon receiving a query from the user, assigns the workers to find the relevant result query set and return the solution to the master node. The master node, after receiving the partial solutions from the workers, integrates the results to produce the final result set for the given query. This can be viewed schematically in the following Fig.1. Each worker/slave node is responsible uses the incremental threshold algorithm for computing the result set of k relevant and recent documents for the given query. The overall system architecture can be viewed as in the following Fig.2 Figure.2. System Architecture for the proposed Data Retrieval Model.  International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 180 Fig. 2.  A data Retrieval system using MapReduce. Each element of the input stream comprises of a document d, a unique document identifier, the document arrival time, a composition list. The composition list contains one (t, wdt) pair for each term t belonging to T in the document and wdt is the frequency of the term in the document d. The notations in this model are as follows in Fig 3. Figure. 3.  A Detailed list of the notations. The worker node maintains an inverted index for each term t in the document. With the inverted index, a query Q is processed as follows: the inverted lists for the terms t belonging to Q are scanned and the partial wdt scores of each encountered document d are accumulated to produce S(d/Q). The documents with the highest scores at the end are returned as the result. 3.3.Incremental Threshold Algorithm Fig.3 represents the data structures that have been used in this system. The valid documents D are stored in a single list, shown at the bottom of the figure. Each element of the list holds the stream of information of document (identifier, text content, composition list, arrival time). D contains the most recent documents for both count-based and time-based windows. Since documents expire in  International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 5, October 2013 181 first-in-first-out manner, D is maintained efficiently by inserting arriving documents at the end of the list and deleting expiring ones from its head. On the top of the list of valid documents we build an inverted index. The structure at the top of the figure is the dictionary of search terms. It is an array that contains an entry for each term t belonging to T. The dictionary entry for t stores a pointer to the corresponding inverted list L t . L t  holds an impact entry for each document d that contains t, together with a pointer to d’s full information in the document list. When a document d arrives, an impact entry (d, wdt) (derived from d’s composition list) is inserted into the inverted list of each term t that appears in d. Likewise, the impact entries of an entries of an expiring document are removed from the respective inverted lists. To keep the inverted lists sorted on wdt while supporting fast (logarithmic) insertions and deletions.  Initial Top-k Search:  When a query is first submitted to the system, its top-k result is computed using the initial search module. The process is an adaptation of the threshold algorithm. Here, the inverted lists L t  of the query terms play the role of the sorted attribute lists. Unlike the srcinal threshold algorithm, however we do not probe the lists in a round robin fashion. Since the similarity function associates different weights wQt with the query specifically, inspired by [4], we probe the list L t  with the highest c t =wQt.wd nxt t value, where d nxt  is the next document in L t . The global threshold g t , a notion used identically to the srcinal algorithm, is the sum of c t  values for all the terms in Q. Consider query Q 1  with search string “red rose” and k=2. Let term t 20 =”red” and t 11 =”rose”. First the server identifies the inverted lists L 11  and L 20  (using the dictionary hash table), and computes the values c 11 =wQ 1 t 11 .wd 7 t 11  and c 20 =wQ 1 t 20 .wd 6 t 20 . In iteration 1, since c 20  is larger, the first entry of L 20  is popped; the similarity score of the corresponding document, d 6 , is computed by accessing its composition list in D and inserted into the tentative R. c 20  is then updated to impact entry which is above local threshold, but we would still include it in R as unverified entry. The algorithm is as follows,  Algorithm Incremental Threshold with Duplicate Detection (Arriving d  ins  , Expiring d  del ) 1: Insert document d ins  into D (the system document list) 2: for all terms t in the composition list of d ins  do 3: for all documents in L t  4: for all terms t in d ins  5: Compute unique (d ins ) 6: wd ins t != wd nxt t 7: Insert the impact entry of d ins  into L t  8: Probe the threshold tree of L t  9: for all queries Q where wd ins t > =localThreshold do 10: if Q has not been considered for d ins  in another L t  then 11: Compute S (d ins  /Q) 12: Insert d ins  into R 13: if S(d ins  /Q)>= old S k   then 14: Update S k   (since d ins  enters the top-k result) 15: Keep rolling up local thresholds while r <= S k   16: Set new τ  as influence threshold for Q 17: Update local thresholds of Q 18: Delete document d del  from D (the system document list) 19: for all terms t in the composition list of d del  do 20: Delete the impact entry of d del  from L t  21: Probe the threshold tree of L t  22: for all queries Q where wd del t >= localThreshold do 23: if Q has not been considered for d del  in another L t  then 24: Delete d del  from R
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x