Industry

Comparison of ontology learning techniques for Qur'anic text

Description
Comparison of ontology learning techniques for Qur'anic text
Categories
Published
of 6
7
Categories
Published
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
  See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/224250736 Comparison of Ontology Learning Techniquesfor Qur'anic Text Conference Paper  · July 2011 DOI: 10.1109/ICFCSA.2011.50 · Source: IEEE Xplore CITATIONS 2 READS 46 4 authors , including: Some of the authors of this publication are also working on these related projects: Electrical and Electronic   View projectStress detection/monitoring device for Autistic children   View projectChing Yee YongUniversity College of Technology Sarawak 41   PUBLICATIONS   77   CITATIONS   SEE PROFILE Rubita SudirmanUniversiti Teknologi Malaysia 132   PUBLICATIONS   271   CITATIONS   SEE PROFILE Kim Mey ChewUniversity College of Technology Sarawak 39   PUBLICATIONS   72   CITATIONS   SEE PROFILE All content following this page was uploaded by Kim Mey Chew on 08 January 2017. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the srcinal documentand are linked to publications on ResearchGate, letting you access and read them immediately.  Comparison of Ontology Learning Techniques for Qur’anic Text Ching Yee Yong, Rubita Sudirman Faculty of Electrical Engineering Universiti Teknologi Malaysia Johor, Malaysia chiyong@fkegraduate.utm.my Kim Mey Chew, Naomie Salim Faculty of Computer Science Universiti Teknologi Malaysia Johor, Malaysia kimmey_chew@yahoo.com  Abstract   —    Currently, ontology plays an important role in semantic web technology. Ontology learning approach is to distinguish the type of input such as text, dictionary, knowledge, policies, schemes and semi-structured schemes relations. Ontology learning can be explained as information extraction subtask and its objectives are to dig the relevant concepts and relationships from the corpus or a particular type of data sets. In this project, an ontology learning of text extraction from Qur’anic text as input data was assessed using a newly developed support system. The algorithms used to extract Qur’anic text in this project are Alfonseca & Manandhar’s and Gupta & Colleagues’s approach. The support system will assess and evaluate these two algorithms and compare with the manually text extraction (Gold Standard) in order to come out an appropriate method or technique which suitable to extract the ontologies from Qur’anic text which can help more people to understand the true meaning from Qur’an teaching.  Keywords-ontology learning; text extraction; natural language; recognition; classification. I.   I  NTRODUCTION Ontology learning is a subtask of information extraction. The goal of ontology learning is to extract relevant concepts and relations semi-automatically from a given corpus or other kinds of data sets to form ontology. Ontologies play a key role in semantic web and technologies by defining concepts and relations among these concepts. Maedche and Staab distinguish different ontology learning approaches according to the types of used input [1]. These ontologies learning are ontology learning from texts, ontology learning from dictionary, ontology learning from knowledge base, ontology learning from semi structured schemata and ontology learning from relational schemata. Many significant studies had been held based on ontology learning topic and these useful techniques. This study is divided into six sections. The first section mainly introduces the whole study. It provides the general overview of the text extraction and recognition analysis system. The second section includes the objectives of this study, which describes the aims that needed to be achieved. The third section discusses the background studies, literature review and the basic concept in this study. Section 4 discusses about the study implementation. A specification list of the computer environment and thorough discussion on the developmental tool or assessment and analysis on text extraction will be explained in section 5. Finally, the last section contains the conclusions, future developments and possible enhancement and improvement on this study. II.   P ROBLEM F ORMULATION  The hypothesized function of this support system is to  provide the users with information about the process of ontology extraction from a Qur’anic text (Yusof Ali’s English translation) as input. Two algorithms were being compared based on the term, synonym and concept layer from Alfonseca and Gupta. The objectives of the fundamental studies are as follows: •   To implement the algorithms from Alfonseca & Manandhar and Gupta & Colleagues for ontology extraction analysis in a newly developed support system. •   To compare the main functions and integration  processes for the new developed support system. •   To enhance the support system by combining the algorithms using Qur’anic text as input. In this study, an effective support system for Qur’anic text extraction is presented. Several aspects for example the Qur’anic English translation version and performance measurement are considered while the assessment is being done. This support system does not only serves extraction as assessment, but also important in managing the data effectively and providing scientific information about extraction in Qur’anic text. III.   L ITERATURE R  EVIEW  This section discusses a few basic concepts regarding ontology learning, extraction methods and theories used in the support system proposed. 2011 International Conference on Future Computer Sciences and Application 978-0-7695-4422-9/11 $26.00 © 2011 IEEEDOI 10.1109/ICFCSA.2011.50192   A.   Ontology Learning The term ontology learning was srcinally introduced by Maedche and can be described as the acquisition of a domain model from data [2]. Benz mentioned there are two fundamental aspects on ontology learning [3]. The first fundamental aspect is the availability of prior knowledge. In other words, whether the learning process is  performed from scratch or some prior knowledge is available. Such prior knowledge is used in the construction of a first version of the ontology. Thus, a source of prior knowledge must demand little effort to be transformed into the first version of the ontology. This version is then extended automatically through learning procedures and manually by a knowledge engineer [4]. The second fundamental aspect is the type of input used  by the learning process. Benz defines three different kinds of input. These three kinds of input are structured data, semi-structured data and unstructured data.  B.    Alfonseca and Manandhar’s Algorithm This method consists of automatically acquiring contextual properties of the words that co-occur with each one of a set of concepts. It can then be used to either cluster concepts inside ontology, or to refine the ontology by adding new concepts. The principle is based on the hypothesis of distributional semantics where the meaning of a word is highly correlated to the contexts in which it appears. This hypothesis can be generalized to cover complex phrases such as whole Noun Phrases, instead of words. The contexts can be encoded as vectors of context words, as in the case of the topic signatures [5]. Using the topic signatures, each concept would be represented by the set of words that co-occur with it, and the frequencies with which they appear. Several similarities metrics, such as TFIDF or chi-square, can then  be used to measure the distance between the different concepts. Alfonseca and Manandhar describe a top-down classification algorithm for extending existing ontologies such as WordNet with new concepts [6]. As it is based on contextual information, this method requires user to have available several occurrences of the concepts to be classified, so that there is enough contextual information to generate the topic signatures. The method has  been used to automatically classify high-frequency concepts from historical texts for generating elearning web sites such as Darwin's The Voyages of the Beagle, Osler's The Evolution of Modern Medicine and Hegel's Lectures on the History of Philosophy. C.   Gupta and Colleagues’s Algorithm The approach aims to enable rapid development of SubWordnets for Natural Language Processing (NLP) applications. The approach proposes an iterative three-step lexicon engineering cycle for developing SubWordNets. Through Gupta study and research, the first step of the approach is to discover concept elements [7]. The goal of this step is to discover concept elements, which include words, generated multi-word phrases, and potential relationships among these elements that occur in input sub-language documents. For example, “  Marine Mountain Warfare Training  ” and “  Maritime Interception   Operation Training  ” would be discovered as multi-word phrases in the navy lessons domain. An unnamed relation between them could be discovered and suggested to the user. Subsequently, a user can identify the relation as of meronym or holonym type. This step typically uses a combination of shallow language and text processing along with learning, discovery, and extraction techniques. The second step is to identify concepts. The objective of this step is to identify new concepts and relations from  phrases and relations discovered in the previous step. According Gupta, concept identification is supported by grouping phrases into concept nodes and establishing concordance with synsets in WordNet. The new concept nodes and relationships update the SubWordNet Maintain concepts is the third step in this approach which allows controlled insertion, deletion, and updating of concepts and relations derived from the  previous step in a SubWordNet while maintaining its integrity. Users can iterate through these steps with as many sublanguage documents as needed to develop SubWordNets and maintaining on an ongoing basis. IV.   M ETHODOLOGIES  In this section, the procedure and the methods used for text extraction assessment will be described. Microsoft Visual C++, Stanford Parser Online and Genia Tagger are used in all the programming in this support system. Generally, the flow of the language support system is as in Figure 1. There are few scopes to ensure the study is conducted within the boundary set, and heading to the right direction to achieve the intended objectives.  A.    Data Acquisition and Management (62 Ayah of Prayer) A set of data (62 Ayah of Prayer-Gold Standard Document) is extracted from an English Translation Qur’anic text by Yusof Ali. Then, the 62 Ayah of Prayer is  parsed using Stanford Parser Online and Genia Tagger Online to distinguish noun singular (NN), proper noun singular (NNP), noun plural (NNS), noun phrase (NP), verb  phrase (VP), adverb phrase (ADVP), propositional phrase (PP) etc. All the prayer as input data are saved in the database in text file as in Figure 2.  B.   Text Extraction Assessment 1)    Alfonseca and Manandhar’s Algorithm (Term and Synonym Concepts).  General Name Entity Recognition (GNER) is being selected as the extraction techniques in the project. GNER is the task of identifying, for an unknown concept or instance in a text. Through this technique, the unknown term/concept 193    Figure 1. System flow. Figure 2. Data input interface. that contained in the 62 Ayah of Prayer is available to be recognized. If compared with Name Entity Recognition (NER), NER is the task of finding and classifying objects in a text. It is more restricted in a task where the hierarchy is flat and only contains a few concepts such as person, organisation and location. Based on this GNER technique, the terms/concepts in an Ayah which in NN, NNP, NNS, and NP categories are easily being recognized. The terms/concepts which had been found after extraction compare with the comparison data in Gold Standard document to determine whether it is the relevant terms/concepts which belong to prayer. Then, all the data is comparing with the terms in the Gold Standard document which been produced by domain expert to see the percentage of relevant. The comparison will determine the suitability of this method in extraction of terms from 62 Ayah of Prayer. If the method achieves the goal, this method can be used to extract the terms in different categories of Ayah in Al- Qur’an and able to form a collection. This collection can be used to determine the type of the Ayah in prayer. Figure 3. Method 1 extraction process. Figure 3 shows the flow for GNER technique. The input data for GNER technique is taken from “ 62 Ayah of  Prayer (Stanford Parser).txt  ” and “ 62 Ayah of     Prayer (Genia Tagger).txt  ”. The developed application will  process the input data and select the terms which belong to  NN, NNS, NNP and NP categories. After the process, both results from different saved as “  Input Data (Stanford  Parser).txt  ” and “Input Data (Genia Tagger).txt  ”. The application merged both of the file and the file saved as “  Method 1 Input Data.txt  ”. Phase 1: Comparison data preparation (Gold Standard Document)Phase 2: Input data preparation (English translated Qur'anic text)Phase 3: Method 1: Alfonseca & Manandhar's method (NER and WSD techniques)Phase 4: Method 2: Gupta and Colleagues's approach BasedonHearstPatternPhase 5: Comarison and analsis of the resultPhase 6: Discussion and report preparation START EXTRACTION EXTRACTION Precision and Recall Precision and Recall Comparing among two methods and existing research Report END 194  2)   Gupta and Colleagues’s Algorithm (Relation Concept).  There are 3 single steps of process in Gupta and Colleagues’s Approach. These ste ps are discovering concept element, identifying concept and maintaining the concept. Since this algorithm only concentrates on noun  phrase (NP), hence, Genia Tagger Online tool is used to  parse out the region of interest in the 62 Ayah of Prayer and then analyze with Hearst Pattern [8]. Figure 4. Method 2 extraction process. In Figure 4, the 8 buttons, H1, H2, H3, H4, D1, D2, Apposition and Copula represent the 8 patterns which improved by Cimiano [9]. After key-in the input data  parsed out from Genia Tagger, data are tested according the relevant pattern by selecting the 8 buttons. The output will show in the list boxes beside the buttons. The first list  box named with “Result” showed the data which extracted from the key-in Ayah of Prayer. The second list  box named with “Concept Relations “isa”“, showed the final result for the extracted data. User can check the output before saving. The information is stored based on the file name that given by user. Finally, the result files which been saved just now are merged with all the data in the result files. The merged data are saved in “  Method 2  Input Data.txt  ” text file. The data in this “  Method 2 Input     Data.txt  ” text file will compare with the comparison data in “ Gold Standard    (Concept Relation) Hearst Pattern.txt  ”. C.   Generation of Assessment Report After the assessment, result from text extraction assessment is generated as Table 1 for first algorithm and second algorithm. TABLE I. R  ESULT C OMPARING T WO M ETHODS  Method 1 Method 2 Total data (Gold Standard Document) 237 32 Total input data (Genia Tagger) after process 781 10 Total Match Terms 96 0 Total Missed Terms 141 32 Precision 0.12 0.00 Recall 0.41 0.00  D.   Support System Performance Evaluation Table below shows the precision of text extraction for the two algorithms used. TABLE II. A VERAGE C ONFUSION M ATRIX  Result and Classification (Gold Standard Document) E1 E2 Method 1 -GNER E1 tp 96 fp 685 E2 fn 141 tn 0 Method 2 - Hearst Pattern E1 tp 0 fp 10 E2 fn 32 tn 0 Assessment of text extraction in term concept using Alfonseca algorithm has marked a high precision (96) while Gupta scored 0 true positive out ot 32 relation concept. Due to the high probability of missing terms and relation concept, hence enhancement of the support system must be done in order to generate positive impacts on text extraction.  E.   Testing with other Input Data (English Text) In order to evaluate the performance of this support system, common text like English text is selected for assessment test. These English texts are cited from journal [10] and some were cited from other sources from internet. The cited English text was parsed with Genia Tagger Online tool. After that, the parsed English sentences is put into the application and test accordingly with the 8 Hearst Pattern. The results are recorded as Hearst Pattern is suitable to use to extract the relationship among two concepts in English text. V.   D ISCUSSION  After the support system is developed, some testing has  been conducted. Firstly, a testing is conducted for GUI application for text extraction assessment and training. Then, user acceptability testing is carried out to find out the feedback of users on this extraction system. Last but not least, an analysis on semi automatic screening and training for recognition ability as well as graphical generation and user database management testing are also has been done, in order to make this system reliable and trustable for the users. Gold Standard is defined as the actual class while Hearst Pattern as the predicted class. Based on Table II in Hearst Pattern section, there are 10 samples of data which  been treated as false positive (FP). It means that the technique extracted 10 of the incorrect sample data. The table also shows that there are 32 terms which been 195
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x