Monday, June 23, 2014

Noun Phrase Chunking

Noun phrase chunking is the process of extracting the noun phrases from a sentence. This process is done after resolving the ambiguity in the POS Tagging. In this system, Noun Phrase Chunker for Tamil Language developed by Sobha,L

Information extraction is the process of extracting relevant information from text and put it in predefined template. In functional terms, this converts the Web into a database that end-users can search or organize into taxonomies. The precision and efficiency of information access improves when digital content is organized into tables within a database

Crawler Algorithm

The proposed system uses the following algorithm for extracting the news contents from online Tamil newspapers. The algorithm can be described informally as follows:
 1. Read the web site address into a variable
 2. The contents are parsed to find all the links in the home page.
 3. Repeat the following step for each link
3.1   In each link, find whether it is an anchor and it contains the   
       Edition link id.
            3.2  Each edition link is stored in an array for further processing.
4. Repeat the following steps for each edition link
          4.1  In each link of the edition, find whether it is an anchor and it
                 contains the news link id.
          4.2  Collect the entire content of that link.
          4.3  Remove all other content except news content represented by
                 paragraph tag and font tag with specific tamil font script.
          4.4  Read the content of paragraph.
          4.5 Remove the control characters and store it in a file for further
                 processing.
   5.  Stop 

Sunday, May 11, 2014

Structured database

This is an exciting vision for reordering how end-users retrieve and organize digital information. Once information is encoded in a database, it could be organized into a taxonomy or searched over by textual attribute or feature. This stands as a vast improvement over the usual search protocol: index content and query full-text documents by keyword. IE is an attempt to convert information from various text documents into database entries, which plays a key role in improving online knowledge discovery.
 The two main methods of information extraction technology are
§  Natural Language Processing

§  Wrapper Induction  

INFORMATION EXTRACTION

Information Extraction is a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine-readable documents and put it in predefined template. In functional terms, this converts the Web into a database that end-users can search or organize into taxonomies. The precision and efficiency of information access improves when digital content is organized into tables within a database. To improve accuracy and ease development, IE is usually domain or topic specific. Because the parameters that define a particular topic are determined a priori, IE systems are fully customizable. In addition to synonymy and homonymy, IE also must contend with co-reference recognition refers to the same thing in a sentence. For IE to work correctly, various entities within documents like place, person name must be identified within a block of text. Information extraction involves discourse analysis, and co-reference recognition refers to an entity introduced earlier in specific discourse.

POS tagging

POS tagging is the process of assigning the contextual tag to the morpheme. It resolves ambiguity that appeared in the Morphological Analysis. Arulmozhi,P et al.(2004)[1] developed a tool, which is used to resolve ambiguity using rule based approach. The input to a tagging algorithm is a string of words of a natural language sentence and a specified tagset. The output is a single best POS tag for each word. As Tamil is a Morphological rich language, the Morph analyzer itself can identify the part-of-speech in most of the cases. But the Morph analyzer fails to resolve some of the lexical ambiguities for which we need a POS tagger.  Typically stochastic models using information about neighboring words are used to assign the appropriate tags. For example in the sentence,

  ‘naan pati erineen’(I climbed the stairs)