Tuesday, April 22, 2014


NLP techniques for preprocessing and NER and Rule based methodology for IE.

The end users generally need the web content to be available in a database for processing and retrieving specific information. Automatic database creation from unstructured documents i.e. the Tamil Web Database system provides the user with domain specific information in a predefined template extracting the information from online Tamil newspapers. This template can be used  for query processing and summarized report generation.
Input :
Domain specific query and online Tamil newspapers websites
Populated Database
To design an Automatic database creation from unstructured documents i.e. Tamil Web to Database System that provides the domain specific information to the end user by using Language Engineering techniques such as IE and IR.

. Co-reference resolution

The domain specifies the broad subject matter of those texts, for example financial news or tourist information or technical information. The scenario represents the particular event types that the IE user is interested in extracting.
The text type involves the kinds of text we are working with, for example  Wall Street Journal articles, email messages or HTML documents from the WWW. The domain specifies the broad subject matter of those texts, for example financial news or tourist information or technical information. The scenario represents the particular event types that the IE user is interested in extracting.
A new Information Extraction System for crime domain from Online Tamil newspapers is proposed such that given a set of web documents from specific domain, the IE system automatically populates a predefined database by extracting relevant fragments from the documents. 

Information Extraction

it is possible to compare the output of a given system to an ideal database in order to determine how well that system is doing. This distinguishes IE systems from other NLP systems where evaluation is well known to be highly problematic.
The typical subtasks of IE are
1.    Named Entity Recognition (NER)
          It first identify and classify the atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times and quantities. There are two types of NER. They are
§  Grammar based techniques & Statistical models
This approach identify all the names of people, places, organization, dates and amounts of money. But it requires experienced linguists and cost of month of work is more. Statistical NER systems require much training data, but can be ported to other language much more rapidly and require less work overall.

NLP problem

§  Processing written text, using lexical, syntactic, and semantic  knowledge as well as the required real world information

§  Processing spoken language, using all the information needed  plus additional knowledge about phonology.

NLP includes both understanding and generation. Understanding is the process of mapping from an input form into a more immediately useful form.
The system uses the following steps in the process of NL understanding:
1.      Morphological Analysis
              It is the process of breaking down the words into its morphemes.               
2.      POS Tagging
             It is the process of assigning the contextual tag to the morpheme resolving the ambiguity.

3.      NP Chunking
              It deals with extracting the noun phrases from a sentence.

Monday, April 21, 2014

Unstructured data

There are some problems in unstructured data
§  Web pages are highly informative, but they are highly unstructured and lack a predefined schema, type or pattern.
§  It is difficult for computers to understand the semantic meaning of diverse web pages and structure them in an organized way for systematic information retrieval and information extraction.
  For the processing of unstructured text, artificial Neural Network can be applied. But it fails because homogeneous processing media is not suited well for the analysis of linguistically structured information.

   The vast majority of information like news articles, judgement report are stored or passed in natural language (i.e., in textual format or in dialogues). Natural language processing (NLP) studies the problems inherent in the processing and manipulation of natural language and natural language understanding devoted to making computers “understand” statements written in human language. Moreover, NLP aims at defining methodologies, models and systems able to cope with NL, to either understand or generate it.

Web Content Mining

                              Web usage mining is the process of extracting interesting patterns from Web access logs.

The Proposed system uses the Web Content Mining knowledge. A great deal of the information on the World Wide Web is in text based documents. Available technology utilizes a limited amount of knowledge such as extracting text  based keywords from web pages.
The information retrieval is concerned with the organization  and retrieval of information from a large number of text based documents. Typical information retrieval problem is to locate relevant documents based on user input  such as keywords or example documents. Information Extraction is the process of extracting relevant data from semi-structured or unstructured documents and transforming them into structured representations. For example, “President of India” is the keyword given for Information retrieval system, it returns the President of India’s homepage along with other presidents for Clubs, College, Union. It returns both related as well as unrelated information. The information extraction system is the one which retrieves only the exact answer for the query.

Web mining

Data mining on the Internet, commonly called Web mining, takes advantage of the contents of documents and the relationships among the resources. Web mining, the intersection of data mining and the WWW, is a growing area of research encompassing more traditional technologies in the areas of NLP, machine learning, information retrieval and artificial intelligence. Web mining is the extraction of interesting  and useful patterns and implicit information related to the WWW.
The three knowledge discovery domains related to web mining are
§  Web Content Mining
                              Web content mining is the mechanism of extracting knowledge form the content of documents or their descriptions.
§  Web Structure Mining
                               Web structure mining is the process of inferring knowledge from web page organization and links between documents.
§  Web Usage Mining
                              Web usage mining is the process of extracting interesting patterns from Web access logs.