Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask DBMS Expert


Home >> DBMS

1. Objective: Named Entity Recognition

In this project, you will use scikit-learn and python 3 to engineer an effective clas- sifier for an important Information Extraction task called Named Entity Recognition.

2. Getting Start

Named Entity Recogonition. The goal of Named Entity Recognition (NER) is to locate segments of text from input document and classify them in one of the pre-defined categories (e.g., PERSON, LOCATION, and ORGNIZATION). In this project, you only need to perform NER for a single category of TITLE. We define a TITLE as an appellation associated with a person by virtue of occupation, office, birth, or as an honorific. For example, in the following sentence, both Prime Minister and MP are TITLEs.

Prime Minister Malcolm Turnbull MP visited UNSW yesterday.

Formulating NER as a Classiftcation Problem. We can cast the TITLE NER problem as a binary classification problem as follows: for each token w in the input document, we construct a feature vector ?x, and classify it into two classes: TITLE or O (meaning other ), using the logistic regression classifier.

The goal of the project is to achieve the best classfication accuracy by configuring your classifier and performing feature engineering.
In this project, you must use the LogisticRegression classifier in scikit-learn ver 0.17.1 in the implementation.

3. Your Tasks

Training and Testing Dataset. We will publish a training dataset on the project page to help you build the classifier. The training dataset contains a set of sentences, where each sentence is already tokenized into tokens. We also provide the part-of-speech tags (e.g., , and the correct named entity class (i.e., "TITLE" or "O") for each token1. The data structure in python for the above sentence is:

[[('Prime', 'NNP', 'TITLE'), ('Minister', 'NNP', 'TITLE'),
('Malcolm', 'NNP', 'O'), ('Turnbull', 'NNP', 'O'), ('MP', 'NNP', 'TITLE'),
('visited', 'VBD', 'O'), ('UNSW', 'NNP', 'O'),
('yesterday', 'NN', 'O'), ('.', '.', 'O')]]

Therefore, the training dataset is formed as a list of sentences. Each sentence is formed as a list of 3-tuple, which contains to the token itself, its POS tag, and the named entity class.

In python, you could just use the following code to get the training dataset (where training data is the path of the provided training dataset file).

with open(training_data, 'rb') as f: training_set = pickle.load(f)

The final test dataset (which will not be provided to you) will be similarly formed as the training dataset, except each sentence will be formed as a list of 2-tuple (only the token itself and its PoS tag). For example, if the testing dataset only contains one sentence which is "Prime Minister Malcolm Turnbull MP visited UNSW yesterday.", then it will be format as:

[[('Prime', 'NNP'), ('Minister', 'NNP'), ('Malcolm', 'NNP'),
('Turnbull', 'NNP'), ('MP', 'NNP'), ('visit', 'NN'), ('UNSW', 'NNP'), ('yesterday', 'NN'), ('.', '.')]]

Feature Engineering. In order to build the classifier, you need to firstly extract features for each token. In this project, using token itself as a feature usually achieves a high accuracy on the training dataset, however it could result in low testing accuracy on the test dataset due to overfitting. For example, it is not uncommon that the test dataset contains titles that do not exist in the training dataset. Therefore, we encourage you to find/engineer meanningful/strong features and build a more robust classifier.

You will need to describe all the features that you have used in this project, and justify your choice in your report.

Build logistic regression classifter. In this project, you need to use the logistic regression classifier (i.e., sklearn.linear model.LogisticRegression) from the scikit- learn package. For more details, you could refer to the scikit-learn page2 and the relevant course materials.

You also need to dump the trained classifier using the following code (where classifier is the trained classifier and and classifier path is the path of the output file):

with open(classifier_path, 'wb') as f: pickle.dump(classifier, f)

You are also required to submit the dumped classifier (which must be named as classifier.dat).

Suggestion Steps. You may want to implement a basic, initial NER classifier. Then you can further improve its performance in multiple ways. For example, you can find best setting of hyper-parameters of the your model/features; you can design and test different sets of features. It is recommended that you use cross validation and construct your own testing datasets.

You need to describe how you have improved the performance of your classifier in your report.

Trainer. You need to submit a python file named trainer.py. It receieves two command line arguments which specify the path to the training dataset and the path to a file where your trained classifier will be dumped to. Your program will be excuted as below:
python trainer.py

Tester. You need to submit a python file named tester.py. It recieves three com- mand line arguments which specify the path to the testing dataset, the path to the dumped classifier, and the path to a file where your output results will be dumped to. Your program will be executed as below:

python tester.py

You should, for each token in the testing dataset, output the named entity class of it (i.e., TITLE or O). Your output, in python internally, is a list sentences, where each sentence is a list of (TOKEN, CLASS) tuples. For example, a possible output for the example in Section 3.1 is:
[[('Prime', 'TITLE'), ('Minister', 'TITLE'), ('Malcolm', 'O'),
('Turnbull', 'O'), ('MP, 'O'), ('visit', 'O'), ('UNSW', 'O'),
('yesterday', 'O'), ('.', 'O')]]

Then you should dump it to a file using the following code (where result is your result list and path to results is the path of the dumped file):
with open(path_to_results, 'wb') as f: pickle.dump(result, f)

Report. You need to submit a report (named report.pdf) which answers the fol- lowing two questions:

- What feature do you use in your classifier? Why are they important and what information do you expect them to capture?
- How do you experiment and improve your classifier?

4. Execution
Your program will be tested automatically on a CSE Linux machine as follows:

- the following command will be used to build the classifier: python trainer.py where

- path to training data indicates the path to the training dataset
- path to classifier indicates the path to the dumped classifier
- the following command will be used to test the classifier:
python tester.py

where

- path to testing data indicates the path to the testing dataset
- path to classifier indicates the path to the dumped classifier
- path to result indicates the path to the dumped result

Your program will be executed using python 3 with only the following packages available:
- nltk 3.2.1
- numpy 1.11.1
- scipy 0.18.0
- scikit-learn 0.17.1
- pandas 0.18.1

5. Evaluation

We will use F1-measure to evaluate the performance of your classifier. We will test your classifier using two test datasets,
- the first one is sampled from the same domain as the given training dataset, and
- the second one is sampled from a domain different from that of the given training dataset.

Each test will contribute 40 points to your final score. Your report contributes the rest 20 points.

In order to minimize the effect of randomness, we will execute trainer.py three times, and use the best performance achieved.

For each test dataset, your best performed classifier will be compared with a reference classifier C. While the detailed scheme will be published later on the project website, basically, the marking scheme will reward you with more points if your classifier works no worse than the reference classifier.

Neither the reference classifier C nor the two test datasets will be given to you. But you can create your own test dataset, submit it and get the performance of the reference classifier C on it. For more details, please refer to Section 6.

6. Customized Datasets

You are encouraged to construct your own test datasets to test your classifier and help improve its performance. You can submit your datasets to get the F1-measure of the pre- trained classifier C on your datasets. Furthermore, all the submitted datasets is accessible by every student in the class. You should also benefit from such experience because this will give you many initial ideas about the meaningful features to use in your own classifier. You can upload your test cases to through the data submission website (url will be sent via email). You will need to login before uploading your datasets or downloading other datasets. The id is your student number (e.g., z1234567) and your password will be sent to you by email.

Once you have logged into the system, you can:
- submit your dataset
- view the performance of all the datasets
- download a chosen dataset

Your submitted test datasets should be in the same format as the training dataset.3 You can submit at most 10 datasets to the system. We will release more details about how to use the system and some tools/tips to help you visualize and tag TITLE occurrences later on the project website.

DBMS, Programming

  • Category:- DBMS
  • Reference No.:- M91980117

Have any Question?


Related Questions in DBMS

Data model development and implementationpurpose of the

Data model development and implementation Purpose of the assessment (with ULO Mapping) The purpose of this assignment is to develop data models and map Database System into a standard development environment to gain unde ...

Quesiton 1 what is data-manipulation language dml there are

Quesiton: 1. What is Data-Manipulation Language (DML)? There are four types of access in DML, explain each one. 2. Assume we have a Library Database consists of the following relations: author(author_id, first_name, last ...

Football association of zambia faz super leaguethe faz has

Football Association of Zambia (FAZ) Super League The FAZ has recently decided to reorganise their operations to support both existing and possibly expanded league operations in Zambia and part of preparation for the 201 ...

Objectivethe objective of this lab is to be familiar with a

OBJECTIVE: The objective of this lab is to be familiar with a process in big data modeling. You're required to produce three big data models using the MS PowerPoint software. This tool is available on UMUC Virtual Deskto ...

Question sql injection is in the top 10 owasp and common

Question : SQL Injection is in the top 10 OWASP and Common Weakness Enumeration. Using MySQL and PHP, show your own very short and simple application that is vulnerable to this attack. Provide another version that mitiga ...

You are in a real estate business renting apartments to

You are in a real estate business renting apartments to customers. Your job is to define an appropriate schema using SQL DDL in MySQL. The relations are Property(Id, Address, NumberOfUnits), Unit(ApartmentNumber, Propert ...

Question create the physical data model for the logical

Question: Create the physical data model for the logical data model that you submitted in IP3. This should include all of the data definition language SQL. Your submission should include all DDL needed to: Create the tab ...

In sql developer onlydeliverables include sql scripts and

In SQL Developer ONLY! Deliverables Include SQL scripts and screenshot of the results: D1. Create the following three user-defined roles that are shown in the table below and assign them the specified permissions for the ...

Question 1 describe 1nf 2nf 3nf2 explain why 4nf is a

Question: 1: Describe 1NF, 2NF, 3NF. 2: Explain why 4NF is a normal form more desirable than BCNF. The response must be typed, single spaced, must be in times new roman font (size 12) and must follow the APA format.

Stored procedure please create the following stored

Stored procedure. Please create the following stored routines using CPS3740_2017S database using the tables in dreamhome database. xxxx is your email id 1) Implement a stored procedure p3Q21_xxxx to display the Branch ci ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As