Ask Biology Expert

Comp 7/ Bio 40 Project: Reliability of Metagenomics Reads

Background

In this class, we are studying 16S rRNA metagenomic sequencing, in which the sequences read come from one of the hypervariable regions of the 16S rRNA genes, which are generally well conserved across prokaryotes. Because of the variability in the region we are sequencing, it is possible to use computational means to try to identify the taxonomic classi?cation of each sequence read. Of course, the sequence data can be noisy; noise may be represented by nucleotides reported as 'N' in the sequence data ?les. In addition, not every bacterial species has previously been sequenced, and some species may have suf?ciently similar sequences even in these hypervariable regions, making it dif?cult to classify them exactly.

The MiSeq metagenomics pipeline uses the method of Wang et al. (Assignment of rRNA Sequences into the New Bacterial Taxonomy. Q. Wang, G. M. Garrity, J. M. Tiedje, J. R. Cole. Appl. Environ. Microbiol. 73(16):5261, 2007) to taxonomically classify sequences. This is a probabilistic method, so the classification at each level of the taxonomic hierarchy is associated with a confidence score, which corresponds roughly to an estimate of the probability that the classification is correct.

The software uses a cutoff of 80% confidence to report a result. If the classification software is less than 80% confident in its classification at any given taxonomic level, it reports that the sequence is "Unclassified" at that level. Some samples have many unclassified reads, while others have relatively few.

Hypothesis

We are going to focus on the most specific level of taxonomic classification reported by our software, the genus level. Our hypothesis is the reads that could not be classified with greater than 80% certainty at the genus level had a higher percentage of undetermined nucleotides (Ns). In this assignment, you will analyze next-generation sequencing data to confirm or refute this hypothesis.

Overview

You will write a python program to address this question by combining data from two data files produced by the MiSeq software. We will help you design the program and will even give you the outline of the code to start with, but you will fill in each of the pieces!

In order to test your hypothesis, you will need to extract information from a data file to determine which reads are "unclassified" at the genus level, and which are not. You will also need to get the actual nucleotide sequences reported for each of those reads. These are stored in separate data files, so you will have to match the data up between two files using the read identifiers.

Matching data between two files is a very common problem in bioinformatics, and it is one that is easily solved using the dictionary data structure that you have recently learned. First we will describe the two file formats you will encounter, and then we will describe the program you will need to complete.

Skeleton Code Overview

At the top of the skeleton code, there are two constants: CLASS_FILE and READS_FILE. Add the names of the classification file and the FASTQ file in quotation marks to each of these constants, respectively. Make sure that these files are located in the same directory as the code.

The classification file is called metagen.txt and the fastq file, metagen.fastq. Both are available for download from the projects page. There should also be smaller sample files for use when you are testing your code. The code is initially set to use just the smallest of these (rand_10.txt and rand_10.fasta). Only replace these file names with the names of the larger test files and, eventually, the full sized files, after you get your code to work on the smaller data sets.

Now take a look at the main function in the skeleton code. It performs the following steps:

1. It uses the readInClasses function to create a dictionary classes that maps read IDs to their corresponding classification strings (a string containing the line from the classification file that contains the taxonomic classifications and their confidence scores for that read).

2. Using classes, it creates a dictionary classified, that maps the read IDs to a boolean value, either True or False, depending on whether the confidence in the classification of that read ID at the genus level is above or equal to the constant CUTOFF (set to 0.8), in which case the value for the read ID in classified is True, and False otherwise.

3. Create one more dictionary, reads, mapping read IDs to a string containing the nucleotide sequence for that read from the FASTQ file.

4. This step is the heart of your calculation. Using the two dictionaries, classified and reads, the findAvgNCount() function should go through each read id that is a key of classified, and calculate the percent N's for the corresponding read. Define two lists, one for the classified ids and the other for the unclassified ids. Add the float representing the percent N's to one of the two lists depending on whether the value for the read id in classified is True or False (i.e., whether the read for the id is classified at the genus level or not).

Now use the helper function avgList() to average the values in each list, and return a list that contains just the two averages.

5. Finally, the report function nicely prints the values returned by the previous step.

Download another file using this link:

?https://www.dropbox.com/s/gxpmhtrwfuz9l24/metagen.txt?dl=0

Attachment:- Assignment.rar

Biology, Academics

  • Category:- Biology
  • Reference No.:- M92067012

Have any Question?


Related Questions in Biology

Case study question -case study - mary 21 years old

Case Study Question - Case Study - Mary, 21 years old, presented to the hospital emergency department with an infected laceration on her left foot. Mary was at a beach resort four days ago, when she trod on a broken glas ...

Assignment -the upper-case blue letters are the 14th exon

Assignment - The upper-case, blue letters are the 14th exon (of 20) in the Hephl1 gene in mice. The lower-case (black) letters are from the flanking introns.  The highlighted bases indicate primers that may be used to ge ...

Question - a pure strain of mendels peas dominant for all

Question - A pure strain of mendel's peas, dominant for all seven of his independently assorting genes, was testcrossed. How many different kinds of gametes could the F1 PRODUCE?

Igfbp2 rbp4 and factor d post bariatric surgeryigfbp2 what

IGFBP2/ RBP4 and Factor D Post Bariatric Surgery IGFBP2 ( what the normal physiological action in the body? And how it affectedby obesity? andpost bariatric surgery?) RBP4 (what the normal physiological action in the bod ...

Assignment on nutrition - q1 task you need to select 2

Assignment on Nutrition - Q1. Task: You need to select 2 different age groups of your choice. You will need to plan balanced meals with snacks for a day. Once you have laid out the meal plan you need to: Explain why the ...

Question - gene cloning a please write the steps to clone

Question - Gene Cloning a) Please write the steps to clone the protease gene from Bacillus strain whose genome sequence is not known. b) Express the protease gene to obtain the enzyme in high yield, please plan your prot ...

Instructions address each question below as it relates to

Instructions: Address each question below as it relates to the caw study given. A patient was brought to the Emergency Department by ambulance with two arrow wounds. One arrow is still in the patient on the left side; en ...

Use of molecular tools and bioinforrnatics in the diagnosis

Use of Molecular Tools and Bioinforrnatics in the Diagnosis Characterization of Enteric Pathogens from a Case Study Purpose: The purpose of this project is to familiarize the student with modern molecular tools and bioin ...

Experiment 1 staining video1 open the media player by

Experiment 1: Staining Video 1. Open the Media Player by clicking on the film-strip button in the lower left of the lab's window frame, as shown below. The Media Player is a repository of images, videos, saved snapshots, ...

Chosen dr jan nolta- stem cell researcher head of uc davis

Chosen Dr. Jan Nolta- Stem Cell Researcher Head of UC Davis Stem Cell Program Director Topic Background: early Stem cells have the ability to develop into many different types of cells. Stem Cell Research is not without ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As