辅导FASTA Assignment、辅导代作英文C/C++ 、辅导讲解C/C++程序、辅导讲解C

- 首页 >> C/C++编程


Homework #1

Due September 21st, 11:59pm

Each homework submission must include:

 An archive (.zip or .gz) file of the source code containing:

o The makefile used to compile the code on Monsoon (5pts)

o All .cpp and .h files (5pts)

o A readme.txt file outlining all modules (if any) needed for the execution of the code and

the exact command lines needed to answer homework’s questions (5pts)

 A full write-up (.pdf of .doc) file containing answers to homework’s questions (5pts) –

screenshots of code output are ok.

The source code must follow the following guidelines:

 No external libraries that implement data structures discussed in class are allowed, unless

specifically stated as part of the problem definition. Standard input/output and utilities libraries

(e.g. math.h) are ok.

 All external data sources (e.g. input data) must be passed in as a command line argument (no

hardcoded paths within the source code.

 Solutions to sub-problems must be executable separately from each other. For example, via a

special flag passed as command line argument (5pts)

For this homework, you will need to use the High Throughput Sequence reads dataset located on

Monsoon: /common/contrib/classroom/inf503/hw_dataset.fa

 Dataset contains approximately 36 million

‘reads’ (genomic sequence fragments of

equal length)

 Each read is exactly 50 nucleotides

(characters) long

 The read set is in FASTA format (see insert)

o The headers are unique and consist

of the read ID number (e.g. R1) and

a series of ‘copy number’ values for

the number of times this read is

present in sample 1, 2, … (separated by underscore “_”)

o The genomic sequences consist of the following alphabet {A, C, G, T, N}

Problem #1 (of 1): Arrays and Classes

Create a class called FASTAreadset. The purpose of the class will be to contain a FASTA read set, all of the

statistics associated with it, and all of the functions needed to operate on this set. Use an array datastructure

to store the genomic sequence of the given read dataset. Use character arrays (char[ ] ) to store

the sequence, rather than ‘string’ object (you should have an array-of-arrays 36 million by 50). Use

another set of arrays to store the copy number counters from the header string. At minimum, the class

must contain:

 A default constructor

 At least one custom constructor (e.g. one taking a file path or ifstream as input)

 A function to read the FASTA fasta file

 A single function to compute all statistics for the Readset (see below)

 A destructor

A. Using the first 1 million reads, estimate and report the total CPU time and RAM it will take to

initialize (fill up) the array data-structure with the entire 36 million reads. Note that this may

mean creating custom constructor to read first X reads rather than to the End-Of-File.

B. Test your prediction using the entire 36 million read set – report actual RAM and CPU time used.

Refer to Monsoon workshop notes for help in estimating actual runtime and RAM usage of your

run. Were you accurate? If not, explain what you think caused the discrepancy.

C. Compute the following statistics for your read set

 Total number of unique sequence fragments (here, safe to assume this is the total

number of sequence fragments in the file).

 Total number of reads for each ‘data set’ separately (recall there are 14 data sets in our

example here)

 Number of A, C, G, and T characters in the dataset.

D. Implement a destructor for your class to delete / deallocate your array data structure. How long

did it take? Does this make sense to you?

E. Implement a function that would sort the genomic sequences (fragments not characters within a

fragment) in your array in alphabetic order. What is the ‘big O’ notation of your approach (linear

/ quadratic / cubic / etc)? Please note that depending on the efficiency of your algorithm, you

may not be able to alphabetically sort the entire 36 million reads in a reasonable amount of time

(24-36 CPU hours). If this happens, reduce the problem size (by using a smaller subset of the

reads) and estimate the final run time.


站长地图