BWT encode C/C++ program 辅导、讲解C/C++ program 解析Linux 语言编程

- 首页 >> C/C++编程


COMP9319 2018s2 Assignment 2: BWT

This assignment contains two tasks: BWT encode and backward search.

Requirements

BWT encode

Your first task in this assignment is to create a special BWT encoder that will encode a file with delimiterseparated

text values (similar to a CSV file) into two files: a BWT encoded file; and an auxiliary positional

information file. Your BWT encoded file may carry some positional information (i.e., first record, second

record, and so on; whilst the size of the BWT encoded file is the same as the original file, and the

frequencies of each character in these two files are the same). The BWT encoded file must be the same

size as the original text file, and the auxiliary must not be larger than the original text file. (Note that

different students may have different, working schemes to capture this positional information).

Your C/C++ program, called bwtencode, accepts commandline input arguments:

1. the delimiter - it can be any visible ASCII alphabet (with ASCII value from 32 to 126) or a newline.

If it is a newline, it will be specified as '\n'

2. the path to an index or temporary folder (see below for details);

3. the path to a file containing text values separated by the specified delimiter; and

4. the path to the encoded file to be generated.

Note that you may decide a unique file extension for your auxiliary positional information file, e.g., by

appending .aux to the encoded filename, as far as you can determine its name from a given path to the

encoded file. This file is used to store extra positional information, and shall not be used to store text such

as the original text file or any encoded form of the orignal text file (e.g., LZW of the original text file).

Backward search

Your second task is to implement a simple BWT backward search, which can efficiently search a BWT

encoded file produced by bwtencode.

Your C/C++ program, called bwtsearch, accepts commandline input arguments:

1. the delimiter;

2. the path to a BWT encoded file;

3. the path to an index folder;

4. either -m for the number of matching substrings (count duplicates), -n for the number of unique

matching records, or -a for listing the identifiers of all the matching records (no duplicates and in

ascending order), -i for displaying the content of the records with ids provided in the search term;

and

5. a quoted query string (i.e., the search term).

The search term can be up to 512 characters. Here record identifiers are the positions of the records in the

original text file. For example, the first record has id 1; the second record has id 2; and the third one has 3

and so on. To make the assignment easier, we assume that the search is case sensitive.

If -a is specified, using the given query string, bwtsearch will perform backward search on the given

BWT encoded file, and output the sorted and unique record identifiers (no duplicates) of all the records

that contain the input query string to the standard output, with one line (ending with a '\n') for each match.

If -m is specified, given a query string, bwtsearch will output the total number of matching substrings

(count duplicates) to the standard output. The output is the total number, with an ending newline

character. Similarly, bwtsearch will output the total number of unique matching records (do not count

duplicates) if -n is specified.

Finally, when -i is specified, bwtsearch will perform backward search on the given BWT encoded file

and search for the records with their identifiers beginning with i and ending with j, as specified in the

search term as "i j". It will output the exact content of each matching record (excluding the delimiter) to

the standard output, plus an ending newline character. All these matching records will be output according

to their record identifiers in ascending order. You may assume that j will never be smaller than i during

testing, and all ids specified in the range of "i j" are valid. E.g., we will not test with "2 10" when there

are only 5 records in the file.

The text file

Text files may include any visible ASCII alphabets (i.e., any character with ASCII value from 32 to 126),

tab (ASCII 9) and newline. For example, a text file may look like

first$second$third$forth$

with $ being the delimiter. Similarly, a newline can be used as the delimiter, then the file will look like:

first

second

third

forth

For simplicity, we assume that there is a delimiter at the end of the last record. We further assume that a

file will have at least one record.

The index folder

For encoding, your solution is allowed to write out multiple external index files or temporary files to assist

your encoding, providing that they are in total no larger than ten times of the size of the given text file.

Please remove these temporary files before your program exits. Similarly, for search, your solution is

allowed to write out external index files that are in total no larger than the size of the original text file.

These index files may help to speed up the subsequent backward search and your program does not need

to remove them when exits.

All index files or temporary files must be written inside the specified index folder (and please do not

create subfolders inside the specified folder). You may safely assume that the specified folder exists.

If the total size of the files inside the index folder is larger than the corresponding size limit specified

above, you will receive zero points for the tests that generating/using these files. You may assume that the

index files will not be deleted during all the search tests for a given BWT file, and all the test BWT files

are uniquely named. You may further assume that the index folder name that associated with a given

BWT file will not be changed throughout the auto test. Therefore, to save the search time, you only need

to generate the index files when they do not exist yet.

Example

Suppose the original file (let's call it dummy.txt) is:

Computers in industry|Data compression|Integration|Big data indexing|

Some examples:

%wagner> bwtencode '|' ~MyAccount/tmpFolder ~/a2/dummy.txt ~MyAccount/XYZ/dummy.bwt

%wagner> bwtsearch '|' -m ~MyAccount/XYZ/dummy.bwt dummyIndex -m "in"

4

%wagner>

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -n "in"

2

%wagner>

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -a "in"

1

4

%wagner>

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -m "in "

1

%wagner>

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -n "in "

1

%wagner>

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -a "In"

3

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -i "2 2"

Data compression

%wagner>

%wagner> bwtsearch '|' ~MyAccount/XYZ/dummy.bwt dummyIndex -i "1 3"

Computers in industry

Data compression

Integration

%wagner>

For encoding, you will need to investigate how to (slightly) extend the original BWT encoding to

represent record positions (i.e., the record id). A small hint is given in the following example. Suppose

that the original file is:

first$second$third$forth$

If you perform a normal BWT encoding on this file, the resultant BWT encoded file will be:

hdtderns$$tthfocfiio$rsr$

Alternatively, you may extend BWT encoding and have the following encoded file instead (you will need

to figure out how this transformation is obtained):

htddenrs$$tthfocfiio$rsr$

Of course, some additional information about $'s need to be stored in the auxiliary file.

Compilation and the index folder

We will use the make command below to compile your solution. Please provide a makefile and ensure that

the code you submit can be compiled on a CSE Linux machine, e.g., wagner. Solutions that have

compilation errors will receive zero points for the entire assignment.

make

After the make command above (without any arguments), it should compile and generate two executables,

namely bwtencode and bwtsearch.

Your solution should not write out any external files other than those index files in the specified index

folder. Any solution that writes out external files outside the index folder will receive zero points for the

entire assignment.

Performance

Your solution will be marked based on space and runtime performance. Your soluton will not be tested

against any files that are larger than 50MB.

Runtime memory is assumed to be always less than 15MB for search, and 250MB for encoding. Runtime

memory consumption will be measured by valgrind massif with the option --pages-as-heap=yes, i.e.,

all the memory used by your program will be measured.

For encoding, any solution that runs for more than 300 seconds on a machine with similar specification as

wagner will be killed.

For search, any solution that runs for more than 60 seconds on a machine with similar specification as

wagner for the first query on a given BWT file will be killed, and will receive zero points for the queries

for that BWT file. After that any solution that runs for more than 10 seconds for any one of the

subsequent queries on that BWT file will be killed, and will receive zero points for that query test. We will

use the time command and count both the user and system time as runtime measurement.

Any solution that violates the memory requirement or timing requirement for search will receive zero

points for that query test. If it violates the memory requirement or timing requirement for encoding, it will

receive zero points for that test as well as those query tests depending on the encoded file generated from

this test.

Documentation

You will be marked on your descriptions in README.txt of how your solution works and your index file

structures. Your source code will be also inspected and marked based on readability and ease of

understanding.

Assumptions/clarifications/hints

1. None of the testcases will result in outputting more than 5,000 records / matches.

2. The input filename is a path to the given text file. Please open the file as read-only in case you do

not have the write permission.

3. Marks will be deducted for output of any extra text, other than the required, correct answers (in the

right order). This extra information includes (but not limited to) debugging messages, line numbers

and so on.

4. You can assume that the input query string will not be an empty string (i.e., "").

5. When counting the number of substring matches (i.e., with -m option), to make it easier for

backward search matching, all combinations of matches should be counted. E.g., There are 2

matches of "aa" on text "aaa"; 2 matches of "ana" on "banana".

6. You are allowed to generate external index files (in the specified index folder) to enhance the

performance of your solution. However, if you believe that your solution is fast enough without

using index files, you do not have to generate any files. Even in such case, your solution should still

accept a path to index folder as one of the input argument as specified (and should not generate an

error message such as "unknown argument".

7. A record will not be unreasonably long, e.g., you will not see a record that is 5,000+ chars long.

8. Empty records may exist in the original files (before BWT). However, these records will never be

matched during searching because the empty string will not be used as a search term when testing

your program.

9. You may assume that the number of delimiters in a given file is significantly fewer than the total

number of characters in that file (i.e., If the file has N characters, the number of delimiters are fewer

than N/4).

10. At least half of the tests will be based on files that are smaller than 10MB. So to start, you should

attempt this assignment without worrying about large files.

11. For performance consideration, you may assume that each encoded large file (i.e., larger than

10MB), there will be at least 9 subsequent tests following the test that index files are generated (so

your index files can be reused).

Marking

This assignment is worth 100 points. Below is an indicative marking scheme:

Component Points

Auto marking 95

Documentation 5

Bonus

Bonus marks (up to 10 points) will be awarded for the solution that achieves 100 points for the whole

assignment (i.e., both encoding and backward search); and runs the fastest for encoding (i.e., the shortest

total time to finish all encoding tests). Similarly, bonus marks (up to 10 points) will also be awarded for

the solution that achieves 100 points for the whole assignment (i.e., both encoding and backward search);

and runs the fastest for search (i.e., the shortest total time to finish all search tests). The same submission

may be the fastest for both encoding and search, and both bonuses will be awarded to the same solution

(i.e., up to 20 points). Note: regardless of the bonus marks you receive in this assignment, the maximum

final mark for the subject is capped at 100.

Submission

Deadline: Friday 5th October 23:59. Late submissions will have marks deducted from the maximum

achievable mark at the rate of roughly 1% of the total mark per hour that they are late (i.e., 24% per day),

and no submissions will be accepted after 3 days late. Use the give command below to submit the

assignment:

give cs9319 a2 makefile *.h *.c *.cpp README.txt

Please use "classrun" to check your submission to make sure that you have submitted all the necessary

files.

Plagiarism

The work you submit must be your own work. Submission of work partially or completely derived from

any other person or jointly written with any other person is not permitted. The penalties for such an

offence may include negative marks, automatic failure of the course and possibly other academic

discipline. Assignment submissions will be examined both automatically and manually for such

submissions.

Relevant scholarship authorities will be informed if students holding scholarships are involved in an

incident of plagiarism or other misconduct.

Do not provide or show your assignment work to any other person - apart from the teaching staff of this

subject. If you knowingly provide or show your assignment work to another person for any reason, and

work derived from it is submitted you may be penalized, even if the work was submitted without your

knowledge or consent. This may apply even if your work is submitted by a third party unknown to you.


站长地图