data编程代写、代做java语言程序
- 首页 >> C/C++编程 Task Description: Implementing a HyperLogLog Sketch and bSktHLL Data Structure
Objective
The goal is to develop two classes to estimate the cardinality (flow spread) of network flow
data using the HyperLogLog algorithm and a specialized data structure comprising an array
of HyperLogLog sketches. The input data (o1.txt) consists of source and destination IP pairs.
The output should be a CSV file containing the estimated cardinality for each unique
destination IP, along with the true count of occurrences.
Requirements
1. HLL Class: Implement a class named HLL to represent a HyperLogLog sketch. This
class should include:
• A constructor that initializes the number of registers (m, default 128) and
register bit size (p, default 5).
• A method add for adding elements to the sketch, which involves hashing the
element, updating the appropriate register based on the hash value, and
tracking the maximum number of leading zeros.
• A method estimate to calculate and return the estimated cardinality based
on the register values.
2. bSktHLL Class: Implement a class named bSktHLL that contains an array of l HLL
sketches (l=3 as each sketch produce a candidate estimate). Each sketch should
have m=128 HLL registers with each register being 5 bits. This class should include:
• A constructor that initializes the array of HLL sketches.
• A method record to hash each flow into two different HLL sketches and
update the sketches accordingly.
• A method query to return the median estimate of the cardinality for a given
flow from the sketches it hashes into.
3. Data Processing Script: Write a script to:
• Read network flow data from a text file, where each line contains a source
and destination IP pair.
• Process each line to record the destination IP in the bSktHLL data structure.
• Generate a CSV file with the following columns: flow_label,
candidate_estimate_1, candidate_estimate_2, candidate_estimate_3,
true_spread. The candidate_estimate columns should contain the
cardinality estimates, and the true_spread should contain the actual count
of occurrences for each destination IP.
Notes: The error in the candidate estimation should not exceed 5% of the actual spread.
References: Assistance functions for recording and querying flows can be sourced from
HyperLogLog.java, while references for hash functions should be utilized from
GeneralUtil.java.
Objective
The goal is to develop two classes to estimate the cardinality (flow spread) of network flow
data using the HyperLogLog algorithm and a specialized data structure comprising an array
of HyperLogLog sketches. The input data (o1.txt) consists of source and destination IP pairs.
The output should be a CSV file containing the estimated cardinality for each unique
destination IP, along with the true count of occurrences.
Requirements
1. HLL Class: Implement a class named HLL to represent a HyperLogLog sketch. This
class should include:
• A constructor that initializes the number of registers (m, default 128) and
register bit size (p, default 5).
• A method add for adding elements to the sketch, which involves hashing the
element, updating the appropriate register based on the hash value, and
tracking the maximum number of leading zeros.
• A method estimate to calculate and return the estimated cardinality based
on the register values.
2. bSktHLL Class: Implement a class named bSktHLL that contains an array of l HLL
sketches (l=3 as each sketch produce a candidate estimate). Each sketch should
have m=128 HLL registers with each register being 5 bits. This class should include:
• A constructor that initializes the array of HLL sketches.
• A method record to hash each flow into two different HLL sketches and
update the sketches accordingly.
• A method query to return the median estimate of the cardinality for a given
flow from the sketches it hashes into.
3. Data Processing Script: Write a script to:
• Read network flow data from a text file, where each line contains a source
and destination IP pair.
• Process each line to record the destination IP in the bSktHLL data structure.
• Generate a CSV file with the following columns: flow_label,
candidate_estimate_1, candidate_estimate_2, candidate_estimate_3,
true_spread. The candidate_estimate columns should contain the
cardinality estimates, and the true_spread should contain the actual count
of occurrences for each destination IP.
Notes: The error in the candidate estimation should not exceed 5% of the actual spread.
References: Assistance functions for recording and querying flows can be sourced from
HyperLogLog.java, while references for hash functions should be utilized from
GeneralUtil.java.