Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
shivajah authored Dec 10, 2020
1 parent 361b84e commit 267dab5
Showing 1 changed file with 39 additions and 0 deletions.
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,3 +150,42 @@ java -jar $JSONWISCDATAGEN_HOME/target/wisconsin-datagen.jar writer=file workloa
7. Default configurations are provided in $JSONWISCDATAGEN_HOMEB/src/main/resources/wisconsin_datagen.properties which can be overwritten similar to the above example (step 6).
- filesize: In MB, which would be another terminator for the program. Program will stop generating records if the asked cardinality of filesize is reached. Whichever that happen first, would be the terminator.
- writer: asterixdb or file. "file" writes the output to a file in target folder with the name provided in "fileoutput". "asterixdb" writer loads the records directly to AsterixDB using AsterixDBLoadPort. For using the asterixdb writer, data generator should run on one of the NC nodes for it in order to work properly.

### Workloads and Advanced Features:
Sample workloads are provided in Workloads folder. The Default.json shows the schema for the original wisconsin benchmark relation. The Advanced.json has more fields with other features such as: varibale length strings, different length distribution, different value distributions( value skewness), real-word strings, nullable, and optional(missing) fields.
In this section we explain these advanced features with some examples.

## Strings:
In the original Wisconsin Data Generator, strings are generated in a random or cyclic format. In the case of random, the string representation of the unique1 attribute is used as the prefix (which is unique as well) and enough ‘x’ characters will be added to the end of the string to reach to the desired length for that attribute. In the case of cyclic, string values are generated from the domain of four prefix in a cyclic format.In JSON Wisconsin Data Generator we only support the random format for generating the strings.
### 1. Real-Word and HEX Strings:
In addition to supporting the mentioned algorithm for generating strings, we support strings that are generated by concatenating words from a list made of 10,000 real words. This approach helps with reducing the impact of data compression due to less repetitive characters. In case of specifiying a distribution for the length of the strings, HEX strings will be generated.
In order to generate strings made of real words, user can specify ```"word":true``` in the field definition. If word is set to true and string is set to be variable length```("variableLength":true)```, an average number of words to be included in the string is provided as an input, and algorithm concatenates as many as asked from the word list.
### 2. Variable Length Strings:
A string attribute can have values with different lengths by setting ```("variableLength":true)```. There are multiple options for the distribution based on which these lengths are generated.

#### 2.1 Word:
If for a variable length string, "```word=true``` then a Normal Distributin is used for calculating the actual number of words in the string. The ```length``` property of the string attribute is used as the average number of words, and the ```standardDeviation``` is used as the standard deviation used in normal distribution.
#### 2.2 Hex String & Different Distributions on Length of the String:
- If a string attribute is defined as variable length and ```"normalDistribution": true``` then a Hex String will be generated with the string length (```length``` property) as the mean and the standard deviation(```standardDeviation```) provided as a property for the string attribute.

- If a string attribute is defined as variable length and ```"bernouliDistribution": true``` then user should also provide values for four other properties: ```"probLargeRecord", "minSizeSmall",maxSizeSmall","minSizeLarge","maxSizeLarge"```. The first property specifies what percentage of the strings should be large. The rest of the properties shows the maximum and minimum length for the small and large strings. First a bernouli distribution will be used for each record to determine if it is a large string or small. Then, a uniform distribution will choose the actual string length from the specified range for large and small strings. The generated string is a HEX string.

- If ```"zipfDistribution": true```, the user should specify three other properties: ```"zipfMinSize","zipfMaxSize","zipfSkew"```. The first two properties specify the range of the length of the string, and the last property is the skewness of Zipf distribution. The rank is calculated based on the number of elements between the ```"zipfMinSize","zipfMaxSize"``` values. The generated string is a HEX string.

- If none of the above distributions were selected, a Gamma distribution will be used to deterime the length of the string. The generated string is a HEX string.

### 3. Fixed Length Strings:
We offer different ways of generating fixed length strings. A fixed length string can either be made of words (```"word":true```) or based on the original Wisconsin Benchmark string generation technique. If ```"word":true```, then two distributions of Zipf and Uniform on selecting which word should be used (distribution on the value not the length of the string) can be performed. Extra 'X' characters can be appended to the string if the requested (but fixed) length of the string is larger than the length of the selected word.

## Skewness in Integer Attribute Values:
The integer attributes can be generated in three ways. The first way is to generate is as the original Wisconsin Benchmark would choose the integer value based on being sequential or random. The other ways are based on Gamma and Normal Distributions that we have defined in order to be able to introduce skew on the values of an integer attribute.

## Nullable and Optional (missing) fields:
While the original Wisconsin Benchmark does not provide a way to introduce null values, in our JSON Wisconsin Data Generator we have added the capability for having nullable and optional fields. Optional fields are mostly used in semi-structured data where the attribute itself maynot appear in some of the records. For specifying a nullable attribute, the user should set ```"nullable":true``` and also set the ```nulls``` to a value between 0 and 1. The latter attribute shows the probability of the attribute to have a null value for which a binomial distribution is used.
For specifying the optional values, the user should set ```"optional":true``` and also set the ```missings``` to a value between 0 and 1. The latter attribute shows the probability of the attribute to be missing for which a binomial distribution is used.

## Binary Data type:
In addition to integer and string data types, we have also introduced Binary data type which generates HEX values.



0 comments on commit 267dab5

Please sign in to comment.