Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsv-uniq: Remove unnecessary memory allocation #234

Merged
merged 3 commits into from
Oct 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions common/src/tsv_utils/common/utils.d
Original file line number Diff line number Diff line change
Expand Up @@ -1030,9 +1030,6 @@ unittest
joinAppend performs a join operation on an input range, appending the results to
an output range.

Note: The main uses of joinAppend have been replaced by BufferedOutputRange, which has
its own joinAppend method.

joinAppend was written as a performance enhancement over using std.algorithm.joiner
or std.array.join with writeln. Using joiner with writeln is quite slow, 3-4x slower
than std.array.join with writeln. The joiner performance may be due to interaction
Expand All @@ -1046,6 +1043,10 @@ illustrates. It is a modification of the InputFieldReordering example. The role
Appender plus joinAppend are playing is to buffer the output. BufferedOutputRange
uses a similar technique to buffer multiple lines.

Note: The original uses joinAppend have been replaced by BufferedOutputRange, which has
its own joinAppend method. However, joinAppend remains useful when constructing internal
buffers where BufferedOutputRange is not appropriate.

---
int main(string[] args)
{
Expand Down
34 changes: 27 additions & 7 deletions tsv-uniq/src/tsv_utils/tsv-uniq.d
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ struct TsvUniqOptions
if (max != 0 || (!equivMode && !numberMode)) max = atLeast;
}

if (!keyIsFullLine) fields.each!((ref x) => --x); // Convert to 1-based indexing.
if (!keyIsFullLine) fields.each!((ref x) => --x); // Convert to 0-based indexing.

}
catch (Exception exc)
Expand Down Expand Up @@ -266,12 +266,13 @@ int main(string[] cmdArgs)
*/
void tsvUniq(in TsvUniqOptions cmdopt, in string[] inputFiles)
{
import tsv_utils.common.utils : InputFieldReordering, bufferedByLine, BufferedOutputRange;
import tsv_utils.common.utils : InputFieldReordering, bufferedByLine, BufferedOutputRange, joinAppend;
import std.algorithm : splitter;
import std.array : join;
import std.array : appender;
import std.conv : to;
import std.range;
import std.uni : toLower;
import std.uni : asLowerCase;
import std.utf : byChar;

/* InputFieldReordering maps the key fields from an input line to a separate buffer. */
auto keyFieldsReordering = cmdopt.keyIsFullLine ? null : new InputFieldReordering!char(cmdopt.fields);
Expand All @@ -285,7 +286,11 @@ void tsvUniq(in TsvUniqOptions cmdopt, in string[] inputFiles)
struct EquivEntry { size_t equivID; size_t count; }
EquivEntry[string] equivHash;

size_t numFields = cmdopt.fields.length;
/* Reusable buffers for multi-field keys and case-insensitive keys. */
auto multiFieldKeyBuffer = appender!(char[]);
auto lowerKeyBuffer = appender!(char[]);

const size_t numKeyFields = cmdopt.fields.length;
long nextEquivID = cmdopt.equivStartID;
bool headerWritten = false;
foreach (filename; (inputFiles.length > 0) ? inputFiles : ["-"])
Expand Down Expand Up @@ -343,10 +348,25 @@ void tsvUniq(in TsvUniqOptions cmdopt, in string[] inputFiles)
(filename == "-") ? "Standard Input" : filename, lineNum));
}

key = keyFieldsReordering.outputFields.join(cmdopt.delim);
if (numKeyFields == 1)
{
key = keyFieldsReordering.outputFields[0];
}
else
{
multiFieldKeyBuffer.clear();
keyFieldsReordering.outputFields.joinAppend(multiFieldKeyBuffer, cmdopt.delim);
key = multiFieldKeyBuffer.data;
}
}

if (cmdopt.ignoreCase) key = key.toLower;
if (cmdopt.ignoreCase)
{
/* Equivalent to key = key.toLower, but without memory allocation. */
lowerKeyBuffer.clear();
lowerKeyBuffer.put(key.asLowerCase.byChar);
key = lowerKeyBuffer.data;
}

bool isOutput = false;
EquivEntry currEntry;
Expand Down
128 changes: 128 additions & 0 deletions tsv-uniq/tests/gold/basic_tests_1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,134 @@ f1 f2 f3 f4 f5 id
9 ÀBC 1367 1331 18
17 0 Z 5734 602 23

====Mixed tests===

====[tsv-uniq input3.tsv]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
1 GREEN GRÜN 緑 VERDE
2 White Weiß 白い Blanca
3 TEAL BLAUGRÜN ティール AZULADO
4 soccer fútbol サッカー fútbol
5 BASEBALL BASEBALL 野球 BÉISBOL
1 green grün 緑 verde
2 white weiß 白い blanca
3 Teal Blaugrün ティール azulado
4 SOCCER FÚTBOL サッカー FÚTBOL
5 baseball baseball 野球 béisbol
1 green Grün 緑 verde
2 white WEISS 白い Blanca
4 SOCCER FÚTBOL サッカー fútbol
5 baseball BASEBALL 野球 béisbol

====[tsv-uniq input3.tsv -i]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
2 white WEISS 白い Blanca

====[tsv-uniq input3.tsv -H -i]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
2 white WEISS 白い Blanca

====[tsv-uniq input3.tsv -H -f 1]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol

====[tsv-uniq input3.tsv -H -f 1 -i]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol

====[tsv-uniq input3.tsv -H -f 2,3]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
1 GREEN GRÜN 緑 VERDE
2 White Weiß 白い Blanca
3 TEAL BLAUGRÜN ティール AZULADO
4 soccer fútbol サッカー fútbol
5 BASEBALL BASEBALL 野球 BÉISBOL
1 green grün 緑 verde
2 white weiß 白い blanca
3 Teal Blaugrün ティール azulado
4 SOCCER FÚTBOL サッカー FÚTBOL
5 baseball baseball 野球 béisbol
1 green Grün 緑 verde
2 white WEISS 白い Blanca
5 baseball BASEBALL 野球 béisbol

====[tsv-uniq input3.tsv -H -f 2,3 -i]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
2 white WEISS 白い Blanca

====[tsv-uniq input3.tsv -H -f 2,3 -i]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
2 white WEISS 白い Blanca

====[tsv-uniq input3.tsv -H -f 2,3,5]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
1 GREEN GRÜN 緑 VERDE
2 White Weiß 白い Blanca
3 TEAL BLAUGRÜN ティール AZULADO
4 soccer fútbol サッカー fútbol
5 BASEBALL BASEBALL 野球 BÉISBOL
1 green grün 緑 verde
2 white weiß 白い blanca
3 Teal Blaugrün ティール azulado
4 SOCCER FÚTBOL サッカー FÚTBOL
5 baseball baseball 野球 béisbol
1 green Grün 緑 verde
2 white WEISS 白い Blanca
4 SOCCER FÚTBOL サッカー fútbol
5 baseball BASEBALL 野球 béisbol

====[tsv-uniq input3.tsv -H -f 2,3,5 -i]====
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
2 white WEISS 白い Blanca

====Max count tests===

====[tsv-uniq -H --max 0 input1.tsv]====
Expand Down
21 changes: 21 additions & 0 deletions tsv-uniq/tests/input3.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
f1 f2 f3 f4 f5
1 Green Grün 緑 Verde
2 WHITE WEIẞ 白い BLANCA
3 teal blaugrün ティール azulado
4 Soccer Fútbol サッカー Fútbol
5 Baseball Baseball 野球 Béisbol
1 GREEN GRÜN 緑 VERDE
2 White Weiß 白い Blanca
3 TEAL BLAUGRÜN ティール AZULADO
4 soccer fútbol サッカー fútbol
5 BASEBALL BASEBALL 野球 BÉISBOL
1 green grün 緑 verde
2 white weiß 白い blanca
3 Teal Blaugrün ティール azulado
4 SOCCER FÚTBOL サッカー FÚTBOL
5 baseball baseball 野球 béisbol
1 green Grün 緑 verde
2 white WEISS 白い Blanca
3 Teal Blaugrün ティール azulado
4 SOCCER FÚTBOL サッカー fútbol
5 baseball BASEBALL 野球 béisbol
13 changes: 13 additions & 0 deletions tsv-uniq/tests/tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,19 @@ runtest ${prog} "input1.tsv -H -f 3,4 --equiv --ignore-case" ${basic_tests_1}
runtest ${prog} "input1.tsv --header -f 3,4 --equiv --equiv-start 10 --ignore-case" ${basic_tests_1}
runtest ${prog} "input1.tsv --header -f 3,4 --equiv --equiv-start 10 --equiv-header id --ignore-case" ${basic_tests_1}

# Additional tests on keys and case sensitivity
echo "" >> ${basic_tests_1}; echo "====Mixed tests===" >> ${basic_tests_1}
runtest ${prog} "input3.tsv" ${basic_tests_1}
runtest ${prog} "input3.tsv -i" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -i" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 1" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 1 -i" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 2,3" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 2,3 -i" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 2,3 -i" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 2,3,5" ${basic_tests_1}
runtest ${prog} "input3.tsv -H -f 2,3,5 -i" ${basic_tests_1}

# Max unique values
echo "" >> ${basic_tests_1}; echo "====Max count tests===" >> ${basic_tests_1}
runtest ${prog} "-H --max 0 input1.tsv" ${basic_tests_1}
Expand Down