-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgroupby_cli.tex
115 lines (99 loc) · 3.88 KB
/
groupby_cli.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
% This is part of the stat-toolkit documentation
% Copyright (C) 2012,2013 Krzysztof Stachowiak
% See the file FDL for copying conditions.
\section{\texttt{groupby}}
\subsection{All options}
\begin{itemize}
\item \texttt{-a} \textit{constr-string} -- defines an aggregator with a
so called construction string.
\item \texttt{-d} \textit{delim-char} -- defines a custom delimiter.
The default value is the tab character.
\item \texttt{-g} \textit{group-index} -- defines a groupping criterion.
\end{itemize}
\subsection{Summary}
The program performs SQL-like groupping aggregation of a set of data given by a
stream of tabuarized textual data. It is based upon the \texttt{groupby.h} library.
A stream of data rows separated with a linebreak is expected. The default
field separator is the tab character, but may be altered with the
\texttt{-d \textit{delimiter}} option. The output is defined by two sets
of the data processing elments: the groupping criteria and the aggregators.
It is required that there is at least one groupping criterion and at least one
aggregator defined in the command line. Note that they will appear in the output
in the same order in which they're given in the command line.
\subsubsection{Groupping criteria}
The grouppers are equivalent to the SQL's ``group by'' statements.
Assumed that we have selected a set of grouppers for fields f1, f2, etc.,
All the input data rows that have the same values in these fields will be
considered groupped. We will be furhter saying they belong to a single group.
The fields are given by indices; in order to define a groupping criterion we use
a \texttt{-g \textit{column-index}} option.
\subsubsection{Aggregators}
The aggregators define the way in which given values for the specific fields
are to be put together. Defining an aggregator consists in providing a
\texttt{-a "\textit{field-index} \textit{aggr-constr}"} option, where the
\textit{field-index} indicates the field that is to be aggregated by the
current aggregator and \textit{aggr-constr} means the string constructing
a given aggregator. The field index is a non-negative, zero-based index
of a particular column, and the constructor string is the name of the
aggregator followed by optional, aggregator-specific arguments. For details see
the aggregator construction in the manual for the \texttt{aggr.h} library.
For example in order to define an aggregator for the field 3 that will compute
the gaussion confidence interval at the confidence level of 0.95 the argument
line should be:
\texttt{... | ./groupby ... -a "3 ci\_gauss 0.95" ...}
\subsubsection{Output format}
Let's assume that fields \texttt{f1, f2, ...} have been chosen as the
grouppers and aggregators \texttt{a1, a2, ...} have been selected.
The results will take the following form:
\begin{verbatim}
f1 f2 ... a1 a2 ...
i1 i2 ... v1 v2 ...
i3 i4 ... v3 v4 ...
\end{verbatim}
... where \texttt{i1, i2, ...} -- the "indicators" -- are the labels for
the given fields that have been captured and \texttt{v1, v2, ...} are
the computed aggregated values.
\subsection{Examples}
Let's consider a simple dataset:
\begin{verbatim}
$cat data
50 4 1.0 2.0
50 4 3.0 4.0
50 6 1.0 2.0
50 6 3.0 4.0
100 4 1.0 2.0
100 4 3.0 4.0
100 6 1.0 2.0
100 6 3.0 4.0
\end{verbatim}
Let's now take a look at different results depending on the input options.
\begin{verbatim}
$cat data | ./groupby -g1 -a "2 sum"
1 "2 sum"
4 8
6 8
\end{verbatim}
\begin{verbatim}
$cat data | ./groupby -g0 -g1 -a "2 sum" -a "3 mean"
0 1 "2 sum" "3 mean"
50 4 4 3
50 6 4 3
100 4 4 3
100 6 4 3
\end{verbatim}
\begin{verbatim}
$cat data | ./groupby -g1 -g0 -a "2 sum" -a "3 mean"
1 0 "2 sum" "3 mean"
4 50 4 3
6 50 4 3
4 100 4 3
6 100 4 3
\end{verbatim}
\begin{verbatim}
$cat data | ./groupby -g1 -g0 -a "3 mean" -a "2 sum"
1 0 "3 mean" "2 sum"
4 50 3 4
6 50 3 4
4 100 3 4
6 100 3 4
\end{verbatim}