-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgroupby_lib.tex
84 lines (71 loc) · 3.43 KB
/
groupby_lib.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
% This is part of the stat-toolkit documentation
% Copyright (C) 2012,2013 Krzysztof Stachowiak
% See the file FDL for copying conditions.
\section{\texttt{groupby.h}}
\subsection{Summary}
The \texttt{groupby.h} library provides functionality silimar to the SQL
groupping. In order to perform the aggregation of a set of data, two
things need to be defined. The set of the group defining columns
(the \texttt{group by} part in the SQL query), and the set of the aggregators
(the \texttt{SUM(?)}, \texttt{MEAN(?)}, etc. SQL functions). The values from
the input rows are groupped by the criteria given by the first set and then
within each group the values are aggregated with te functions provided by
the aggregators.
All the classes are placed in the \texttt{groupby} namespace. There are three
of them:
\begin{itemize}
\item \texttt{group} -- Internal representation of a data group.
This one should not be used outside of the library.
\item \texttt{group\_result} -- This class is meant as a mean of
safely transporting the groupping result. Instead of
polymorphic aggregator pointers it already stores the
aggregated values and is therefore easily copyable and
movable.
\item \texttt{groupper} -- The main facility of the library. It
is defined by the aforementioned concepts of groupping
columns and aggregators and works by consuming plain data
rows and exposing the groupping results upon request.
\end{itemize}
Two of the classes will be described here as the ones meant to be used by
the library clients.
\subsection{groupper}
The main functional class being constructed with two arguments:
\begin{itemize}
\item \texttt{groupbys : vector<uint32\_t>} -- The list of the
indices defining the groupping columns.
\item \texttt{aggr\_strs : vector<string>} -- The strings for
the construction of the aggregators. They are given by
the pairs of the form: \textit{column-index constructor},
where the \textit{column-index} indicates the column from
which the values are to be aggregated and the
\textit{constructor} is the aggregator constructor string.
For the details on the aggregator construction from strings
see the \texttt{aggr.h} library section of the manual.
\end{itemize}
The runtime interface of the \texttt{groupper} class consists the following
functions:
\begin{itemize}
\item \texttt{consume\_row(row : vector<string>) : void}\\
Accepts a row of data and asigns the according values to
their groups.
\item \texttt{for\_each\_group(f : function<void(group)>) : void}\\
Visits all the groups that have been determined so far
calling the provided function for each of them.
\item \texttt{copy\_result() : vector<group\_result>}\\
Performs all the aggregations of the values stored for the
internal list of groups and returns a static copy of
the groupping result that only contains the aggregated
values.
\end{itemize}
\subsection{group\_result}
This class serves the purpose of transporting the
information about the groupping result in a safe, copyable and movable
way. It consists of two lists:
\begin{itemize}
\item \texttt{definition : vector<pair<uint32\_t, string>>}\\
The mapping of the column indices and the values that
can be found in the columns for the given group.
\item \texttt{aggregators : vector<pair<uint32\_t, double>>}\\
The mapping of the column indices and the aggregations
of the values coming from the columns.
\end{itemize}