Skip to content

Latest commit

 

History

History
58 lines (37 loc) · 6.96 KB

map-reduce.md

File metadata and controls

58 lines (37 loc) · 6.96 KB

Map and Reduce

The MapReducer is the central object of every OSHDB query. It is returned by the initial OSHDB view and allows to filter out defined subsets of the OSM history dataset. At that point one can transform (map) and aggregate (reduce) the respective OSM data into a final result.

For example, a map function can calculate the length of every OSM highway, and a reduce function can sum up all of these length values.

For many of the most frequently used reduce operations, such as the summing up of many values or the counting of elements, there exist specialized reducers.

map

A transformation function can be set by calling the map method of any MapReducer. It is allowed to have an OSHDB query without a map step or one with multiple map steps, which are executed one after each other. Such a map function can also transform the data type of the MapReducer it operates on.

For example, when calculating the length (which is a floating point number) of an entity snapshot, the underlying MapReducer changes from type MapReducer<OSMEntitySnapshot> to being a MapReducer<Double>.

flatMap

A flatMap operation allows one to map any input value to an arbitrary amount of output values. Each of the output values can be transformed in further map steps individually.

filter

It is possible to define filters that can sort out values after they already have been transformed in a map step.

Note that these filters are different from the OSM data filters described in the “Filtering of OSM data” section of this manual, since those filters are always applied at the beginning of each query on the full OSM history data directly, while the filters described here are executed during the transformation of the data. Normally, it is best to use the less flexible, but more performant OSM data filters wherever possible, because they can reduce the amount of data to be iterated over right from the start of the query.

reduce

The reduce operation produces the final result of an OSHDB query. It takes the result of the previous map steps and combines (reduces) these values into a final result. This can be something as simple as summing up all of the values, but also something more complicated, for example estimating statistical properties such as the median of the calculated values. Many query use common reduce operations, for which the OSHDB provides shorthand methods (see below).

Every OSHDB query must have exactly one terminal reduce operation (or use the stream method explained below).

Remark: If you are already familiar with Hadoop, note that for defining a reduce operation we use the terminology of the Java stream API which is slightly different from the terminology used in Hadoop. In particular, the Java stream API and Hadoop use the same term 'combiner' for different things.

specialized reducers

The OSHDB provides the following list of default reduce operations, that are often used for querying OSM history data. Their names and usage are mostly self explanatory.

Some of the listed specialized reducers also have overloaded versions that accept a mapping function directly. This allows some queries to be written more consicely, but also allows for improved type inference: For example when summing integer values, using the overloaded sum reducer knows that the result must also be of type Integer, and doesn't have to resort on returning the more generic Number type.

stream

Instead of using a regular reduce operation at the end of an OSHDB query, one can also call stream, which doesn't aggregate the values into a final result, but rather returns a (potentially long) stream of values. If possible, using a reduce operation instead of streaming all values and using post-processing results in better performance of a query, because there is less data to be transferred. The stream operation is however beneficiall over using collect if the result set is expected to be large, because it doesn't require all the data to be buffered into a result collection.