Notes and homework related to Week 5: Aggregation Framework.
- Aggregation reference (MongoDB Manual)
- Simple example:
db.products.aggregate([ { $group: { _id: "$manufacturer", num_products: { $sum: 1}}}])
returns a document with fields_id
andnum_products
, containing a product count per manufacturer - Compound grouping:
db.products.aggregate([ { $group: { _id: { "manufacturer": "$manufacturer", "category": "$category" }, num_products: { $sum: 1}}}])
is similar toSELECT manufacturer, category, COUNT(*) FROM products GROUP BY manufacturer, category
- By the way, documents can be used for
_id
in general:db.foo.insert({ _id: { name: "andrew", class: "m101"}, hometown: "NY"}
is perfectly valid
- Documents of a collection are piped thru an aggregation pipeline, which is a queue in FIFO-order of aggregation operations (or aggregation stages), that is going to transform the collection
- … expressed by
db.products.aggregate( ‹pipeline›)
in the Mongo shell, where ‹pipeline› is an ordered array of aggregation operations - … each operation can appear more than once in the pipeline, at different stages
- Sample pipeline: collection →
$project
→$match
→$group
→$sort
→ result $project
operation -- select, reshape (1:1)$match
operation -- filters (N:1 / reducing effect)$group
operation -- aggregation (N:1 / reducing effect)$project
operation -- sorting (1:1)$skip
operation -- skip documents (N:1 / reducing effect)$limit
operation -- limit number of documents (N:1 / reducing effect)$unwind
operation -- unjoins data of arrays or subdocuments (1:N / augmenting effect), normalizing the data to ease aggregation; for instance, when unwinding its fieldtags: ['sports', 'outdoors', 'summer']
, a blog post would be exploded into three documents
- Aggregation expressions in the
$group
stage:$sum
(sums up the key value with{$sum: "$price"}
, or sums up the given value to count'$sum': 1
),$avg
(average, for instance{$avg: "$price"}
),$min
and$max
(find the minimum and maximum values of the key),$push
and$addToSet
(build arrays, the last adding uniquely each key),$first
and$last
(both requiring to sort, otherwise the returned keys are arbitrary) - Sum operation
$sum
:db.zips.aggregate([{ "$group": { _id: "$state", population: { $sum: "$pop" }}}])
- Average operation
$avg
:db.products.aggregate([{ $group: { _id: { category: "$category"}, avg_price: { $avg: "$price" }}}])
- Building arrays with
$addToSet
:db.products.aggregate([{ $group: { _id: { maker: "$manufacturer"}, categories: { $addToSet: "$category" }}}])
will build a new set of documents grouped by manufacturer, looking thru the array to see if the category is already there, adding it if it was not; quiz answer:db.zips.aggregate([{ $group: { _id: "$city", postal_codes: { $addToSet: "$_id" }}}, { $limit: 5}])
; note that the"$_id"
expression refers to the_id
in the source documents of the pipeline stage - Building arrays with
$push
: similar to$addToSet
, but does not guarantee that it adds each item only once; for instancedb.products.aggregate([{ $group: { _id: { maker: "$manufacturer"}, categories: { $push: "$category" }}}])
would again build a set of new documents grouped by manufacturer, where the categories might appear more than once; quiz:$push
is obviously identical to$addToSet
if its expression contains unique values, although the order of documents might be different - Picking the maximum and minimum value with the
$max
and$min
operations:db.products.aggregate([{ $group: { _id: { maker: "$manufacturer"}, max_price: { $max: "$price"}}}])
will find the maximum price of the products of a manufacturer - However, getting the product matching that maximum price can't be done within this grouping stage alone, additional
$sort
and$limit
stages would be needed - Quiz answer:
db.zips.aggregate([{ $group: { "_id": "$state", pop: { $max: "$pop"}}}])
- To figure out the average class grade in each class, we need to average all the students' grades within the class, and after that first stage, average those grades (aka scores) in a second stage:
db.grades.aggregate([{ "$group":{ _id: { class_id: "$class_id", student_id: "$student_id"}, average:{ "$avg": "$score"}}}, { "$group":{ _id: "$_id.class_id", average:{ "$avg": "$average"}}}])
- Quiz answer:
db.fun.aggregate([{ $group:{ _id:{ a: "$a", b: "$b"}, c:{ $max: "$c"}}}, { $group:{ _id: "$_id.a", c:{ $min: "$c"}}}])
returns documents{ _id: 0, c: 52}
and{ _id: 1, c: 22}
### Projection operation (1:1)
- Allows to reshape the documents coming thru: remove keys, add new keys, reshape keys, use simple functions on keys, such as
$toUpper
,$toLower
,$add
or$multiply
- For instance
db.zips.aggregate([{ $project:{ _id: 0, city:{ $toLower: "$city"}, pop:1, state:1, zip: "$_id"}}])
will suppress the_id
field (0
means drop it) and replace it withzip
, including the fieldspop
andstate
as they are (1
means just include it), pluscity
lowercased (more complex expressions are evaluated) - Quiz answer: see the example above
- Matching operation
$match
: performs a filter (on the documents as they pass thru the pipe) and has a reducing effect (N:1, or 1:1 if all documents are matched) - For instance
db.zips.aggregate([{ $match: { state: "NY"}}])
- Matching and grouping:
db.zips.aggregate([{ $match:{ state:"NY"} }, { $group:{ _id: "$city", population:{ $sum:"$pop"}, zip_codes:{ $addToSet: "$_id" }}}])
- Matching, grouping and projecting:
db.zips.aggregate([{ $match:{ state:"NY"}}, { $group:{ _id: "$city", population:{ $sum:"$pop"}, zip_codes:{ $addToSet: "$_id"}}}, { $project:{ _id: 0, city: "$_id", population: 1, zip_codes: 1 }}])
- Quiz answer:
db.zips.aggregate([{ $match: { pop: { $gt: 100000}}}])
- Sorting operation
$sort
: before or after grouping stage; beware memory usage! the aggreation framework handles everything in memory - If the sort is after the grouping stage, the operation won't be able to use an index
- Sort can be used multiple times
- Sorting before grouping is often very useful when aggregating
- For instance
db.zips.aggregate([ …, { $sort: { population: -1 }}])
- … within a complete pipeline:
db.zips.aggregate([{ $match:{ state:"NY" }}, { $group:{ _id: "$city", population:{ $sum:"$pop" }}}, { $project:{ _id: 0, city: "$_id", population: 1 }}, { $sort:{ population: -1 }}])
- Quiz answer:
db.zips.aggregate([{ $sort: { state: 1, city: 1}}])
- Skip forward in the document set with
$skip
and limiting the resultset with$limit
- The order they are specified does matter (to the contrary of
find
); generally you first skip, then limit - Skipping doesn't make a lot of sense unless you sort, so you first sort, then skip and/or limit
- For instance
db.zips.aggregate([ … { $sort: { population: -1 }}, { $skip: 10}, { $limit: 5}])
- … within a complete example:
db.zips.aggregate([{ $match: { state: "NY"}}, { $group:{ _id: "$city", population:{ $sum: "$pop"}}}, { $project:{ _id: 0, city: "$_id", population: 1}}, { $sort:{ population: -1 }}, { $skip: 10}, { $limit: 5 }])
- Quiz answer: there will be zero documents in the result set; if you first limit it to 5 documents, and then skip 10 documents, there will be none left
-
$first
and$last
are group operators, which allow to get the first or last value in each group as the aggregation pipeline processes documents -
After a sort stage, these operators can find the first or last document in sorted order in each group
-
If the document set #{ {A:0, B:23}, {A:0, B:45}, {A:1, B:17}, {A:1, B:68}} gets sorted by (A, B), then grouped on A,
$first
would returnB:23
andB:17
, and$last
would returnB:45
andB:68
-
For instance
db.zips.aggregate([{ $sort: { "_id.state": 1, "population": -1}}, { $group: { _id: "$_id.state", city:{ $first: "$_id.city"}, population:{ $first: "$population" }}}])
-
Complete example: we want to find the largest city in every state
db.zips.aggregate([ /* Get the population of every city in every state */ { $group: { _id:{ state: "$state", city: "$city"}, population:{ $sum: "$pop"}}}, /* Sort by state, population (note we don't need the dollar operator with $sort) */ { $sort: { "_id.state": 1, "population": -1}}, /* Group by state, get the first item in each group */ { $group: { _id: "$_id.state", city:{ $first: "$_id.city"}, population:{ $first: "$population"}}}, /* Now sort by state again */ { $sort: { "_id": 1}} ])
-
Allows to unjoin (or explode) the data of an array and rejoin it in a way that let us do grouping operations; for instance
db.items.aggregate([{ $unwind: "$attributes"}])
-
Complete example: we want to count the number times that each tag appears in a post (or how many posts were tagged with each tag); therefore, we're going to unwind the tags, and then group by the tag, and count:
db.posts.aggregate([ /* Unwind by tags */ { "$unwind": "$tags"}, /* Now group by tags, counting each tag */ { "$group": { _id: "$tags", count:{ $sum: 1}}}, /* Sort by popularity */ { "$sort": { count: -1}}, /* Show me the top 10 */ { "$limit": 10}, /* Change the name of _id to be tag */ {"$project": { _id: 0, tag: "$_id", count: 1}} ])
-
Answer to quiz:
$push
allows to reverse the effect of an$unwind
(grouping by the rest of the document); another option would be$addToSet
, depending if the array contained only unique values in the original document -
If there is more than one array in a document, and you want to create a cartesian product of the arrays, you have to double unwind
-
Complete example:
db.inventory.aggregate([ { $unwind: "$sizes"}, { $unwind: "$colors"}, /* Create the color array */ { $group: { "_id": { name: "$name", size: "$sizes"}, "colors":{ $push: "$colors"}}}, /* Create the size array */ { $group: { "_id": { "name": "$_id.name", "colors": "$colors"}, "sizes":{ $push: "$_id.size"}}}, /* Reshape for beauty */ { $project: { _id:0, "name":"$_id.name", "sizes":1, "colors": "$_id.colors"}} ])
-
Answer to quiz: can you reverse the effect of a double unwind (two unwinds in a row) with the
$push
operator? Yes, with two pushes in a row; or even in a single stage using twice$addToSet
, if the values were unique
- Mapping from SQL:
WHERE
→$match
,GROUP BY
→$group
,HAVING
→$match
,SELECT
→$project
,ORDER BY
→$sort
,LIMIT
→$limit
,SUM()
→$sum
,COUNT()
→$sum
,join
→ no direct corresponding operator;$unwind
allows for similar functionality, but with fields embedded within the document - Some common examples: see SQL to Aggregation Mapping Chart
- Resultset limited to 16 MB of memory
- Can use max. 10% of the memory on a machine; use
$match
to do early filtering, if all data is not needed in following stages; similarily, use$project
to filter out some of the fields - Sharding, that is collections split among multiple Mongo instances (which can be replica sets in turn), with a
mongos
router in front of the shards: aggregation does work on a sharded environment, but after the first$group
or$sort
stage, the aggregation has to be brought back to themongos
, so that it can gather the results before sending them for the next stage of the pipeline (which needs to see the finished result); so when having a long pipeline, all calculations will wind up to themongos
server -- often colocated on the application server, therefore pushing the server resources - mapreduce is an alternative to the Aggregation framework
- Hadoop is also an alternative to Aggregation framework; there is a connector for Hadoop in MongoDB, and Hadoop is an implementation of mapreduce