forked from wwasilak/northwind_test
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path25DaysAnswers.malloynb
373 lines (313 loc) · 12.9 KB
/
25DaysAnswers.malloynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
>>>markdown
### SCOPE
In this project i would like to answer all the questions from Curbal's '25 Days of DAX Fridays! Challenge – Ed1: NorthWind Company' (https://curbal.com/25-days-of-dax-fridays-challenge-ed1-northwind-company) using Malloy. Now in this challenge all the required answers are in form of single numeric value or string - i won't be returning tables here and i won't be able to show some of Malloy's key concepts (like reusing queries, nesting and many more). So this notebook is begginer's level.
>>>markdown
I've decided to use csv files from this source:
https://github.com/graphql-compose/graphql-compose-examples/tree/master/examples/northwind/data/csv .
When using csv or parquet files DuckDb is used as underlying engine (currently Malloy can be used also with BigQuery or Postgres). Csv files were transformed to parquet files (this can be visible in **DataTransformation.malloysql** notebook).
Dataset consists of 11 tables (2 tables are missing from original schema: customer_customer_demo and customer_demographics).
This is how original schema looks like: (from https://github.com/yugabyte/yugabyte-db/wiki/Northwind-Sample-Database)
![Alt text](schema.png)
### THE MODEL
To start using Malloy, first i need to create model. This includes defining data sources, specifying joins, creating dimensions and measures. Malloy is designed in a way to work swiftly with transactional data and (from what i understand) there is no need to create dimensional model. This is a different approach than the one in Power Bi ecosystem, where the best practice is to create star schema, a design that guarantees good performance and simplified DAX code.
I can work on the model in two ways:
Option 1: As i have database schema i can specify all the joins from start. To specify a join inside source, the joined source needs to exist already, so I would have to start from outermost tables in the schema and move to the center.
Option 2: First i would define base sources from tables together with measures and dimensions (we can think about dimensions as calculated columns in PBI world). This join-less model would allow me to calculate most of the interesting numbers from those tables. For cross-table metrics, i would create new sources by extending those base sources with joins.
(There is also third option: define join on the fly (so it would work only for specific query) using 'run: source_name extend {}'.)
I will stick with option 1:
>>>malloy
import '25Days.malloy'
>>>markdown
Model is more or less ready. In some of the sources we can spot dimensions and measures - most of these were added later, while answering questions.
>>>markdown
### QUESTIONS
Question 1: How many current products cost less then 20$?
>>>malloy
run: ProductsBase -> {
aggregate: product_count
where: discontinued = 0 and unitPrice < 20
}
>>>markdown
Question 2: Which product is the most expensive? Success at third try. Currently correlated subqueries are not available in Malloy so i added a column with ALL (like calculated column in PBI) and then filtered by the value in this column:
>>>malloy
run: ProductsBase -> {
order_by: unitPrice desc
project: productName, unitPrice
limit: 1
}
>>>markdown
Question 3: What is the average unit price for our products?
>>>malloy
run: ProductsBase -> {
aggregate: avg_price
}
>>>markdown
Question 4: How many products are above the average unit price?
>>>malloy
run: ProductsBase -> {
group_by:
productName
aggregate:
avg_price
all_avg_price is all(avg_price)
} -> {
where: avg_price > all_avg_price
aggregate: result is count()
}
>>>markdown
Question 5: How many products cost between $15 and $25? (inclusive)
>>>malloy
run: ProductsBase -> {
where: unitPrice >= 15 and unitPrice <= 25
aggregate: product_count
}
>>>markdown
Question 6: What is the average number of products (not qty) per order? - this question was later clarified: we need to calculate average order size:
>>>markdown
Also calculated using pipeline:
>>>malloy
run: Orders -> {
group_by: orderID
aggregate: OrderDetails.item_count
} -> {
aggregate: average_order_size is avg(item_count)
}
>>>markdown
Question 7: What is the order value in $ of open orders? (not shipped)
`order_is_open` is something we might want to ask often so it was added to the OrdersBase source.
>>>malloy
run: Orders -> {
where: order_is_open
aggregate: orders_value is OrderDetails.total_revenue
}
>>>markdown
Question 8: How many orders are single item (only one product ordered)?
>>>malloy
run: Orders -> {
group_by: orderID
aggregate: OrderDetails.item_count
} -> {
where: item_count = 1
aggregate: result is count()
}
>>>markdown
Question 9: Avg sales per transaction for "Romero y Tobillo"
A nice question as this time i need 3 sources to answer (OrderDetails - Orders - Customers). As i've already defined my model it should be easy - i will need Orders and OrderDetails but also Customers as name 'Romero y Tabillo' can be found only there. There is a typo in the question so to find correct companyName i've made a helper query:
>>>malloy
run: CustomersBase -> {
project: companyName
where: companyName ~ '%Romero%'
}
>>>malloy
run: Orders -> {
where: Customers.companyName = 'Romero y tomillo'
group_by: orderID
aggregate: revenue_per_order is OrderDetails.revenue.sum()
} -> {
aggregate: result is avg(revenue_per_order)
}
>>>markdown
Question 10: How many days since 'North/South' last purchase?
I need to use Orders and Customers sources and also a built in time constant 'now' to get current date time. The result will be different depending on time when the query is evaluated.
To make everything work i am changing time stamps to dates using '::date'.
>>>malloy
run: Orders -> {
where: Customers.companyName = 'North/South'
project: days_since_order is days(order_date to now::date)
limit: 1
order_by: days_since_order
}
>>>markdown
Question 11: How many customers have ordered only once?
>>>malloy
run: Orders -> {
group_by: customerID
aggregate: order_count
} -> {
where: order_count = 1
aggregate: result is count()
}
>>>markdown
Question 12: How many new customers in 1998? - (slightly modified, 1998 instead of 2022 as my version of data has different time range)
I've added new dimension to Orders source: order_year is year(orderDate).
Also i am using all with grouping dimension on this order_year dimension.
>>>malloy
run: Orders -> {
group_by: customerID
aggregate: first_year is min(order_year)
}
-> {
where: first_year = 1998
aggregate: result is count()
}
>>>markdown
Question 13: How many lost customers in 1998 (no purchases)? - so how many customers bought something in 1997 but nothing in 1998
(slightly modified, 1998 instead of 2022 as my version of data has different time range, picked 1998)
>>>malloy
run: Orders -> {
group_by: customerID
aggregate: order_1997 is count() {where: order_year = 1997}
aggregate: order_1998 is count() {where: order_year = 1998}
having: order_1997 > 0 and order_1998 = 0
}
-> {
aggregate: customer_count is count()
}
>>>markdown
Question 14: How many customers has never purchased Queso Cabrales?
Came up with 2 options here:
1. Here i am first defining a query that returns a table with unique customers that purchased Queso Cabrales.
Then i am joining this query to new source made from Customers source. Next filtering using 'where' and aggregating the result.
>>>malloy
--distinct customers that purchased Queso Cabrales
query: queso_table is Orders -> {
group_by:
Customers.customerID
where: OrderDetails.Products.productName = 'Queso Cabrales'
}
-- new source from Customers with join to query defined above
source: Customers_extended is Customers
extend {
join_one: queso_table is from(->queso_table) on customerID = queso_table.customerID
}
-- query with filter and aggregation
run: Customers_extended -> {
where: queso_table.customerID = null
aggregate: result is count()
}
>>>markdown
2. This time i am not using another query. Instead i am defining new source from Customers where i join Orders (to get access to Products) and add dimension called 'queso_flag' that returns 1 when customer bought Queso Cabrales and 0 if he didn't (this duplicates records in this extended source). Next i am running the pipeline query where first i add aggregate that counts unique queso_flags for each customer (this requires to use ALL) and then filter and count the result.
>>>malloy
source: Customers_to_Orders is Customers extend {
join_many: Orders on customerID = Orders.customerID
dimension:
queso_flag is pick 1 when Orders.OrderDetails.Products.productName = 'Queso Cabrales' else 0
}
-- adding number of unique queso_flags per customerID:
-- if 1 then Customer did buy or didn't buy Queso Cabrales
-- if 2 then Customer did buy QC (along with other products)
run: Customers_to_Orders -> {
group_by: customerID, queso_flag
aggregate: flag_count is all(count(distinct queso_flag), customerID)
}
-> {
where:
flag_count = 1 and --a customer did buy or didn't buy Queso Cabrales
queso_flag = 0 --a customer never bought Queso Cabrales
aggregate: result is count()
}
>>>markdown
Question 15: How many customers have purchased only Queso Cabrales?
>>>malloy
run: Customers_to_Orders -> {
group_by: customerID, queso_flag
aggregate: flag_count is all(count(distinct queso_flag), customerID)
}
-> {
where: flag_count = 1 and queso_flag = 1
aggregate: result is count()
}
>>>markdown
At some point question 15 was changed:
How many customers have only purchased one thing? I am not sure if logic below is correct:
>>>malloy
run: Orders -> {
group_by: customerID
aggregate: products_in_order is count(OrderDetails.productID)
having: products_in_order = 1
}
-> {
aggregate: result is count()
}
>>>markdown
Question 16: How many products are out of stock?
>>>malloy
--result should be 5:
run: Products -> {
where: unitsInStock = 0
aggregate: result is count()
}
-- if we would like 'discontinued' flag to be included:
run: Products -> {
where: discontinued != 1 and unitsInStock = 0
aggregate: result is count()
}
>>>markdown
Question 17: How many products need to be restocked (based on restock levels)?
>>>malloy
run: Products -> {
where: unitsInStock < reorderLevel
aggregate: result is count()
}
>>>markdown
Question 18: How many products on order we need to restock?
>>>malloy
-- we should use unitsOnOrder to calculate this:
run: Products -> {
where: unitsInStock < unitsOnOrder
aggregate: result is count()
}
>>>markdown
Question 19: What is the stocked value of the discontinued products?
>>>malloy
run: Products -> {
where: discontinued = 1
group_by: value is unitPrice * unitsInStock --this could be added as a dimension of course
} -> {
aggregate: result is sum(value)
}
>>>markdown
Question 20: Which vendor has the highest stock value?
As i want to return only vendor name (without value) i am using pipeline:
>>>malloy
--to return only name i will use pipeline:
run: Products extend {
dimension: value is unitPrice * unitsInStock
measure: max_value is all(max(value))
} -> {
group_by: Suppliers.companyName, value
aggregate: max_value
} -> {
where: value = max_value
project: companyName
}
>>>markdown
Question 21: How many employees (%) are female?
>>>malloy
-- seems like i am missing 'gender' column in my table so i will use 'titleOfCourtesy' to solve this:
run: Employees -> {
# percent
aggregate: female_count is count() {where: titleOfCourtesy = 'Ms.' or titleOfCourtesy = 'Mrs.'} / count()
}
>>>markdown
Question 22: How many employees are 60 years old or over?
Cannot solve this one as i am missing 'age' column in my source.
>>>markdown
Question 23: Which employee had the highest sales in 1997? - (slightly modified, originally it was 2022 but my version of data has different time range)
>>>malloy
run: Orders -> {
where: order_year = 1997
group_by: full_name is concat(Employees.firstName, ' ', Employees.lastName)
aggregate: OrderDetails.revenue.sum()
limit: 1
} -> {
project: full_name
}
>>>markdown
Question 24: How many employees sold over 100K$ in 1997? (slightly modified, originally it was 2022 but my version of data has different time range)
>>>malloy
run: Orders -> {
where: order_year = 1997
group_by: employeeID
aggregate: sales is OrderDetails.revenue.sum()
having: sales > 100000
} -> {
aggregate: result is count()
}
>>>markdown
Question 25: How many employees got hired in 1994?
>>>malloy
run: Employees -> {
where : year(hireDate) = 1994
aggregate: result is count()
}