25DaysAnswers.malloynb

>>>markdown
### SCOPE

In this project i would like to answer all the questions from Curbal's '25 Days of DAX Fridays! Challenge – Ed1: NorthWind Company' (https://curbal.com/25-days-of-dax-fridays-challenge-ed1-northwind-company) using Malloy. Now in this challenge all the required answers are in form of single numeric value or string - i won't be returning tables here and i won't be able to show some of Malloy's key concepts (like reusing queries, nesting and many more). So this notebook is begginer's level.
>>>markdown


I've decided to use csv files from this source:
https://github.com/graphql-compose/graphql-compose-examples/tree/master/examples/northwind/data/csv .
When using csv or parquet files DuckDb is used as underlying engine (currently Malloy can be used also with BigQuery or Postgres). Csv files were transformed to parquet files (this can be visible in **DataTransformation.malloysql** notebook).

Dataset consists of 11 tables (2 tables are missing from original schema: customer_customer_demo and customer_demographics).

This is how original schema looks like: (from https://github.com/yugabyte/yugabyte-db/wiki/Northwind-Sample-Database)

![Alt text](schema.png)


### THE MODEL

To start using Malloy, first i need to create model. This includes defining data sources, specifying joins, creating dimensions and measures. Malloy is designed in a way to work swiftly with transactional data and (from what i understand) there is no need to create dimensional model. This is a different approach than the one in Power Bi ecosystem, where the best practice is to create star schema, a design that guarantees good performance and simplified DAX code.

I can work on the model in two ways:
Option 1: As i have database schema i can specify all the joins from start. To specify a  join inside source, the joined source needs to exist already, so I would have to start from outermost tables in the schema and move to the center.

Option 2: First i would define base sources from tables together with measures and dimensions (we can think about dimensions as calculated columns in PBI world). This join-less model would allow me to calculate most of the interesting numbers from those tables. For cross-table metrics, i would create new sources by extending those base sources with joins.

(There is also third option: define join on the fly (so it would work only for specific query) using 'run: source_name extend {}'.)

I will stick with option 1:
>>>malloy
import '25Days.malloy'
>>>markdown
Model is more or less ready. In some of the sources we can spot dimensions and measures - most of these were added later, while answering questions.
>>>markdown
### QUESTIONS


Question 1: How many current products cost less then 20$?
>>>malloy
run: ProductsBase -> {
    aggregate: product_count 
    where: discontinued = 0 and unitPrice < 20
}
>>>markdown

Question 2: Which product is the most expensive? Success at third try. Currently correlated subqueries are not available in Malloy so i added a column with ALL (like calculated column in PBI) and then filtered by the value in this column:
>>>malloy
run: ProductsBase -> {
    order_by: unitPrice desc
    project: productName, unitPrice
    limit: 1
}
>>>markdown

Question 3: What is the average unit price for our products?
>>>malloy
run: ProductsBase -> {
  aggregate: avg_price 
}
>>>markdown
Question 4: How many products are above the average unit price? 
>>>malloy
run:  ProductsBase -> {
   group_by: 
    productName   
   aggregate: 
    avg_price
    all_avg_price is all(avg_price)
    
} -> {
    where: avg_price > all_avg_price
    aggregate: result is count()
}


>>>markdown

Question 5: How many products cost between $15 and $25? (inclusive)
>>>malloy

run: ProductsBase -> {
    where: unitPrice >= 15 and unitPrice <= 25
    aggregate: product_count
}
>>>markdown

Question 6: What is the average number of products (not qty) per order? - this question was later clarified: we need to calculate average order size:
>>>markdown

Also calculated using pipeline:
>>>malloy
run: Orders -> {
  group_by: orderID
  aggregate: OrderDetails.item_count
       
} -> {
  aggregate: average_order_size is avg(item_count)
}
>>>markdown

Question 7: What is the order value in $ of open orders? (not shipped)
`order_is_open` is something we might want to ask often so it was added to the OrdersBase source.
>>>malloy
run: Orders -> {
  where: order_is_open 
  aggregate: orders_value is OrderDetails.total_revenue        
}
>>>markdown
Question 8: How many orders are single item (only one product ordered)?
>>>malloy
run: Orders -> {
  group_by: orderID 
  aggregate: OrderDetails.item_count
} -> {
  where: item_count = 1
  aggregate: result is count()
}
>>>markdown
Question 9: Avg sales per transaction for "Romero y Tobillo"

A nice question as this time i need 3 sources to answer (OrderDetails - Orders - Customers). As i've already defined my model it should be easy - i will need Orders and OrderDetails but also Customers as name 'Romero y Tabillo' can be found only there. There is a typo in the question so to find correct companyName i've made a helper query:
>>>malloy
run: CustomersBase -> {
  project: companyName
  where: companyName ~ '%Romero%'
}
>>>malloy
run: Orders -> {
  where: Customers.companyName = 'Romero y tomillo'
  group_by: orderID
  aggregate: revenue_per_order is OrderDetails.revenue.sum()
   
} -> {
  aggregate: result is avg(revenue_per_order)
}
>>>markdown
Question 10: How many days since 'North/South' last purchase?

I need to use Orders and Customers sources and also a built in time constant 'now' to get current date time. The result will be different depending on time when the query is evaluated.

To make everything work i am changing time stamps to dates using '::date'.
>>>malloy
run: Orders -> {
  where: Customers.companyName = 'North/South'
  project: days_since_order is days(order_date to now::date)
  limit: 1
  order_by: days_since_order
}
>>>markdown
Question 11: How many customers have ordered only once?
>>>malloy
run: Orders -> {
  group_by: customerID 
  aggregate: order_count 
} -> {
  where: order_count = 1
  aggregate: result is count()
}
>>>markdown
Question 12: How many new customers in 1998? - (slightly modified, 1998 instead of 2022 as my version of data has different time range)

I've added new dimension to Orders source: order_year is year(orderDate).
Also i am using all with grouping dimension on this order_year dimension.
>>>malloy
run: Orders -> {
    group_by:  customerID
    aggregate: first_year is min(order_year)
} 
-> {
    where: first_year = 1998 
    aggregate: result is count()
}
>>>markdown
Question 13: How many lost customers in 1998 (no purchases)? - so how many customers bought something in 1997 but nothing in 1998
(slightly modified, 1998 instead of 2022 as my version of data has different time range, picked 1998)
>>>malloy
run: Orders -> {
    group_by: customerID
    aggregate: order_1997 is count() {where: order_year = 1997}
    aggregate: order_1998 is count() {where: order_year = 1998}
    having: order_1997 > 0 and order_1998 = 0
}
-> {
    aggregate: customer_count is count()
}
>>>markdown
Question 14: How many customers has never purchased Queso Cabrales?

Came up with 2 options here:

1. Here i am first defining a query that returns a table with unique customers that purchased Queso Cabrales.
Then i am joining this query to new source made from Customers source. Next filtering using 'where' and aggregating the result.
>>>malloy

--distinct customers that purchased Queso Cabrales
query: queso_table is Orders -> {
    group_by: 
        Customers.customerID
        where: OrderDetails.Products.productName = 'Queso Cabrales' 
} 

-- new source from Customers with join to query defined above
    source: Customers_extended is Customers
    extend {
        join_one: queso_table is from(->queso_table) on customerID = queso_table.customerID
    } 
-- query with filter and aggregation
run: Customers_extended -> {
    where: queso_table.customerID = null
    aggregate: result is count()
    } 
        
>>>markdown
2. This time i am not using another query. Instead i am defining new source from Customers where i join Orders (to get access to Products) and add dimension called 'queso_flag' that returns 1 when customer bought Queso Cabrales and 0 if he didn't (this duplicates records in this extended source). Next i am running the pipeline query where first i add aggregate that counts unique queso_flags for each customer (this requires to use ALL) and then filter and count the result.
>>>malloy

source: Customers_to_Orders is Customers extend {
    join_many: Orders on customerID = Orders.customerID

    dimension: 
    queso_flag is pick 1 when Orders.OrderDetails.Products.productName = 'Queso Cabrales' else 0
}

 -- adding number of unique queso_flags per customerID:
 -- if 1 then Customer did buy or didn't buy Queso Cabrales
 -- if 2 then Customer did buy QC (along with other products)

    run: Customers_to_Orders -> {
        group_by: customerID, queso_flag 
        aggregate: flag_count is all(count(distinct queso_flag), customerID)
        
    } 
    -> {
        where: 
        flag_count = 1    and   --a customer did buy or didn't buy Queso Cabrales
        queso_flag = 0          --a customer never bought Queso Cabrales
        
        
        aggregate: result is count()
    }
    
>>>markdown
Question 15: How many customers have purchased only Queso Cabrales?
>>>malloy
 run: Customers_to_Orders -> {
        
        group_by: customerID, queso_flag 
        aggregate: flag_count is all(count(distinct queso_flag), customerID)

    } 
    -> {
        where: flag_count = 1 and queso_flag = 1
        aggregate: result is count()
    }
>>>markdown
At some point question 15 was changed:
How many customers have only purchased one thing? I am not sure if logic below is correct:
>>>malloy
run: Orders -> {
    group_by: customerID
    aggregate: products_in_order is count(OrderDetails.productID)
    having: products_in_order = 1
} 

-> {
    aggregate: result is count()
}
>>>markdown
Question 16: How many products are out of stock?
>>>malloy
--result should be 5:
run: Products -> {
    where:  unitsInStock = 0
    aggregate: result is count()

}

-- if we would like 'discontinued' flag to be included:
run: Products -> {
    where: discontinued != 1 and unitsInStock = 0
    aggregate: result is count()

}
>>>markdown
Question 17: How many products need to be restocked (based on restock levels)?
>>>malloy

run: Products -> {
    where: unitsInStock < reorderLevel
    aggregate: result is count()

}
>>>markdown
Question 18: How many products on order we need to restock?
>>>malloy
-- we should use unitsOnOrder to calculate this:
run: Products -> {
    where: unitsInStock < unitsOnOrder
    aggregate: result is count()

}
>>>markdown
Question 19: What is the stocked value of the discontinued products?
>>>malloy
run: Products -> {
    where: discontinued = 1
    group_by: value is unitPrice * unitsInStock --this could be added as a dimension of course
} -> {
    aggregate: result is sum(value)
}
>>>markdown
Question 20: Which vendor has the highest stock value?

As i want to return only vendor name (without value) i am using pipeline:
>>>malloy
--to return only name i will use pipeline:
run: Products extend {
    dimension: value is unitPrice * unitsInStock
    measure: max_value is all(max(value))
} -> {
    group_by: Suppliers.companyName, value
    aggregate: max_value
} -> {
    where: value = max_value
    project: companyName
}
>>>markdown
Question 21: How many employees (%) are female?
>>>malloy
-- seems like i am missing 'gender' column in my table so i will use 'titleOfCourtesy' to solve this:

run: Employees -> {
    # percent
    aggregate: female_count is count() {where: titleOfCourtesy = 'Ms.' or titleOfCourtesy = 'Mrs.'} / count()
    
}
>>>markdown
Question 22: How many employees are 60 years old or over?

Cannot solve this one as i am missing 'age' column in my source.
>>>markdown
Question 23: Which employee had the highest sales in 1997? - (slightly modified, originally it was 2022 but my version of data has different time range)
>>>malloy
run: Orders -> {
    
        where: order_year = 1997 
        group_by: full_name is concat(Employees.firstName, ' ', Employees.lastName)
        aggregate: OrderDetails.revenue.sum()
        limit: 1
} -> {
    project: full_name
}
>>>markdown
Question 24: How many employees sold over 100K$ in 1997? (slightly modified, originally it was 2022 but my version of data has different time range)
>>>malloy
run: Orders -> {
    
        where: order_year = 1997 
        group_by: employeeID
        aggregate: sales is OrderDetails.revenue.sum()
        having: sales > 100000
} -> {
    aggregate: result is count()
}
>>>markdown
Question 25: How many employees got hired in 1994?
>>>malloy
run: Employees -> {
        where : year(hireDate) = 1994
        aggregate: result is count()
        }