Modeling Joins in Morphir #77

AttilaMihaly · 2022-02-09T19:48:50Z

AttilaMihaly
Feb 9, 2022
Maintainer

Problem Statement

The business logic we are modeling with Morphir often involves doing lookups to get more data to do further calculations. Traditionally this is done using joins in an ETL pipeline. Joins don't map directly to functional programming so we usually use dictionary lookups to get more data instead.

For example we might want to calculate the total value of each item in an order by joining price information:

select item_name, item_count, item_count * product.price as total_value
from items
join product on product.id = item.product_id

You can do the same in Elm with a dictionary lookup:

calc items products =
    items
        |> List.map
            (\item ->
                { item_name = item.item_name
                , item_count = item.item_count
                , total_value = item.item_count * (products |> Dict.get item.product_id |> Maybe.map .price |> Maybe.withDefault 0) 
                }
            )

It looks ok except for that bit with the Maybe. Even with all that extra code though it doesn't do exactly what the SQL does. It's essentially doing an outer join with some defaulting. If we want to get the same semantics we have to do something even more complicated:

calc items products =
    items
        |> List.filterMap
            (\item ->
                products 
                    |> Dict.get item.product_id 
                    |> Maybe.map
                        (\product ->
                            { item_name = item.item_name
                            , item_count = item.item_count
                            , total_value = item.item_count * product.price 
                            }
                        )
            )

Now we have the same behavior but is this really what we want? Do we want to make it look like a product was never ordered if it's missing from our pricing database for some reason? Probably not. We want things to not fail when this happens but we also want to get notified. How do we achieve that?

Proposed Solution

We recently added a module to the Morphir SDK which has a required function that serves as marker for the tooling that a value is required but might not be available at runtime. See the docs for an example: https://package.elm-lang.org/packages/finos/morphir-elm/latest/Morphir-SDK-Validate

It turns out that we can use that same function to simplify our first example:

calc items products =
    items
        |> List.map
            (\item ->
                let
                    product = products |> Dict.get item.product_id |> required
                in
                { item_name = item.item_name
                , item_count = item.item_count
                , total_value = item.item_count * product.price 
                }
            )

If you just ran this Elm code it would fail at runtime when the product is missing. Morphir backends on the other hand can use the extra information that the required marker provides. Depending on how you decide to handle it it could generate any of the following:

An SQL with an inner join to make sure it doesn't fail at runtime.
A SQL that gives all the items without a corresponding product.
A reference constraint in the database to prevent this from happening.
A Spark job that does an outer join and returns "error" rows when the product is missing (maybe using https://github.com/awslabs/deequ).

All the above is possible because we retain the modeler's original intent, which is not to filter out or default missing data but to avoid it.

deusaquilus · 2022-02-09T20:59:49Z

deusaquilus
Feb 9, 2022

Hi @AttilaMihaly. @stephengoldbaum and @DamianReeves mentioned that perhaps I should comment.
I think the limitation of this approach is that you need to have a one-to-one mapping between items and products.
The way Quill and Slick do this kind of thing is via an implicit-inner-join that uses flatMap + filter. I think in ELM it would be something like this:

calc items products =
    items
        |> List.flatMap
            (\item ->
                products
                    |> List.filter
                        (\product -> product.id == item.product_id)
                            |> List.map
                                (\product ->
                                    { item_name = item.item_name
                                    , item_count = item.item_count
                                    , total_value = item.item_count * product.price 
                                    }
                                )
            )

This is basically the way Quill does it:

query[Person].flatMap(p =>
	query[Address].filter(a => 
		p.id == a.person_fk
	).map(a => 
          PersonAddress(p.name, a.zip)
        )
)

1 reply

AttilaMihaly Feb 10, 2022
Maintainer Author

Thanks for sharing this. This approach is much more in line with the semantics of SQL.

I should have clarified that the main goal here is not to be in line with SQL though. I focused on looking up one value in my example because that's the business intent. A SQL join could return duplicate rows here but that would be a bad thing from a business perspective.

So I should have given the title "How to capture the original intent of a join using Morphir".

deusaquilus · 2022-02-09T21:23:14Z

deusaquilus
Feb 9, 2022

This might be too much for what you are trying to do but, if so inclined, you can technically go even further and change the List.filter into a Join-DSL and it uses the exact same semantics.

It would look something like this:

calc items products =
    items
        |> List.flatMap
            (\item ->
                products
                    |> Joins.right
                        (\product -> product.id == item.product_id)
                            |> List.map
                                (\product ->
                                    { item_name = item.item_name
                                    , item_count = item.item_count
                                    , total_value = item.item_count * product.price 
                                    }
                                )
            )

Then for left-join the only difference is that product would be a Maybe (I barely know any ELM so pardon any mistakes).

calc items products =
    items
        |> List.flatMap
            (\item ->
                products
                    |> Joins.left
                        (\product -> product.id == item.product_id)
                            |> List.map
                            |> Maybe.map
                                (\product ->
                                    { item_name = item.item_name
                                    , item_count = item.item_count
                                    , total_value = item.item_count * product.price 
                                    }
                                )
            )

This is exactly how Quill does it.

// Inner Join
val pa: Query[PersonAddress] =
  query[Person].flatMap(p =>
    query[Address].join(a => 
      p.id == a.person_fk
    ).map(a => // a: Address
        PersonAddress(p.name, a.zip)
    )
  )

// Outer Join
val pa: Query[Option[PersonAddress]] =
  query[Person].flatMap(p =>
    query[Address].leftJoin(a => 
      p.id == a.person_fk
    ).map(a => // a: Option[Address]
      a.map(av => // av: Address
        PersonAddress(p.name, av.zip)
      )
    )
  )

Hope this helps.

1 reply

AttilaMihaly Feb 10, 2022
Maintainer Author

Thanks for sharing this as well. The syntax is really nice. If our goal was to be aligned with SQL I would definitely go with this approach.

stephengoldbaum · 2022-02-10T13:58:06Z

stephengoldbaum
Feb 10, 2022
Maintainer

Great insights. There are good business cases for both styles. It seems like they could be nicely combined.

0 replies

AttilaMihaly · 2022-02-10T14:40:29Z

AttilaMihaly
Feb 10, 2022
Maintainer Author

I realized that there is an even more direct way to express the fact that we expect the lookup to return exactly one value. Instead of a Dict k v we can use a total function: k -> v. In practice this would mean that we change the signature from this:

calc : List Item -> Dict String Product -> List Result
calc items products =

To this:

calc : List Item -> (String -> Product) -> List Result
calc items products =

If we do this the query becomes even cleaner:

calc items products =
    items
        |> List.map
            (\item ->
                { item_name = item.item_name
                , item_count = item.item_count
                , total_value = item.item_count * (products item.product_id).price 
                }
            )

0 replies

AttilaMihaly · 2022-08-18T12:05:46Z

AttilaMihaly
Aug 18, 2022
Maintainer Author

Recently, while working on translating real-world relational data processing pipelines into Morphir, I made a few interesting observations about mapping relational concepts to FP.

The fundamental issue with joins is that they can do various things ranging from decreasing the number of rows (filter), through not changing the number of rows (map) to even increasing the number of rows (flatMap). What's worse is that the behavior depends on the data and it is not known until runtime. This is very different from FP where you explicitly specify which one you are doing and the behavior is know much earlier at compile time.

I also found that in most cases this is not something data modelers consider to be a benefit of the relational model. It causes a lot of unexpected behavior such as unintentional duplication of output rows or missing rows which leads to downstream issues. This happens when the intent of the modeler is to look up some extra information for the main entity without changing the number of rows but there are no join operators that would specifically support that. In essence, they want to map but they get filter or flatMap behavior instead.

Doing an outer join can protect you from missing rows but it doesn't protect against duplication. It also makes it look like the intent of the modeler was to assume that a piece of data is optional for a certain use case while the real expectation might be a mandatory lookup and a better behavior would be to report errors for missing data. But again, joins don't support that since a join cannot fail, it just doesn't return any rows.

So it looks like the FP model can actually provide benefit to data modelers by making it possible to express their intent more clearly and letting the environment deal with aligning that with reality. The most direct functional way of expressing that you want to look up some required or optional values here is simply using a total or partial function:

lookupRequiredValue : a -> b

lookupOptionalValue : a -> Maybe b

But, since most of our data is stored in relational databases all we get is a List b from the environment. To turn that into a function that returns the data we expect we need to take that List as an input and apply the ON clause to get the data we need. With some small utility functions we can turn the calculation example mentioned above into the following:

items : List Item

products : List Product

-- This represents a join with an on clause. Unlike a join though it is reusable across multiple queries.
product_of : Item -> Product
product_of item =
    Lookup.exactlyOne products
        (\product ->
            product.id == item.product_id
        )

calc : List Result
calc =
    items
        |> List.map
            (\item ->
                { item_name = item.item_name
                , item_count = item.item_count
                , total_value = item.item_count * (product_of item).price 
                }
            )

The Lookup.exactlyOne function would be in the SDK and the Morphir tools would know that if the join returns zero or multiple rows an error would need to be reported. The exact mechanism of reporting the error would be specific to the execution environment. One possible implementation would be to use an inner join to get the valid results back and run another query in parallel to report missing values. All the information is available in the logic above to be able to infer that automatically.

0 replies

deusaquilus · 2022-08-28T06:05:02Z

deusaquilus
Aug 28, 2022

@AttilaMihaly I think I'm looking at the problem from the other end. Say that instead of starting from a greenfield ability to model business data you have a pre-existing ETL codebase of large queries with deeply nested left/right joins etc... in a large repository of multi-page-long queries. Trying to re-model that kind of codebase has huge risks so it remains "locked up" in SQL for years or even decades. On the other hand, if such a codebase could be hand-transcribed into an SQL-like DSL that contains the same kinds of constructs e.g. left/right joins etc... then suddenly it would become fully portable across technologies with a minimal amount of risk. I think there is tremendous value in that.

2 replies

AttilaMihaly Aug 29, 2022
Maintainer Author

Thank you for sharing this. We are actually thinking along the same lines, I just failed to share the full context in this discussion. While working on similar use-cases to what you describe above we created APIs to model joins directly so that large queries can be directly translated. Once it's in Morphir the execution becomes more flexible.

What I described above is the next step that would improve the quality of the solution by aligning the model to the original intent. This is an important step because even though we can run a join in any environment, running it on the JVM with collections will be much less efficient than it could be if we didn't have to replicate join semantics. And it's not just efficiency. Even our insight/transparency tooling would benefit hugely from knowing when the intent is to look up a single value not to join two data sets.

AttilaMihaly Sep 1, 2022
Maintainer Author

I just realized that the API we created happens to exactly match applicative-joins in Quill.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modeling Joins in Morphir #77

{{title}}

Replies: 6 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Modeling Joins in Morphir #77

AttilaMihaly Feb 9, 2022 Maintainer

Problem Statement

Proposed Solution

Replies: 6 comments · 4 replies

deusaquilus Feb 9, 2022

AttilaMihaly Feb 10, 2022 Maintainer Author

deusaquilus Feb 9, 2022

AttilaMihaly Feb 10, 2022 Maintainer Author

stephengoldbaum Feb 10, 2022 Maintainer

AttilaMihaly Feb 10, 2022 Maintainer Author

AttilaMihaly Aug 18, 2022 Maintainer Author

deusaquilus Aug 28, 2022

AttilaMihaly Aug 29, 2022 Maintainer Author

AttilaMihaly Sep 1, 2022 Maintainer Author

AttilaMihaly
Feb 9, 2022
Maintainer

Replies: 6 comments 4 replies

deusaquilus
Feb 9, 2022

AttilaMihaly Feb 10, 2022
Maintainer Author

deusaquilus
Feb 9, 2022

AttilaMihaly Feb 10, 2022
Maintainer Author

stephengoldbaum
Feb 10, 2022
Maintainer

AttilaMihaly
Feb 10, 2022
Maintainer Author

AttilaMihaly
Aug 18, 2022
Maintainer Author

deusaquilus
Aug 28, 2022

AttilaMihaly Aug 29, 2022
Maintainer Author

AttilaMihaly Sep 1, 2022
Maintainer Author