Data Engineer Test: Sizes & Variants

Background

One of our core responsibilities at Trouva is to provide an easy-to-use inventory management solution to our boutiques. Being able to provide stock information on sizes and other variants is fundamental to keep the stock accuracy high. A couple of years ago we opened up third party integrations (commonly known as feeds) to provide a temporary solution for boutiques that have an existing system in place, until we're able to provide a full in-store POS solution.

Every few hours an automated system runs and syncs the feed with the inventory information we hold. It will create new entities (products, sizes etc) if they don't exist or update the stock levels of the inventory. A direct consequence of the feed sync, is that we extend our universe of sizes with any information that is coming through the feeds. If a feed for example has a size "s" and we only have "small" in our database, the system will generate a new size "s". Given there are no checks at all even "Small", "smAll" and "smaLL" will be different sizes. To add insult to injury, we don't have control over what boutique owners put in this field. So we do not even know if it's a size, it could be a color, a material or something else.

Objective

Help us sanitise the current data as a first step towards full standardisation

Current challenges

As you can see in ./mongo-seed/sizes.json the current process has introduced a wide-ranging set of issues for example ambiguity (is it a size, a material, a colour?) and duplication ("s", "small", "SMALL"). This results in challenges across the business:

Our size filters on Trouva.com many times don't make much sense which degrades the customer experience
It's quite hard for us to streamline communication with the boutiques for example to promote re-stocking behaviour
On the business intelligence side, it's hard to provide the merchandising team with insights on what sizes sell well
We aren't able to distinguish between sizes, materials, colours etc. which reduces our capabilities on the consumer side

We won't be able to solve all of this in one go, but as a data engineer you'll be looped in to help figure out how we can in small incremental steps standardise our size universe and streamline our process.

Potential action plan

Analyse the ./mongo-seed/sizes.json
Identify core issues
Outline the strategy you would propose to cleanse the data
Perform a data cleansing operation

Key results

A document, deck or handwritten napkin outlining the core issues you've identified & the strategy you'd follow to improve
A script or program performing the data cleansing operation you'd run first
[BONUS] The second most important data cleansing operation you'd run

Notes

We assume that you're going to use Python for your technical solution, however it's up to you to choose the right tool for the job
You don't need to wrap you solution into our docker-compose structure (unless you want to)
Keep your code in the ./size-sanitizer folder, you can copy the mongo-seed/sizes.json over if you'd like

Submitting the challenge

Please clone/fork this repo to your personal computer and create a branch to hold your work, committing to Git as you go.

Once you are ready to submit, we would like you to create a zip file of your branch and email it over to us.

To create a Zip archive of your branch use: git archive -o ./trouva-test.zip <your_branch_name>
Then please email it to emma.karydi@trouva.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data.Size_manipulation.md

Data.Size_manipulation.md

Data Engineer Test: Sizes & Variants

Background

Objective

Current challenges

Potential action plan

Key results

Notes

Submitting the challenge

Files

Data.Size_manipulation.md

Latest commit

History

Data.Size_manipulation.md

File metadata and controls

Data Engineer Test: Sizes & Variants

Background

Objective

Current challenges

Potential action plan

Key results

Notes

Submitting the challenge