Identity Resolution Coding Exercise

Welcome to the take-home coding exercise for the FullContact Identity Resolution team hiring pipeline. This is intended to be a short, straightforward exercise to help us understand your skill level with a few of the technologies and tasks we most frequently work with.

Expectations:

Duration: About 1 to 4 hours
Subjects: Scala, Apache Spark
Resources: Yourself, existing online documentation

Assignment Summary

Read Records.txt and Queries.txt
Utilize spark to process them to produce
- Output1: IDs from Queries.txt with records they appear in, separated by a colon
- Output2: IDs from Queries.txt with the deduplicated union of records they appear in, separated by a colon
Include documentation on how to run your solution from a terminal at the root of the project
Include a test suite

The Scaffolding

This repository includes a basic gradle buildscript, with dependencies on scala and spark already declared.

The buildscript also includes a scalatest-based test scaffold, invokable with ./gradlew test. Whether you use the provided scaffold or not, you should include a test suite invokable with ./gradlew test.

The application plugin has been applied to the project so that ./gradlew run will invoke the main method of com.fullcontact.interview.RecordFinder, where you can write your solution.

You may modify the buildscript and codebase as you see fit, (e.g., add dependencies, add classes, move/modify/duplicate main method, etc.). However, we ask that you document how to run your submission from a terminal session at the root of the project, whether that is by ./gradlew run or otherwise.

We have seen, in practice, that running the project in a Windows environment can be more challenging than on OSX or a Linux distro. Notably, PATH length limitations on Windows make the large classpaths that build up on JVM invocations not function properly. Workarounds exist, though we recommend using an alternative system or a VM to run the project, if possible.

The Scenario

In the course of our work, we receive multiple datasets of observed, but incomplete, contact records. e.g., we might have an input record of (bart@fullcontact.com, linkedin:lorangb, +2025551234), indicating that those three pieces of contact information represent the same person. If we later have a record of (bart@fullcontact.com, +2025557890) in addition to the first, we want to retrieve both records when searching for bart@fullcontact.com.

Your task is to replicate this behavior with a small dataset and simplified matching logic.

You are provided two datasets of randomized IDs:

Records.txt (~50k lines)
`ZVREMGG HLLCCNX`
`BROOFJY UBXZQKD ATCZUTP`
`WONIWXW NPXNWKZ RHUJKEY QULQKGC LXZFLFQ`
`CFTGDGD QTBSKQW`
`MEKQMTV LOATEQG HEVOKKP YQLZEAY NPRMWRX`
...

Queries.txt (~100k lines)
`JKXBHJJ`
`TSZRMKP`
`HLLCCNX`
`MUXCODO`
`QIAXOON`
...

The random identifiers stand in for pieces of contact information. Identifiers are 7-character strings using the uppercase ASCII alphabet.

Records.txt is a file where each line is a space-separated list of IDs indicating that those IDs were observed together in some dataset.

Queries.txt is a file where each line is an ID which we want to find all associated records for.

Given these two inputs, create a submission which utilizes spark in local mode to produce two deliverable outputs:

Output1.txt/part-00000
`HLLCCNX:ZVREMGG HLLCCNX`
`LMLHENN:LMLHENN ZETNBFX`
`OOFDUJC:HAWCESV OOFDUJC MNVQEIN TJWEWHT`
`FSNMWAF:DJBEEPL FSNMWAF`
`FSNMWAF:FSNMWAF MLEAGKE`
...

Output2.txt/part-00000
`HLLCCNX:ZVREMGG HLLCCNX`
`LMLHENN:LMLHENN ZETNBFX`
`OOFDUJC:HAWCESV OOFDUJC MNVQEIN TJWEWHT`
`FSNMWAF:DJBEEPL FSNMWAF MLEAGKE`
...

Output1.txt is a text file where each row is an ID from Queries.txt, followed by the : character, followed by a row from Records.txt where the query ID appeared. Query IDs may appear multiple times on the left side of this output.

Output2.txt is a text file where each row is an ID from Queries.txt, followed by the : character, followed by the deduplicated set of the union of all IDs from rows from Records.txt where the query ID appeared. Query IDs must not appear multiple times on the left side of this output.

ID FSNMWAF demonstrates the difference between the two outputs in the samples above.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
InterviewOutline.txt		InterviewOutline.txt
Queries.txt		Queries.txt
README.md		README.md
Records.txt		Records.txt
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identity Resolution Coding Exercise

Expectations:

Assignment Summary

The Scaffolding

The Scenario

About

Contributors 2

Languages

fullcontact/identity-resolution-interview-public

Folders and files

Latest commit

History

Repository files navigation

Identity Resolution Coding Exercise

Expectations:

Assignment Summary

The Scaffolding

The Scenario

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages