Welcome to the take-home coding exercise for the FullContact Identity Resolution team hiring pipeline. This is intended to be a short, straightforward exercise to help us understand your skill level with a few of the technologies and tasks we most frequently work with.
- Duration: About 1 to 4 hours
- Subjects: Scala, Apache Spark
- Resources: Yourself, existing online documentation
- Read Records.txt and Queries.txt
- Utilize spark to process them to produce
- Output1: IDs from Queries.txt with records they appear in, separated by a colon
- Output2: IDs from Queries.txt with the deduplicated union of records they appear in, separated by a colon
- Include documentation on how to run your solution from a terminal at the root of the project
- Include a test suite
This repository includes a basic gradle buildscript, with dependencies on scala and spark already declared.
The buildscript also includes a scalatest-based test scaffold, invokable with ./gradlew test
.
Whether you use the provided scaffold or not,
you should include a test suite invokable with ./gradlew test
.
The application
plugin has been applied to the project so that
./gradlew run
will invoke the main
method of com.fullcontact.interview.RecordFinder
,
where you can write your solution.
You may modify the buildscript and codebase as you see fit,
(e.g., add dependencies, add classes, move/modify/duplicate main
method, etc.).
However, we ask that you document how to run your submission
from a terminal session at the root of the project,
whether that is by ./gradlew run
or otherwise.
We have seen, in practice, that running the project in a Windows environment can be more challenging than on OSX or a Linux distro. Notably, PATH length limitations on Windows make the large classpaths that build up on JVM invocations not function properly. Workarounds exist, though we recommend using an alternative system or a VM to run the project, if possible.
In the course of our work,
we receive multiple datasets of observed, but incomplete, contact records.
e.g., we might have an input record of
(bart@fullcontact.com, linkedin:lorangb, +2025551234)
,
indicating that those three pieces of contact information represent the same person.
If we later have a record of
(bart@fullcontact.com, +2025557890)
in addition to the first, we want to retrieve both records
when searching for bart@fullcontact.com
.
Your task is to replicate this behavior with a small dataset and simplified matching logic.
You are provided two datasets of randomized IDs:
Records.txt (~50k lines) |
---|
ZVREMGG HLLCCNX |
BROOFJY UBXZQKD ATCZUTP |
WONIWXW NPXNWKZ RHUJKEY QULQKGC LXZFLFQ |
CFTGDGD QTBSKQW |
MEKQMTV LOATEQG HEVOKKP YQLZEAY NPRMWRX |
... |
Queries.txt (~100k lines) |
---|
JKXBHJJ |
TSZRMKP |
HLLCCNX |
MUXCODO |
QIAXOON |
... |
The random identifiers stand in for pieces of contact information. Identifiers are 7-character strings using the uppercase ASCII alphabet.
Records.txt is a file where each line is a space-separated list of IDs indicating that those IDs were observed together in some dataset.
Queries.txt is a file where each line is an ID which we want to find all associated records for.
Given these two inputs, create a submission which utilizes spark in local mode to produce two deliverable outputs:
Output1.txt/part-00000 |
---|
HLLCCNX:ZVREMGG HLLCCNX |
LMLHENN:LMLHENN ZETNBFX |
OOFDUJC:HAWCESV OOFDUJC MNVQEIN TJWEWHT |
FSNMWAF:DJBEEPL FSNMWAF |
FSNMWAF:FSNMWAF MLEAGKE |
... |
Output2.txt/part-00000 |
---|
HLLCCNX:ZVREMGG HLLCCNX |
LMLHENN:LMLHENN ZETNBFX |
OOFDUJC:HAWCESV OOFDUJC MNVQEIN TJWEWHT |
FSNMWAF:DJBEEPL FSNMWAF MLEAGKE |
... |
Output1.txt is a text file where each row is
an ID from Queries.txt,
followed by the :
character,
followed by a row from Records.txt where the query ID appeared.
Query IDs may appear multiple times on the left side of this output.
Output2.txt is a text file where each row is
an ID from Queries.txt,
followed by the :
character,
followed by the deduplicated set of the union of
all IDs from rows from Records.txt where the query ID appeared.
Query IDs must not appear multiple times on the left side of this output.
ID FSNMWAF
demonstrates the difference between the two outputs in the samples above.