Skip to content

Latest commit

 

History

History
40 lines (33 loc) · 1.72 KB

README.md

File metadata and controls

40 lines (33 loc) · 1.72 KB

Snorkel Drybell Example

This example is based on the Snorkel Drybell project, a collaboration between the Snorkel team and Google to implement weak supervision at industrial scale. You can read more in the blog post and research paper (SIGMOD Industry, 2019). The paper used a running example of classifying documents as containing a celebrity mention or not, which is what we use here as well. The data is a very small set of six faux newspaper articles and titles, stored as a Parquet file:

Title                                       Body
-----                                       ----
Sports team wins the game!                  It was an exciting game. The team won at the end.
Jennifer Smith donates entire fortune.      She has a lot of money. Now she has less, because...
...

Of course, with such a small (and very fake) dataset, we don't expect to produce high quality models. The goal here is to demonstrate how Snorkel can be used in a large-scale production setting. We present two scripts — one using Snorkel's Dask interface and one using Snorkel's Spark interface — that represent how Snorkel can be deployed as part of a pipeline. We also demonstrate Snorkel's NLPLabelingFunction interface, similar to the NLPLabelingFunction template presented in the Drybell paper.

If you plan to execute these scripts, do so from the snorkel-tutorials directory:

python3 drybell/drybell_dask.py

# or

python3 drybell/drybell_spark.py