Skip to content

Latest commit

 

History

History
179 lines (112 loc) · 3.68 KB

README.md

File metadata and controls

179 lines (112 loc) · 3.68 KB

NeuScraper

Source code for our ACL'24 paper :
Cleaner Pretraining Corpus Curation with Neural Web Scraping

If you find this work useful, please cite our paper and give us a shining star.

Quick Start

1️⃣ Clone from git

git clone https://github.com/OpenMatch/NeuScraper
cd NeuScraper

2️⃣ Data

ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies.

The ClueWeb22 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained by signing a data license agreement with Carnegie Mellon University. For details on how to get it, please click the following link:

https://www.lemurproject.org/clueweb22/obtain.php

3️⃣ Environment

Install the torch first :

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Install other packages :

pip install -r requirements.txt

Deploy NeuScraper on Your GPU Server

1️⃣ Open the deployment directory

cd NeuScraper/app

2️⃣ Fill in the model path in app

args.model_path = "/path/to/your/model"

3️⃣ Deploy NeuScraper

uvicorn app:app --reload --host 0.0.0.0 --port 1688

4️⃣ Use it like:

import requests

port = 'http://0.0.0.0:1688/predict/'
data = {
    'url': 'https://blog.christianperone.com/2023/06/appreciating-llms-data-pipelines/'
}

response = requests.post(port, json=data)

if response.status_code == 200:
    print('Success!')
    print(response.json())
else:
    print('Failed to call API')
    print('Status code:', response.status_code)
    print('Response:', response.text)

Reproduction

1️⃣ Download checkpoint for NeuScraper

git lfs install
git clone https://huggingface.co/OpenMatch/neuscraper-v1-clueweb

2️⃣ Preprocess the test data, we use the en0001-01 as our test set.