feat(core): add `crawler.exportData()` helper #2166

B4nan · 2023-11-06T12:57:24Z

Retrieves all the data from the default crawler Dataset and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the path automatically.

const crawler = new BasicCrawler({ ... });
crawler.pushData({ ... });

await crawler.exportData('./data.csv');

B4nan · 2023-11-06T12:59:38Z

Once released, this will be used in the homepage example to simplify it:

import { PlaywrightCrawler } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // ...
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

-// Export the whole dataset to a single file in `./storage/key_value_stores/result.csv`.
-const dataset = await crawler.getDataset();
-await dataset.exportToCSV('result');
+// Export the whole dataset to a single file in `./result.csv`.
+await crawler.exportData('./result.csv');

// Or work with the data directly.
const data = await crawler.getData();
console.table(data.items);

B4nan · 2023-11-06T13:23:45Z

Tests on node 16 were broken after yarn v4 upgrade, so fixed that too in the PR.

Retrieves all the data from the default crawler `Dataset` and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically. ```ts const crawler = new BasicCrawler({ ... }); crawler.pushData({ ... }); await crawler.exportData('./data.csv'); ```

packages/basic-crawler/src/internals/basic-crawler.ts

B4nan · 2023-11-06T14:27:47Z

@vladfrangu any idea why that test keeps failing in the CI? it works locally just fine for me, I thought its something with the relative paths, even tried a 500ms wait, but it still fails the same

vladfrangu · 2023-11-06T14:29:50Z

Huh, it works locally? I'll take a look

test/core/crawlers/basic_crawler.test.ts

the sync version was introduced for easier chaining, but with the `crawler.exportData()` we dont really need it

.github/workflows/test-ci.yml

packages/basic-crawler/src/internals/basic-crawler.ts

janbuchar

LGTM, this seems pretty helpful. Are there any review guidelines I should follow?

B4nan · 2023-11-08T08:57:55Z

LGTM, this seems pretty helpful. Are there any review guidelines I should follow?

Not really, did you use some previously? Happy for suggestions, we were also discussing some time ago with @vdusek that we could have a PR template with some checklist (or a link to some guidelines page).

Some things I usually look for:

meaningful and consistent naming (its always hard, we all know it)
location of the code changes (sometimes things are on a bad place, e.g. wrong package, or wrong level of abstraction)
hard to understand code (both runtime and type level), things can be usually broken down to more methods to make things easier to read/understand
undeclared dependencies (a common source of bugs with monorepos, if users use other package managers than NPM which quirks this somehow)
missing tests (things might be hard to test, in such case the PR author should say how they tested things if automated tests are missing)
missing docs (we generate API docs from the JSDoc comments, so having those is usually enough, but for bigger features we should have examples or guides too)

janbuchar · 2023-11-08T13:10:27Z

LGTM, this seems pretty helpful. Are there any review guidelines I should follow?

Not really, did you use some previously? Happy for suggestions, we were also discussing some time ago with @vdusek that we could have a PR template with some checklist (or a link to some guidelines page).

Some things I usually look for:

meaningful and consistent naming (its always hard, we all know it)

location of the code changes (sometimes things are on a bad place, e.g. wrong package, or wrong level of abstraction)

hard to understand code (both runtime and type level), things can be usually broken down to more methods to make things easier to read/understand

undeclared dependencies (a common source of bugs with monorepos, if users use other package managers than NPM which quirks this somehow)

missing tests (things might be hard to test, in such case the PR author should say how they tested things if automated tests are missing)

missing docs (we generate API docs from the JSDoc comments, so having those is usually enough, but for bigger features we should have examples or guides too)

Thanks, that seems reasonable. Every place I worked at did something pretty close to this 🙂

vdusek · 2023-11-08T16:34:43Z

Not really, did you use some previously? Happy for suggestions, we were also discussing some time ago with @vdusek that we could have a PR template with some checklist (or a link to some guidelines page).

Yeah, I was thinking of just a simple pull_request_template.md with a few sections, that outline the purpose of the pull request, the solution it provides, a reference to the associated issue, the testing process, the steps for release, and helpful guidance on understanding the changes and how to review them.

Based on this assignment, LLM provides the following 🙂:

## Description

[Provide a brief description of the problem or feature this pull request addresses.]

## Solution

[Explain how this pull request solves the problem or implements the feature.]

## Issue

[Link to the related issue (e.g., `Closes #123` or `Fixes #456`).]

## Testing

[Describe how the changes have been tested. Include any relevant test cases or steps.]

## Release Steps

[Outline the steps required to release these changes. Include any deployment or configuration steps if applicable.]

## Review Guidance

[Offer tips and guidance for reviewers on how to approach and assess the changes. Explain any specific areas or considerations to focus on.]

## Checklist

- [ ] Code has been reviewed
- [ ] Tests pass successfully
- [ ] Documentation has been updated (if necessary)
- [ ] Issue is closed (if applicable)
- [ ] All discussions are resolved

## Additional Information

[Any additional information, context, or screenshots that might be helpful in reviewing this pull request.]

Of course, only if the sections make sense in the context of PR, not saying all of the PRs have to contain them.

B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Nov 6, 2023

github-actions bot assigned B4nan Nov 6, 2023

github-actions bot added this to the 76th sprint - Tooling team milestone Nov 6, 2023

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Nov 6, 2023

B4nan requested review from janbuchar and vladfrangu November 6, 2023 12:58

B4nan force-pushed the export-data branch 4 times, most recently from 69c2931 to 2d843a1 Compare November 6, 2023 13:21

B4nan force-pushed the export-data branch from 2d843a1 to fa69d7e Compare November 6, 2023 13:34

vladfrangu requested changes Nov 6, 2023

View reviewed changes

packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved

packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved

B4nan added 2 commits November 6, 2023 14:49

try to fix tests by using absolute paths

7523a49

fix linter and add validation to crawler.getDataset()

bbdb790

vladfrangu and others added 4 commits November 6, 2023 16:34

chore: try to fix test

b9bb35d

chore: make tmp folder

3797054

chore: remove .only

f043f68

create missing folders automatically

4429306

vladfrangu requested changes Nov 6, 2023

View reviewed changes

test/core/crawlers/basic_crawler.test.ts Show resolved Hide resolved

revert back to the async crawler.getDataset()

8724020

the sync version was introduced for easier chaining, but with the `crawler.exportData()` we dont really need it

janbuchar reviewed Nov 6, 2023

View reviewed changes

.github/workflows/test-ci.yml Outdated Show resolved Hide resolved

packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved

improve CI setup

49973f7

B4nan force-pushed the export-data branch from 3be760e to 49973f7 Compare November 7, 2023 16:40

B4nan requested a review from vladfrangu November 7, 2023 17:42

B4nan requested a review from janbuchar November 7, 2023 17:42

vladfrangu approved these changes Nov 7, 2023

View reviewed changes

janbuchar approved these changes Nov 8, 2023

View reviewed changes

B4nan merged commit c8c09a5 into master Nov 8, 2023
8 checks passed

B4nan deleted the export-data branch November 8, 2023 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): add `crawler.exportData()` helper #2166

feat(core): add `crawler.exportData()` helper #2166

B4nan commented Nov 6, 2023

B4nan commented Nov 6, 2023 •

edited

Loading

B4nan commented Nov 6, 2023

B4nan commented Nov 6, 2023

vladfrangu commented Nov 6, 2023

janbuchar left a comment

B4nan commented Nov 8, 2023

janbuchar commented Nov 8, 2023

vdusek commented Nov 8, 2023 •

edited

Loading

feat(core): add crawler.exportData() helper #2166

feat(core): add crawler.exportData() helper #2166

Conversation

B4nan commented Nov 6, 2023

B4nan commented Nov 6, 2023 • edited Loading

B4nan commented Nov 6, 2023

B4nan commented Nov 6, 2023

vladfrangu commented Nov 6, 2023

janbuchar left a comment

Choose a reason for hiding this comment

B4nan commented Nov 8, 2023

janbuchar commented Nov 8, 2023

vdusek commented Nov 8, 2023 • edited Loading

feat(core): add `crawler.exportData()` helper #2166

feat(core): add `crawler.exportData()` helper #2166

B4nan commented Nov 6, 2023 •

edited

Loading

vdusek commented Nov 8, 2023 •

edited

Loading