Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): add crawler.exportData() helper #2166

Merged
merged 9 commits into from
Nov 8, 2023
Merged

feat(core): add crawler.exportData() helper #2166

merged 9 commits into from
Nov 8, 2023

Conversation

B4nan
Copy link
Member

@B4nan B4nan commented Nov 6, 2023

Retrieves all the data from the default crawler Dataset and exports them to the specified format. Supported formats are currently 'json' and 'csv', and will be inferred from the path automatically.

const crawler = new BasicCrawler({ ... });
crawler.pushData({ ... });

await crawler.exportData('./data.csv');

@B4nan B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Nov 6, 2023
@github-actions github-actions bot added this to the 76th sprint - Tooling team milestone Nov 6, 2023
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Nov 6, 2023
@B4nan B4nan requested review from janbuchar and vladfrangu November 6, 2023 12:58
@B4nan
Copy link
Member Author

B4nan commented Nov 6, 2023

Once released, this will be used in the homepage example to simplify it:

import { PlaywrightCrawler } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // ...
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

-// Export the whole dataset to a single file in `./storage/key_value_stores/result.csv`.
-const dataset = await crawler.getDataset();
-await dataset.exportToCSV('result');
+// Export the whole dataset to a single file in `./result.csv`.
+await crawler.exportData('./result.csv');

// Or work with the data directly.
const data = await crawler.getData();
console.table(data.items);

@B4nan B4nan force-pushed the export-data branch 4 times, most recently from 69c2931 to 2d843a1 Compare November 6, 2023 13:21
@B4nan
Copy link
Member Author

B4nan commented Nov 6, 2023

Tests on node 16 were broken after yarn v4 upgrade, so fixed that too in the PR.

Retrieves all the data from the default crawler `Dataset` and exports them to the specified format.
Supported formats are currently 'json' and 'csv', and will be inferred from the `path` automatically.

```ts
const crawler = new BasicCrawler({ ... });
crawler.pushData({ ... });

await crawler.exportData('./data.csv');
```
@B4nan
Copy link
Member Author

B4nan commented Nov 6, 2023

@vladfrangu any idea why that test keeps failing in the CI? it works locally just fine for me, I thought its something with the relative paths, even tried a 500ms wait, but it still fails the same

@vladfrangu
Copy link
Member

Huh, it works locally? I'll take a look

the sync version was introduced for easier chaining, but with the `crawler.exportData()` we dont really need it
.github/workflows/test-ci.yml Outdated Show resolved Hide resolved
packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved
@B4nan B4nan requested a review from janbuchar November 7, 2023 17:42
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this seems pretty helpful. Are there any review guidelines I should follow?

@B4nan
Copy link
Member Author

B4nan commented Nov 8, 2023

LGTM, this seems pretty helpful. Are there any review guidelines I should follow?

Not really, did you use some previously? Happy for suggestions, we were also discussing some time ago with @vdusek that we could have a PR template with some checklist (or a link to some guidelines page).

Some things I usually look for:

  • meaningful and consistent naming (its always hard, we all know it)
  • location of the code changes (sometimes things are on a bad place, e.g. wrong package, or wrong level of abstraction)
  • hard to understand code (both runtime and type level), things can be usually broken down to more methods to make things easier to read/understand
  • undeclared dependencies (a common source of bugs with monorepos, if users use other package managers than NPM which quirks this somehow)
  • missing tests (things might be hard to test, in such case the PR author should say how they tested things if automated tests are missing)
  • missing docs (we generate API docs from the JSDoc comments, so having those is usually enough, but for bigger features we should have examples or guides too)

@B4nan B4nan merged commit c8c09a5 into master Nov 8, 2023
8 checks passed
@B4nan B4nan deleted the export-data branch November 8, 2023 09:24
@janbuchar
Copy link
Contributor

LGTM, this seems pretty helpful. Are there any review guidelines I should follow?

Not really, did you use some previously? Happy for suggestions, we were also discussing some time ago with @vdusek that we could have a PR template with some checklist (or a link to some guidelines page).

Some things I usually look for:

  • meaningful and consistent naming (its always hard, we all know it)
  • location of the code changes (sometimes things are on a bad place, e.g. wrong package, or wrong level of abstraction)
  • hard to understand code (both runtime and type level), things can be usually broken down to more methods to make things easier to read/understand
  • undeclared dependencies (a common source of bugs with monorepos, if users use other package managers than NPM which quirks this somehow)
  • missing tests (things might be hard to test, in such case the PR author should say how they tested things if automated tests are missing)
  • missing docs (we generate API docs from the JSDoc comments, so having those is usually enough, but for bigger features we should have examples or guides too)

Thanks, that seems reasonable. Every place I worked at did something pretty close to this 🙂

@vdusek
Copy link
Contributor

vdusek commented Nov 8, 2023

Not really, did you use some previously? Happy for suggestions, we were also discussing some time ago with @vdusek that we could have a PR template with some checklist (or a link to some guidelines page).

Yeah, I was thinking of just a simple pull_request_template.md with a few sections, that outline the purpose of the pull request, the solution it provides, a reference to the associated issue, the testing process, the steps for release, and helpful guidance on understanding the changes and how to review them.

Based on this assignment, LLM provides the following 🙂:

## Description

[Provide a brief description of the problem or feature this pull request addresses.]

## Solution

[Explain how this pull request solves the problem or implements the feature.]

## Issue

[Link to the related issue (e.g., `Closes #123` or `Fixes #456`).]

## Testing

[Describe how the changes have been tested. Include any relevant test cases or steps.]

## Release Steps

[Outline the steps required to release these changes. Include any deployment or configuration steps if applicable.]

## Review Guidance

[Offer tips and guidance for reviewers on how to approach and assess the changes. Explain any specific areas or considerations to focus on.]

## Checklist

- [ ] Code has been reviewed
- [ ] Tests pass successfully
- [ ] Documentation has been updated (if necessary)
- [ ] Issue is closed (if applicable)
- [ ] All discussions are resolved

## Additional Information

[Any additional information, context, or screenshots that might be helpful in reviewing this pull request.]

Of course, only if the sections make sense in the context of PR, not saying all of the PRs have to contain them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants