Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

style: let the reader decide if something is simple or easy (upstream branch) #1201

Merged
merged 8 commits into from
Sep 10, 2024
2 changes: 1 addition & 1 deletion sources/academy/glossary/concepts/css_selectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ CSS selectors are important for web scraping because they allow you to target sp

For example, if you wanted to scrape a list of all the titles of blog posts on a website, you could use a CSS selector to select all the elements that contain the title text. Once you have selected these elements, you can extract the text from them and use it for your scraping project.

Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it easily. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need.
Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need.

## Resources

Expand Down
4 changes: 2 additions & 2 deletions sources/academy/glossary/concepts/http_headers.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ For some websites, you won't need to worry about modifying headers at all, as th

Some websites will require certain default browser headers to work properly, such as **User-Agent** (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user).

Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which simply would not know which data to return without knowing which exact website is requesting it.
Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it.

## Custom headers required {#needs-custom-headers}

Expand All @@ -44,7 +44,7 @@ You could use Chrome DevTools to inspect request headers, and [Insomnia](../tool
HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of:

1. HTTP/2 headers do not include status messages. They only contain status codes.
2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will simply ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem.
2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem.
3. While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. **Accept-Encoding**, **Cache-Control**, **User-Agent**), HTTP/2 headers must be lower-cased (e.g. **accept-encoding**, **cache-control**, **user-agent**).

> To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out [this](https://httptoolkit.tech/blog/translating-http-2-into-http-1/) article
6 changes: 3 additions & 3 deletions sources/academy/glossary/tools/edit_this_cookie.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ slug: /tools/edit-this-cookie

---

**EditThisCookie** is a simple Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain.
**EditThisCookie** is a Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain.

![EditThisCookie popup](./images/edit-this-cookie-popup.png)

Expand All @@ -21,11 +21,11 @@ At the top of the popup, there is a row of buttons. From left to right, here is

### Delete all cookies

Clicking this button will simply remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again.
Clicking this button will remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again.

### Reset

Basically just a refresh button.
A refresh button.

### Add a new cookie

Expand Down
6 changes: 3 additions & 3 deletions sources/academy/glossary/tools/insomnia.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: Insomnia
description: Learn about Insomnia, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.
description: Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.
sidebar_position: 9.2
slug: /tools/insomnia
---

# What is Insomnia {#what-is-insomnia}

**Learn about Insomnia, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.**
**Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.**

---

Expand Down Expand Up @@ -66,4 +66,4 @@ This will bring up the **Manage cookies** window, where all cached cookies can b

## Postman or Insomnia {#postman-or-insomnia}

The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, just choose the one that has the most intuitive interface for you.
The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, choose the one that has the most intuitive interface for you.
4 changes: 2 additions & 2 deletions sources/academy/glossary/tools/modheader.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ If you read about [Postman](./postman.md), you might remember that you can use i

After you install the ModHeader extension, you should see it pinned in Chrome's task bar. When you click it, you'll see an interface like this pop up:

![Modheader's simple interface](./images/modheader.jpg)
![Modheader's interface](./images/modheader.jpg)

Here, you can add headers, remove headers, and even save multiple collections of headers that you can easily toggle between (which are called **Profiles** within the extension itself).
Here, you can add headers, remove headers, and even save multiple collections of headers that you can toggle between (which are called **Profiles** within the extension itself).

## Use cases {#use-cases}

Expand Down
12 changes: 6 additions & 6 deletions sources/academy/glossary/tools/postman.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
---
title: Postman
description: Learn about Postman, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.
description: Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.
sidebar_position: 9.3
slug: /tools/postman
---

# What is Postman? {#what-is-postman}

**Learn about Postman, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.**
**Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.**

---

[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to easily test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper.
[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper.

The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a simple signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/).
The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/).

## Understanding the interface {#understanding-the-interface}

Expand Down Expand Up @@ -43,7 +43,7 @@ In order to use a proxy, the proxy's server and configuration must be provided i

![Proxy configuration in Postman settings](./images/postman-proxy.png)

After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings just needs to be un-ticked to disable it.
After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings needs to be un-ticked to disable it.

## Managing the cookies cache {#managing-cookies}

Expand All @@ -55,7 +55,7 @@ In order to check whether there are any cookies associated with a certain reques

![Button to view the cached cookies](./images/postman-cookies-button.png)

Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to **https://github.com/apify**, within this window we would be able to easily find cached cookies associated with github.com. Cookies can also be easily edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here.
Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to **https://github.com/apify**, within this window we would be able to find cached cookies associated with github.com. Cookies can also be edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here.

![Managing cookies in Postman with the "MANAGE COOKIES" window](./images/postman-manage-cookies.png)

Expand Down
6 changes: 3 additions & 3 deletions sources/academy/glossary/tools/quick_javascript_switcher.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
title: Quick JavaScript Switcher
description: Discover a super simple tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.
description: Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.
sidebar_position: 9.9
slug: /tools/quick-javascript-switcher
---

# Quick JavaScript Switcher

**Discover a super simple tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.**
**Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.**

---

**Quick JavaScript Switcher** is a very simple Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed.
**Quick JavaScript Switcher** is a Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed.

If JavaScript is enabled - clicking the button will switch it off and reload the page. The next click will re-enable JavaScript and refresh the page. This extension is useful for checking whether a certain website will work without JavaScript (and thus could be parsed without using a browser with a plain HTTP request) or not.

Expand Down
6 changes: 3 additions & 3 deletions sources/academy/glossary/tools/user_agent_switcher.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
title: User-Agent Switcher
description: Learn how to easily switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.
description: Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.
sidebar_position: 9.8
slug: /tools/user-agent-switcher
---

# User-Agent Switcher

**Learn how to easily switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.**
**Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.**

---

**User-Agent Switcher** is a simple Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups.
**User-Agent Switcher** is a Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups.

![User-Agent Switcher groups](./images/user-agent-switcher-groups.png)

Expand Down
4 changes: 1 addition & 3 deletions sources/academy/platform/deploying_your_code/deploying.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,10 @@ Before we deploy our project onto the Apify platform, let's ensure that we've pu

### Creating the Actor

Before anything can be integrated, we've gotta create a new Actor. Luckily, this is super easy to do. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal) and click on the **Develop new** button, then select the **Empty** template.
Before anything can be integrated, we've gotta create a new Actor. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal) and click on the **Develop new** button, then select the **Empty** template.

![Create new button](../getting_started/images/develop-new-actor.png)

Easy peasy!

### Changing source code location {#change-source-code}

In the **Source** tab on the new Actor's page, we'll click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**.
Expand Down
6 changes: 3 additions & 3 deletions sources/academy/platform/deploying_your_code/docker_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ import TabItem from '@theme/TabItem';

The **Dockerfile** is a file which gives the Apify platform (or Docker, more specifically) instructions on how to create an environment for your code to run in. Every Actor must have a Dockerfile, as Actors run in Docker containers.

> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to just run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc).
> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc).

## Base images {#base-images}

If your project doesn’t already contain a Dockerfile, don’t worry! Apify offers [many base images](/sdk/js/docs/guides/docker-images) that are optimized for building and running Actors on the platform, which can be found [here](https://hub.docker.com/u/apify). When using a language for which Apify doesn't provide a base image, [Docker Hub](https://hub.docker.com/) provides a ton of free Docker images for most use-cases, upon which you can create your own images.

> Tip: You can see all of Apify's Docker images [on DockerHub](https://hub.docker.com/r/apify/).

At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or just install them yourself during the build step.
At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or install them yourself during the build step.

Once you find the base image you need, you can add it as the initial `FROM` statement:

Expand Down Expand Up @@ -111,7 +111,7 @@ CMD python3 main.py

## Examples {#examples}

The examples we just showed were for Node.js and Python, however, to drive home the fact that Actors can be written in any language, here are some examples of some Dockerfiles for Actors written in different programming languages:
The examples above show how to deploy Actors written in Node.js or Python, but you can use any language. As an inspiration, here are a few examples for other languages: Go, Rust, Julia.

<Tabs groupId="main">
<TabItem value="GO Actor Dockerfile" label="GO Actor Dockerfile">
Expand Down
Loading
Loading