Skip to content

Commit

Permalink
Add param requests_kwargs for WebBaseLoader (#5485)
Browse files Browse the repository at this point in the history
# Add param `requests_kwargs` for WebBaseLoader

Fixes # (issue)

#5483 

## Who can review?

@eyurtsev
  • Loading branch information
sevendark authored May 31, 2023
1 parent 359fb8f commit bd9e0f3
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 3 deletions.
21 changes: 20 additions & 1 deletion docs/modules/indexes/document_loaders/examples/sitemap.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
"Extends from the `WebBaseLoader`, `SitemapLoader` loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.\n",
"\n",
"The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load, you can change the `requests_per_second` parameter to increase the max concurrent requests. Note, while this will speed up the scraping process, but it may cause the server to block you. Be careful!"
"The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load. Note, while this will speed up the scraping process, but it may cause the server to block you. Be careful!"
]
},
{
Expand Down Expand Up @@ -63,6 +63,25 @@
"docs = sitemap_loader.load()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can change the `requests_per_second` parameter to increase the max concurrent requests. and use `requests_kwargs` to pass kwargs when send requests."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sitemap_loader.requests_per_second = 2\n",
"# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue\n",
"sitemap_loader.requests_kwargs = {\"verify\": False}"
]
},
{
"cell_type": "code",
"execution_count": 4,
Expand Down
7 changes: 5 additions & 2 deletions langchain/document_loaders/web_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import asyncio
import logging
import warnings
from typing import Any, List, Optional, Union
from typing import Any, Dict, List, Optional, Union

import aiohttp
import requests
Expand Down Expand Up @@ -47,6 +47,9 @@ class WebBaseLoader(BaseLoader):
default_parser: str = "html.parser"
"""Default parser to use for BeautifulSoup."""

requests_kwargs: Dict[str, Any] = {}
"""kwargs for requests"""

def __init__(
self, web_path: Union[str, List[str]], header_template: Optional[dict] = None
):
Expand Down Expand Up @@ -170,7 +173,7 @@ def _scrape(self, url: str, parser: Union[str, None] = None) -> Any:

self._check_parser(parser)

html_doc = self.session.get(url)
html_doc = self.session.get(url, **self.requests_kwargs)
html_doc.encoding = html_doc.apparent_encoding
return BeautifulSoup(html_doc.text, parser)

Expand Down

0 comments on commit bd9e0f3

Please sign in to comment.