aiohttp_chromium

aiohttp-like interface to chromium

based on selenium_driverless to bypass cloudflare

status

working prototype

usage

aiohttp_chromium is a drop-in replacement for aiohttp

import asyncio

#import aiohttp
import aiohttp_chromium as aiohttp

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://httpbin.org/get') as resp:
            print(resp.status)
            print(await resp.text())

asyncio.run(main())

why

handling file downloads with selenium is too verbose, and too complex to integrate into selenium, so this is a wrapper for selenium

i wanted a "stupid http client", so it has the same interface as aiohttp.client, and handling web pages has lower priority, so the selenium interface is hidden in response._driver

known issues

chromium window is stealing focus

when creating new tabs, or when switching between tabs, the chromium window is grabbing focus

this is an issue with the window manager

workaround for the KDE plasma desktop: move the chromium window to a different desktop, and focus some window

chromium seems to have no command line switch to disable this focus-grabbing

possible solutions

run chromium in a LD_PRELOAD wrapper
binary patching of the chromium executable
configure the window manager

todo

remove tempfiles on session close and on error
add support for streams: request streams, response streams
- currently, session.get only works for "short and small" requests and responses, but not for infinite streams
- implementing this is non-trivial, because chromium does not expose streams over the Chrome DevTools Protocol (CDP)
- kaliiiiiiiiii/Selenium-Driverless#123
  - i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client, which is pretty much what we are trying to do here...
- wkeeling/selenium-wire#656 (comment)
  - sounds like we need either/or: a patched version of chromium, or a dynamic analysis tool like frida to insert hooks into the chromium binary ... to pipe all requests and responses through a local http proxy, for passive tracing and active intercepting of https traffic
  - tracing https traffic with frida
- https://groups.google.com/g/chrome-debugging-protocol/c/w65z0cMqgvc - Fetch.fulfillRequest and (very) long body
  - there's no streaming support for Fetch network interception
  - there is Fetch.takeResponseBodyAsStream and IO.read, but not Fetch.giveResponseBodyAsStream and IO.write
  - there is Network.takeResponseBodyForInterceptionAsStream and IO.read, but not Network.giveResponseBodyForInterceptionAsStream and IO.write
  - google has hidden the discussion: "You don't have permission to access this content. For access, try contacting the group's owners and managers"
    - see snapshot from archive.org 2024-06-23
    - hey google? thanks for reminding us that google is a bunch of fascists, engaging in sabotage and censorship
- https://issues.chromium.org/issues/332570739 - Streaming body for Fetch.fulfillRequest() CDP API
  - Fetch.fullfillRequest() only provides an option to set the 'body' response as a base64-encoded string. Of course, this does not work well for larger response body. Similar to the streaming takeResponseBodyAsStream(), it would be great if there was a fullfillRequest() option with a stream, fullfillRequestWithStream()
    - Perhaps this could be done by expanding the IO APIs to have a IO.write() option that allows sending a streaming data to the browser. I realize this is probably fairly low-priority, but would make Fetch request interception more efficient, especially when dealing with larger responses/chunked response of unknown size, etc...
  - The feature request makes sense but currently it is a low priority for us.
  - see snapshot
graphical interface where the user can solve challenges: captchas, unexpected responses, ...
integration with captcha solving services
remove unfree dependencies
- selenium_driverless - cc by-nc-sa license
  - selenium_driverless is a high-level wrapper for the Chrome DevTools Protocol (CDP)
  - NOT based on chromedriver binary, because chromedriver is detected by cloudflare
- see also Awesome Chrome DevTools # Libraries for driving the protocol (or a layer above)
  - https://github.com/pyppeteer/pyppeteer - 3K stars
  - https://github.com/fake-name/ChromeController - 200 stars
  - https://github.com/chazkii/chromewhip - 120 stars
grep -r -w FIXME src/
grep -r -w TODO src/

keywords

web scraper
chromium
aiohttp
web scraping
asyncio
bypass cloudflare
headful scraper
headful web scraper
headful chromium
gui scripting
headful webscraper
selenium driverless

similar projects

botasaurus

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
doc		doc
src/aiohttp_chromium		src/aiohttp_chromium
test/stream-response		test/stream-response
.gitignore		.gitignore
license.txt		license.txt
pyproject.toml		pyproject.toml
readme.md		readme.md
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aiohttp_chromium

status

usage

why

known issues

chromium window is stealing focus

todo

keywords

similar projects

About

Contributors 2

Languages

License

milahu/aiohttp_chromium

Folders and files

Latest commit

History

Repository files navigation

aiohttp_chromium

status

usage

why

known issues

chromium window is stealing focus

todo

keywords

similar projects

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages