Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy host is not a valid hostname #416

Closed
g-arcas opened this issue Aug 10, 2021 · 7 comments
Closed

Proxy host is not a valid hostname #416

g-arcas opened this issue Aug 10, 2021 · 7 comments
Assignees
Labels

Comments

@g-arcas
Copy link

g-arcas commented Aug 10, 2021

It looks like it is not possible to configure Hyphe to use a proxy identified by an IP address.

Context

  • Dockerized Hyphe running on a VM
  • Proxy installed and running on the VM, bound to 127.0.0.1:3128

When setting the Proxy option to this IP, Hyphe complains:

  • Could not save new settings: Proxy host is not a valid hostname

Even using the hostname (from /etc/hosts file) Hyphe still not takes the settings.

Any idea about where or why I'm doing something wrong?

Regards.

@boogheta
Copy link
Member

Hello,

Indeed it looks like the test that is ran on the host when setting the proxy does not accept IP adresses.

We will fix it in a next release, but in the mean time if you want to hotfix it on your server, you can do the following:
in hyphe_backend/core.tac line 164, change:
if '/' in options['proxy']['host'] or not is_url(options['proxy']['host'], tld_aware=True, require_protocol=False):
into:
if '/' in options['proxy']['host'] or not (is_url(options['proxy']['host'], tld_aware=True, require_protocol=False) or urllru.special_hosts.match(options['proxy']['host'])):

@boogheta boogheta added the bug label Aug 23, 2021
@boogheta boogheta self-assigned this Aug 23, 2021
@g-arcas
Copy link
Author

g-arcas commented Aug 24, 2021

Hello Banjamin.

Thank you for your answer!

My question's motivation was to be able to "plug" Hyphe to a proxy in order to archive all HTTP traffic in WARC format. Next step will be to set Tor Socks as upstream proxy in order to be able to scrap .onion websites as easily as "clean Internet" ones.
Other question: I'd like to create a whitelist of domains that Hyphe will automatically tag as "out". Is it possible to have such a list that would be a global one, I mean => not dedicated or specific to a Hyphe corpus?

Best regards.

@g-arcas
Copy link
Author

g-arcas commented Aug 24, 2021 via email

@boogheta
Copy link
Member

Hi Guillaume,
I must warn you we never tried or intended so far to plug Hyphe to Tor and I cannot guarantee that onion urls will run through all Hyphe's routines properly and won't have much time to help you fix it otherwise.
And regarding setting a global list of entities to set as OUT no, such a functionality does not exist in Hyphe, but it could easily be scripted using the API if you know how to code a little.

@g-arcas
Copy link
Author

g-arcas commented Aug 24, 2021 via email

@boogheta
Copy link
Member

For the API part, you can get inspired by the script hyphe_backend/test_client.py which is a helper to do that directly in the shell. The documentation of all API routes is available here: https://github.com/medialab/hyphe/blob/master/doc/api.md For your need, you should be interested mostly in the following routes: store.get_webentity_for_url and store.set_webentities_status

And regarding your last question, the manual documentation hasn't been updated for a while because Docker is so much easier, but I did full manual installs myself recently on some servers and you should be able to do so as well by just adjusting a few things. Mainly I'd recommand:

Let me know precisely if you run into more errors and I can try and help

@g-arcas
Copy link
Author

g-arcas commented Aug 24, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants