Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use heuristic_parse for untrusted URLs. #976

Merged
merged 1 commit into from
Feb 4, 2023
Merged

Conversation

benubois
Copy link
Contributor

Hello,

Thanks for this gem!

Addressable::URI.parse will throw exceptions for URLs it thinks are invalid. The issue with this is that Addressable and Twitter do not agree on what qualifies as a valid URL. So a tweet can contain URL entities that Addressable believes are invalid. Addressable::URI.heuristic_parse is Addressable's more lenient parser.

This will make it so any tainted URLs are parsed with heuristic_parse. This way there is less of a chance of encountering an Addressable::URI::InvalidURIError exception in the wild.

For example the tweet below contains the URL http://suspicio\\.us/URL". Which Twitter recognizes as a URL so it shows up as an entity.

This throws an exception when calling tweet.urls.first.expanded_url.

require "addressable"
Addressable::URI.parse("http://suspicio\\.us/URL")
Traceback (most recent call last):
        1: from (irb):7
Addressable::URI::InvalidURIError (Invalid character in host: 'suspicio\.us')

vs

require "addressable"
Addressable::URI.heuristic_parse("http://suspicio\\.us/URL")
=> #<Addressable::URI:0x3fc6cfc76c78 URI:http://suspicio/.us/URL>

Incidentally, it looks like Addressable was first used to help with this same type of issue: #487.

This should also fix #742 and #891.

{:created_at=>"Fri Aug 07 16:06:51 +0000 2020",
 :id=>1291767772754726914,
 :id_str=>"1291767772754726914",
 :full_text=>
  "curl -s -D- https://t.co/j30q2zQoYS |grep -iE \"^Location: |URL=|window.location|document.location\" # Try to check where a URL may redirect you.",
 :truncated=>false,
 :display_text_range=>[0, 143],
 :entities=>
  {:hashtags=>[],
   :symbols=>[],
   :user_mentions=>[],
   :urls=>
    [{:url=>"https://t.co/j30q2zQoYS",
      :expanded_url=>"http://suspicio\\.us/URL",
      :display_url=>"suspicio\\.us/URL",
      :indices=>[12, 35]}]},
 :source=>
  "<a href=\"http://suso.suso.org/xulu/Command_Line_Magic\" rel=\"nofollow\">CLI Magic poster</a>",
 :in_reply_to_status_id=>nil,
 :in_reply_to_status_id_str=>nil,
 :in_reply_to_user_id=>nil,
 :in_reply_to_user_id_str=>nil,
 :in_reply_to_screen_name=>nil,
 :user=>
  {:id=>91333167,
   :id_str=>"91333167",
   :name=>"Command Line Magic",
   :screen_name=>"climagic",
   :location=>"BASHLAND",
   :description=>
    "Cool Unix/Linux Command Line tricks you can use in $TWITTER_CHAR_LIMIT characters or less. Here mostly to inspire all to try more. Read docs first, run later.\\~",
   :url=>"https://t.co/eKoQFEZTLs",
   :entities=>
    {:url=>
      {:urls=>
        [{:url=>"https://t.co/eKoQFEZTLs",
          :expanded_url=>"http://www.climagic.org/",
          :display_url=>"climagic.org",
          :indices=>[0, 23]}]},
     :description=>{:urls=>[]}},
   :protected=>false,
   :followers_count=>198274,
   :friends_count=>12330,
   :listed_count=>3962,
   :created_at=>"Fri Nov 20 12:49:35 +0000 2009",
   :favourites_count=>1748,
   :utc_offset=>nil,
   :time_zone=>nil,
   :geo_enabled=>true,
   :verified=>false,
   :statuses_count=>13134,
   :lang=>nil,
   :contributors_enabled=>false,
   :is_translator=>false,
   :is_translation_enabled=>false,
   :profile_background_color=>"C0DEED",
   :profile_background_image_url=>
    "http://abs.twimg.com/images/themes/theme1/bg.png",
   :profile_background_image_url_https=>
    "https://abs.twimg.com/images/themes/theme1/bg.png",
   :profile_background_tile=>true,
   :profile_image_url=>
    "http://pbs.twimg.com/profile_images/535876218/climagic-icon_normal.png",
   :profile_image_url_https=>
    "https://pbs.twimg.com/profile_images/535876218/climagic-icon_normal.png",
   :profile_link_color=>"0084B4",
   :profile_sidebar_border_color=>"C0DEED",
   :profile_sidebar_fill_color=>"DDEEF6",
   :profile_text_color=>"333333",
   :profile_use_background_image=>true,
   :has_extended_profile=>false,
   :default_profile=>false,
   :default_profile_image=>false,
   :following=>false,
   :follow_request_sent=>false,
   :notifications=>false,
   :translator_type=>"none"},
 :geo=>nil,
 :coordinates=>nil,
 :place=>nil,
 :contributors=>nil,
 :is_quote_status=>false,
 :retweet_count=>35,
 :favorite_count=>128,
 :favorited=>false,
 :retweeted=>false,
 :possibly_sensitive=>false,
 :lang=>"en",
 :text=>
  "curl -s -D- https://t.co/j30q2zQoYS |grep -iE \"^Location: |URL=|window.location|document.location\" # Try to check where a URL may redirect you."}

Addressable::URI.parse will throw exceptions for URLs it thinks are
invalid. The issue with this is that Addressable and Twitter do not
agree on what qualifies as a valid URL. So a tweet can contain URL
entities that Addressable believes are invalid.
Addressable::URI.heuristic_parse is Addressable's more lenient parser.

This will make it so any tainted user data is parsed with
heuristic_parse. This way there is less of a chance of encountering an
Addressable::URI::InvalidURIError exception in the wild.
@sferik sferik force-pushed the master branch 4 times, most recently from c5e4814 to ccba161 Compare February 4, 2023 02:37
@sferik sferik merged commit 5e09093 into sferik:master Feb 4, 2023
@dentarg
Copy link

dentarg commented Feb 4, 2023

sferik force-pushed the master branch 4 times, most recently from c5e4814 to ccba161 yesterday

@sferik Was this reverted from master due to the force pushes? 5e09093 says "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rescue errors from Addressable?
3 participants