CHANGELOG.txt

1.0
    1.0 will be a complete api overhaul


0.12.1
    * typing
    * added `METADATA_PARSER_FUTURE` environment variable
        `export METADATA_PARSER_FUTURE=1` to enable
    * is_parsed_valid_url can accept a ParseResultBytes object now

0.12.0
    * drop python 2.7
    * initial typing support

0.11.0 | UNRELEASED

  * BREAKING CHANGES
    Due to the following breaking changes, the version was bumped to 0.11.0
    * `MetadataParser.fetch_url` now returns a third item.
      
  * COMPATIBLE CHANGES
    The following changes are backwards compatible to the 0.10.x releases
    * a test-suite for an application leveraging `metadata_parser` experienced
      some issues due to changes in the Responses package used to mock tests.
      to better faciliate against that, a new change were made:

      MetadataParser now has 2 subclassable attributes for items that should
      or should not be parsed:

        +    _content_types_parse = ("text/html",)
        +    _content_types_noparse = ("application/json",)

      Previously, these values were hardcoded into the logic.     
    * some error log messages were reformatted for clarity
    * some error log messages were incorrectly reformatted by black
    * added logging for NotParseable situations involving redirects
    * added a `.response` attribute to NotParsable errors to help debug
      redirects
    * added a new ResponseHistory class to track redirects
      * it is computed and returned during `MetadataParser.fetch_url`
      * `MetadataParser.parse(` optionally accepts it, and will stash
        it into ParsedResult
      * `ParsedResult` 
        * ResponseHistory is not stashed in the metadata stash, but a new namespace
        * `.response_history` will either be `ResponseHistory` or None
    * improving docstrings
    * added `decode_html` helper
    * extended MetadataParser to allow registration of a defcault_encoder for results
    * style cleanup

0.10.5
    packaging fixes
    migrated 'types.txt' out of distribution; it remains in github source
    updated some log lines with the url
    introduced some new log lines
    added `METADATA_PARSER__DISABLE_TLDEXTRACT` env
    merged, but reverted PR#34 which addresses Issue#32


0.10.4
    * black via pre-commit
    * upgraded black; 20.8b1
    * integrated with pre-commit
    * github actions and tox
    * several test files were not in git!

0.10.3
    updated docs on bad data
    black formatting
    added pyproject.toml
    moved BeautifulSoup generation into it's own method, so anyone can subclass to customize
        :fixes: https://github.com/jvanasco/metadata_parser/issues/25
    some internal variable changes thanks to flake8

0.10.2
    added some docs on encoding

0.10.1
    clarifying some inline docs
    BREAKING CHANGE: `fetch_url` now returns a tuple of `(html, encoding)
    now tracking in ParsedResult: encoding
        ParsedResult.metadata['_internal']['encoding'] = resp.encoding.lower() if resp.encoding else None
    `.parse` now accepts `html_encoding`
    refactored url fetching to use context managers
    refactored url fetching to only insert our hooks when needed
    adjusted test harness to close socket connections

0.10.0
    better Python3 support by using the six library

0.9.23
    added tests for url entities
    better grabbing of the charset
    better grabbing of some edge cases

0.9.22
    removed internal calls to the deprecated `get_metadata`, replacing them with `get_metadatas`.
    this will avoid emitting a deprecation warning, allowing users to migrate more easily

0.9.21
    * requests_toolbelt is now required
    ** this is to solve PR#16 / Issue#21
    ** the toolbelt and built-in versions of get_encodings_from_content required different workarounds
    * the output of urlparse is now cached onto the parser instance.
    ** perhaps this will be global cache in the future
    * MetadataParser now accepts `cached_urlparser`
    ** default: True
       options: True: use a instance of UrlParserCacheable(maxitems=30)
              : INT: use a instance of UrlParserCacheable(maxitems=cached_urlparser)
              : None/False/0 - use native urlparse
              : other truthy values - use as a custom urlparse

    * addressing issue #17 (https://github.com/jvanasco/metadata_parser/issues/17) where `get_link_` logic does not handle schemeless urls.
    ** `MetadataParser.get_metadata_link` will now try to upgrade schemeless links (e.g. urls that start with "//")
    ** `MetadataParser.get_metadata_link` will now check values against `FIELDS_REQUIRE_HTTPS` in certain situations to see if the value is valid for http
    ** `MetadataParser.schemeless_fields_upgradeable` is a tuple of the fields which can be upgradeable. this defaults to a package definition, but can be changed on a per-parser bases.
        The defaults are:
            'image',
            'og:image', 'og:image:url', 'og:audio', 'og:video',
            'og:image:secure_url', 'og:audio:secure_url', 'og:video:secure_url',        
    ** `MetadataParser.schemeless_fields_disallow` is a tuple of the fields which can not be upgradeable. this defaults to a package definition, but can be changed on a per-parser bases.
        The defaults are:
            'canonical',
            'og:url',
    ** `MetadataParser.get_url_scheme()` is a new method to expose the scheme of the active url
    ** `MetadataParser.upgrade_schemeless_url()` is a new method to upgrade schemeless links
        it accepts two arguments: url and field(optional)
        if present, the field is checked against the package tuple FIELDS_REQUIRE_HTTPS to see if the value is valid for http
            'og:image:secure_url',
            'og:audio:secure_url',
            'og:video:secure_url',

0.9.20
    * support for deprecated `twitter:label` and `twitter:data` metatags, which use "value" instead of "content".
    * new param to `__init__` and `parse`: `support_malformed` (default `None`).
      if true, will support malformed parsing (such as consulting "value" instead of "content".
      functionality extended from PR #13 (https://github.com/jvanasco/metadata_parser/pull/13) from https://github.com/amensouissi

0.9.19
    * addressing https://github.com/jvanasco/metadata_parser/issues/12
        on pages with duplicate metadata keys, additional elements are ignored
        when parsing the document, duplicate data was not kept.
    * `MetadataParser.get_metadata` will always return a single string (or none)
    * `MetadataParser.get_metadatas` has been introduced. this will always return an array.
    * the internal parsed_metadata store will now store data in a mix of arrays and strings, keeping it backwards compatible
    * This new version benches slightly slower because of the mixed format but preserves a smaller footprint.
    * the parsed result now contains a version record for tracking the format `_v`.
    * standardized single/double quoting
    * cleaned up some line
    * the library will try to coerce strategy= arguments into the right format
    * when getting dublin core data, the result could either be a string of a dict.  there's no good way to handle this.
    * added tests for encoders
    * greatly expanded tests
    
0.9.18
    * removed a stray debug line

0.9.17
    * added `retry_dropped_without_headers` option

0.9.16
    * added `fix_unicode_url()`
    * Added `allow_unicode_url` (default True) to the following calls:
        `MetadataParser.get_url_canonical`
        `MetadataParser.get_url_opengraph`
        `MetadataParser.get_discrete_url`
      This functionality will try to recode canonical urls with unicode data into percent-encoded streams

0.9.15
    * Python3 support returned

0.9.14
    * added some more tests to ensure encoding detected correctly
    * stash the soup sooner when parsing, to aid in debugging

0.9.13
    * doing some work to guess encoding...
    * internal: now using `resp` instead of `r`, it is easier for pdb debugging
    * the peername check was changed to be a hook, so it can be processed more immediately
    * the custom session redirect test was altered
    * changed the DummyResponse encoding fallback to `ENCODING_FALLBACK` which is Latin (not utf8)
      this is somewhat backwards incompatible with this library, but maintains compatibility with the underlying `requests` library

0.9.12
    * added more attributes to DummyResponse:
    ** `content`
    ** `headers`

0.9.11
    * some changes to how we handle upgrading bad canonicals
      upgrades will no longer happen IF they specify a bad domain.
      upgrades from localhost will still transfer over

0.9.10
    * slight reorder internally of TLD extract support

0.9.9
    * inspecting `requests` errors for a response and using it if possible
    * this will now try to validate urls if the `tldextract` library is present.
      this feature can be disabled with a global toggle

            import metadata_parser
            metadata_parser.USE_TLDEXTRACT = False

0.9.7
    * changed some internal variable names to better clarify difference between a hostname and netloc

0.9.7
    updated the following functions to test for RFC valid characters in the url string
    some websites, even BIG PROFESSIONAL ONES, will put html in here.
    idiots? amateurs? lazy? doesn't matter, they're now our problem.  well, not anymore.
        * get_url_canonical
        * get_url_opengraph
        * get_metadata_link

0.9.6
    this is being held for an update to the `requests` library
    * made the following arguments to `MetadataParser.fetch_url()` default to None - which will then default to the class setting. they are all passed-through to `requests.get`
    ** `ssl_verify`
    ** `allow_redirects`
    ** `requests_timeout`
    * removed `force_parse` kwarg from `MetadataParser.parser`
    * added 'metadata_parser.RedirectDetected' class. if allow_redirects is False, a detected redirect will raise this.
    * added 'metadata_parser.NotParsableRedirect' class. if allow_redirects is False, a detected redirect will raise this if missing a Location.
    * added `requests_session` argument to `MetadataParser`
    * starting to use httpbin for some tests
    * detecting JSON documents
    * extended NotParseable exceptions with the MetadataParser instance as `metadataParser`
    * added `only_parse_http_ok` which defaults to True (legacy).  submitting False will allow non-http200 responses to be parsed.
    * shuffled `fetch_url` logic around. it will now process more data before a potential error.
    * working on support for custom request sessions that can better handle redirects (requires patch or future version of requests)
    * caching the peername onto the response object as `_mp_peername` [ _m(etadata)p(arser)_peername ].  this will allow it to be calculated in a redirect session hook. (see tests/sessions.py)
    * added `defer_fetch` argument to `MetadataParser.__init__`, default ``False``.  If ``True``, this will overwrite the instance's `deferred_fetch` method to actually fetch the url.  this strategy allows for the `page` to be defined and response history caught.  Under this situation, a 301 redirecting to a 500 can be observed; in the previous versions only the 500 would be caught.
    * starting to encapsulate everything into a "parsed result" class
    * fixed opengraph minimum check
    * added `MetadataParser.is_redirect_unique`
    * added `DummyResponse.history`

0.9.5
    * failing to load a document into BeautifulSoup will now catch the BS error and raise NotParsable

0.9.4
    * created `MetadataParser.get_url_canonical`
    * created `MetadataParser.get_url_opengraph`
    * `MetadataParser.get_discrete_url` now calls `get_url_canonical` and `get_url_opengraph`

0.9.3
    * fixed packaging error. removed debug "print" statements

0.9.2
    * upgrade nested local canonical rels correctly

0.9.1
    * added a new `_internal` storage namespace to the `MetadataParser.metadata` payload.
      this simply stashes the `MetadataParser.url` and `MetadataParser.url_actual` attributes to makes objects easier to encode for debugging
    * the twitter parsing was incorrectly looking for 'value' not 'content' as in the current spec
    * tracking the shortlink on a page

0.9.0
    - This has a default behavior change regarding `get_discrete_url()` .
    - `is_parsed_valid_url()` did not correctly handle `require_public_netloc=True`, and would allow for `localhost` values to pass
    - new kwarg `allow_localhosts` added to
        * is_parsed_valid_url
        * is_url_valid
        * url_to_absolute_url
        * MetadataParser.__init__
        * MetadataParser.absolute_url
        * MetadataParser.get_discrete_url
        * MetadataParser.get_metadata_link
    - new method `get_fallback_url`
    - `url_to_absolute_url` will return `None` if not supplied with a fallback and test url. Previously an error in parsing would occur
    - `url_to_absolute_url` tries to do a better job at determining the intended url when given a malformed url.

0.8.3
    - packaging fixes

0.8.2
    - incorporated fix in https://github.com/jvanasco/metadata_parser/pull/10 to handle windows support of socket objects
    - cleaned up some tests
    - added `encode_ascii` helper
    - added git-ignored `tests/private` directory for non-public tests
    - added an `encoder` argument to `get_metadata` for encoding values

0.8.1
    added 2 new properties to a computed MetadataParser object:
        is_redirect = None
        is_redirect_same_host = None
    in the case of redirects, we only have the peername available for the final URL (not the source)
    if a response is a redirect, it may not be for the same host -- and the peername would correspond to the destination URL -- not the origin

0.8.0
    this bump introduces 2 new arguments and some changed behavior:

    - `search_head_only=None`.  previously the meta/og/etc data was only searched in the document head (where expected as per HTML specs).
      after indexing millions of pages, many appeared to implement this incorrectly of have html that is so off specification that
      parsing libraries can't correctly read it (for example, Twitter.com).
      This is currently implemented to default from None to True, but future versions will default to `False`.
      This is marked for a future default of `search_head_only=False`

    - `raise_on_invalid`.  default False.  If True, this will raise a new exception: InvalidDocument if the response
       does not look like a proper html document


0.7.4
    - more aggressive attempts to get the peername.

0.7.3
    - this will now try to cache the `peername` of the request (ie, the remote server) onto the peername attribute

0.7.2
    - applying a `strip()` to the "title".  bad authors/cms often have whitespace.

0.7.1
    - added kwargs to docstrings
    - `get_metadata_link` behavior has been changed as follows:
       * if an encoded uri is present (starts with `data:image/`)
       ** this will return None by default
       ** if a kwarg of `allow_encoded_uri=True` is submitted, will return the encoded url (without a url prefix)

0.7.0
    - merged https://github.com/jvanasco/metadata_parser/pull/9 from xethorn
    - nested all commands to `log` under `__debug__` to avoid calls on production when PYTHONOPTIMIZE is set

0.6.18
    - migrated version string into __init__.py

0.6.17
    - added a new `DummyResponse` class to mimic popular attributes of a `requests.response` object when parsing from HTML files

0.6.16
    - incorporated pull8 (https://github.com/jvanasco/metadata_parser/pull/8) which fixes issue5 (https://github.com/jvanasco/metadata_parser/issues/5) with comments

0.6.15
    - fixed README which used old api in the example

0.6.14
    - there was a typo and another bug that passed some tests on BeautifulSoup parsing.  they have been fixed.  todo- migrate tests to public repo

0.6.13
    - trying to integrate a "safe read"

0.6.12
    - now passing "stream=True" to requests.get.  this will fetch the headers first, before looping through the response.  we can avoid many issues with this approach

0.6.11
    - now correctly validating urls with ports. had to restructure a lot of the url validation

0.6.10
    - changed how some nodes are inspected. this should lead to fewer errors

0.6.9
    - added a new method `get_metadata_link()`, which applies link transformations to a metadata in an attempt to ensure a valid link

0.6.8
    - added a kwarg `requests_timeout` to proxy a timeout value to `requests.get()`

0.6.7
    - added a lockdown to `is_parsed_valid_url` titled `http_only` -- requires http/https for the scheme

0.6.6
    - protecting against bad doctypes, like nasa.gov
    -- added `force_doctype` to __init__.  defaults to False. this will change the doctype to get around to bs4/lxml issues
    -- this is defaulted to False.

0.6.5
    - keeping the parsed BS4 document; a user may wish to perform further operations on it.
    -- `MetadataParser.soup` attribute holds BS4 document

0.6.4
    - flake8 fixes. purely cosmetic.

0.6.3
    - no changes.  `sdist upload` was picking up a reference file that wasn't in github; that file killed the distribution install

0.6.2
    - formatting fixes via flake8

0.6.1
    - Lightweight, but functional, url validation
    -- new 'init' argument (defaults to True) : `require_public_netloc`
    -- this will ensure a url's hostname/netloc is either an IPV4 or "public DNS" name
    -- if the url is entirely numeric, requires it to be IPV4
    -- if the url is alphanumeric, requires a TLD + Domain ( exception is "localhost" )
    -- this is NOT RFC compliant, but designed for "Real Life" use cases.

0.6.0
    - Several fixes to improve support of canonical and absolute urls
    -- replaced REGEX parsing of urls with `urlparse` parsing and inspection; too many edge cases got in
    -- refactored `MediaParser.absolute_url` , now proxies a call to new function `url_to_absolute_url`
    -- refactored `MediaParser.get_discrete_url` , now cleaner and leaner.
    -- refactored how some tests run, so there is cleaner output


0.5.8
    - trying to fix some issues with distribution

0.5.7
    - trying to parse unparsable pages was creating an error
    -- `MetadataParser.init` now accepts `only_parse_file_extensions` -- list of the only file extensions to parse
    -- `MetadataParser.init` now accepts `force_parse_invalid_content_type` -- forces to parse invalid content
    -- `MetadataParser.fetch_url` will only parse "text/html" content by default

0.5.6
    - trying to ensure we return a valid url in get_discrete_url()
    - adding in some proper unit tests; migrating from the private demo's slowly ( the private demo's hit a lot of internal files and public urls ; wouldn't be proper to make these public )
    - setting `self.url_actual = url` on __init__. this will get overridden on a `fetch`, but allows for a fallback on html docs passed through


0.5.5
    - Dropped BS3 support
    - test Python3 support ( support added by Paul Bonser [ https://github.com/pib ] )


0.5.4
    - Pull Request - https://github.com/jvanasco/metadata_parser/pull/1
        Credit to Paul Bonser [ https://github.com/pib ]

0.5.3
    - added a few `.strip()` calls to clean up metadata values

0.5.2
    - fixed an issue on html title parsing.  the old method incorrectly regexed on a BS4 tag, not tag contents, creating character encoding issues.

0.5.1
    - missed the ssl_verify command

0.5.0
    - migrated to the requests library

0.4.13
    - trapping all errors in httplib and urrlib2 ; raising as an NotParsable and sticking the original error into the `raised` attribute.
        this will allow for cleaner error handling
    - we *really* need to move to requests.py

0.4.12
    - created a workaround for sharethis hashbang urls, which urllib2 doesn't like
    - we need to move to requests.py

0.4.11
    - added more relaxed controls for parsing safe files

0.4.10
    - fixed force_parse arg on init
    - added support for more filetypes

0.4.9
    - support for gzip documents that pad with extra data ( spec allows, python module doesn't )
    - ensure proper document format

0.4.8
    - added support for twitter's own og style markup
    - cleaned up the beautifulsoup finds for og data
    - moved 'try' from encapsulating 'for' blocks to encapsulating the inner loop.  this will pull more data out if an error occurs.

0.4.7
    - cleaned up some code

0.4.6
    - realized that some servers return gzip content, despite not advertising that this client accepts that content ; fixed by using some ideas from mark pilgrim's feedparser.  metadata_parser now advertises gzip and zlib, and processes it as needed

0.4.5
    - fixed a bug that prevented toplevel directories from being parsed

0.4.4
    - made redirect/masked/shortened links have better dereferenced url support

0.4.2
    - Wrapped title tag traversal with an AttributeException try block
    - Wrapped canonical tag lookup with a KeyError try block, defaulting to 'href' then 'content'
    - Added support for `url_actual` and `url_info` , which persist the data from the urllib2.urlopen object's `geturl()` and `info()`
    - `get_discrete_url` and `absolute_url` use the underlying url_actual data
    - added support for passing data and headers into urllib2 requests

0.4.1
    Initial Release