Option to not-read certain media types #247

stichiboi · 2022-02-23T08:07:12Z

Hello
I'm trying to read data from epubs I downloaded from the web.
I'm just interested in the text, I don't care about images or styles
Would it be possible to add a media_type_filter option and only load the specified types from the manifest?

I imagine something along the lines of, in epub.EpubReader._load_manifest

media_type = r.get('media-type')
if self.media_type_filter and len(self.media_type_filter) and media_type not in self.media_type_filter:
    return

And the media_type_filter would just be a list I pass in as options

The text was updated successfully, but these errors were encountered:

stichiboi · 2022-02-23T08:12:35Z

Just to be transparent: this idea originates from an error I keep getting when reading some epubs

KeyError: "There is no item named 'styles/3.ttf' in the archive"

This error originates from the epub rather than from ebooklib: opening the file with Atom shows that indeed there is no styles/3.ttf (there is a fonts/3.ttf).

I don't want to throw away the whole epub just because it cannot read the styles, so ideally I could just skip reading them

This should also make the process quicker.

But I'm no expert in EPUB, so maybe this is not a good idea 😓

aerkalov · 2022-06-18T21:23:09Z

Good point. Everything fails now if EPUB claims to have something which is really missing in the archive. One option would be for the EpubReader. Something like fail silently. The other one would be like you suggested - list of things to ignore/allow.

aerkalov added the enhancement label Jun 18, 2022

fake-name mentioned this issue May 6, 2023

Library fails to handle epub where some items in content.xml are missing #281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to not-read certain media types #247

Option to not-read certain media types #247

stichiboi commented Feb 23, 2022

stichiboi commented Feb 23, 2022

aerkalov commented Jun 18, 2022

Option to not-read certain media types #247

Option to not-read certain media types #247

Comments

stichiboi commented Feb 23, 2022

stichiboi commented Feb 23, 2022

aerkalov commented Jun 18, 2022