You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The latest counts showed that we filter out a large part of the contents, either because they have a "wrong" mimetype (we should not even download those, see #23 ) or because the parser finds something that resembles base64. We have to crawl through a few of those pages to see, whether the parser works correctly or if there is some sort of bug.
The text was updated successfully, but these errors were encountered:
So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards.
Further, we can whitelist such things as SVG images
On Mon, Jun 4, 2018, 07:26 Roman Brunner ***@***.***> wrote:
So, we found the issue: We also filter data URLs (containing data). Such
URLs seems to include tiny image elements often. Therefore we suggest that
we switch to just removing image data elements. This way, we would have the
images in volatile memory for a very short time, but without giving anybody
access to it. Such behaviour is similar to a node in the internet routing
traffic - it does have to store the content shortly, but it cannot be held
responsible for any illegal content it may forwards.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhPPAClvgiywtCkcQs3esevLz0KYWbEks5t5MUHgaJpZM4UVMJP>
.
The latest counts showed that we filter out a large part of the contents, either because they have a "wrong" mimetype (we should not even download those, see #23 ) or because the parser finds something that resembles base64. We have to crawl through a few of those pages to see, whether the parser works correctly or if there is some sort of bug.
The text was updated successfully, but these errors were encountered: