Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images are classified only by URL? #3572

Open
foolip opened this issue Feb 12, 2024 · 12 comments
Open

Images are classified only by URL? #3572

foolip opened this issue Feb 12, 2024 · 12 comments

Comments

@foolip
Copy link

foolip commented Feb 12, 2024

I have been poking at https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2022/media/bytes_and_dimensions_by_format.sql to get an updated view of quality distributions in the wild.

I happened to look for 'heif' images and was surprised how many I found. It turns out that for example https://gaijincph.dk/ serves https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic

That gets classified by pithyType() as 'heif' because the URL ends with '.heic'. However, it's actually a JPEG.

It seems like only the URL is used in fact, because of this call here:

resourceFormat: pithyType({ contentType: d.mimeType, url: d.url })

There is no mimeType in the data, at least not in the httparchive.pages.2024_01_01_desktop data. Here's what I unpacked from payload and a few nested JSON objects for https://gaijincph.dk/:

[
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": false,
    "hasHeight": false,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/logo.png?v58288",
    "totalCandidates": 1,
    "altAttribute": "GAIJIN logo",
    "clientWidth": 150,
    "clientHeight": 134,
    "naturalWidth": 2097,
    "naturalHeight": 1598,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 2097,
    "approximateResourceHeight": 1598,
    "byteSize": 125672,
    "bitsPerPixel": 0.3000221426043403,
    "computedSizingStyles": {
      "width": "auto",
      "height": "auto",
      "maxWidth": "150px",
      "maxHeight": "none",
      "minWidth": "auto",
      "minHeight": "auto"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "both",
      "height": "intrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/e6680b7e-a494-49d3-b8dd-027338d28566_m.jpg",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of The Full Gaijin Experience",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 826,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 826,
    "byteSize": 283244,
    "bitsPerPixel": 3.8101156846919557,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of Tasting menu",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 960,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 960,
    "byteSize": 214586,
    "bitsPerPixel": 2.4836342592592593,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/73c94994-f80a-4f1c-b524-44d3e80e28ee_m.heic",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of A la carte",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 900,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 900,
    "byteSize": 248070,
    "bitsPerPixel": 3.0625925925925928,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/29fd156a-2b46-4cb6-a5a3-0e481f66aaba_m.png",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of Private Dining",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 709,
    "naturalHeight": 540,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 709,
    "approximateResourceHeight": 540,
    "byteSize": 22605,
    "bitsPerPixel": 0.47233975865851746,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": false,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/coverimage_l.jpeg",
    "totalCandidates": 1,
    "widthAttribute": "100%",
    "altAttribute": "",
    "clientWidth": 918,
    "clientHeight": 918,
    "naturalWidth": 1200,
    "naturalHeight": 1200,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 1200,
    "approximateResourceHeight": 1200,
    "byteSize": 61206,
    "bitsPerPixel": 0.34003333333333335,
    "computedSizingStyles": {
      "width": "100%",
      "height": "auto",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "auto",
      "minHeight": "auto"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "intrinsic"
    },
    "reservedLayoutDimensions": false
  }
]

Since the number of bytes and the decoded width and height are known, the decoder that was actually used should in principle be knowable.

@rviscomi
Copy link
Member

cc @eeeps

@eeeps
Copy link
Contributor

eeeps commented Feb 13, 2024

Those URLs are returned with the following HTTP header:

Content-Type: application/octet-stream

That fails the Regex test here, so we fall back to looking at the file extension, at the place you identified.

I agree that the crawler knows more than we can, by looking at URLs and HTTP headers, and it would be nice to have the actual decoded type exposed to catch cases like this (or, failing that, at least to get a sense of how common such cases are). It might actually already be, because of the work Pat Meenan did in 2022 to get the actual image resources run through ImageMagick and a bunch of things reported (see the note in the README https://github.com/HTTPArchive/almanac.httparchive.org/tree/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media#notes-for-2023). I'll try to dig in later today to see why we didn't use that here.

@eeeps
Copy link
Contributor

eeeps commented Feb 13, 2024

That was fast! We didn't get to use any of the ImageMagick data here because this query is working from <img>s found in the markup, rather than from HTTP requests. See my note in the readme about my failure to join requests up to loaded <img> resources, and how that was my number one TODO going forward.

@foolip
Copy link
Author

foolip commented Feb 13, 2024

@eeeps is any of the code using ImageMagick running in the current crawls? I've been thinking about exactly that these past few days, if we could run identify -format "%Q\n" for JPEG files in particular to understand the quality in a different way. I assumed that none of the resources are on disk so this would be a big lift, but it sounds like some of the work has already been done?

@foolip
Copy link
Author

foolip commented Feb 13, 2024

Is the $._image_details data being written to anything in BigQuery yet? If not, is there a sample of that from the raw crawl data that I could look at? I'm interested to know what kind of stuff is in there and if it would help.

@rviscomi
Copy link
Member

rviscomi commented Feb 13, 2024

Yeah here's a way to cheaply (355.91 MB) query a sample of the $._image_details object:

SELECT
  url,
  JSON_QUERY(payload, '$._image_details') AS image_details
FROM
  `httparchive.all.requests` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE
  date = '2024-01-01' AND
  client = 'mobile' AND
  is_root_page AND
  type = 'image'
LIMIT
  10
Sample result
{
    "detected_type": "jpeg",
    "metadata": {
        "ExifTool": {
            "ExifToolVersion": 12.52
        },
        "File": {
            "FileSize": "137 kB",
            "FileType": "JPEG",
            "FileTypeExtension": "jpg",
            "MIMEType": "image/jpeg",
            "Comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n",
            "ImageWidth": 800,
            "ImageHeight": 800,
            "EncodingProcess": "Baseline DCT, Huffman coding",
            "BitsPerSample": 8,
            "ColorComponents": 3,
            "YCbCrSubSampling": "YCbCr4:2:0 (2 2)"
        },
        "JFIF": {
            "JFIFVersion": 1.01,
            "ResolutionUnit": "inches",
            "XResolution": 96,
            "YResolution": 96
        },
        "Composite": {
            "ImageSize": "800x800",
            "Megapixels": 0.64
        }
    },
    "magick": {
        "baseName": "10710.94",
        "format": "JPEG",
        "formatDescription": "JPEG",
        "mimeType": "image/jpeg",
        "class": "DirectClass",
        "geometry": {
            "width": 800,
            "height": 800,
            "x": 0,
            "y": 0
        },
        "resolution": {
            "x": 96,
            "y": 96
        },
        "printSize": {
            "x": 8.33333,
            "y": 8.33333
        },
        "units": "PixelsPerInch",
        "type": "TrueColor",
        "baseType": "Undefined",
        "endianness": "Undefined",
        "colorspace": "sRGB",
        "depth": 8,
        "baseDepth": 8,
        "channelDepth": {
            "red": 8,
            "green": 8,
            "blue": 1
        },
        "pixels": 1920000,
        "imageStatistics": {
            "Overall": {
                "min": 0,
                "max": 255,
                "mean": 65.7495,
                "median": 35.6667,
                "standardDeviation": 79.6716,
                "kurtosis": 0.0952899,
                "skewness": 1.18423,
                "entropy": 0.835339
            }
        },
        "channelStatistics": {
            "red": {
                "min": 0,
                "max": 255,
                "mean": 58.5843,
                "median": 15,
                "standardDeviation": 82.5814,
                "kurtosis": 0.438709,
                "skewness": 1.4063,
                "entropy": 0.805027
            },
            "green": {
                "min": 0,
                "max": 255,
                "mean": 60.4429,
                "median": 30,
                "standardDeviation": 76.1642,
                "kurtosis": 0.756071,
                "skewness": 1.40876,
                "entropy": 0.838816
            },
            "blue": {
                "min": 0,
                "max": 255,
                "mean": 78.2214,
                "median": 62,
                "standardDeviation": 80.2692,
                "kurtosis": -0.494016,
                "skewness": 0.80254,
                "entropy": 0.862173
            }
        },
        "renderingIntent": "Perceptual",
        "gamma": 0.454545,
        "chromaticity": {
            "redPrimary": {
                "x": 0.64,
                "y": 0.33
            },
            "greenPrimary": {
                "x": 0.3,
                "y": 0.6
            },
            "bluePrimary": {
                "x": 0.15,
                "y": 0.06
            },
            "whitePrimary": {
                "x": 0.3127,
                "y": 0.329
            }
        },
        "matteColor": "#BDBDBD",
        "backgroundColor": "#FFFFFF",
        "borderColor": "#DFDFDF",
        "transparentColor": "#00000000",
        "interlace": "None",
        "intensity": "Undefined",
        "compose": "Over",
        "pageGeometry": {
            "width": 800,
            "height": 800,
            "x": 0,
            "y": 0
        },
        "dispose": "Undefined",
        "iterations": 0,
        "compression": "JPEG",
        "quality": 80,
        "orientation": "Undefined",
        "properties": {
            "comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n",
            "date:create": "2024-01-13T04:33:02+00:00",
            "date:modify": "2024-01-13T04:33:02+00:00",
            "date:timestamp": "2024-01-13T04:34:18+00:00",
            "jpeg:colorspace": "2",
            "jpeg:sampling-factor": "2x2,1x1,1x1",
            "signature": "0d0e8995e2aae98c15e1e2bc69c8f988423e022cf4055d72e9752a457a53a440"
        },
        "tainted": false,
        "filesize": "136972B",
        "numberPixels": "640000",
        "pixelsPerSecond": "40.1272MB",
        "userTime": "0.020u",
        "elapsedTime": "0:01.015"
    }
}

@eeeps
Copy link
Contributor

eeeps commented Feb 13, 2024

As per usual, Rick beat me to it. Different (older?) flavor:

SELECT
  url,
  JSON_QUERY(payload, '$._image_details') as image_details
FROM `httparchive.requests.2023_12_01_mobile` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE JSON_QUERY(payload, '$._image_details') IS NOT NULL

results

@foolip
Copy link
Author

foolip commented Feb 14, 2024

Thank you @rviscomi and @eeeps, my joy is boundless! I will play around with this.

@foolip
Copy link
Author

foolip commented Feb 14, 2024

After some terrible queries and intermediate tables I have a first result:

JPEG quality

Is this the right repo to ask questions like why is _image_details sometimes missing? and other things I'll need to figure out to refine this?

@rviscomi
Copy link
Member

Yeah I think here is fine

cc @pmeenan

@eeeps
Copy link
Contributor

eeeps commented Feb 14, 2024

@foolip not sure about venue (if I have a discussion that might require a chattier exploration, I generally start it in the HTTP Archive Slack), but @pmeenan is the person to ask about missing _image_details.

Interesting chart! I do worry though... the "quality" reported by ImageMagick's identify for JPEGs, like most 0-100 quality scales used by encoders, is arbitrary and IM- and JPEG-specific. It's based on the quantization tables IM finds in the file, which will mostly correlate with what people think "quality" means (a subjective evaluation of "how good" the output looks when compared with the input), but not at all exactly. Worse, this value doesn't line up with other formats or other tools. People expect "quality 80" to mean the same thing everywhere. It does not, even for tools that are only dealing with JPEGs, and once you're talking other formats, you're in another universe.

That said... the number of quality: 100 JPEGs here -- wow. Antipattern!

@pmeenan
Copy link
Member

pmeenan commented Feb 14, 2024

If you have examples for where it is missing I can take a look. It could happen if for some reason the image response body isn't available or the code that detects the image type by looking at the header bytes doesn't recognize it.

heif is definitely not detected but the others should be reasonably up to date.

@foolip foolip mentioned this issue Mar 3, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants