-
Notifications
You must be signed in to change notification settings - Fork 13
kig/metadata
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Thanks ------ Konrad Meyer for his patient testing and bug reports. Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities (along with being the author of wmainfo-rb and flacinfo-rb.) Description ----------- This package `Metadata' comes with a library called `metadata' and a small program called `mdh'. The library probes files for their metadata (e.g. jpeg dimensions and camera make, mp3 artist, pdf text and word count) and returns the metadata as a Hash. All strings in the metadata are converted to UTF-8. The `mdh'-program can print out file metadata as YAML and package the metadata with the file. The metadata hash follows the shared file metadata spec naming, with some additional fields, see list at the end of this file (Appendix A.) For details on the MDH file format, see the end of this file (Appendix B.) Usage ----- # print out metadata for myfile.jpg mdh myfile.jpg # create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg mdh -c myfile.jpg # print out the metadata header from an MDH file mdh -e -p myfile.jpg.mdh # strip out the metadata header from an MDH file and write the actual file # to myfile.jpg mdh -e myfile.jpg.mdh # include file path, filename, md5sum and sha1sum in the metadata header mdh --path --name -m -s myfile.jpg # guess title for document (first line that starts with a capital letter) mdh --guess-title foo.ps # guess title, abstract and metadata for document mdh --guess-metadata foo.ps # don't include document text (File.Content) in the metadata mdh --no-text foo.ps # query CiteSeer with the document title, add possible results to metadata mdh --citeseer foo.ps # query DBLP with the document title, add possible results to metadata mdh --dblp foo.ps # If you have an unknown CS document, this might help identify it: mdh --guess-metadata --dblp --citeseer foo.ps # print out the list of options mdh -h irb> require 'metadata' irb> Metadata.extract('myfile.jpg') irb> Metadata.extract_text('myfile.pdf') irb> Pathname.new("myfile.jpg").metadata List of supported formats ------------------------- Audio: Whatever you manage to make mplayer play. Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma. Successfully tested with: mp3, flac, ogg, wav, ra, m4a, wma Should also work: wv, mpc, ape Video: Whatever you manage to make mplayer play. Successfully tested with: wmv, mov, divx, xvid, flv, ogg, mpg, mkv Images: Should handle pretty much anything. I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw. Successfully tested with: Web formats: jpeg, png, gif, svg Camera raws: nef, dng, crw, pef, orf, arw, raf, cr2 Image editor state dumps: psd, xcf The rest: tga, tif, bmp, xpm, ppm, pcx Documents: Successfully tested with: Web formats: html, txt Print formats: pdf, ps, ps.gz OO formats: sxi, odp MS formats: doc, ppt, xls - I'm using unoconv to convert OO & MS docs to temp PDFs for the text & dimensions extraction, so those bits of data are missing. MSOffice docs are missing dimensions for the same reason. Here's a way to get them: ( first, get Thumbnailer: http://github.com/kig/thumbnailer/tree/master ) $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg $ mdh foo.odp $ rm foo.odp-temp.pdf /tmp/foo.jpg Others: - BitTorrent .torrent files - Archive contents: tar.gz, zip - Whatever `extract' outputs and I am handling Formats that yield very little metadata: ai Formats that don't yield usable metadata: chm, sis, rb, rar, ttf Formats that fail mimetype guessing: exr Requirements ------------ * Ruby 1.8 * Tons of metadata extraction programs and libs. This package has many dependencies since there is no single universal metadata header format that all files use. Blame resource forks, filename extensions, bags of bytes and mimetypes. List of gems: flacinfo-rb wmainfo-rb MP4Info id3lib-ruby apetag text hpricot ruby-mp3info List of Debian packages: dcraw libimlib2-ruby extract libimage-exiftool-perl poppler-utils mplayer html2text imagemagick unhtml pstotext antiword catdoc shared-mime-info * You do want to install the latest versions of dcraw and shared-mime-info to be able to handle camera raw images. http://cybercom.net/~dcoffin/dcraw/ http://freedesktop.org/wiki/Software/shared-mime-info * Python + chardet library http://chardet.feedparser.org/ Install ------- De-compress archive and enter its top directory. Then type: ($ su) # ruby setup.rb These simple step installs this program under the default location of Ruby libraries. You can also install files into your favorite directory by supplying setup.rb some options. Try "ruby setup.rb --help". Appendix A: Metadata fields -------------------------------------- This list contains the metadata fields output by Metadata and mdh. The list follows the shared file metadata spec for the most part. http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec field name | field type ---------------------------------------------------------------------- Archive.Contents array of pathnames Audio.Band string Audio.Composer string Audio.Conductor string Audio.Copyright string (copyright message) Audio.Grouping string Audio.Image base64-encoded binary string (embedded image data) Audio.InterpretedBy string Audio.Lyricist string Audio.Publisher string Audio.RemixedBy string Audio.Subtitle string Audio.Tempo integer Audio.VariableBitrate boolean Audio.Writer string Audio.Publicationright string Audio.File string Audio.EAN/UPC string Audio.ISBN string Audio.Catalog string Audio.LC string Audio.Media string Audio.Index string Audio.Related string Audio.ISRC string Audio.Abstract string Audio.Language string Audio.Bibliography string Audio.Introplay string Audio.Dummy string Audio.DebutAlbum string Audio.RecordDate string Audio.RecordLocation string v-- ORIGINAL FIELDS USED --v Audio.Title string Audio.Artist string Audio.Album string Audio.AlbumArtist string Audio.AlbumTrackCount integer Audio.TrackNo integer Audio.DiscNo integer Audio.Performer string Audio.Duration float Audio.ReleaseDate datetime Audio.Comment string Audio.Genre string Audio.Codec string Audio.Samplerate integer Audio.Bitrate float Audio.Channels integer Audio.Lyrics string Doc.Album string Doc.Artist string Doc.Charset string Doc.Description string Doc.Genre string Doc.Language string Doc.ModifyDate date Doc.PageSizeName string (A4, A5, letter, ...) Doc.RevisionHistory array of strings Doc.ParagraphCount integer Doc.LineCount integer Doc.CharacterCount integer Doc.LastSavedBy string Doc.Keywords array of strings Doc.Template string Doc.Publisher string Doc.PublicationName string Doc.PublicationPages string Doc.Citations array of {href=>a, title=>b, rest=>c} hashes Doc.Contributor string Doc.CiteSeerIdentifier string Doc.CiteSeerURL string Doc.Published datetime Doc.Source string Doc.DBLPIdentifier string Doc.CrossRef string (BibTex crossref) Doc.BibSource string (BibTex source) Doc.BibTexType string (BibTex type: article, inbook, ...) Doc.ACMCategories array of strings v-- ORIGINAL FIELDS USED --v Doc.Title string Doc.Subject string Doc.Author string Doc.PageCount integer Doc.WordCount integer Doc.Created datetime File.Software string (software used to create the file) File.MD5Sum string (md5sum of file's contents) File.SHA1Sum string (sha1sum of file's contents) v-- ORIGINAL FIELDS USED --v File.Name string (basename of the file) File.Path string (dirname of the file) File.Format string (mime type, inode/directory for dirs) File.Size integer File.Content string File.Modified string Image.DateCreated date Image.DateTimeCreated date Image.DateTimeOriginal date Image.DimensionUnit string (px, mm, pt, ...) Image.Editor string Image.EXIF string (exiftool output) Image.FrameCount integer Image.LayerCount integer Image.Modified date Image.OriginatingProgram string Image.ComponentCount integer Image.ColorMode string (e.g. RGB) Image.ColorSpace string (e.g. sRGB) v-- ORIGINAL FIELDS USED --v Image.Height float Image.Width float Image.Title string Image.Date datetime Image.Creator string Image.Description string Image.Software string Image.CameraMake string Image.CameraModel string Image.ExposureProgram string Image.ExposureTime float Image.Fnumber float Image.Flash boolean Image.FocalLength float Image.ISOSpeed float Image.MeteringMode string Image.WhiteBalance string Image.Copyright string Location.Latitude float Location.Longitude float Video.Album string Video.Artist string Video.Bitrate integer Video.Codec string Video.Comment string Video.Duration float Video.Framerate float (frames per second) Video.Genre string Video.ReleaseDate date Video.Title string Video.TrackNo integer Video.Demuxer string BitTorrent.Name string BitTorrent.Files array of { 'path' => string, 'length' => integer, 'md5sum' => string } BitTorrent.Length integer (size of single-file torrents) BitTorrent.MD5Sum string (md5sum for single-file torrents) BitTorrent.PieceCount integer BitTorrent.PieceLength integer (length of a single piece BitTorrent.Comment string BitTorrent.Announce string (announce url) BitTorrent.AnnounceList array of arrays of strings BitTorrent.Nodes array of [hostname, port] -arrays Appendix B: The MDH file format ------------------------------- MDH files are built as follows: bytes | content --------------- 3 | "MDH" - MDH file format identifier 1 | "\x01" - MDH file format version number 4 | Long, network byte order - the size of the metadata struct in bytes var | YAML - The MDH metadata struct var | The actual file contents All string fields in the metadata are UTF-8. License ------- Ruby's Ilmari Heikkinen <ilmari.heikkinen gmail com>
About
File metadata extraction tool and Ruby library
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published