-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requiring mime/types accounts for 25% of all application RAM #94
Comments
👍 on lazy load from JSON Hash. I don't know too much about which are the most popular users of the mime-types gem but I really think that eagerly loading the top 10% of mime-types will probably cover a majority of use cases and dramatically reduce the memory footprint. |
👎 on lazy load of the types from JSON file. That’s not an option for the slowness. Memory use is of paramount importance—see #83 by @jeremyevans (also mikel/mail#829) for a similar report. I have no information to indicate what the most commonly used types are, so I don’t want to predictively load anything. I have a slightly different option (mostly because working from the JSON hash would require a fairly substantial change to What I’m leaning toward, and have been a little too busy to try to investigate on, is essentially “mime-types-lite”. Consider the canonical representation in YAML: - !ruby/object:MIME::Type
content-type: application/atom+xml
friendly:
en: Atom Syndication Format
encoding: 8bit
extensions:
- atom
references:
- IANA
- RFC4287
- RFC5023
- ! '{application/atom+xml=http://www.iana.org/assignments/media-types/application/atom+xml}'
xrefs: !ruby/hash:MIME::Types::Container
rfc:
- rfc4287
- rfc5023
template:
- application/atom+xml
registered: true What most users care about—based on some scanning of codebases on GitHub—is the content type, the extensions, and maybe the encoding. All of the other data is useful, but not for all applications. There’s a bunch of things that I’ll be removing in mime-types 3.0 because they’ve been deprecated for a while, and I have at least two issues (#45 and #67) asking for more information or at least a different organization of the same data for different purposes. I’m also increasingly convinced that the simplified type (and the sort based on that, per rest-client/rest-client#248) is probably a mistake. In the short term, #64 looks like it may offer a substantial reduction in duplicated text (e.g., |
Thanks for all the feedback. The file store seems to be out of the question. I've played around more with some refactorings and I want to share my experiments, their results, and a suggestion. Json ParserWe can save memory switching to Yajl for the json parser. Yajl versus JSON uses about 3mb compared to 6mb. require 'json'
require 'get_process_mem'
require 'yajl'
file_name = "/Users/richardschneeman/Documents/projects/mime-types/data/mime-types.json"
GC.start(full_mark: true, immediate_sweep: true)
before = GetProcessMem.new.mb
# array = JSON.parse(File.open(file_name, 'r:UTF-8:-') { |f| f.read })
array = Yajl::Parser.new.parse(File.new(file_name, 'r'))
GC.start(full_mark: true, immediate_sweep: true)
after = GetProcessMem.new.mb
puts "MEM Difference: #{after - before}" Unfortunately without doing anything else, we'll see no savings, this is because the act of creating Smaller Object FootprintI experimented using a minimum viable object to store data in. The idea is that once we need it we could either promote it or do something with it. class MicroMime
attr_accessor :"content_type", :encoding, :references,
:xrefs, :registered, :extensions, :obsolete,
:"use_instead", :friendly, :signature,
:system, :docs
def initialize(hash)
@content_type = hash["content-type".freeze]
@encoding = hash["encoding".freeze]
@references = hash["references".freeze]
@xrefs = hash["xrefs".freeze]
@registered = hash["registered".freeze]
@extensions = hash["extensions".freeze]
@obsolete = hash["obsolete".freeze]
@use_instead = hash["use-instead".freeze]
@friendly = hash["friendly".freeze]
@signature = hash["signature".freeze]
@system = hash["system".freeze]
@docs = hash["docs".freeze]
end
end Compared to the the current ENV['RUBY_MIME_TYPES_LAZY_LOAD'] = "true"
require 'mime/types'
# array = Yajl::Parser.new.parse(json).map {|element| MicroMime.new(element) }
# array = Yajl::Parser.new.parse(json).map {|element| MIME::Type.new(element) } The results are pretty stark
Maybe we could investigate why the Truly lazy loadI've got a branch of code that stores the raw json from the file (manipulated slightly to make searching by content-type easier). We can re-implement While I think this will eventually work, it feels a bit to me like this: It's slow on the first call and fast on all the others. Here's the memory difference on startup: array = Yajl::Parser.new.parse(json)
lazy = MIME::Types::Lazy.new(array)
# => memory use: 5.203125 mb Here we're actually using more memory than saving and retaining the SuggestionUse Yajl, it's faster and has a smaller memory footprint, without it even if we get savings somewhere else, we might lose it on the json parser. I also think we should switch to lazy creation of Right now I feel like the logic for There may be other spots to save memory on, but going forwards this looks like the most sane plan. Let me know what you think or if you have any questions. |
Thanks for pursuing this further. You’ve confirmed my suspicion that moving to a lighter-weight load of data is going to have a substantial impact, so I’m going to start designing in that direction for mime-types 3—which is where any change of this magnitude has to go. Unfortunately, Yajl is almost certainly a non-starter, because mime-types has to work on all Rubies conforming to 1.9.2 or higher, and Yajl is a C binding (excluding at least JRuby). This means I could either support MultiJSON (nope), support multiple JSON libraries in a similar way that MultiJSON does (probably not), or provide some sort of configuration mechanism can be provided to pass in a JSON parser that works the way that I use it. (Purely academically, it would be interesting to compare Yajl against Oj.) I am leaning toward the third option, but I have to think about it. It also doesn’t feel right because mime-types is currently a no-dependency gem for users. (I do not mind adding dependencies for people who are developing mime-types code.) There’s one other idea that was suggested back at RubyConf that’s a bit radical—what if the default mime-types registry was generated as Ruby files? It’d be a pretty substantial change…but not hard to test. The entire memory use would be at load time, and while the total number of objects may not be lower, using a few of the tricks that you’ve pointed out for an earlier PR would make it fairly easy to keep that lower in memory. If, in fact, a MIME::Type::Lite (or whatever it gets called) is what is written out and the JSON parsing is only used when more data is desired (that is not, as far as I can tell, the average use case, but there are users who do want more data)…this could be interesting. |
@halostatue Just curious, why is MultiJSON not an option? |
Main reason? Because mime-types is currently a no-dependencies-required gem, and I want to keep it that way. There’s nothing about mime-types that should require it use anything beyond that which is included with Ruby by default (at least for a sane Ruby like MRI or JRuby). Enabling advanced usage is one thing; requiring it by default for a library like mime-types is the wrong thing to do. Beyond that? I don’t like MultiJSON as a user. The only reason it will ever show up in a project that I’m involved with is if a library that I use has chosen not to make a choice on a library; it will never be primary in the |
@halostatue Sorry to bother but why don't you like as MultiJSON as a user? I know you're probably busy so if you're not interested in explaining, feel free to leave the question unanswered. |
I’m planning on being at RubyConf this fall; if you’re there, we can have that discussion. I’d like to keep this discussion focussed on the performance improvements suggested by @schneems—and how to keep them within the goals and constraints I have for mime-types as a library. |
I was wrong about the suggestion of generating Ruby to include this. This was suggested by @postmodern in #85, and is the only reason that ticket is still open. It does also suggest that adding unstated dependencies (because apparently Rubinius does not include JSON in its standard library but every other Ruby does) is potentially problematic. |
@halostatue I tried extracting the storage of the mime types into pure ruby files, and weirdness happened. I pulled them into a runnable benchmark: https://github.com/schneems/require_memory_size_benchmarks. Looks like to minimize memory we would need to split out the requires into different files. Also it ends up not saving us any memory versus what we're currently doing. Note the "json" method in that example uses yajl which even if we can't use it here, has so far been my "best case" and I think we should shoot for near that memory footprint. While I was at RailsConf @jeremyevans gave a recommendation to check out SDBM which is in the stdlib as a way of storing and loading data. I played around with it, below are some benchmarks and thoughts, please take a look. Option store info in a SDBM database and create objects as neededThe SDBM library I would propose storing two SDBM databases one that has a reference to simplified content types, another that has a reference to extensions. I used a script like this for preparing the two databases, I've got pros and cons listed below of using such an approach require 'sdbm'
require 'json'
file_name = File.expand_path("../data/mime-types.json", __FILE__)
file = File.open(file_name, 'r:UTF-8:-') { |f| f.read }
array = JSON.parse(file).map { |hash| hash }
require 'mime/types'
SDBM.open 'data/content-types' do |db|
db.clear
array.each do |hash|
simplified = MIME::Type.simplified(hash["content-type"])
db[simplified] ||= "[]"
previous = JSON.parse(db[simplified])
previous << hash
db[simplified] = previous.to_json
end
end
SDBM.open 'data/extensions' do |db|
db.clear
array.each do |hash|
next unless hash["extensions"]
hash["extensions"].each do |extension|
db[extension] ||= "[]"
previous = JSON.parse(db[extension])
previous << MIME::Type.simplified(hash["content-type"])
db[extension] = previous.to_json
end
end
end Pros
Here's an example of how I was doing lookups @content_type_database = SDBM.open("data/content-types")
@extensions_database = SDBM.open("data/extensions")
def find_by_content_type(type)
array_string = @content_type_database[type] || "[]"
JSON.parse(array_string).map {|hash| MIME::Type.new(hash) }
end
def find_extension(ext)
if types = @extensions_database[ext]
JSON.parse(types).flat_map do |type|
find_by_content_type(type)
end
end
end The benchmarks are promising Benchmark.ips do |bm|
bm.report("find extension") { find_extension("html".freeze) }
bm.report("find by type") { find_by_content_type("application/applefile".freeze) }
end
# Calculating -------------------------------------
# find extension 4.195k i/100ms
# find by type 8.332k i/100ms
# -------------------------------------------------
# find extension 43.075k (±14.6%) i/s - 213.945k
# find by type 86.857k (±13.1%) i/s - 433.264k
Cons
To work around, we could copy the database to a tmp location on the first require 'tmpdir'
require 'benchmark'
Benchmark.ips do |bm|
bm.report("copy") do
tmpdir = Dir.mktmpdir
FileUtils.cp(%W{ data/extensions.dir data/extensions.pag data/content-types.dir data/content-types.pag }, tmpdir)
end
end
# Calculating -------------------------------------
# copy 1.000 i/100ms
#-------------------------------------------------
# copy 10.254 (± 9.8%) i/s - 51.000 The copy technique is slow, i.e. takes between 0.09 and 0.125 seconds. We'll make a little of that back by not having to parse json and create objects, but not nearly all of it. Maybe there's a better or more efficient way to do the copying, at least it would only have to be done once no matter how many times you call
FinAnywhoo, that's what I've been playing around with, any thoughts? Also thanks to Jeremy for the recommendation, I didn't know about SDBM. |
@schneems Did you have a chance to benchmark how much this will slow down something like To speed that up, it may be advantageous to have a file with a simplified mime type per line. Then we can scan the file to get all simplified types matching the regexp, map over that to look up each matching simplified type in the database. That should significantly decrease IO for non-stupid regexps (e.g. I guess I never checked whether SDBM was ruby code or a C extension, and figured ruby. I'm surprised Rubinius and JRuby don't implement it because they implement most of the rest of the stdlib, but maybe they just never got a request for it, since so few people use or know about sdbm. However, since mime-types should work in Rubinius and JRuby, we'd have to create an sdbm-compatible reader in pure ruby and ship it with mime-types, which while possible, would probably be a lot of work. @halostatue The other idea that @schneems and I discussed at RailsConf was using a columnar storage approach. Basically, have the main mime-types file load |
@schneems & @jeremyevans: This leaves me with a lot to chew on. After I release mime-types 2.5, I’m going to start a 3.x branch where I can start trying to figure out how the data can be sorted out between minimal, current, and expanded. The first step is to get rid of the deprecated methods and data. That will probably cut the memory usage by up to a third (we are duplicating some data between My first guess on the data required is content type/subtype, extensions, and two binary flags (un/registered and text/binary). The extensions are required for anyone interacting with MIME::Types through CarrierWave or another uploader binary (including Restify). I think that 3.x is going to be a breaking change with some (lots of?) incompatibility in methods (to the point where I may even work on this in a I’m uncomfortable with the SDBM approach because while it will do better for MRI, a pure-Ruby reader for SDBM will probably drag the performance of JRuby and Rubinius down substantially). I don’t know enough about columnar stores to be able to implement them myself…and there are ~2k MIME types involved. I’m also going to shift my approach to mime-types development as soon as I release 2.5. |
(And, BTW, I am jealous of the two of you having been at RailsConf. Hope to see you both at RubyConf in the fall—and I’m trying to figure out something to possibly propose a talk about this. Maybe this. Maybe one of several things I am doing at work.) |
I'll see if I can work on a proof of concept of my columnar store idea and see if that will work (pass all tests) and provide enough memory savings to make the approach worthwhile. I'm not sure I'll be able to make RubyConf this year, but I'll try. |
Sounds great. Just as a warning, I’m releasing 2.5 tonight (if at all possible), and then I’m going to move mime-types development to mime-types/ruby-mime-types and keep this as a fork so people don’t completely wonder where this has gone. I’m updating the documentation in the gem right now. |
@jeremyevans It takes about @content_types_database.each.select { |k, _| k =~ regex }.map {|v| MIME::Type.new(v) } Or about 55 iterations per second @halostatue I look forward to seeing you in San Antonio. If you're interested in a getting a set of eyes on a title/abstract or you want someone to ping ideas off i've got some experience and would be happy to help, you can shoot me an email richard@heroku.com. I think SDBM might be a non-starter unless easy and performant jruby and rubinius options come out of the woodwork. Thanks to both of you for your time and responses! |
Here's my work in progress diff: http://pastie.org/pastes/10112762/text Basic approach: Use a plain text file to store the content type and extensions for each mime type in the json file, with one line per mime type. Have supplementary data files for each separate mime-type attribute (currently implemented: If an attribute getter is called on a mime type instance, and the attribute value has not been loaded yet, just load the attribute value for all mime types, then return the getter value. We iterate over the file lines using each_line to avoid loading the entire file into memory. The parse_mime_json script included with the diff parses the json file into the plain text data files. The check_mime_json script checks for behavior. There is currently one mime-type where the behavior is different, the Memory difference: Ruby by itself: 11752KB RSS So about a 10x improvement on base memory use. Additionally, the time to load mime-types has been reduced also by about 3.8x, from 0.38 seconds to 0.1 seconds on my machine. I haven't actually started running the tests with this yet, since I need to handle the rest of the attributes before doing so. But I'd like your thoughts before I continue further down this path. I haven't added mutex locking around loading the supplementary data files to ensure thread safety, but I definitely plan to do that if this looks like a good approach. @halostatue @schneems Your thoughts on this approach? |
If you rebase what you’re doing against master (mime-types/ruby-mime-types, now; update your remotes!) you should see this improved (I improved my parser to dump that crap into a new xrefs:notes field), as I just release 2.5. Do note that the JSON file is not the source file; that's the YAML files in |
Unfortunately, I'm not seeing a significant difference after rebasing to the current master branch, as my initial checkout was only 4-5 hours ago. Since you feel pretty good about the columnar approach, I'll try to work tomorrow on filling out the remaining attributes, changing the data file creation to use the yaml files instead of the json ones, adding rake tasks for creating the data files, and fixing the remaining issues I discussed, as well as making sure all of the tests pass. |
I've added pull request #96 which implements the columnar storage idea. I believe it should be backwards compatible, and it offers significant memory savings (10x best case, 2x worst case). |
== 2.6.1 / 2015-05-25 * Bugs: * Make columnar store handle all supported extensions, not just the first. * Avoid circular require when using the columnar store. == 2.6 / 2015-05-25 * New Feature: * Columnar data storage for the MIME::Types registry, contributed by Jeremy Evans (@jeremyevans). Reduces default memory use substantially (the mail gem drops from 19 Mib to about 3 Mib). Resolves {#96}[mime-types/ruby-mime-types#96], {#94}[mime-types/ruby-mime-types#94], {#83}[mime-types/ruby-mime-types#83]. Partially addresses {#64}[mime-types/ruby-mime-types#64] and {#62}[mime-types/ruby-mime-types#62]. * Development: * Removed caching of deprecation messages in preparation for mime-types 3.0. Now, deprecated methods will always warn their deprecation instead of only warning once. * Added a logger for deprecation messages. * Renamed <tt>lib/mime.rb</tt> to <tt>lib/mime/deprecations.rb</tt> to not conflict with the {mime}[https://rubygems.org/gems/mime] gem on behalf of the maintainers of the {Praxis Framework}[http://praxis-framework.io/]. Provided by Josep M. Blanquer (@blanquer), {#100}[mime-types/ruby-mime-types#100]. * Added the columnar data conversion tool, also provided by Jeremy Evans. * Documentation: * Improved documentation and ensured that all deprecated methods are marked as such in the documentation. * Development: * Added more Ruby variants to Travis CI. * Silenced deprecation messages for internal tools. Noisy deprecations are noisy, but that's the point. == 2.5 / 2015-04-25 * Bugs: * David Genord (@albus522) fixed a bug in loading MIME::types cache where a container loaded from cache did not have the expected +default_proc+, {#86}[mime-types/ruby-mime-types#86]. * Richard Schneeman (@schneems) provided a patch that substantially reduces unnecessary allocations. * Documentation: * Tibor Szolár (@flexik) fixed a typo in the README, {#82}[mime-types/ruby-mime-types#82] * Fixed {#80}[mime-types/ruby-mime-types#80], clarifying the relationship of MIME::Type#content_type and MIME::Type#simplified, with Ken Ip (@kenips). * Development: * Juanito Fatas (@JuanitoFatas) enabled container mode on Travis CI, {#87}[mime-types/ruby-mime-types#87]. * Moved development to a mime-types organization under {mime-types/ruby-mime-types}[https://github.com/mime-types/ruby-mime-types].
FWIW all that the mail gem needs and all that rest client needs is a very minimal subset of mime types ... so I am making this https://github.com/discourse/mini_mime/blob/master/lib/mini_mime.rb and will do PRs for mail gem and rest client to swap it out. 100% lazy loaded with binary search |
Using a rails application www.codetriage.com it uses mime/types and when it boots uses:
Without loading mime/types it uses:
Thats
52.168 - 38.9 # => 13.268
mb of savings or 25% of all RAM usage. This is a non-trivial amount of memory to use.I've got some causes and some ideas, but I want some more eyes and some feedback before moving forwards.
Memory Causes
As far as I can tell there are two main culprits that are causing memory use.
1) Loading large JSON blob. Loading a 547 KB file into JSON and converting to a hash takes a bunch of memory. This is done in the loader. The entire json blob and resultant hash cannot fit into memory so Ruby must malloc more. Unfortunately Ruby never free-s memory after it's been allocated. If you are using a large Rails app, this isn't a concern since those empty ruby object slots will eventually be used, however if you're running a really small service, this is a non-trivial operation. While we could optimize this, it likely won't have an impact for most applications.
2) Lots of large objects retained. The mime/types gem proactively generates and retains 1800+ objects that each have quite a bit of data in them. I'm pretty sure this is where the bulk of the memory problems come from. Since we never release a reference to unused mime types, we never get this memory back. The list of types is only going to get longer, however the effective number of types used on a system is dramatically lower than the default set.
While it's currently possible to export and use a custom cache this process isn't easy and most developers don't know that this capability exists.
Potential Solution to #2
We don't necessarily have to make any tradeoffs default, instead we could offer them as a flag, however it would be ideal if we could find a middle-ground that was fast enough with less than 5~10% (random number I came up with) total RAM impact.
Any options to lazily create or evaluate
MIME::Types
could be enhanced by encouraging other libraries to explicitly declare common types they expect to use.Option Lazy load from JSON Hash) Don't coerce default values into
MIME::Type
objects. These objects expand the data stored in the default cache quite a bit and are very heavy. Instead we could store the resultant JSON hash in memory and scan it to lazily generateMIME::Type
objects so we only create what we need. The first time a mime type is needed it is coerced and retained so that we never have to seek for it again.Viability: Speed impact: minimal, decreased RAM impact: medium. Depending on common access patterns, and how we store and search the data this could be fast. We are still storing a bunch of data we will never use in memory but it's cheaper than what we're currently doing. We would end up with duplicate info retained in two places, but we could either delete from the source data or it may be inconsequential to keep both around.
Option Distributed File Store) We could get really fancy and try to create a ton of small files each named with how it is accessed so we could simply see if that file exists and read it's contents when it is accessed. If there are multiple common access patterns, we could have different directories with different file names that would redirect or refer to another file containing the full info.
Viability: Speed impact: depends, decreased RAM impact: large (good). We would literally only store the objects in memory we need, so RAM use would be as close to minimal as possible. Reading from disk is really really slow so speed would probably be negatively impacted for most cases except those that only need one or two mime types. In this case data scans would be prohibitively expensive and we might need to keep a stash of JSON data around on disk lest we access and read from 1800+ files.
Option Lazy JSON File ) We lazilly create each and every
MIME::Type
by loading in the json file and searching for the entry we want when it isn't in memory already.Viability: Speed impact: large (bad), decreased RAM impact: large (good). Might not be so bad for some cases, for others, this would be a world of hurt. It would help RAM more than the first option of storing the JSON in memory but it would provide us with no random access capabilities, any lookups would require a scan.
Option X) Hopefully there's some other options I've not yet considered. Maybe we could mark the MIME::Types that are being used and provide some kind of a
MIME::Type.clean
to undefine or remove references to types not being used? We would have to do it in conjunction with a lazy loading lest we be forced to load the whole into memory again if a new mime type gets referenced. Maybe we can use some binary data blob or store. It would be sweet to have a sqlite3 table we could query against, but that would add undue complexity and dependencies to the project. There's no clear winner yet, go crazy and recommend something.Next Steps
Im interested in concerns that you as library maintainer have with any or all of these plans. You know how people commonly use this gem and maybe you could help provide a top 5 (or whatever number) of use cases. I'm also interested in alternative solutions. If we can figure out one that makes sense to try, i'll be happy to work on an spike implementation so we can benchmark speed an memory use. Hopefully you're interested. Let me know what you think 😄
The text was updated successfully, but these errors were encountered: