Skip to content
Brooke M. Fujita edited this page Mar 28, 2015 · 2 revisions

Why Even Bother?

MeCab is distributed as a tar-ball of source code that compiles on many different platforms, including Linux, Solaris, Mac OSX, and *BSD. It also is available as a Windows executable. It can be built and run on many different platforms, making it a very convenient and powerful tool when working with Japanese text.

Furthermore, there are bindings available in source-code form for Perl, Python, Java and even Ruby.

So why even bother creating this gem?

Building a Rubygem That Just Works

Well, those bindings mentioned above are SWIG-based. The Simplified Wrapper and Interface Generator allows one to define, compile and then use bindings to libraries written in C or C++, from other high-level programming languages. While it is a very useful bridge, they are C extensions that nevertheless require one to compile the wrapper and interface. This might not be an option for some MeCab users, due to lack of control over their deployment environment or perhaps lack of a compiler in their target environment.

Even if you do have access to a compiler, SWIG can be quite difficult to work with for all the dependencies that must be satisfied for compiling and linking. I actually first tried SWIG to bridge from Ruby to MeCab a while back, but ended up abandoning that idea in the face of dependency Hell.

Foreign Function Interface (FFI) elegantly overcomes the shortcomings of SWIG. It allows calls from one language to a target library of another, usually C or C++. FFI does not require any mucking around with generated header files or other fragile bindings. Higher level languages call functions or services of an underlying shared object or library runtime.

I wanted to keep to the platform-neutral spirit of MeCab, and offer a Ruby binding that would simply work out-of-the-box on as wide a range of Ruby platforms as possible. The ffi Rubygem let me do just that, giving me a way to bind Ruby to MeCab at runtime even in JRuby, which at the time of this writing only offers very limited support for C-extensions. The natto Rubygem thus provides a simple and easy-to-use runtime binding from the Ruby programming language to the MeCab shared object / dynamic library / dynamic-link library.

What's the Object Here?

While leveraging FFI does make binding to the MeCab library at runtime very simple, the responsibility of object destruction and freeing resources still remains.

I have seen other examples out on the internet of Ruby-MeCab bridging using FFI, but they make two implementation errors.

First is any use of ObjectSpace.id2ref. JRuby is a major Ruby platform, but for difficulties in enumerating objects managed by the Java memory model, support for ObjectSpace is turned off by default. ObjectSpace support can be enabled explicitly in JRuby, but it incurs a performance overhead that I did not want to force on any of my users on the JRuby platform.

But there is a greater implementation error I have seen in some of these examples that has no place in production-level code.

Failure to Self Destruct

Here is a short snippet to illustrate...

#!ruby

class BadExample

  ...    

  def initialize(options = {})
    ...
    ObjectSpace.define_finalizer(@pointer, self.class.method(:finalize).to_proc)
  end
   
  def self.finalize(oid)
    self.mecab_destroy(ObjectSpace._id2ref(oid))
  end
end

Aside from the fact that this bad example above uses ObjectSpace#_id2ref, do you see how the instance destructor is being created as a Proc in the initialize instance method? The intent is for this code to use a Proc created at the end of the initialize method, registering it with a finalizer hook that is called after the instance has been garbage collected.

It looks as if it should work, right?

The fatal flaw here is that the Proc for cleaning up resources for the garbage-collected instance is being created within the initialize method. Keep in mind that Proc is a closure, and that it will be bound to the current context of where it is being created. If the current context is an object's instance method, then that context will include self, a reference to the object instance. Should the destructor Proc have a reference to the object instance, then that object will never be marked for garbage collection and memory will leak.

The Bottom Line: A Better API

But the bottom line is that I wanted to create a Ruby API for MeCab that hides the internal implementation and simply just works. You pass in the parsing options you wish to use, and you are given an object reference that can parse as simple strings or as MeCab nodes that contain more detailed information. The complexities of object creation/destruction and memory management are all handled behind the scenes. Easy as pie!


Previous | Home | Next