Work on Bloom Filter

I wrote a lot of code over the last few days and not much prose.

Right now I'm running this command.

haruhi run job -clusterId tinyAwsCluster -jarId telepath inBloomFilter -input s3n://wikimedia-pagecounts/2008/2008-01/pagecounts-20080101-000000.gz -bloomFilter s3n://wikimedia-summary/test/firstBloom/part-r-00000 -k 7 -output s3n://wikimedia-summary/test/firstBloomOut

I've got a painful 10 minute development cycle because I'm not practicing good TDD, but the questions I'm answering have to do with uncertainties of what I can get away with I/O and until I know those answers, I can't write the unit tests.

I am thinking of patching Haruhi so I can keep an AWS cluster running between flows. This gets us from "8 cents per test" to "8 cents per working hour", which is a cost savings, and it could speed the development cycle up to 1-2 minutes.

The current exception is

java.lang.IllegalArgumentException: Can not create a Path from a null string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
	at org.apache.hadoop.fs.Path.<init>(Path.java:90)
	at com.ontology2.telepath.bloom.InBloomFilterMapper.setup(InBloomFilterMapper.java:30)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

The right thing to do is write tests and we have a precedent for that because the SingleJobTool provides a test point, used in the following code:

https://github.com/paulhoule/telepath/blob/master/telepath/src/test/java/com/ontology2/telepath/bloom/CreateBloomFilterToolTest.java

Since the only interesting thing most Tool(s) do is create a Job object, we can write a unit test that checks that we created the correct job.

Now I have the error

2014-01-15 17:51:59,000 WARN org.apache.hadoop.mapred.Child (main): Error running child
java.lang.IllegalArgumentException: This file system object (hdfs://10.194.249.65:9000) does not support access to the request path 's3n://wikimedia-summary/test/firstBloom' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
	at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:798)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1538)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1533)
	at com.ontology2.telepath.bloom.InBloomFilterMapper.setup(InBloomFilterMapper.java:35)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

amazingly it tells me exactly what I did wrong!

So the next hang-up is the following:

2014-01-15 18:10:34,731 WARN org.apache.hadoop.mapred.Child (main): Error running child
java.io.IOException: 's3n://wikimedia-summary/test/firstBloom' is a directory
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.open(NativeS3FileSystem.java:1056)
	at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1558)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1545)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1538)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1533)
	at com.ontology2.telepath.bloom.InBloomFilterMapper.setup(InBloomFilterMapper.java:35)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

and the issue here is that I'm reading with a method that reads just one file, not a whole bunch of files in a Directory. (The last step didn't produce just one Bloom Filter, it produced one per reducer, but all of these filters can be OR-ed together to produce the master filter.) It's not particularly efficient to do this combining in the Mapper, but it is convenient in the sense that the work can all be done in one place. A better implementation might do the merging in a previous job or do the merging when we set up the job and pass it out through the DistributedCache.

Speaking of which, now I did some integration testing and I think I have everything working right in the 'single bloom filter' case, which means now I need to do consolidation. I think I'm inclined to write an M/R job that eats the separate bloom filters and spits out the OR bloom filter. (At least that is scalable)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work on Bloom Filter

Clone this wiki locally