Skip to content

Work on Bloom Filter

Paul Houle edited this page Jan 15, 2014 · 6 revisions

I wrote a lot of code over the last few days and not much prose.

Right now I'm running this command.

haruhi run job -clusterId tinyAwsCluster -jarId telepath inBloomFilter -input s3n://wikimedia-summary/monthlyAll/2008-01/part-r-00000.gz -bloomFilter s3n://wikimedia-summary/test/firstBloom -k 7 -output s3n://wikimedia-summary/test/firstBloomOut

I've got a painful 10 minute development cycle because I'm not practicing good TDD, but the questions I'm answering have to do with uncertainties of what I can get away with I/O and until I know those answers, I can't write the unit tests.

I am thinking of patching Haruhi so I can keep an AWS cluster running between flows. This gets us from "8 cents per test" to "8 cents per working hour", which is a cost savings, and it could speed the development cycle up to 1-2 minutes.

The current exception is

java.lang.IllegalArgumentException: Can not create a Path from a null string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
	at org.apache.hadoop.fs.Path.<init>(Path.java:90)
	at com.ontology2.telepath.bloom.InBloomFilterMapper.setup(InBloomFilterMapper.java:30)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

The right thing to do is write tests and we have a precedent for that because the SingleJobTool provides a test point, used in the following code:

https://github.com/paulhoule/telepath/blob/master/telepath/src/test/java/com/ontology2/telepath/bloom/CreateBloomFilterToolTest.java

Since the only interesting thing most Tool(s) do is create a Job object, we can write a unit test that checks that we created the correct job.

Now I have the error

2014-01-15 17:51:59,000 WARN org.apache.hadoop.mapred.Child (main): Error running child
java.lang.IllegalArgumentException: This file system object (hdfs://10.194.249.65:9000) does not support access to the request path 's3n://wikimedia-summary/test/firstBloom' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
	at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:798)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1538)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1533)
	at com.ontology2.telepath.bloom.InBloomFilterMapper.setup(InBloomFilterMapper.java:35)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

amazingly it tells me exactly what I did wrong!