mahout-sequence-file-groups

Takes a bunch of sequence files that are in the mahout classification format (groupId/unique-id) and creates another bunch of sequence files with each file having items from only 1 group id

This code will primarily be used by the classification utilities in mahout. After doing a seq2sparse, we get the vectors in a sequence file format. However, the keys in these sequence files are random i.e the special meaning that mahout assigns the keys i.e (category-id/unique-item-id) is not considered.

We would like to use the --testSplitPct option of the mahout split utility to split the data into training and test data

But the problem is that to use this option, the input directory needs to have 1 sequence file per category.

So, in this code we're simply going to pass through all the sequence files in the input directory and create output sequence files that have only 1 group in them

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
README.md		README.md
mahout-sequence-file-groups.iml		mahout-sequence-file-groups.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mahout-sequence-file-groups

About

Releases

Packages

Languages

abbas-gadhia/mahout-sequence-file-groups

Folders and files

Latest commit

History

Repository files navigation

mahout-sequence-file-groups

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages