FastText4j implementing FastText with Kotlin&Java. Fasttext is a library for text representation and classification by facebookresearch.
FastText4j是java&kotlin开发的fasttext算法库。Fasttext 是由facebookresearch开发的一个文本分类和词向量的库。
代码迁移至Mynlp项目 https://github.com/mayabot/mynlp/tree/master/fasttext 。
New code move to Mynlp project https://github.com/mayabot/mynlp/tree/master/fasttext
Features:
- Implementing with java(kotlin)
- Well-designed API
- Compatible with original C++ model file (include quantizer compression model)
- Provides train、test etc. api (almost the same performance)
- Support for java file formats( can read file use mmap),read big model file with less memory
Features:
- 100%由kotlin&java实现
- 良好的API
- 兼容官方原版的预训练模型
- 提供所有的包括train、test等api
- 支持自有模型存储格式,可以使用MMAP快速加载大模型
compile 'com.mayabot.mynlp:fastText4j:3.1.2'
<dependency>
<groupId>com.mayabot.mynlp</groupId>
<artifactId>fastText4j</artifactId>
<version>3.1.2</version>
</dependency>
File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);
inputArgs.setEpoch(20);
FastText model = FastText.trainSupervised(trainFile, inputArgs);
主要参数说明:
- loss 损失函数
- hs 分层softmax.比完全softmax慢一点。 分层softmax是完全softmax损失的近似值,它允许有效地训练大量类。 还请注意,这种损失函数被认为是针对不平衡的label class,即某些label比其他label更多出现在样本。 如果您的数据集每个label的示例数量均衡,则值得尝试使用负采样损失(-loss ns -neg 100)。
- ns NegativeSamplingLoss 负采样
- softmax default for Supervised model
- ova one-vs-all 可用于多分类.“OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label.
- lr 学习率learn rate
- dim 向量维度
- epoch 迭代次数 训练数据格式:
where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string label. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:
训练数据是个纯文本文件,每一行一条数据,词之间使用空格分开,每一行必须包含至少一个label标签。默认
情况下,是一个带__label__
前缀的字符串。
__label__tag1 saints rally to beat 49ers the new orleans saints survived it all hurricane ivan
__label__积极 这个 商品 很 好 用 。
支持cow和Skipgram两种模型
FastText.trainCow(file,inputArgs)
//Or
FastText.trainSkipgram(file,inputArgs)
File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);
FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.test(new File("data/agnews/ag.test"),1,0,true);
output:
F1-Score : 0.968954 Precision : 0.960683 Recall : 0.977368 __label__2
F1-Score : 0.882043 Precision : 0.882508 Recall : 0.881579 __label__3
F1-Score : 0.890173 Precision : 0.888772 Recall : 0.891579 __label__4
F1-Score : 0.917353 Precision : 0.926463 Recall : 0.908421 __label__1
N 7600
P@1 0.915
R@1 0.915
FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.saveModel(new File("path/data.model"));