Skip to content

Implementing Facebook's FastText with java

Notifications You must be signed in to change notification settings

mayabot/fastText4j

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

FastText4j implementing FastText with Kotlin&Java. Fasttext is a library for text representation and classification by facebookresearch.

FastText4j是java&kotlin开发的fasttext算法库。Fasttext 是由facebookresearch开发的一个文本分类和词向量的库。

代码迁移至Mynlp项目 https://github.com/mayabot/mynlp/tree/master/fasttext

New code move to Mynlp project https://github.com/mayabot/mynlp/tree/master/fasttext

Features:

  • Implementing with java(kotlin)
  • Well-designed API
  • Compatible with original C++ model file (include quantizer compression model)
  • Provides train、test etc. api (almost the same performance)
  • Support for java file formats( can read file use mmap),read big model file with less memory

Features:

  • 100%由kotlin&java实现
  • 良好的API
  • 兼容官方原版的预训练模型
  • 提供所有的包括train、test等api
  • 支持自有模型存储格式,可以使用MMAP快速加载大模型

Installing

Gradle

compile 'com.mayabot.mynlp:fastText4j:3.1.2'

Maven

<dependency>
  <groupId>com.mayabot.mynlp</groupId>
  <artifactId>fastText4j</artifactId>
  <version>3.1.2</version>
</dependency>

API

Train model | 训练模型

1. train Text classification model | 训练文本分类模型

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);
inputArgs.setEpoch(20);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

主要参数说明:

  • loss 损失函数
    • hs 分层softmax.比完全softmax慢一点。 分层softmax是完全softmax损失的近似值,它允许有效地训练大量类。 还请注意,这种损失函数被认为是针对不平衡的label class,即某些label比其他label更多出现在样本。 如果您的数据集每个label的示例数量均衡,则值得尝试使用负采样损失(-loss ns -neg 100)。
    • ns NegativeSamplingLoss 负采样
    • softmax default for Supervised model
    • ova one-vs-all 可用于多分类.“OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label.
  • lr 学习率learn rate
  • dim 向量维度
  • epoch 迭代次数 训练数据格式:

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string label. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

训练数据是个纯文本文件,每一行一条数据,词之间使用空格分开,每一行必须包含至少一个label标签。默认 情况下,是一个带__label__前缀的字符串。

__label__tag1 saints rally to beat 49ers the new orleans saints survived it all hurricane ivan

__label__积极 这个 商品 很 好 用 。

2. word representation learning | 词向量学习

支持cow和Skipgram两种模型

FastText.trainCow(file,inputArgs)
//Or
FastText.trainSkipgram(file,inputArgs)

Test model

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

model.test(new File("data/agnews/ag.test"),1,0,true);

output:

F1-Score : 0.968954 Precision : 0.960683 Recall : 0.977368  __label__2
F1-Score : 0.882043 Precision : 0.882508 Recall : 0.881579  __label__3
F1-Score : 0.890173 Precision : 0.888772 Recall : 0.891579  __label__4
F1-Score : 0.917353 Precision : 0.926463 Recall : 0.908421  __label__1
N	7600
P@1	0.915
R@1	0.915

Save model | 保存模型文件

FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.saveModel(new File("path/data.model"));

Load model | 加载模型