GitHub - mayabot/fastText4j: Implementing Facebook's FastText with java

FastText4j implementing FastText with Kotlin&Java. Fasttext is a library for text representation and classification by facebookresearch.

FastText4j是java&kotlin开发的fasttext算法库。Fasttext 是由facebookresearch开发的一个文本分类和词向量的库。

代码迁移至Mynlp项目 https://github.com/mayabot/mynlp/tree/master/fasttext 。

New code move to Mynlp project https://github.com/mayabot/mynlp/tree/master/fasttext

Features:

Implementing with java(kotlin)
Well-designed API
Compatible with original C++ model file (include quantizer compression model)
Provides train、test etc. api (almost the same performance)
Support for java file formats( can read file use mmap),read big model file with less memory

Features:

100%由kotlin&java实现
良好的API
兼容官方原版的预训练模型
提供所有的包括train、test等api
支持自有模型存储格式，可以使用MMAP快速加载大模型

Installing

Gradle

compile 'com.mayabot.mynlp:fastText4j:3.1.2'

Maven

<dependency>
  <groupId>com.mayabot.mynlp</groupId>
  <artifactId>fastText4j</artifactId>
  <version>3.1.2</version>
</dependency>

API

Train model | 训练模型

1. train Text classification model | 训练文本分类模型

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);
inputArgs.setEpoch(20);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

主要参数说明：

loss 损失函数
- hs 分层softmax.比完全softmax慢一点。分层softmax是完全softmax损失的近似值，它允许有效地训练大量类。还请注意，这种损失函数被认为是针对不平衡的label class，即某些label比其他label更多出现在样本。如果您的数据集每个label的示例数量均衡，则值得尝试使用负采样损失（-loss ns -neg 100）。
- ns NegativeSamplingLoss 负采样
- softmax default for Supervised model
- ova one-vs-all 可用于多分类.“OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label.
lr 学习率learn rate
dim 向量维度
epoch 迭代次数训练数据格式:

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string label. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

训练数据是个纯文本文件，每一行一条数据，词之间使用空格分开，每一行必须包含至少一个label标签。默认情况下，是一个带__label__前缀的字符串。

__label__tag1 saints rally to beat 49ers the new orleans saints survived it all hurricane ivan

__label__积极这个商品很好用。

2. word representation learning | 词向量学习

支持cow和Skipgram两种模型

FastText.trainCow(file,inputArgs)
//Or
FastText.trainSkipgram(file,inputArgs)

Test model

File trainFile = new File("data/agnews/ag.train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(0.1);
inputArgs.setDim(100);

FastText model = FastText.trainSupervised(trainFile, inputArgs);

model.test(new File("data/agnews/ag.test"),1,0,true);

output:

F1-Score : 0.968954 Precision : 0.960683 Recall : 0.977368  __label__2
F1-Score : 0.882043 Precision : 0.882508 Recall : 0.881579  __label__3
F1-Score : 0.890173 Precision : 0.888772 Recall : 0.891579  __label__4
F1-Score : 0.917353 Precision : 0.926463 Recall : 0.908421  __label__1
N	7600
P@1	0.915
R@1	0.915

Save model | 保存模型文件

FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.saveModel(new File("path/data.model"));

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installing

Gradle

Maven

API

Train model | 训练模型

1. train Text classification model | 训练文本分类模型

2. word representation learning | 词向量学习

Test model

Save model | 保存模型文件

Load model | 加载模型

mayabot/fastText4j

Folders and files

Latest commit

History

Repository files navigation

Installing

Gradle

Maven

API

Train model | 训练模型

1. train Text classification model | 训练文本分类模型

2. word representation learning | 词向量学习

Test model

Save model | 保存模型文件

Load model | 加载模型