Welcome to the FS Crawler for Elasticsearch
This crawler helps to index documents from your local file system and over SSH. It crawls your file system and index new files, update existing ones and removes old ones.
You need to install a version matching your Elasticsearch version:
Elasticsearch | FS Crawler | Released | Docs |
---|---|---|---|
2.x, 5.x, 6.x | 2.4-SNAPSHOT | See below | |
2.x, 5.x, 6.x | 2.3 | 2017-07-10 | 2.3 |
1.x, 2.x, 5.x | 2.2 | 2017-02-03 | 2.2 |
1.x, 2.x, 5.x | 2.1 | 2016-07-26 | 2.1 |
es-2.0 | 2.0.0 | 2015-10-30 | 2.0.0 |
Thanks to Travis for the build status:
FS Crawler binary is available on Maven Central. Just download the latest release (or any other specific version you want to try).
The filename ends with .zip
.
For example, if you wish to download fscrawler-2.3:
wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.3/fscrawler-2.3.zip
unzip fscrawler-2.3.zip
The distribution contains:
$ tree
.
├── LICENSE
├── NOTICE
├── README.md
├── bin
│ ├── fscrawler
│ └── fscrawler.bat
└── lib
├── ... All needed jars
Note that you can also download a SNAPSHOT version from sonatype without needing to build it by yourself.
It can happen that you need to upgrade a mapping before starting fscrawler after a version upgrade. Read carefully the following update instructions.
To update fscrawler, just download the new version, unzip it in another directory and launch it as usual. It will still pick up settings from the configuration directory. Of course, you need to stop first the existing running instances.
-
fscrawler comes with new default mappings for files. They have better defaults as they consume less disk space and CPU at index time. You should remove existing files in
~/.fscrawler/_default/_mappings
before starting the new version so default mappings will be updated. If you modified manually mapping files, apply the modification you made on sample files. -
excludes
is now set by default for new jobs to["~*"]
. In previous versions, any file or directory containing a~
was excluded. Which means that if in your jobs, you are defining any exclusion rule, you need to add*~*
if you want to get back the exact previous behavior. -
If you were indexing
json
orxml
documents with thefilename_as_id
option set, we were previously removing the suffix of the file name, like indexing1.json
was indexed as1
. With this new version, we don't remove anymore the suffix. So the_id
for your document will be now1.json
.
-
fscrawler comes with new mapping for folders. The change is really tiny so you can skip this step if you wish. We basically removed
name
field in the folder mapping as it was unused. -
The way FSCrawler computes now
path.virtual
for docs has changed. It now includes the filename. Instead of/path/to
you will now get/path/to/file.txt
. -
The way FSCrawler computes now
virtual
for folders is now consistent with what you can see for folders. -
path.encoded
in documents andencoded
in folders have been removed as not needed by FSCrawler after all. -
OCR is now properly activated for PDF documents. This can be time, cpu and memory consuming though. You can disable explicitly it by setting
fs.pdf_ocr
tofalse
. -
All dates are now indexed in elasticsearch in UTC instead of without any time zone. For example, we were indexing previously a date like
2017-05-19T13:24:47.000
. Which was producing bad results when you were located in a time zone other than UTC. It's now indexed as2017-05-19T13:24:47.000+0000
. -
In order to be compatible with the coming 6.0 elasticsearch version, we need to get rid of types as only one type per index is still supported. Which means that we now create index named
job_name
andjob_name_folder
instead of one indexjob_name
with two typesdoc
andfolder
. If you are upgrading from FSCrawler 2.2, it requires that you reindex your existing data either by deleting the old index and running again FSCrawler or by using the reindex API as follows:
# Create folder index job_name_folder based on existing folder data
POST _reindex
{
"source": {
"index": "job_name",
"type": "folder"
},
"dest": {
"index": "job_name_folder"
}
}
# Remove old folder data from job_name index
POST job_name/folder/_delete_by_query
{
"query": {
"match_all": {}
}
}
Note that you will need first to create the right settings and mappings so you can then run the reindex job.
You can do that by launching bin/fscrawler job_name --loop 0
.
Better, you can run bin/fscrawler job_name --upgrade
and let FSCrawler do all that for you. Note that this can take
a loooong time.
Also please be aware that some APIs used by the upgrade action are only available from elasticsearch 2.3 (reindex) or elasticsearch 5.0 (delete by query). If you are running an older version than 5.0 you need first to upgrade elasticsearch.
This procedure only applies if you did not set previously elasticsearch.type
setting (default value was doc
).
If you did, then you also need to reindex the existing documents to the default doc
type as per elasticsearch 6.0:
# Copy old type doc to the default doc type
POST _reindex
{
"source": {
"index": "job_name",
"type": "your_type_here"
},
"dest": {
"index": "job_name",
"type": "doc"
}
}
# Remove old type data from job_name index
POST job_name/your_type_here/_delete_by_query
{
"query": {
"match_all": {}
}
}
But note that this last step can take a very loooong time and will generate a lot of IO on your disk. It might be easier in such case to restart fscrawler from scratch.
-
As seen in the previous point, we now have 2 indices instead of a single one. Which means that
elasticsearch.index
setting has been split toelasticsearch.index
andelasticsearch.index_folder
. By default, it's set to the crawler name and the crawler name plus_folder
. Note that theupgrade
feature performs that change for you. -
fscrawler has removed now mapping files
doc.json
andfolder.json
. Mapping for doc is merged within_settings.json
file and folder mapping is now part of_settings_folder.json
. Which means you can remove old files to avoid confusion. You can simply remove existing files in~/.fscrawler/_default
before starting the new version so default files will be created again.
You need to have at least Java 1.8. and have properly configured JAVA_HOME
to point to your Java installation
directory. For example on MacOS you can define in your ~/.bash_profile
file:
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
Start FS crawler with:
bin/fscrawler job_name
FS crawler will read a local file (default to ~/.fscrawler/{job_name}/_settings.json
).
If the file does not exist, FS crawler will propose to create your first job.
$ bin/fscrawler job_name
18:28:58,174 WARN [f.p.e.c.f.FsCrawler] job [job_name] does not exist
18:28:58,177 INFO [f.p.e.c.f.FsCrawler] Do you want to create it (Y/N)?
y
18:29:05,711 INFO [f.p.e.c.f.FsCrawler] Settings have been created in [~/.fscrawler/job_name/_settings.json]. Please review and edit before relaunch
Create a directory named /tmp/es
or c:\tmp\es
, add some files you want to index in it and start again:
$ bin/fscrawler --config_dir ./test job_name
18:30:34,330 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:30:34,332 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
18:30:34,682 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]
If you did not create the directory, FS crawler will complain until you fix it:
18:30:34,683 WARN [f.p.e.c.f.FsCrawlerImpl] Error while indexing content from /tmp/es: /tmp/es doesn't exists.
You can also run FS crawler without arguments. It will give you the list of existing jobs and will allow you to choose one:
$ bin/fscrawler
18:33:00,624 INFO [f.p.e.c.f.FsCrawler] No job specified. Here is the list of existing jobs:
18:33:00,629 INFO [f.p.e.c.f.FsCrawler] [1] - job_name
18:33:00,629 INFO [f.p.e.c.f.FsCrawler] Choose your job [1-1]...
1
18:33:06,151 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
This is a common use case in elasticsearch, we want to search for something! ;-)
GET docs/doc/_search
{
"query" : {
"match" : {
"_all" : "I am searching for something !"
}
}
}
By default, FS crawler will read your file from /tmp/es
every 15 minutes. You can change those settings by
modifying ~/.fscrawler/{job_name}/_settings.json
file where {job_name}
is the name of the job you just created.
{
"name" : "job_name",
"fs" : {
"url" : "/path/to/data/dir",
"update_rate" : "15m"
}
}
You can change also update_rate
to watch more or less frequently for changes.
If you just want FS crawler to run once and exit, run it with --loop
option:
$ bin/fscrawler job_name --loop 1
18:47:37,487 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:47:37,854 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]
...
18:47:37,855 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler is stopping after 1 run
18:47:37,959 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [job_name] stopped
If you have already ran FS crawler and want to restart (which means reindex existing documents),
use the --restart
option:
$ bin/fscrawler job_name --loop 1 --restart
You will find more information about settings in the following sections:
FS crawler can be a nice gateway to elasticsearch if you want to upload binary documents and index them into elasticsearch without writing by yourself all the code to extract data and communicate with elasticsearch.
To start FS crawler with the REST service, use the --rest
option. A good idea is also to combine it
with --loop 0
so you won't index local files but only listen to incoming REST requests:
$ bin/fscrawler job_name --loop 0 --rest
18:55:37,851 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:55:39,237 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service started on [http://127.0.0.1:8080/fscrawler]
Check the service is working with:
curl http://127.0.0.1:8080/fscrawler/
It will give you back a JSON document.
The you can start uploading your binary files:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload"
It will index the file into elasticsearch and will give you back the elasticsearch URL for the created document, like:
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534"
}
You will find more information about REST settings in the following section:
FS crawler supports all formats Tika version 1.15 supports, like:
- HTML
- Microsoft Office
- Open Office
- Images
- MP3
- ...
Once the crawler is running, it will write status information and statistics in:
~/.fscrawler/{job_name}/_settings.json
~/.fscrawler/{job_name}/_status.json
It means that if you stop the job at some point, FS crawler will restart it from where it stops.
FS crawler will also store default mappings and index settings for elasticsearch in ~/.fscrawler/_default/_mappings
:
1/_settings_doc.json
: for elasticsearch 1.x series document index settings1/_settings_folder.json
: for elasticsearch 1.x series folder index settings2/_settings_doc.json
: for elasticsearch 2.x series document index settings2/_settings_folder.json
: for elasticsearch 2.x series folder index settings5/_settings_doc.json
: for elasticsearch 5.x series document index settings5/_settings_folder.json
: for elasticsearch 5.x series folder index settings6/_settings_doc.json
: for elasticsearch 6.x series document index settings6/_settings_folder.json
: for elasticsearch 6.x series folder index settings
Read Mapping for more information.
-
--help
displays help -
--silent
runs in silent mode. No output is generated. -
--debug
runs in debug mode. -
--trace
runs in trace mode (more verbose than debug). -
--config_dir
defines directory where jobs are stored instead of default~/.fscrawler
. -
--username
defines the username to use when using an secured version of elasticsearch cluster. Read Using Credentials. (From 2.2) -
--upgrade_mapping
tries to upgrade existing mappings for documents and folders. Read Upgrading an existing mapping. (From 2.2) -
--loop x
defines the number of runs we want before exiting (From 2.2):X
where X is a negative value means infinite, like-1
(default)0
means that we don't run any crawling job (useful when used with rest).X
where X is a positive value is the number of runs before it stops.
If you want to scan your hard drive only once, run with --loop 1
.
--restart
restart a job from scratch (From 2.2). See below.--rest
starts the REST service (From 2.2):
If you want to run the REST Service without scanning your hard drive, launch with:
bin/fscrawler --rest --loop 0
You can tell FS crawler that it must restart from the beginning by using --restart
option:
bin/fscrawler job_name --restart
In that case, the {job_name}/_status.json
file will be removed.
If you want to provide JVM settings, like defining memory allocated to FS Crawler, you can define
a system property named FS_JAVA_OPTS
:
FS_JAVA_OPTS="-Xmx521m -Xms521m" bin/fscrawler
The job file must comply to the following json
specifications:
{
"name" : "job_name",
"fs" : {
"url" : "/path/to/data/dir",
"update_rate" : "15m",
"includes": [
"*.*"
],
"excludes": [
"*.json"
],
"json_support" : false,
"xml_support" : false,
"ignore_folders" : false,
"attributes_support" : false,
"raw_metadata" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"store_source" : false,
"lang_detect" : false,
"continue_on_error" : false,
"pdf_ocr" : true,
"indexed_chars" : "10000"
},
"server" : {
"hostname" : null,
"port" : 22,
"username" : null,
"password" : null,
"protocol" : "local",
"pem_path" : null
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"index" : "docs",
"index_folder" : "folders",
"bulk_size" : 100,
"flush_interval" : "5s",
"username" : "username",
"password" : "password",
"pipeline" : "pipeline-id-if-any"
},
"rest" : {
"enabled" : false,
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"port" : "fscrawler"
}
}
Here is a list of existing top level settings:
Name | Documentation |
---|---|
name |
the job name (mandatory field) |
fs |
Local FS settings |
elasticsearch |
Elasticsearch settings |
server |
SSH settings |
rest |
REST settings |
You can define the most simple crawler job by writing a ~/.fscrawler/test/_settings.json
file as follow:
{
"name" : "test"
}
This will scan every 15 minutes all documents available in /tmp/es
dir and will index them into test_doc
index.
It will connect to an elasticsearch cluster running on 127.0.0.1
, port 9200
.
Note: name
is a mandatory field.
Here is a list of Local FS settings (under fs.
prefix)`:
Name | Default value | Documentation |
---|---|---|
fs.url |
"/tmp/es" |
Root directory |
fs.update_rate |
"15m" |
Update Rate |
fs.includes |
null |
Includes and Excludes |
fs.excludes |
["~*"] |
Includes and Excludes |
fs.json_support |
false |
Indexing JSon docs |
fs.xml_support |
false |
Indexing XML docs (from 2.2) |
fs.add_as_inner_object |
false |
Add as Inner Object |
fs.ignore_folders |
false |
Ignore folders (from 2.2) |
fs.attributes_support |
false |
Adding file attributes |
fs.raw_metadata |
true |
Disabling raw metadata |
fs.filename_as_id |
false |
Using Filename as _id |
fs.add_filesize |
true |
Disabling file size field |
fs.remove_deleted |
true |
Ignore deleted files |
fs.store_source |
false |
Storing binary source document |
fs.index_content |
true |
Ignore content |
fs.lang_detect |
false |
Language detection (from 2.2) |
fs.continue_on_error |
false |
Continue on File Permission Error (from 2.3) |
fs.pdf_ocr |
true |
Run OCR on PDF documents (from 2.3) |
fs.indexed_chars |
100000.0 |
Extracted characters |
fs.checksum |
null |
File Checksum |
Define fs.url
property in your ~/.fscrawler/test/_settings.json
file:
{
"name" : "test",
"fs" : {
"url" : "/path/to/data/dir"
}
}
For Windows users, use a form like c:/tmp
or c:\\tmp
.
By default, update_rate
is set to 15m
. You can modify this value using any compatible
time unit.
For example, here is a 15 minutes update rate:
{
"name": "test",
"fs": {
"update_rate": "15m"
}
}
Or a 3 hours update rate:
{
"name": "test",
"fs": {
"update_rate": "3h"
}
}
update_rate
is the pause duration between the last time we read the file system and another run.
Which means that if you set it to 15m
, the next scan will happen on 15 minutes after the end of
the current scan, whatever its duration.
Let's say you want to index only docs like *.doc
and *.pdf
but resume*
. So resume_david.pdf
won't be indexed.
Define fs.includes
and fs.excludes
properties in your ~/.fscrawler/test/_settings.json
file:
{
"name" : "test",
"fs": {
"includes": [
"*.doc",
"*.pdf"
],
"excludes": [
"resume*"
]
}
}
It also applies to directory names. So if you want to ignore .ignore
dir, just add .ignore
as an excluded name.
Note that includes
does not apply to directory names but only to filenames.
By default, FS crawler will exclude files starting with ~
.
If you want to index JSon files directly without parsing with Tika, you can set json_support
to true
.
JSon contents will be stored directly under _source. If you need to keep JSon documents synchronized to the index,
set option Add as Inner Object which stores additional metadata and the JSon contents under
field object
.
{
"name" : "test",
"fs" : {
"json_support" : true
}
}
Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.
If you want to index XML files and convert them to JSON, you can set xml_support
to true
.
The content of XML files will be added directly under _source. If you need to keep XML documents synchronized to the
index, set option Add as Inner Object which stores additional metadata and the XML contents
under field object
.
{
"name" : "test",
"fs" : {
"xml_support" : true
}
}
Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.
The default settings store the contents of json and xml documents directly onto the _source element of elasticsearch documents. Thereby, there is no metadata about file and path settings, which are necessary to determine if a document is deleted or updated. New files will however be added to the index, (determined by the file timestamp).
If you need to keep json or xml documents synchronized to elasticsearch, you should set this option.
{
"name" : "test",
"fs" : {
"add_as_inner_object" : true
}
}
By default FS Crawler will index folder names in the index using a specific folder
type.
If you don't want to index those folders, you can set ignore_folders
to true
.
Note that in that case, FS Crawler won't be able to detect removed folders so any document has been indexed in elasticsearch, it won't be removed when you remove or move the folder.
{
"name" : "test",
"fs" : {
"ignore_folders" : true
}
}
If you have more than one type, create as many crawlers as types:
~/.fscrawler/test_type1/_settings.json
:
{
"name": "test_type1",
"fs": {
"url": "/tmp/type1",
"json_support" : true
},
"elasticsearch": {
"index": "mydocs1",
"index_folder": "myfolders1"
}
}
~/.fscrawler/test_type2/_settings.json
:
{
"name": "test_type2",
"fs": {
"url": "/tmp/type2",
"json_support" : true
},
"elasticsearch": {
"index": "mydocs2",
"index_folder": "myfolders2"
}
}
~/.fscrawler/test_type3/_settings.json
:
{
"name": "test_type3",
"fs": {
"url": "/tmp/type3",
"xml_support" : true
},
"elasticsearch": {
"index": "mydocs3",
"index_folder": "myfolders3"
}
}
You can also index many types from one single dir using two crawlers scanning the same dir and by setting
includes
parameter:
~/.fscrawler/test_type1.json
:
{
"name": "test_type1",
"fs": {
"url": "/tmp",
"includes": [ "type1*.json" ],
"json_support" : true
},
"elasticsearch": {
"index": "mydocs1",
"index_folder": "myfolders1"
}
}
~/.fscrawler/test_type2.json
:
{
"name": "test_type2",
"fs": {
"url": "/tmp",
"includes": [ "type2*.json" ],
"json_support" : true
},
"elasticsearch": {
"index": "mydocs2",
"index_folder": "myfolders2"
}
}
~/.fscrawler/test_type3.json
:
{
"name": "test_type3",
"fs": {
"url": "/tmp",
"includes": [ "*.xml" ],
"xml_support" : true
},
"elasticsearch": {
"index": "mydocs3",
"index_folder": "myfolders3"
}
}
Please note that the document _id
is always generated (hash value) from the filename to avoid issues with
special characters in filename.
You can force to use the _id
to be the filename using filename_as_id
attribute:
{
"name" : "test",
"fs" : {
"filename_as_id" : true
}
}
If you want to add file attributes such as attributes.owner
and attributes.group
, you can set attributes_support
to true
.
{
"name" : "test",
"fs" : {
"attributes_support" : true
}
}
By default, FS Crawler will extract all found metadata within meta.raw
object.
If you want to disable this feature, you can set raw_metadata
to false
.
{
"name" : "test",
"fs" : {
"raw_metadata" : false
}
}
Generated raw metadata depends on the file format itself.
For example, a PDF document could generate:
"date" : "2016-07-07T08:37:42Z"
"pdf:PDFVersion" : "1.5"
"xmp:CreatorTool" : "Microsoft Word"
"Keywords" : "keyword1, keyword2"
"access_permission:modify_annotations" : "true"
"access_permission:can_print_degraded" : "true"
"subject" : "Test Tika Object"
"dc:creator" : "David Pilato"
"dcterms:created" : "2016-07-07T08:37:42Z"
"Last-Modified" : "2016-07-07T08:37:42Z"
"dcterms:modified" : "2016-07-07T08:37:42Z"
"dc:format" : "application/pdf; version=1.5"
"title" : "Test Tika title"
"Last-Save-Date" : "2016-07-07T08:37:42Z"
"access_permission:fill_in_form" : "true"
"meta:save-date" : "2016-07-07T08:37:42Z"
"pdf:encrypted" : "false"
"dc:title" : "Test Tika title"
"modified" : "2016-07-07T08:37:42Z"
"cp:subject" : "Test Tika Object"
"Content-Type" : "application/pdf"
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser"
"creator" : "David Pilato"
"meta:author" : "David Pilato"
"dc:subject" : "keyword1, keyword2"
"meta:creation-date" : "2016-07-07T08:37:42Z"
"created" : "Thu Jul 07 10:37:42 CEST 2016"
"access_permission:extract_for_accessibility" : "true"
"access_permission:assemble_document" : "true"
"xmpTPg:NPages" : "2"
"Creation-Date" : "2016-07-07T08:37:42Z"
"access_permission:extract_content" : "true"
"access_permission:can_print" : "true"
"meta:keyword" : "keyword1, keyword2"
"Author" : "David Pilato"
"access_permission:can_modify" : "true"
Where a MP3 file would generate:
"xmpDM:genre" : "Vocal"
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser"
"creator" : "David Pilato"
"xmpDM:album" : "FS Crawler"
"xmpDM:trackNumber" : "1"
"xmpDM:releaseDate" : "2016"
"meta:author" : "David Pilato"
"xmpDM:artist" : "David Pilato"
"dc:creator" : "David Pilato"
"xmpDM:audioCompressor" : "MP3"
"title" : "Test Tika"
"xmpDM:audioChannelType" : "Stereo"
"version" : "MPEG 3 Layer III Version 1"
"xmpDM:logComment" : "Hello but reverted"
"xmpDM:audioSampleRate" : "44100"
"channels" : "2"
"dc:title" : "Test Tika"
"Author" : "David Pilato"
"xmpDM:duration" : "1018.775146484375"
"Content-Type" : "audio/mpeg"
"samplerate" : "44100"
As elasticsearch will by default to automatically guess the type, you could end up having conflicts between metadata raw fields: a field which is first detected as a date but is getting for another document a value like "in the seventies". In such a case, you could imagine forcing the mapping or defining an index mapping template.
Note that dots in metadata names will be replaced by a :
. For example PTEX.Fullbanner
will be indexed as
PTEX:Fullbanner
.
By default, FS crawler will create a field to store the original file size in octets. You can disable it using `add_filesize' option:
{
"name" : "test",
"fs" : {
"add_filesize" : false
}
}
If you don't want to remove indexed documents when you remove a file or a directory, you can
set remove_deleted
to false
(default to true
):
{
"name" : "test",
"fs" : {
"remove_deleted" : false
}
}
If you don't want to extract file content but only index filesystem metadata such as filename, date, size and path,
you can set index_content
to false
(default to true
):
{
"name" : "test",
"fs" : {
"index_content" : false
}
}
By default FS Crawler will immediately stop indexing if he hits a Permission denied exception.
If you want to just skip this File and continue with the rest of the directory tree you can
set continue_on_error
to true
(default to false
):
{
"name" : "test",
"fs" : {
"continue_on_error" : true
}
}
From FS crawler 2.2, you can ask for language detection using lang_detect
option:
{
"name" : "test",
"fs" : {
"lang_detect" : true
}
}
In that case, a new field named meta.language
is added to the generated JSon document.
If you are using elasticsearch 5.0 or superior, you can use this value to send your document to a specific index using a Node Ingest pipeline.
For example, you can define a pipeline named langdetect
with:
PUT _ingest/pipeline/langdetect
{
"description" : "langdetect pipeline",
"processors" : [
{
"set": {
"field": "_index",
"value": "myindex-{{meta.language}}"
}
}
]
}
In FS crawler settings, set both fs.lang_detect
and elasticsearch.pipeline
options:
{
"name" : "test",
"fs" : {
"lang_detect" : true
},
"elasticsearch" : {
"pipeline" : "langdetect"
}
}
And then, a document containing french text will be sent to myindex-fr
.
A document containing english text will be sent to myindex-en
.
You can also imagine changing the field name from content
to content-fr
or content-en
. That will help you
to define the correct analyzer to use.
Language detection might detect more than one language in a given text but only the most accurate will be set.
Which means that if you have a document containing 80% of french and 20% of english, the document will be marked
as fr
.
Note that language detection is CPU and time consuming.
You can store in elasticsearch itself the binary document using store_source
option:
{
"name" : "test",
"fs" : {
"store_source" : true
}
}
In that case, a new field named attachment
is added to the generated JSon document. This field is not indexed.
Default mapping for attachment
field is:
{
"doc" : {
"properties" : {
"attachment" : {
"type" : "binary",
"doc_values" : false
}
// ... Other properties here
}
}
}
By default FS crawler will extract only the first 100 000 characters.
But, you can set indexed_chars
to 5000
in FS crawler settings in order to overwrite this default settings.
{
"name": "test",
"fs": {
"indexed_chars": "5000"
}
}
This number can be either a fixed size, number of characters that is, or a percent using %
sign.
The percentage value will be applied to the filesize to determine the number of character the crawler needs
to extract.
If you want to index only 80%
of filesize, define indexed_chars
to "80%"
.
Of course, if you want to index the full document, you can set this property to "100%"
. Double values are also
supported so "0.01%"
is also a correct value.
Compressed files: If your file is compressed, you might need to increase indexed_chars
to more than "100%"
.
For example, "150%"
.
If you want to extract the full content, define indexed_chars
to "-1"
.
Note: Tika requires to allocate in memory a data structure to extract text. Setting indexed_chars
to a high
number will require more memory!
If you want FS crawler to generate a checksum for each file, set checksum
to the algorithm you wish to use
to compute the checksum, such as MD5
or SHA-1
.
{
"name": "test",
"fs": {
"checksum": "MD5"
}
}
You can index files remotely using SSH.
Here is a list of SSH settings (under server.
prefix)`:
Name | Default value | Documentation |
---|---|---|
server.hostname |
null |
Hostname |
server.port |
22 |
Port |
server.username |
null |
Username |
server.password |
null |
Password |
server.protocol |
"local" |
Set it to ssh |
server.pem_path |
null |
Using Username / PEM file |
Let's say you want to index from a remote server using SSH:
- FS URL:
/path/to/data/dir/on/server
- Server:
mynode.mydomain.com
- Username:
username
- Password:
password
- Protocol:
ssh
(default tolocal
) - Port:
22
(default to22
)
{
"name" : "test",
"fs" : {
"url" : "/path/to/data/dir/on/server"
},
"server" : {
"hostname" : "mynode.mydomain.com",
"port" : 22,
"username" : "username",
"password" : "password",
"protocol" : "ssh"
}
}
Let's say you want to index from a remote server using SSH:
- FS URL:
/path/to/data/dir/on/server
- Server:
mynode.mydomain.com
- Username:
username
- PEM File:
/path/to/private_key.pem
- Protocol:
ssh
(default tolocal
) - Port:
22
(default to22
)
{
"name" : "test",
"fs" : {
"url" : "/path/to/data/dir/on/server"
},
"server" : {
"hostname" : "mynode.mydomain.com",
"port" : 22,
"username" : "username",
"protocol" : "ssh",
"pem_path": "/path/to/private_key.pem"
}
}
Here is a list of Elasticsearch settings (under elasticsearch.
prefix)`:
Name | Default value | Documentation |
---|---|---|
elasticsearch.index |
job name | Index name for docs. See Index settings |
elasticsearch.index_folder |
job name + _folder | Index name for folders. See Index settings |
elasticsearch.bulk_size |
100 |
Bulk settings |
elasticsearch.flush_interval |
"5s" |
Bulk settings |
elasticsearch.pipeline |
null |
Using Ingest Node Pipeline (from 2.2) |
elasticsearch.nodes |
http://127.0.0.1:9200 | Node settings |
elasticsearch.username |
null |
Username. See Using credentials (X-Pack) (from 2.2) |
elasticsearch.password |
null |
Password. See Using credentials (X-Pack) (from 2.2) |
By default, FS crawler will index your data in an index which name is the same as the crawler name (name
property)
plus _doc
suffix, like test_doc
. You can change it by setting index
field:
{
"name" : "test",
"elasticsearch" : {
"index" : "docs"
}
}
When FS crawler needs to create the doc index, it applies some default settings and mappings which are read from
~/.fscrawler/_default/5/_settings_doc.json
.
You can read its content from the source.
Settings define an analyzer named fscrawler_path
which uses a
path hierarchy tokenizer.
FS crawler will also index folders in an index which name is the same as the crawler name (name
property)
plus _folder
suffix, like test_folder
. You can change it by setting index_folder
field:
{
"name" : "test",
"elasticsearch" : {
"index_folder" : "folders"
}
}
FS crawler applies as well a mapping automatically which is read from ~/.fscrawler/_default/5/_settings_folder.json
.
Source here.
You can also display the index mapping being used with Kibana:
GET docs/_mapping
Or fall back to the command line:
curl 'http://localhost:9200/docs/_mapping?pretty'
If you want to define your own index settings and mapping to set analyzers for example, you can either create the index
and push the mapping or define a
~/.fscrawler/_default/5/_settings_doc.json
document which contains the index settings and mappings you wish
before starting the FS crawler.
The following example uses a french
analyzer to index the content
field.
{
"settings": {
"analysis": {
"analyzer": {
"fscrawler_path": {
"tokenizer": "fscrawler_path"
}
},
"tokenizer": {
"fscrawler_path": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"doc": {
"properties" : {
"attachment" : {
"type" : "binary",
"doc_values" : false
},
"attributes" : {
"properties" : {
"group" : {
"type" : "keyword"
},
"owner" : {
"type" : "keyword"
}
}
},
"content" : {
"type" : "text",
"analyzer" : "french"
},
"file" : {
"properties" : {
"content_type" : {
"type" : "keyword"
},
"filename" : {
"type" : "keyword"
},
"extension" : {
"type" : "keyword"
},
"filesize" : {
"type" : "long"
},
"indexed_chars" : {
"type" : "long"
},
"indexing_date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"last_modified" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"checksum": {
"type": "keyword"
},
"url" : {
"type" : "keyword",
"index" : false
}
}
},
"meta" : {
"properties" : {
"author" : {
"type" : "text"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "text"
},
"title" : {
"type" : "text"
},
"language" : {
"type" : "keyword"
}
}
},
"path" : {
"properties" : {
"real" : {
"type" : "keyword",
"fields": {
"tree": {
"type" : "text",
"analyzer": "fscrawler_path",
"fielddata": true
}
}
},
"root" : {
"type" : "keyword"
},
"virtual" : {
"type" : "keyword",
"fields": {
"tree": {
"type" : "text",
"analyzer": "fscrawler_path",
"fielddata": true
}
}
}
}
}
}
}
}
}
Note that if you want to push manually the mapping to elasticsearch you can use the classic REST calls:
# Create index (don't forget to add the fscrawler_path analyzer)
PUT docs
{
// Same index settings as previously seen
}
Let's say you created a job named job_name
and you are sending documents against an elasticsearch cluster
running version 5.x
.
If you create the following files, they will be picked up at job start time instead of the default ones:
~/.fscrawler/{job_name}/_mappings/5/_settings_doc.json
~/.fscrawler/{job_name}/_mappings/5/_settings_folder.json
You can do the same for other elasticsearch versions with:
~/.fscrawler/{job_name}/_mappings/1/_settings_doc.json
for 1.x series~/.fscrawler/{job_name}/_mappings/1/_settings_folder.json
for 1.x series~/.fscrawler/{job_name}/_mappings/2/_settings_doc.json
for 2.x series~/.fscrawler/{job_name}/_mappings/2/_settings_folder.json
for 2.x series~/.fscrawler/{job_name}/_mappings/6/_settings_doc.json
for 6.x series~/.fscrawler/{job_name}/_mappings/6/_settings_folder.json
for 6.x series
Unfortunately you can not change the mapping on existing data. Therefore, you'll need first to remove existing index, which means remove all existing data, and then restart FS crawler with the new mapping.
You might to try elasticsearch Reindex API though.
FS crawler is using bulks to send data to elasticsearch. By default the bulk is executed every 100 operations or
every 5 seconds. You can change default settings using bulk_size
and flush_interval
:
{
"name" : "test",
"elasticsearch" : {
"bulk_size" : 1000,
"flush_interval" : "2s"
}
}
Note that elasticsearch has a default limit of 100mb
per HTTP request.
Which means that if you are indexing a massive bulk of documents, you might hit that limit and FSCrawler will throw
an error like entity content is too long [xxx] for the configured buffer limit [104857600]
.
You can either change this limit on elasticsearch side by setting http.max_content_length
to a higher value but please
be aware that this will consume much more memory on elasticsearch side.
Or you can decrease the bulk_size
setting to a smaller value.
If you are using an elasticsearch cluster running a 5.0 or superior version, you can use an Ingest Node pipeline to transform documents sent by FS crawler before they are actually indexed. Please note that folder objects are not sent through the pipeline as they are more internal objects.
For example, if you have the following pipeline:
PUT _ingest/pipeline/fscrawler
{
"description" : "fscrawler pipeline",
"processors" : [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}
In FS crawler settings, set the elasticsearch.pipeline
option:
{
"name" : "test",
"elasticsearch" : {
"pipeline" : "fscrawler"
}
}
Note that this option is available from FS crawler 2.2.
FS crawler is using elasticsearch REST layer to send data to your running cluster.
By default, it connects to 127.0.0.1
on port 9200
which are the default settings when
running a local node on your machine.
Of course, in production, you would probably change this and connect to a production cluster:
{
"name" : "test",
"elasticsearch" : {
"nodes" : [
{ "host" : "mynode1.mycompany.com", "port" : 9200, "scheme" : "HTTP" }
]
}
}
You can define multiple nodes:
{
"name" : "test",
"elasticsearch" : {
"nodes" : [
{ "host" : "mynode1.mycompany.com", "port" : 9200, "scheme" : "HTTP" },
{ "host" : "mynode2.mycompany.com", "port" : 9200, "scheme" : "HTTP" },
{ "host" : "mynode3.mycompany.com", "port" : 9200, "scheme" : "HTTP" }
]
}
}
You can use HTTPS instead of default HTTP (from 2.2):
{
"name" : "test",
"elasticsearch" : {
"nodes" : [
{ "host" : "CLUSTERID.eu-west-1.aws.found.io", "port" : 9243, "scheme" : "HTTPS" }
]
}
}
If you secured your elasticsearch cluster with X-Pack, you can provide
username
and password
to FS crawler:
{
"name" : "test",
"elasticsearch" : {
"username" : "elastic",
"password" : "changeme"
}
}
WARNING: note that for the current version, the elasticsearch password is stored in plain text in your job setting file.
A better practice is to only set the username or pass it with --username elastic
option when starting
FS Crawler.
If the password is not defined, you will be prompted when starting the job:
22:46:42,528 INFO [f.p.e.c.f.FsCrawler] Password for elastic:
FS crawler creates the following fields :
Field | Description | Example |
---|---|---|
content |
Extracted content | "This is my text!" |
attachment |
BASE64 encoded binary file | BASE64 Encoded document |
meta.author |
Author if any in document metadata | "David Pilato" |
meta.title |
Title if any in document metadata | "My document title" |
meta.date |
Document date if any in document metadata | "2013-04-04T15:21:35" |
meta.keywords |
Keywords if any in document metadata | ["river","fs","elasticsearch"] |
meta.language |
Detected Language if any | "fr" |
meta.raw |
An object with all raw metadata | "meta.raw.channels" : "2" |
file.content_type |
Content Type | "application/vnd.oasis.opendocument.text" |
file.last_modified |
Last modification date | 1386855978000 |
file.indexing_date |
Indexing date | "2013-12-12T13:50:58.758Z" |
file.filesize |
File size in bytes | 1256362 |
file.indexed_chars |
Extracted chars if fs.indexed_chars > 0 |
100000 |
file.filename |
Original file name | "mydocument.pdf" |
file.extension |
Original file name extension (from 2.2) | "pdf" |
file.url |
Original file url | "file://tmp/mydir/otherdir/mydocument.pdf" |
file.checksum |
Checksum if fs.checksum set |
"c32eafae2587bef4b3b32f73743c3c61" |
path.virtual |
Relative path from root path | "/mydir/otherdir/mydocument.pdf" |
path.root |
MD5 encoded parent path (for internal use) | "112aed83738239dbfe4485f024cd4ce1" |
path.real |
Actual real path name | "/tmp/mydir/otherdir/mydocument.pdf" |
attributes.owner |
Owner name | "david" |
attributes.group |
Group name | "staff" |
Here is a typical JSON document generated by the crawler:
{
"file":{
"filename":"test.odt",
"extension":"odt",
"last_modified":1386855978000,
"indexing_date":"2013-12-12T13:50:58.758Z",
"content_type":"application/vnd.oasis.opendocument.text",
"url":"file:///tmp/testfs_metadata/test.odt",
"indexed_chars":100000,
"filesize":8355,
"checksum":"c32eafae2587bef4b3b32f73743c3c61"
},
"path":{
"root":"bceb3913f6d793e915beb70a4735592",
"virtual":"/test.odt",
"real":"/tmp/testfs_metadata/test.odt"
},
"attributes": {
"owner": "david",
"group": "staff"
},
"meta":{
"author":"David Pilato",
"title":"Mon titre",
"date":"2013-04-04T15:21:35",
"keywords":[
"fs",
"elasticsearch",
"crawler"
],
"language":"fr"
},
"content":"Bonjour David\n\n\n"
}
You can use the content field to perform full-text search on
GET docs/_search
{
"query" : {
"match" : {
"content" : "the quick brown fox"
}
}
}
You can use meta fields to perform search on.
GET docs/_search
{
"query" : {
"term" : {
"file.filename" : "mydocument.pdf"
}
}
}
Or run some aggregations on top of them, like:
GET docs/_search
{
"size": 0,
"aggs": {
"by_extension": {
"terms": {
"field": "file.extension"
}
}
}
}
From 2.2, FS crawler comes with a REST service available by default at http://127.0.0.1:8080/fscrawler
.
To activate it, launch FS Crawler with --rest
option.
To get an overview of the running service, you can call GET /
endpoint:
curl http://127.0.0.1:8080/fscrawler/
It will give you a response similar to:
{
"ok" : true,
"version" : "2.2",
"elasticsearch" : "5.1.1",
"settings" : {
"name" : "fscrawler-rest-tests",
"fs" : {
"url" : "/tmp/es",
"update_rate" : "15m",
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"index" : "fscrawler-rest-tests_doc",
"index_folder" : "fscrawler-rest-tests_folder",
"bulk_size" : 100,
"flush_interval" : "5s",
"username" : "elastic"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
}
To upload a binary, you can call POST /_upload
endpoint:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload"
It will give you a response similar to:
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534"
}
The url
represents the elasticsearch address of the indexed document.
If you call:
curl http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534?pretty
You will get back your document as it has been stored by elasticsearch:
{
"_index" : "fscrawler-rest-tests_doc",
"_type" : "doc",
"_id" : "dd18bf3a8ea2a3e53e2661c7fb53534",
"_version" : 1,
"found" : true,
"_source" : {
"content" : "This file contains some words.\n",
"meta" : {
"raw" : {
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
"Content-Encoding" : "ISO-8859-1",
"Content-Type" : "text/plain; charset=ISO-8859-1"
}
},
"file" : {
"extension" : "txt",
"content_type" : "text/plain; charset=ISO-8859-1",
"indexing_date" : "2017-01-04T21:01:08.043",
"filename" : "test.txt"
},
"path" : {
"virtual" : "test.txt",
"real" : "test.txt"
}
}
}
If you started FS crawler in debug mode with --debug
or if you pass debug=true
query parameter,
then the response will be much more complete:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"
will give
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534",
"doc" : {
"content" : "This file contains some words.\n",
"meta" : {
"raw" : {
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
"Content-Encoding" : "ISO-8859-1",
"Content-Type" : "text/plain; charset=ISO-8859-1"
}
},
"file" : {
"extension" : "txt",
"content_type" : "text/plain; charset=ISO-8859-1",
"indexing_date" : "2017-01-04T14:05:10.325",
"filename" : "test.txt"
},
"path" : {
"virtual" : "test.txt",
"real" : "test.txt"
}
}
}
If you want to get back the extracted content and its metadata but without indexing into elasticsearch
you can use simulate=true
query parameter:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload?debug=true&simulate=true"
By default, FS crawler encodes the filename to generate an id. Which means that if you send 2 files
with the same filename test.txt
, the second one will overwrite the first one because they will both share
the same ID.
You can force any id you wish by adding id=YOUR_ID
in the form data:
echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "id=my-test" "http://127.0.0.1:8080/fscrawler/_upload"
There is a specific id named _auto_
where the ID will be autogenerated by elasticsearch.
It means that sending twice the same file will result in 2 different documents indexed.
Here is a list of REST service settings (under rest.
prefix)`:
Name | Default value | Documentation |
---|---|---|
rest.scheme |
http |
Scheme. Can be either http or https |
rest.host |
127.0.0.1 |
Bound host |
rest.port |
8080 |
Bound port |
rest.endpoint |
fscrawler |
Endpoint |
REST service is running at http://127.0.0.1:8080/fscrawler
by default.
You can change it using rest
settings:
{
"name" : "test",
"rest" : {
"scheme" : "HTTP",
"host" : "192.168.0.1",
"port" : 8180,
"endpoint" : "my_fscrawler"
}
}
It also means that if you are running more than one instance of FS crawler locally, you can (must) change
the port
.
There is no specific support for HDFS in FS crawler. But you can mount your HDFS on your machine and run FS crawler on this mount point. You can also read details about HDFS NFS Gateway.
To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika. Then add an image (png, jpg, ...) into your Fscrawler root directory. After the next index update, the text will be indexed and placed in "_source.content".
By default, FS crawler will try to extract also images from your PDF documents and run OCR on them.
This can be a CPU intensive operation. If you don't mean to run OCR on PDF but only on images, you can set fs.pdf_ocr
to false
:
{
"name" : "test",
"fs" : {
"pdf_ocr" : false
}
}
To use FS crawler with docker, check docker-fscrawler recipe.
This software is licensed under the Apache 2 license, quoted below.
Copyright 2011-2017 David Pilato
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.