Skip to content

Collects some metadata about FASTQ files and stores them in elasticsearch.

Notifications You must be signed in to change notification settings

fejesa/fastq-elasticsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastq-elasticsearch

Collects some metadata about FASTQ files and stores them in elasticsearch

Build

mvn clean install

generates the bundle that contains all dependencies.

Start The fastq-elastic.sh can be used to start the app from a console.

Configuration The meta information about the sample files is stored in JSON document format. Before the fastq-elastic tool is started we have to prepare the mapping in the Elasticsearch. The mapping configuration can be found in the git repo (src/main/resources/sampledb-index.json).

Next step is the configuration of the fastq-elastic tool. You must set the custom values in the sample.conf file.

{
    elastic.host = localhost
    elastic.port = 9200

    # Supported file types
    file.extensions = [fastq.gz]

    # List of folders that should be parsed
    folders.root = [
        /sample/folder1,
        /sample/folder2
    ]

    # List of ignored folders
    folders.exclusive = []
}

Cheat sheet The most interesting part of the fastq-elastic service is what and how can we retrieve the collected data from the Elasticsearch. The following section shows some data queries that can be applied from the Kibana console.

Another general cheat sheet about the Kibana is http://elasticsearch-cheatsheet.jolicode.com/.

Counts the number of samples

GET sampledb/_doc/_count
{
  "query": {
    "wildcard": {
      "sample.samplePath": "*"
    }
  }
}

Get sample files that start with 'XXX-KM-34_S34'

GET sampledb/_doc/_search
{
  "query": {
    "wildcard": {
      "sample.sampleName.exact": "XXX-KM-34_S34*"
    }
  }
}

Get all sample file that name contain 'XXX5S' and field length > 30MB

GET sampledb/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "wildcard": {
            "sample.sampleName.exact": "*XXX5S*"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "sample.fileLength": {
              "gte": "30000000"
            }
          }
        }
      ]
    }
  }
}

Get the top 20 duplicated sample files

GET sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "distinct_sample": {
      "terms": {
        "field": "sample.sampleName.exact",
        "size": 20
      }
    }
  }
}

Find largest sample file in MB using aggreagation (in 2 steps)

POST sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "largest_sample": {
      "max": {
        "field": "sample.fileLength",
        "script": {
          "source": "_value / params.in_mb",
          "params": {
            "in_mb": 1048576
          }
        }
      }
    }
  }
}
GET sampledb/_doc/_search
{
  "query": {
    "match": {
      "sample.fileLength": 58362878472
    }
  }
}

Find top 3 largest sample files using query and sorting (in 1 step)

GET sampledb/_doc/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": "doc['sample.fileLength'].value / params.in_mb",
          "params": {
            "in_mb": 1048576
          }
        },
        "order": "desc"
      }
    }
  ],
  "size": 3
}

Get the sum of the size of the sample files in GB

GET sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "largest_sample": {
      "sum": {
        "field": "sample.fileLength",
        "script": {
          "source": "_value / params.in_gb",
          "params": {
            "in_gb": 1073741824
          }
        }
      }
    }
  }
}

About

Collects some metadata about FASTQ files and stores them in elasticsearch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published