Skip to content

Commit

Permalink
Merge pull request #315 from usc-isi-i2/development
Browse files Browse the repository at this point in the history
merging development into master to tag a release and push to pypi
  • Loading branch information
saggu committed Jul 12, 2018
2 parents 84fd866 + 4243e9f commit 5f48a9d
Show file tree
Hide file tree
Showing 410 changed files with 84,762 additions and 47,649 deletions.
9 changes: 9 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*.out
*.err
.idea
.DS_Store
*.test
*.log
examples/*
notebooks/*
etk/unit_tests/*
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
__pycache__/
*.py[cod]
*$py.class
*.pyc

# C extensions
*.so
Expand Down
20 changes: 15 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
sudo: required
language: python
python:
- 2.7.11
- 3.6.2
services:
- docker
install:
- sudo apt-get update
# We do this conditionally because it saves us some downloading if the
# version is the same.
- if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh; else wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh; fi
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"
- hash -r
Expand All @@ -14,10 +17,17 @@ install:
# Useful for debugging any issues with conda
- conda info -a
- conda-env create .
- source activate etk_env
- source activate etk2_env
script:
- python -m spacy download en
- python -m unittest discover
- python -m spacy download en_core_web_sm
- python -W ignore -m unittest discover
notifications:
slack:
secure: O7Cj7NvZHMu7vRkWwmnyAesXvIGfG7NBkgJWdN+y69HSHJwn1szoZQjiNUrVBCePCzEHsmSa40aT34/dV18Kj4cpqWnaLK2iof/YwxJk2YgsKaUnS2FcmTfzchQ+TbtVmnI2DFi47VnVR48u+6nKM2xqoczHICgl1MY2HPm16ldJHyuj2TTt/syju2t4cjq4sSgwPIEnG3te+435+Y0TNWuO+7NsLU2wuC2e2ExCNeqoUzq9qDOtX99/E269OKrLz5pJElQabsJZts68g3trwLt5qMqQd7YNXdUQoqNzznueXe6O39nMSiS8JXWMj2jjyC77Oho1KY0GhvMtEdacwM4x7hLzLMJ2pXR5QYBaDAF/vI7tCG83R1Y9YRhAbXxGWFzC6PANG2Q6SAObTnr9ezrezXP3CvWnsSFMsHsim4Xevf3g7VIe4jo/UE/50zsI/l9+DAITMYE29p4kWE4KFRazTk7HJcYneeRh7MZ5VTQR4sDhPJJwu+ftmEBVJu8nBtvElYBN0r23helSmvPM22EQ3rKMbNFJTd0gETrVTyqc4j9iar+kFVnbClpB1SmE3eGxfeNbIuHk+7345aZ0Ywqvxs0dqjJgjE+4Laa6vUNC8jjgtyZshMGYs7LoTzD5HMcZ0MrVspimK1UjODQg70daQZsph1NozQ+TawnqStE=

deploy:
- provider: script
script: bash docker_deploy.sh
on:
tags: true
branch: development
44 changes: 44 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# ETK base for all
FROM ubuntu:16.04

# all packages and environments are in /app
WORKDIR /app

## install required command utils
RUN apt-get update && apt-get install -y \
locales \
build-essential \
python \
python-dev \
git \
wget \
curl \
vim
RUN locale-gen en_US.UTF-8

# install pip
RUN wget https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py

# install conda
RUN wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x Miniconda3-latest-Linux-x86_64.sh && \
./Miniconda3-latest-Linux-x86_64.sh -p /app/miniconda -b && \
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH=/app/miniconda/bin:${PATH}
RUN conda update -y conda

# install etk dependencies (install them here for cache of image building)
RUN mkdir /app/etk
ADD environment.yml /app/etk

# create and config conda-env for etk
RUN cd /app/etk && conda-env create .
# set etk2_env as default env
ENV PATH /app/miniconda/envs/etk2_env/bin:$PATH
RUN /bin/bash -c "python -m spacy download en_core_web_sm"

# add etk
ADD . /app/etk

CMD /bin/bash
21 changes: 0 additions & 21 deletions LICENSE

This file was deleted.

81 changes: 61 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,64 @@
# etk
![travis ci](https://travis-ci.org/usc-isi-i2/etk.svg?branch=master)
This repository will contain our toolkit for extracting information from web pages.
It will be built in stages to contain the following capabilities:

* Several structure extractors to identify the main content of a page and tables
* A host of data extractors for common entities, including people, places, phone, email, dates, etc.
* A trainable algorithm to rank extractions
* Automated experimentation to measure precision and recall of extractions
## Setup
`conda-env create .`
`source activate etk_env`
`python -m spacy download en`

## Run Tests
# ETK: Information Extraction Toolkit

ETK is a Python library for high precision information extraction from many document formats.
It proivdes a flexible framework of **composable extractors** that enables you to combine a host of **predefined extractors** provided in ETK with custom extractors that you may need to develop for your application.
It supports extraction from HTML pages, text documents, CSV and Excel files and JSON documents.
ETK is open-source software, released under the MIT license.



![MIT License](https://img.shields.io/badge/license-MIT-blue.svg) ![travis ci](https://travis-ci.org/usc-isi-i2/etk.svg?branch=etk2)

## Documentation


## Features

* Extraction from HTML, text, CSV, Excel, JSON
* High-precision predefined extractors for common entities (dates, phones, email, cities, ...)
* Extraction of microdata, schema.org and RDFa markup
* Integration with [spaCy](https://github.com/explosion/spaCy) for text processing
* Automatic identification and extraction of HTML tables containing data
* Automatic identification and extraction of time series
* Semi-automatic generation of Web wrappers
* Scalable execution and management of extraction pipelines
* Automatic provenance recording

# Releases

- [Source code](https://github.com/usc-isi-i2/etk/releases)
- [Docker images](https://hub.docker.com/r/uscisii2/etk/tags/)

## Installation

<table>
<tr><td><b>Operating system:</td><td>macOS / OS X, Linux, Windows</td></tr>
<tr><td><b>Python version:</td><td>Python 3.6+</td></tr>
<table>

Clone or fork this repository, open a terminal window and in the directory where you downloaded ETK type the following commands:
```
conda-env create .
source activate etk2_env
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg
```

## Run Tests

`python -m unittest discover`

## Launch Jupyter Notebook
`jupyter notebook etk_examples.ipynb`
or
`jupyter notebook etk_extraction_using_config.ipynb`
## Docker

Build image

`docker build -t etk:test .`

Run container

`docker run -it etk:dev /bin/bash`

Mount local volume for test

`docker run -it -v $(pwd):/app/etk etk:dev /bin/bash`

> Before running the code in the notebook, change the kernel to `Python [conda env:etk_env]`
144 changes: 0 additions & 144 deletions add_images_ads.py

This file was deleted.

35 changes: 35 additions & 0 deletions bin/ontodocgen
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env python
import argparse
import sys, os
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
from etk.ontology_api import Ontology
from etk.ontology_report_generator import OntologyReportGenerator

if __name__ == '__main__':

parser = argparse.ArgumentParser(description='Generate HTML report for the input ontology files')
parser.add_argument('files', nargs='+', help='Input turtle files.')
parser.add_argument('--no-validation', action='store_false', dest='validation', default=True,
help='Don\'t perform domain and range validation.')
parser.add_argument('-o', '--output', dest='out', default='ontology-doc.html',
help='Location of generated HTML report.')
parser.add_argument('-i', '--include-undefined-classes', action='store_true',
dest='include_class', default=False, help='Include those undefined classes '
'but referenced by others.')
parser.add_argument('-t', '--include-turtle', action='store_true', dest='include_turtle',
default=False, help='Include turtle related to this entity. NOTE: this may '
'takes longer time.')
parser.add_argument('-q', '--quiet', action='store_true', dest='quiet', default=False,
help='Suppress warning.')
parser.add_argument('--exclude-warning', action='store_true', dest='exclude_warning',
default=False, help='Exclude warning messages in HTML report')
args = parser.parse_args()

contents = [open(f).read() for f in args.files]
ontology = Ontology(contents, validation=args.validation, include_undefined_class=args.include_class,
quiet=args.quiet)
doc_content = OntologyReportGenerator(ontology).generate_html_report(include_turtle=args.include_turtle,
exclude_warning=args.exclude_warning)

with open(args.out, "w") as f:
f.write(doc_content)
Loading

0 comments on commit 5f48a9d

Please sign in to comment.