Skip to content

Commit

Permalink
Adds WARC upload backend, combining indexing, sorting, and replacemen…
Browse files Browse the repository at this point in the history
…t for replay for #436
  • Loading branch information
machawk1 committed Aug 10, 2018
1 parent 8d90351 commit 797a091
Show file tree
Hide file tree
Showing 4 changed files with 122 additions and 12 deletions.
13 changes: 1 addition & 12 deletions ipwb/indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from six.moves import input

from util import IPFSAPI_HOST, IPFSAPI_PORT
from util import generateCDXJMetadata

# from warcio.archiveiterator import ArchiveIterator

Expand Down Expand Up @@ -278,18 +279,6 @@ def getCDXJLinesFromFile(warcPath, **encCompOpts):
return cdxjLines


def generateCDXJMetadata(cdxjLines=None):

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

Do we need to move this function in the utils file? We are already importing indexer in the replay, so we might be able to directly use it from there.

This comment has been minimized.

Copy link
@machawk1

machawk1 Aug 10, 2018

Author Member

The issue was that the function needed to be called from util and util does not import reply or indexer by-design. This was done to prevent the explicit coupling of util to ipwb scripts.

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

I didn't mean just this function, but the whole logic of indexing should remain in the indexer, including the block of code that needed to call this function.

This comment has been minimized.

Copy link
@machawk1

machawk1 Aug 10, 2018

Author Member

The join function is in util because replay calls it and I did not want to introduce coupling via a call from replay to indexer.

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

There is no harms in calling some indexer functions from the replay. It is more like offloading the task to indexer. Moving them into utils is not really elegant if the functions are not generic enough. Even now we are calling indexer.indexFileAt() from the replay, so the decoupling you are talking about is not there in that sense.

metadata = ['!context ["http://tools.ietf.org/html/rfc7089"]']
metaVals = {
'generator': "InterPlanetary Wayback v.{0}".format(ipwbVersion),
'created_at': '{0}'.format(datetime.datetime.now().isoformat())
}
metaVals = '!meta {0}'.format(json.dumps(metaVals))
metadata.append(metaVals)

return metadata


def askUserForEncryptionKey():
if DEBUG: # Allows testing instead of requiring a user prompt
return 'ipwb'
Expand Down
68 changes: 68 additions & 0 deletions ipwb/replay.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
import surt
import re
import signal
import random
import string

from pywb.utils.canonicalize import unsurt

Expand All @@ -40,6 +42,8 @@
from util import IPFSAPI_HOST, IPFSAPI_PORT, IPWBREPLAY_HOST, IPWBREPLAY_PORT
from util import INDEX_FILE

import indexer

from base64 import b64decode
from Crypto.Cipher import AES
from Crypto.Util.Padding import pad
Expand All @@ -50,7 +54,15 @@
from __init__ import __version__ as ipwbVersion


from flask import flash, url_for
from werkzeug.utils import secure_filename
from flask import send_from_directory

UPLOAD_FOLDER = '/tmp'
ALLOWED_EXTENSIONS = set(['warc', 'warc.gz'])

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.debug = False

IPFS_API = ipfsapi.Client(IPFSAPI_HOST, IPFSAPI_PORT)
Expand All @@ -62,6 +74,62 @@ def setServerHeader(response):
return response


def allowed_file(filename):
return '.' in filename and \
filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

filename.rsplit('.', 1)[1] will either return warc, gz, or something else, but never return warc.gz.

>>> "foo.warc".rsplit('.', 1)[1]
'warc'
>>> "foo.warc.gz".rsplit('.', 1)[1]
'gz'
>>> "foo.bar.warc".rsplit('.', 1)[1]
'warc'
>>> "foo.bar.warc.gz".rsplit('.', 1)[1]
'gz'

This comment has been minimized.

Copy link
@machawk1

machawk1 Aug 10, 2018

Author Member

@ibnesayeed I updated this in 10c662a. Please have a look.

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

And I posted two related comments there that need to be read together.

This comment has been minimized.

Copy link
@machawk1

machawk1 Aug 10, 2018

Author Member

Those comments have been addressed in subsequent commits.



@app.route('/upload', methods=['GET', 'POST'])
def upload_file():
if request.method == 'POST':
# check if the post request has the file part
if 'file' not in request.files:
flash('No file part')
return redirect(request.url)
file = request.files['file']
# if user does not select file, browser also
# submit an empty part without filename
if file.filename == '':
flash('No selected file')
return redirect(request.url)
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
warcPath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(warcPath)

cdxjPath = '/tmp/' + ''.join(random.sample(

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

Please seed the random number generator (using current time or something like that) somewhere and create a longer string to avoid collisions.

This comment has been minimized.

Copy link
@machawk1

machawk1 Aug 10, 2018

Author Member

ACK. I am aware that this is needed but did not experience collisions when testing.

This comment has been minimized.

Copy link
@machawk1

machawk1 Aug 10, 2018

Author Member

Done in 633edb5 and 0cd1c34.

string.ascii_uppercase + string.digits * 6, 6)) + '.cdxj'
combinedcdxjPath = '/tmp/' + ''.join(random.sample(
string.ascii_uppercase + string.digits * 6, 6)) + '.cdxj'

# Check if semaphore lock exists
# Index file, produce new.cdxj
print('Indexing file from uploaded WARC at {0} to {1}'.format(
warcPath, cdxjPath))
indexer.indexFileAt(warcPath, outfile=cdxjPath)
print('index created at {0}'.format(cdxjPath))

# Create semaphore
# Join current.cdxj w/ new.cdxj, write to combined.cdxj
print('* Prior index file: ' + app.cdxjFilePath)
print('* Index file of new WARC: ' + cdxjPath)
print('* Combined index file (to-write): ' + combinedcdxjPath)
ipwbUtils.joinCDXJFiles(
app.cdxjFilePath, cdxjPath, combinedcdxjPath)
print('Setting ipwb replay index variables')

ipwbUtils.setIPWBReplayIndexPath(combinedcdxjPath)
app.cdxjFilePath = combinedcdxjPath
app.cdxjFileContents = getIndexFileContents(combinedcdxjPath)

# Set replay.index to path of combined.cdxj
# Release lock
# Restart replay system?

return redirect('/')
return 'Upload failed, send POST'


@app.route('/webui/<path:path>')
def showWebUI(path):
""" Handle requests for the IPWB replay Web interface and requests
Expand Down
43 changes: 43 additions & 0 deletions ipwb/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import datetime
import logging
import platform
import shutil

import urllib2
import json
Expand Down Expand Up @@ -70,6 +71,48 @@ def isValidCDXJ(stringIn): # TODO: Check specific strict syntax
return True


def generateCDXJMetadata(cdxjLines=None):
metadata = ['!context ["http://tools.ietf.org/html/rfc7089"]']
metaVals = {
'generator': "InterPlanetary Wayback v.{0}".format(ipwbVersion),
'created_at': '{0}'.format(datetime.datetime.now().isoformat())
}
metaVals = '!meta {0}'.format(json.dumps(metaVals))
metadata.append(metaVals)

return metadata


def joinCDXJFiles(cdxjPath1, cdxjPath2, outputFilePath):

This comment has been minimized.

Copy link
@ibnesayeed

ibnesayeed Aug 10, 2018

Member

Some of this index merging logic is already in the indexer, just refactor the reusable section of that into a separate function then use directly from there rather than bring it into the utils.

# CDXJ2 takes precedence in surt uri and datetimes identity

# Join two files quickly
with open(outputFilePath, 'wb') as wfd:
for f in [cdxjPath1, cdxjPath2]:
with open(f, 'rb') as fd:
shutil.copyfileobj(fd, wfd, 1024 * 1024 * 10)

cdxjLines = ''
with open(outputFilePath, 'r') as wfd:
cdxjLines = wfd.read().split('\n')

# De-dupe and sort, needed for CDXJ adherence (pulled from indexer.py)
cdxjLines = list(set(cdxjLines))
cdxjLines.sort()

cdxjLines[:] = [line for line in cdxjLines
if len(line) > 0 and line[0] != '!']

# Prepend metadata
cdxjMetadataLines = generateCDXJMetadata(cdxjLines)
cdxjLines = cdxjMetadataLines + cdxjLines

cdxjLines = '\n'.join(cdxjLines)

with open(outputFilePath, 'w') as wfd:
wfd.write(cdxjLines)


def isValidCDXJLine(cdxjLine):
try:
(surtURI, datetime, jsonData) = cdxjLine.split(' ', 2)
Expand Down
10 changes: 10 additions & 0 deletions ipwb/webui/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
</script>
</head>
<body>

<div id="wrapper">
<div>
<h1><img src="./webui/logo.png" alt="ipwb" /></h1>
Expand All @@ -35,6 +36,15 @@ <h1><img src="./webui/logo.png" alt="ipwb" /></h1>
<p class="centered topSpace"><a id="webui" target="_blank">IPFS WebUI</a></p>
<p class="centered"><a href="https://github.com/oduwsdl/ipwb/" target="_blank">IPWB Help</a></p>
</details>
<div style="margin: auto; background-color: #eee; border: 1px solid #999; width: 400px;">
<label style="width: 100%; background-color: #999; display: block;">Upload WARC</label>
<form method="post" action="/upload" enctype="multipart/form-data">
<input type="file" name="file" style="display: inline-block;">
<input type="submit" value="Upload" style="display: inline-block;">
</form>
</div>


</footer>
<div id="uris" class="hidden">
<h3 id="urisHeader"><abbr title="Uniform Resource Identifiers">URIs</abbr> locally available</h3>
Expand Down

1 comment on commit 797a091

@machawk1
Copy link
Member Author

@machawk1 machawk1 commented on 797a091 Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibnesayeed Please have a look at these additions relative to #436. I am going to wait until I have fresh eyes/brain before I submit a PR.

Also, disregard styling, layout, and such. This effort was mainly to get the backend in-place.

Please sign in to comment.