Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use frictionless framework v5 #69

Merged
merged 62 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
bda364a
Update validation library to Frictionless version 5
aivuk Sep 14, 2022
a8d8e09
remove http proxy from validate because it doesn't exist in frictionl…
aivuk Sep 14, 2022
5021f95
put frictionless dependency to the last beta version 5.0.0b9
aivuk Sep 14, 2022
259016b
Install jinja2 version 2.10.1 that works with CKAN 2.9
aivuk Sep 14, 2022
6e5a68a
pin down markupsafe version
aivuk Sep 15, 2022
d57a3cf
add tableschema dependency
aivuk Sep 15, 2022
358bd0d
Fix logic for new report format from frictionless v5
aivuk Sep 20, 2022
7c63f1a
Fix some tests in test_logic
aivuk Sep 20, 2022
57030c6
Revert back test.ini
aivuk Sep 20, 2022
ac7bd2b
Comment out test for goodtables
aivuk Sep 20, 2022
972d094
Comment out another test for goodtables
aivuk Sep 20, 2022
cfa307d
Fix tests in test_logic and check for different report formats at job…
aivuk Sep 20, 2022
92af068
Merge branch 'dev-v2' into use-frictionless-framework-v5
aivuk Sep 20, 2022
e602d62
correct test.ini file
aivuk Sep 20, 2022
cc3ff3d
Do not use .get in frictionless Report object
aivuk Sep 20, 2022
5bbe208
Check first for key 'warnings' and 'tables' in Report before accessin…
aivuk Sep 20, 2022
f75f84f
Remove unnecessary json parsing and fix test at test_form TestResourc…
aivuk Sep 21, 2022
d7df909
Change how key check is done in Report object
aivuk Sep 21, 2022
cfd4305
Uses frictionless.system to run validate in a trusted environment, ot…
aivuk Sep 21, 2022
1a027f9
Correctly check resource source in Report and correct test in test_form
aivuk Sep 21, 2022
876ff6b
Update README removing references to Goodtables and adding to Frictio…
aivuk Sep 22, 2022
28608e9
Use http proxy with frictionless v5
aivuk Sep 23, 2022
ed56be1
Correctly convert schema to str as expected by frictionless
aivuk Sep 26, 2022
8571b00
correctly deals with null schema before running validate
aivuk Sep 26, 2022
91ba8f5
correct schema comparing test at TestValidationJob.test_job_run_schema
aivuk Sep 26, 2022
754fa14
Update README.md
aivuk Sep 28, 2022
694d829
Update ckanext/validation/jobs.py
aivuk Sep 28, 2022
abd199b
Load schema to Frictionless Schema object before passing it to validate
aivuk Sep 29, 2022
c9a5d13
Merge branch 'use-frictionless-framework-v5' of github.com:frictionle…
aivuk Sep 29, 2022
24f2f5e
Merge branch 'dev-v2' into use-frictionless-framework-v5
amercader Oct 3, 2022
9728099
Merge branch 'dev-v2' into use-frictionless-framework-v5
amercader Oct 3, 2022
8e84b50
Update ckanext/validation/jobs.py
aivuk Oct 17, 2022
70e7386
Fix test with new error message from validation
aivuk Oct 17, 2022
ae20876
Update ckanext/validation/tests/test_jobs.py
aivuk Oct 17, 2022
829722f
Make test for resource_form_create_invalid more specific regarding wh…
aivuk Oct 20, 2022
8b4bf81
Merge branch 'use-frictionless-framework-v5' of github.com:frictionle…
aivuk Oct 20, 2022
df38480
uses new frictionless report component to display validation report
aivuk Oct 20, 2022
6fd1bc2
Mention plugin ordering in README
amercader Oct 21, 2022
90dcdae
Display correct validation duration in seconds in report page
aivuk Oct 21, 2022
2152a7f
Merge branch 'use-frictionless-report-component' of github.com:fricti…
aivuk Oct 21, 2022
ee4beb2
Correct resource path in report to not display local paths
aivuk Oct 24, 2022
2e4a9b9
Fix tests
aivuk Oct 24, 2022
9df3b3e
Parse Dialect and Checks options in validation options and update README
aivuk Oct 24, 2022
255b946
Correct test
aivuk Oct 24, 2022
a067c5c
Update README.md
aivuk Oct 25, 2022
de8793a
Update README.md
aivuk Oct 25, 2022
372f40e
Update link to the current custom check of frictionless v5
aivuk Oct 25, 2022
2cb684f
Merge branch 'use-frictionless-framework-v5' of github.com:frictionle…
aivuk Oct 25, 2022
bf2a50f
Update link to validate documentation from frictionless framework v5
aivuk Oct 25, 2022
7a58f7c
Update ckanext/validation/jobs.py
aivuk Oct 25, 2022
17c33ee
Add checks examples
aivuk Oct 25, 2022
ca59320
Merge branch 'dev-v2' into use-frictionless-framework-v5
aivuk Oct 25, 2022
4c63d6e
Merge branch 'use-frictionless-framework-v5' of github.com:frictionle…
aivuk Oct 25, 2022
26f0203
Fix report dialog when plugin is running with sync=True
aivuk Oct 26, 2022
e9575a1
Fix tests for local paths in report and with validation options follo…
aivuk Oct 26, 2022
cf90c6b
add missing closing quote
aivuk Oct 26, 2022
e774828
Put is_url variable back on template
aivuk Oct 26, 2022
b33dd11
Make dialog of report dialog bigger and correct dialog css
aivuk Oct 26, 2022
08af965
Merge branch 'use-frictionless-framework-v5' into use-frictionless-re…
aivuk Oct 26, 2022
3da4388
remove console for debug in js
aivuk Oct 27, 2022
c3b6807
Merge branch 'use-frictionless-report-component' of github.com:fricti…
aivuk Oct 27, 2022
fff8dc5
Merge pull request #71 from frictionlessdata/use-frictionless-report-…
amercader Oct 27, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ jobs:
run: |
pip install -r dev-requirements.txt
pip install -r requirements.txt
pip install --no-warn-conflicts jinja2==2.10.1
Copy link
Contributor

@ThrawnCA ThrawnCA Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is concerning. Why can't conflicts be resolved? There should at least be a comment to explain why this is being done and what would be needed to address it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThrawnCA this will be solved with upcoming requirement changes in frictionless-py. We just added this to not block the tests on this branch.

@aivuk would be good to sort this out on frictionless-py. IIRC it's just loosening up the requirements

pip install --no-warn-conflicts markupsafe==2.0.1
pip install -e .
# Replace default path to CKAN core config file with the one on the container
sed -i -e 's/use = config:.*/use = config:\/srv\/app\/src\/ckan\/test-core.ini/' test.ini
Expand Down
12 changes: 4 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Data description and validation for CKAN with [Frictionless Data](https://fricti

## Overview

This extension brings data validation powered by the [Goodtables](https://github.com/frictionlessdata/goodtables-py) library to CKAN. It provides out of the box features to validate tabular data and integrate validation reports to the CKAN interface.
This extension brings data validation powered by the [Frictionless Framework](https://github.com/frictionlessdata/framework) library to CKAN. It provides out of the box features to validate tabular data and integrate validation reports to the CKAN interface.

Data validation can be performed automatically on the background or during dataset creation, and the results are stored against each resource.

Expand All @@ -51,7 +51,7 @@ If you are eager to get started, jump to the [Installation](#installation) and [

## Versions supported and requirements

This extension is currently tested in CKAN 2.8 and CKAN 2.9.
This extension is currently tested in CKAN 2.9.
aivuk marked this conversation as resolved.
Show resolved Hide resolved

It is strongly recommended to use it alongside [ckanext-scheming](https://github.com/ckan/ckanext-scheming) to define the necessary extra fields in the default CKAN schema.

Expand All @@ -76,7 +76,6 @@ Once installed, add the `validation` plugin to the `ckan.plugins` configuration

ckan.plugins = ... validation

*Note:* if using CKAN 2.6 or lower and [asynchronous validation](#asynchronous-validation), also add the `rq` plugin ([see Versions supported and requirements](#versions-supported-and-requirements)) to `ckan.plugins`.

### Adding schema fields to the Resource metadata

Expand Down Expand Up @@ -108,7 +107,7 @@ You can also provide [validation options](#validation-options) that will be used

ckanext.validation.default_validation_options={
"skip_checks": ["blank-rows", "duplicate-headers"],
"headers": 3}
}

Make sure to use indentation if the value spans multiple lines otherwise it won't be parsed.

Expand Down Expand Up @@ -238,9 +237,6 @@ The following validation options would make validation pass:

```json
{
"headers": 3,
"delimiter": ";",
"skip_rows": ["#"],
"skip_checks": ["blank-rows"]
}
aivuk marked this conversation as resolved.
Show resolved Hide resolved

Expand Down Expand Up @@ -327,7 +323,7 @@ The extension requires changes in the default CKAN resource metadata schema to a
Here's more detail on the fields added:

* `schema`: This can be a [Table Schema](http://frictionlessdata.io/specs/table-schema/) JSON object or an URL pointing to one. In the UI form you can upload a JSON file, link to one providing a URL or enter it directly. If uploaded, the file contents will be read and stored in the `schema` field. In all three cases the contents will be validated against the Table Schema specification.
* `validation_options`: A JSON object with validation options that will be passed to [Goodtables](https://github.com/frictionlessdata/goodtables-py#validatesource-options).
* `validation_options`: A JSON object with validation options that will be passed to Frictionless Framework [validate](https://framework.frictionlessdata.io/docs/guides/validating-data.html) function.

![Form fields](https://i.imgur.com/ixKOCij.png)

Expand Down
8 changes: 6 additions & 2 deletions ckanext/validation/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,12 @@ def validation_extract_report_from_errors(errors):
if error == 'validation':
report = errors[error][0]
# Remove full path from table source
source = report['tables'][0]['source']
report['tables'][0]['source'] = source.split('/')[-1]
if 'tasks' in report:
source = report['tasks'][0]['place']
report['tasks'][0]['place'] = source.split('/')[-1]
elif 'tables' in report:
source = report['tables'][0]['source']
report['tables'][0]['source'] = source.split('/')[-1]
msg = _('''
There are validation issues with this file, please see the
<a {params}>report</a> for details. Once you have resolved the issues,
Expand Down
86 changes: 49 additions & 37 deletions ckanext/validation/jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@

import requests
from sqlalchemy.orm.exc import NoResultFound
from goodtables import validate
# from goodtables import validate
from frictionless import validate, system, Report
aivuk marked this conversation as resolved.
Show resolved Hide resolved

from ckan.model import Session
import ckan.lib.uploader as uploader
Expand All @@ -22,7 +23,7 @@

def run_validation_job(resource):

log.debug(u'Validating resource %s', resource['id'])
log.debug('Validating resource %s', resource['id'])

try:
validation = Session.query(Validation).filter(
Expand All @@ -33,18 +34,18 @@ def run_validation_job(resource):
if not validation:
validation = Validation(resource_id=resource['id'])

validation.status = u'running'
validation.status = 'running'
Session.add(validation)
Session.commit()

options = t.config.get(
u'ckanext.validation.default_validation_options')
'ckanext.validation.default_validation_options')
if options:
options = json.loads(options)
else:
options = {}

resource_options = resource.get(u'validation_options')
resource_options = resource.get('validation_options')
if resource_options and isinstance(resource_options, str):
resource_options = json.loads(resource_options)
if resource_options:
Expand All @@ -54,54 +55,62 @@ def run_validation_job(resource):
{'ignore_auth': True}, {'id': resource['package_id']})

source = None
if resource.get(u'url_type') == u'upload':
if resource.get('url_type') == 'upload':
upload = uploader.get_resource_uploader(resource)
if isinstance(upload, uploader.ResourceUpload):
source = upload.get_path(resource[u'id'])
source = upload.get_path(resource['id'])
else:
# Upload is not the default implementation (ie it's a cloud storage
# implementation)
pass_auth_header = t.asbool(
t.config.get(u'ckanext.validation.pass_auth_header', True))
if dataset[u'private'] and pass_auth_header:
t.config.get('ckanext.validation.pass_auth_header', True))
if dataset['private'] and pass_auth_header:
s = requests.Session()
s.headers.update({
u'Authorization': t.config.get(
u'ckanext.validation.pass_auth_header_value',
'Authorization': t.config.get(
'ckanext.validation.pass_auth_header_value',
_get_site_user_api_key())
})

options[u'http_session'] = s
options['http_session'] = s

if not source:
source = resource[u'url']

schema = resource.get(u'schema')
if schema and isinstance(schema, str):
if schema.startswith('http'):
r = requests.get(schema)
schema = r.json()
source = resource['url']

schema = resource.get('schema')
if schema:
if isinstance(schema, str):
if schema.startswith('http'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if schema is a string that doesn't start with "http" then it's preserved unchanged?

r = requests.get(schema)
schema = r.json()
else:
schema = json.loads(schema)
schema = json.dumps(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is wrong, schema can be:

  • A python dict, passed directly to the action (eg tests)
  • A string, in that case it can either be:
    • A URL (we need to get the contents and load the JSON)
    • Something else: we assume is a JSON string, so we need to load it (should be valid because it has passed the validators)
      We should check all these cases and end up with a schema dict. I assume validate() expects a dict for the schema, so I'm not sure why you are using dumps() here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this code is handling all of those cases, except that it will not consistently result in a dict.

  • Passing in a dict will follow the else branch and go through json.dumps, producing a string.
  • Passing in a URL will follow both if branches, producing a JSON object.
  • Passing a JSON string will follow the first if, but not the second, retaining the original string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that the type o the schema parameter could be a str, but noticed now that if it's a str frictionless deals with it as a path to a schema file. You can't pass directly a dict to the validate function, but you do need to load it with Schema.from_descriptor. I fixed it in my last commit.


_format = resource[u'format'].lower()
_format = resource['format'].lower()

report = _validate_table(source, _format=_format, schema=schema, **options)

# Hide uploaded files
for table in report.get('tables', []):
if table['source'].startswith('/'):
table['source'] = resource['url']
for index, warning in enumerate(report.get('warnings', [])):
report['warnings'][index] = re.sub(r'Table ".*"', 'Table', warning)

if report['table-count'] > 0:
validation.status = u'success' if report[u'valid'] else u'failure'
validation.report = report
if type(report) == Report:
report = report.to_dict()
if 'tables' in report:
for table in report['tables']:
if table['source'].startswith('/'):
table['source'] = resource['url']
if 'warnings' in report:
for index, warning in enumerate(report['warnings']):
report['warnings'][index] = re.sub(r'Table ".*"', 'Table', warning)
if 'valid' in report and report['valid']:
validation.status = 'success' if report['valid'] else 'failure'
validation.report = json.dumps(report)
else:
validation.status = u'error'
validation.error = {
'message': '\n'.join(report['warnings']) or u'No tables found'}
validation.status = 'error'
validation.report = json.dumps(report)
if 'tables' in report:
validation.error = {
'message': [str(err) for err in report['tables'][0]['errors']] if len(report['tables'][0]['errors']) > 0 else 'No tables found'}
else:
validation.error = {'message': []}
aivuk marked this conversation as resolved.
Show resolved Hide resolved
validation.finished = datetime.datetime.utcnow()

Session.add(validation)
Expand All @@ -117,18 +126,21 @@ def run_validation_job(resource):
'validation_timestamp': validation.finished.isoformat()})


def _validate_table(source, _format=u'csv', schema=None, **options):
def _validate_table(source, _format='csv', schema=None, **options):

frictionless_context = { 'trusted': True }
aivuk marked this conversation as resolved.
Show resolved Hide resolved
http_session = options.pop('http_session', None) or requests.Session()

use_proxy = 'ckan.download_proxy' in t.config
if use_proxy:
proxy = t.config.get('ckan.download_proxy')
log.debug(u'Download resource for validation via proxy: %s', proxy)
log.debug('Download resource for validation via proxy: %s', proxy)
http_session.proxies.update({'http': proxy, 'https': proxy})
report = validate(source, format=_format, schema=schema, http_session=http_session, **options)
frictionless_context['http_session'] = http_session
aivuk marked this conversation as resolved.
Show resolved Hide resolved

log.debug(u'Validating source: %s', source)
with system.use_context(**frictionless_context):
report = validate(source, format=_format, schema=schema, **options)
log.debug('Validating source: %s', source)

return report

Expand Down
39 changes: 22 additions & 17 deletions ckanext/validation/logic.py
Original file line number Diff line number Diff line change
Expand Up @@ -656,26 +656,31 @@ def _run_sync_validation(resource_id, local_upload=False, new_resource=True):
{u'ignore_auth': True},
{u'resource_id': resource_id})

report = validation['report']
if validation['report']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an instance when validation["report"] is not set? This is our own ckanext-validation object so even if frictionless doesn't return an object for some reason we could store a custom one like {"valid": False, "errors": [...]} or something similar. IIUC in the goodtables-based code we always had a report key

report = json.loads(validation['report'])

if not report['valid']:
if not report['valid']:

# Delete validation object
t.get_action(u'resource_validation_delete')(
{u'ignore_auth': True},
{u'resource_id': resource_id}
)
# Delete validation object
t.get_action(u'resource_validation_delete')(
{u'ignore_auth': True},
{u'resource_id': resource_id}
)

# Delete uploaded file
if local_upload:
delete_local_uploaded_file(resource_id)
# Delete uploaded file
if local_upload:
delete_local_uploaded_file(resource_id)

if new_resource:
# Delete resource
t.get_action(u'resource_delete')(
{u'ignore_auth': True, 'user': None},
{u'id': resource_id}
)
if new_resource:
# Delete resource
t.get_action(u'resource_delete')(
{u'ignore_auth': True, 'user': None},
{u'id': resource_id}
)

raise t.ValidationError({
u'validation': [report]})
else:
raise t.ValidationError({
u'validation': [report]})
'validation': []
})
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
{% import 'macros/form.html' as form %}

{% set value = data[field.field_name] %}
{% set is_url = value and value[4:]|lower == 'http' %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still used below right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, I think this line should stay

{% set is_json = not is_url and value %}

<div class="image-upload"
Expand Down
24 changes: 12 additions & 12 deletions ckanext/validation/tests/test_form.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def test_resource_form_create(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["schema"]) == value
assert dataset["resources"][0]["schema"] == value

def test_resource_form_create_json(self, app):
dataset = Dataset()
Expand All @@ -105,7 +105,7 @@ def test_resource_form_create_json(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["schema"]) == value
assert dataset["resources"][0]["schema"] == value

def test_resource_form_create_upload(self, app):
dataset = Dataset()
Expand All @@ -131,7 +131,7 @@ def test_resource_form_create_upload(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["schema"]) == value
assert dataset["resources"][0]["schema"] == value

def test_resource_form_create_url(self, app):
dataset = Dataset()
Expand Down Expand Up @@ -187,7 +187,7 @@ def test_resource_form_update(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["schema"]) == value
assert dataset["resources"][0]["schema"] == value

def test_resource_form_update_json(self, app):
value = {"fields": [{"name": "code"}, {"name": "department"}]}
Expand Down Expand Up @@ -216,7 +216,7 @@ def test_resource_form_update_json(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["schema"]) == value
assert dataset["resources"][0]["schema"] == value

def test_resource_form_update_url(self, app):
value = {"fields": [{"name": "code"}, {"name": "department"}]}
Expand Down Expand Up @@ -274,7 +274,7 @@ def test_resource_form_update_upload(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["schema"]) == value
assert dataset["resources"][0]["schema"] == value


@pytest.mark.usefixtures("clean_db", "validation_setup")
Expand Down Expand Up @@ -312,7 +312,7 @@ def test_resource_form_create(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["validation_options"]) == value
assert dataset["resources"][0]["validation_options"] == value

def test_resource_form_update(self, app):
value = {
Expand Down Expand Up @@ -353,7 +353,7 @@ def test_resource_form_update(self, app):

dataset = call_action("package_show", id=dataset["id"])

assert json.loads(dataset["resources"][0]["validation_options"]) == value
assert dataset["resources"][0]["validation_options"] == value


@pytest.mark.usefixtures("clean_db", "validation_setup", "mock_uploads")
Expand Down Expand Up @@ -405,8 +405,8 @@ def test_resource_form_create_invalid(self, app):
)

assert "validation" in response.body
assert "missing-value" in response.body
assert "Row 2 has a missing value in column 4" in response.body
assert "missing-cell" in response.body
assert "This row has less values compared to the header row" in response.body
aivuk marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.usefixtures("clean_db", "validation_setup", "mock_uploads")
Expand Down Expand Up @@ -458,8 +458,8 @@ def test_resource_form_update_invalid(self, app):
)

assert "validation" in response.body
assert "missing-value" in response.body
assert "Row 2 has a missing value in column 4" in response.body
assert "missing-cell" in response.body
assert "This row has less values compared to the header row" in response.body


@pytest.mark.usefixtures("clean_db", "validation_setup")
Expand Down
Loading