Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCO should represent a database record at the record-cell level #415

Closed
13 tasks done
ajnelson-nist opened this issue Jul 26, 2022 · 19 comments · Fixed by #475
Closed
13 tasks done

UCO should represent a database record at the record-cell level #415

ajnelson-nist opened this issue Jul 26, 2022 · 19 comments · Fixed by #475

Comments

@ajnelson-nist
Copy link
Contributor

ajnelson-nist commented Jul 26, 2022

Background

This Issue stemmed from a question:

On the CASE Crossover Scenario, a tool's output identified an incorrect association with an ICCID number for a phone. The ICCID number is known to be incorrect because the person who seeded the phone data knows what the ICCID was.

https://caseontology.org/examples/crossover_wmd/

The ICCCID should be 8931088918010550289, whereas the tool claimed a value 89390100002217635543. This discrepancy comes from having selected an incorrect record in a SQLite file.

How should UCO designate a certain record within SQLite? SQLite has the advantage of being able to identify its allocated rows with row_number(). (I had thought there was a ROW_NO, but haven't had a chance to go back through the SQLite docs to confirm.)

https://www.sqlite.org/windowfunctions.html

Additional nuance for this example: This example relies on allocated records. Do we need a mechanism to identify unallocated records as well initially (especially for carving work)?

Requirements

Requirement 1

UCO must be able to represent an individual cell within a database table.

Requirement 2

UCO must be able to relate a cell within a database to its containing column within the row. Likewise for containing row within the table, table within the database, or table within the schema and schema within the database. Each of these relating properties must tie to the cell.

Requirement 3

At least in the initial representation, it must not be necessary to represent each layer between the whole database as an object, down to schema, down to table, down to row, down to column, down to cell. The cell object itself must be able to carry its locating characteristics.

Requirement 4

The cell contents must be able to represent strings, binary content, floating point numbers (xsd:decimal), integers, OR that the cell is NULL.

Risk / Benefit analysis

Benefits

  • One or more database row-column points can be the source of significant analytic conclusions. Being able to refer to one cell unambiguously is essential.

Risks

  • Databases, as compact representations of relational data, have an entire field of research devoted to optimal encodings and serializations. UCO's representation, as part of an RDF-based graph, will always feel suboptimal. Proactive characterization of all database cells within a UCO-based characterization operation (such as a CASE investigation) is likely to lead to an unwieldy graph size.

Competencies demonstrated

Competency 1

The Crossover "WMD" scenario has an issue where a tool drew an assignment for ICCID from either a system .dat file, or from a certain selection within a table that stored ICCID histories. (See the "SIM CARDS" section on that page.) The ground truth is known in this instance, and the value reported by that tool is incorrect.

Competency Question 1.1

What database cell provided the ICCID values 89390100002217635543 (known incorrect) and 8931088918010550289
(known in ground truth to be correct)?

Result 1.1

SELECT ?nRecord ?lTableName ?lRowID ?lFieldName ?lFieldValue
WHERE {
  ?nRecord
    drafting:databaseTable ?lTableName ;
    drafting:databaseRowID ?lRowID ;
    drafting:databaseFieldName ?lFieldName ;
    drafting:databaseFieldValue ?lFieldValue
    .

  FILTER ( ?lFieldValue IN ("89390100002217635543", "8931088918010550289") )
}

Solution suggestion

The solution is drafted in CASE-Examples' drafting.ttl here, except a property drafting:databaseFieldIsNull, Boolean-ranged, should be added as part of an sh:xone (exactly one) constraint incorporating constraints on drafting:databaseFieldValue .

Coordination

  • Tracking in Jira ticket OC-251
  • Administrative review completed, proposal announced to Ontology Committees (OCs) on 2022-08-03
  • Requirements to be discussed in OC meeting, 2022-08-09
  • Requirements Review vote occurred, passing, on 2022-08-09
  • Requirements development phase completed.
  • Solution announced to OCs on 2022-11-07
  • Solutions Approval to be discussed in OC meeting, 2022-11-07
  • Decision on recordRowId range including strings or not is recorded
  • Decision recorded on whether to change recordRowId definition to: The unique ID that identifies a database record, supplied by the originating database engine..
  • Solutions Approval vote occurred, passing, on 2022-11-17
  • Solutions development phase completed.
  • Implementation merged into develop
  • Milestone linked
  • Documentation logged in pending release page
@kchason
Copy link
Contributor

kchason commented Aug 2, 2022

@ajnelson-nist
Copy link
Contributor Author

ajnelson-nist commented Aug 3, 2022

CASE-Example development has been merged, and the example is now here. @kchason , could you please copy that drafting.ttl over to the website and add a snippet illustrating how it would be used to represent the two records?

The crossover page presents that row 3 (last row) in telephony.db is the correct ICCID assignment. Row 1 has a coinciding ICCID from the erroneous report drawn from SimCard.dat. Ground truth ( per @eoghanscasey , phone data populater) is that row 3 has the right ICCID number, from his knowledge of having put the SIM card in himself.

@kchason , can you please add rows 1 and 3 as a new code snippet with the drafting concepts, with appropriate descriptive narrative (could reword what I put above)?

I'll revise this Issue into a proposal now.

@ajnelson-nist ajnelson-nist changed the title How should UCO represent a selection of a particular database record? UCO should represent a database record at the record-cell level Aug 3, 2022
@sbarnum
Copy link
Contributor

sbarnum commented Aug 9, 2022

While the referenced examples contain anecdotal use of the new DatabaseRecordFacet and some properties, I believe the Solution Suggestion section of this CP should explicitly outline details of all proposed new classes and properties including property type and cardinality.

@gwebb-case
Copy link

I voted to accept the above, as I do agree with the majority of what is being proposed, but have the following main reservation in the definition drafting.ttl; and in the supporting example database_records.json :

  1. The current class definitions for DatabaseRecord and DatabaseRecordFacet appears to describe a field not a record.

note: A Record contains a [ordered/unordered] collection of Fields, but the current definitions only describes a single field.

@gwebb-case
Copy link

A further reservation is that Databases do not directly have have records; or fields, assigned to them, they only have tables (of types Table, View, and Index), so semantically I think the current naming of a number of the classes and properties in the definition is likely to cause confusion for the following reasons:

  1. A Database consists of 1 or more Tables, and it is not uncommon for a dB to have in excess of 300 tables, even in SQLite dBs,
  2. Each table will have a single schema, which describes its original field types.
  3. Tables have Records (Rows), which in turn contain a [ordered/unordered] collection of Fields.

Personally, I would much prefer that some time in the future the following simple semantic changes be made to more closely match what seems to be being described by the current draft ontology and its associated example:

Current -> Rename
drafting:databaseTable -> drafting:tableName
drafting:databaseSchema -> drafting:tableSchema

drafting:DatabaseRecord -> drafting:TableField
drafting:DatabaseRecordFacet -> drafting:TableFieldFacet

drafting:databaseRowID -> drafting:recordRowID
drafting:databaseFieldName -> drafting:recordFieldName or drafting:fieldName
drafting:databaseFieldValue -> drafting:recordFieldValue or drafting:fieldValue

and add a new class called drafting:Table to hold the current properties drafting:databaseTable and drafting:databaseSchema.

These changes would, in my opinion, also make it relatively simple to describe non-relational databases; and other table types found forensics, using the same classes and properties.

@ajnelson-nist
Copy link
Contributor Author

@gwebb-case , I agree with you on "Record" more typically corresponding with a database row. I'd vaguely recalled relational algebra using "record" when referencing a tuple.

As for your field names, we had discussed leaving the possibility open in the future for a higher-level abstraction of "Table" that could also accommodate a table in a HTML, Word, PDF, or other document. So, these two changes I'd NACK, because "tableSchema" would not mesh well with that.

  • drafting:databaseTable -> drafting:tableName
  • drafting:databaseSchema -> drafting:tableSchema

I had thought that TableCell would work better than TableField, but the non-relational database matter wouldn't work as well with TableCell. @kchason suggested looking at MongoDB as a sanity-check for how well this rename would apply. So, I agree with this rename:

  • drafting:DatabaseRecord -> drafting:TableField
  • drafting:DatabaseRecordFacet -> drafting:TableFieldFacet

I'm on the fence on whether we need to keep the word "database" in these, as a matter of verbosity vs. concept scope confusion. We might want close to the same property names for doing (in the future) HTML forms. (Skipping a few steps in modeling needs, I could see some desire to certain HTML web form based attacks.) I think for now we would be reasonably future-safe for that other modeling need if we used recordFieldName (/Value) for databases.

  • drafting:databaseRowID -> drafting:recordRowID
  • drafting:databaseFieldName -> drafting:recordFieldName
  • drafting:databaseFieldValue -> drafting:recordFieldValue

@ajnelson-nist
Copy link
Contributor Author

Implementation has not been completed to the point of exercising the database NULL representation. We could vote on this today if we think it is sufficiently specified, but there is some risk form not testing yet.

@ajnelson-nist ajnelson-nist modified the milestones: UCO 1.0.0, UCO 1.x.0 Aug 25, 2022
@ajnelson-nist ajnelson-nist linked a pull request Aug 29, 2022 that will close this issue
13 tasks
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Oct 3, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Oct 3, 2022
References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Oct 4, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Oct 4, 2022
References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist
Copy link
Contributor Author

Looking over the state of the PR, I think there's only one thing I'd like to see changed. If there's anyone else in favor, please note so, and we can make the extension.

observable:recordRowID is currently proposed as a property with range xsd:string. This is not compatible with SQLite rowid tables, where no explicit primary key or INTEGER PRIMARY KEY make the unique row identifier an integer. It's possible for UCO to just cast the integer to a string, but this then burdens UCO consumers with testing for number-ness of each string value.

Should recordRowID have a range of the union of xsd:integer and xsd:string? (For the sake of SQLite supporting negative integers as its rowid (source: see that same page), I won't suggest a xsd:nonNegativeInteger restriction.)

ajnelson-nist added a commit to kchason/UCO that referenced this issue Nov 8, 2022
These tests were initially drafted in CASE-Examples, and merged in that
repository's Pull Request 89.

A follow-on patch will regenerate Make-managed files.

References:
* casework/CASE-Examples#89
* ucoProject#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to kchason/UCO that referenced this issue Nov 8, 2022
References:
* ucoProject#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to kchason/UCO that referenced this issue Nov 8, 2022
The sample is the SHA-1 of the 0-length string.

No effects were observed on Make-managed files.

References:
* ucoProject#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Archive that referenced this issue Nov 8, 2022
No effects were observed on Make-managed files.

References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue Nov 8, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/casework.github.io that referenced this issue Nov 8, 2022
References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Nov 8, 2022
No effects were observed in Make-managed files.

References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Nov 8, 2022
A follow-on patch will regenerate Make-managed files.

References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to casework/CASE-Examples that referenced this issue Nov 8, 2022
References:
* ucoProject/UCO#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist ajnelson-nist modified the milestones: UCO 1.x.0, UCO 1.1.0 Nov 8, 2022
@plbt5
Copy link
Contributor

plbt5 commented Nov 8, 2022 via email

@gjwebb-case
Copy link

Hi Alex,

I agree with your view, xsd:integer would be by far the better data type to use.

In my opinion xsd:integer would still be best irrespective of whether a record is modelled directly on the Database or is based on what it representing, it would also allow it to be modelled on its location based on a disk offset (e.g a record in unallocated disk space).

@packet-rat
Copy link

packet-rat commented Nov 9, 2022 via email

@ajnelson-nist
Copy link
Contributor Author

Re: @gjwebb-case

Hi Alex,

I agree with your view, xsd:integer would be by far the better data type to use.

In my opinion xsd:integer would still be best irrespective of whether a record is modelled directly on the Database or is based on what it representing, it would also allow it to be modelled on its location based on a disk offset (e.g a record in unallocated disk space).

The point I raised about integers was scoped to observable:recordRowID, which currently has this definition:

The unique ID that identifies a database record.

I need to warn everybody about usage of the word "Unique" in general. "Unique" implies a universe of scope and a defining authority. @gjwebb-case , what you suggested about using the location based on disk offset would be incongruous with the an identifier supplied by the database engine. If you need to recover such a record from unallocated space, and you don't have an inlined identifier within that record, it would be inappropriate in my opinion to assign your own unique identifier. In that scenario, you should instead use observable:DataRangeFacet, and leave observable:recordRowID unassigned.

If others agree with avoiding concept conflation, I suggest observable:recordRowID be altered to read:

The unique ID that identifies a database record, supplied by the originating database engine.

Meanwhile, while I prefer xsd:integer in general, I did suggest that we allow observable:recordRowID to be the union of xsd:integer and xsd:float, because I'm aware that some database engines permit primary key values of strings. I've been in an only-numbers-for-primary-keys habit for years now, so I've forgotten if there are other datatypes permitted in single-column primary keys. I defer to people with more database-internals experience on whether we should use integers or a union of integers and strings (and other types?) for observable:recordRowID.

Re: @packet-rat :

The only challenge is that, as I’m certain has been discussed, different methods are frequently used to characterize file and disk sizes (i.e. 40GB). People and/or tools populating these values are likely to use non-precise terms.

Perhaps a modifier?

I suggest not introducing support for "rounding" modifiers like "kiB". The original issue inspiring this proposal needed to select between two records in a single SQLite table that had, IIRC, two records total. Permitting rounding defeats the use case where we need to point to this offset within a disk sector or database page, rather than that offset.

@ajnelson-nist
Copy link
Contributor Author

Re: @plbt5

Alex, I know this is a long time since, and I should have reacted sooner. Let me just ask the question and leave it with that. Why are we modelling the design of a relational database schema, as opposed to what it represents?

I'm not quite sure what level this question is at, but I'll reply per my best guess.

We are defining a model that can, as a user needs, represent a record with an individual object and a field per layer, rather than requiring representing each layer of the database model (database engine class, database engine instance, backing-store file or disk partition, schema instance if applicable, table instance, row instance, column instance) and relationships between each layer.

We do this in part to support provenience (independent of provenance) - so we can relate a single field within a record geometrically with its containing objects (such as the backing-store storage object).

Does that answer your question?

@kchason
Copy link
Contributor

kchason commented Nov 13, 2022

Looking over the state of the PR, I think there's only one thing I'd like to see changed. If there's anyone else in favor, please note so, and we can make the extension.

observable:recordRowID is currently proposed as a property with range xsd:string. This is not compatible with SQLite rowid tables, where no explicit primary key or INTEGER PRIMARY KEY make the unique row identifier an integer. It's possible for UCO to just cast the integer to a string, but this then burdens UCO consumers with testing for number-ness of each string value.

Should recordRowID have a range of the union of xsd:integer and xsd:string? (For the sake of SQLite supporting negative integers as its rowid (source: see that same page), I won't suggest a xsd:nonNegativeInteger restriction.)

I think this generally makes sense.

I have a concern with just switching it to xsd:integer (or other numeric option) as some databases suggest GUIDs as primary keys. Do we have precedent for having unions of types like this (beyond open vocabularies of xsd:string and a dictionary)?

While not really the concern of the ontology community itself, this would cause issues for some of the strongly-typed bindings that have been developed to help generate CASE graphs and would otherwise cause additional logic needed for consumers of the graphs to account for both types.

@ajnelson-nist
Copy link
Contributor Author

I have a concern with just switching it to xsd:integer (or other numeric option) as some databases suggest GUIDs as primary keys. Do we have precedent for having unions of types like this (beyond open vocabularies of xsd:string and a dictionary)?

On review, not really.

observable:priority allows a union of integer, string, and ... TaskPriorityVocab. The definition of that property is a bit incongruous with its multiple uses (email messages and Windows processes). I had thought priority was meant to support integers or strings for one of those two uses, but our SHACL shapes only permit xsd:string, without mention of TaskPriorityVocab. So, priority needs a bugfix.

So, I don't think we have precedent, but we have the knowledge of how to encode and test for a union. We can make a union work.

While not really the concern of the ontology community itself, this would cause issues for some of the strongly-typed bindings that have been developed to help generate CASE graphs and would otherwise cause additional logic needed for consumers of the graphs to account for both types.

This is an adoption-level concern that is relevant to the ontology committee. Do we know what the total union of datatypes permitted for primary key usage is? E.g. is a DATETIME permitted in some engine? Should we have the range of observable:recordRowId be the union of string, integer, and xsd:hexBinary (shorter binary data, rather than xsd:base64Binary for longer), with hexBinary being our catch-all for other datatypes that are special to the database engine?

@ajnelson-nist
Copy link
Contributor Author

I'm becoming hesitant to include observable:recordRowId in this proposal. It's not clear to me how it would be used in these scenarios:

  1. A record from a SQLite table with no PRIMARY KEY defined, everything in an "Allocated" state (not requiring carving). In SQLite, this implicitly uses a field that is an integer, per the docs I cited above.
  2. A record from a SQLite table with a DATETIME field designated its PRIMARY KEY, everything in an "Allocated" state.
  3. A record from a SQLite table with two or more fields defined as a compound PRIMARY KEY, everything in an "Allocated" state.
  4. A record recovered from a SQLite WAL, in any of situations 1--3.
  5. A record from a Microsoft SQL Server table where a GUID is the PRIMARY KEY.

@kchason
Copy link
Contributor

kchason commented Nov 17, 2022

This could instead refer to the internal record ID contained within the database, but it may not be as readily available from standard database queries or record exports into flat files, so I would still prefer this be defined as the defined primary key of the record.

Some databases recommend integers, some suggest strings, and other recommend GUIDs for the primary key field of a database. As @ajnelson-nist pointed out on a call this morning, this also doesn't account for clustered primary keys consisting of two or more fields. I'd still suggest xsd:string since it allows the most flexibility for the different types of keys that might exist.

@ajnelson-nist
Copy link
Contributor Author

The OCs decided this morning that recordRowId should be retained in this proposal. It will be a union of string and integer. For the compound PRIMARY KEY case, we will consider that a potential addition to the union. Expanding the union will be a backwards-COMPATIBLE change.

@ajnelson-nist
Copy link
Contributor Author

The definition of recordRowId will be revised to: The unique ID that identifies a database record, supplied by the originating database engine.

ajnelson-nist added a commit to kchason/UCO that referenced this issue Nov 17, 2022
References:
* ucoProject#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
ajnelson-nist added a commit to kchason/UCO that referenced this issue Nov 17, 2022
No effects were observed on Make-managed files.

References:
* ucoProject#415

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants