Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification about handling of changed MH records in import API #25

Open
emiliom opened this issue Feb 5, 2021 · 2 comments
Open

Clarification about handling of changed MH records in import API #25

emiliom opened this issue Feb 5, 2021 · 2 comments

Comments

@emiliom
Copy link
Member

emiliom commented Feb 5, 2021

@BobTorgerson @brucecrevensten I'm trying to clarify a behavior in the import API. The situation and examples have arisen in the context of MountainHub data, and I'll use it to illustrate it, but it applies to all 3 providers.

We need to know what happens when the import API encounters a recent record that was previously ingested, except that there has been a change at the source since it was previously ingested. Specifically, the snowpack depth has been modified. I know the import API only queries for records with timestamps < 1 week from run time, so the edited record does meet that criteria.

The SQL INSERT statement has a condition that will reject the insertion if the record fully matches one that already exists:

INSERT INTO observations(location, id, author, depth, timestamp, source, elevation)
VALUES ${observations}
ON CONFLICT DO NOTHING

So, my interpretation is that since depth will be different, the record will be treated as new and inserted. Can you confirm?

In addition, the id is generated as a random has via the generateId function:

const crypto = require('crypto');

// Used to generate random ids
const generateId = (data) =>
  crypto.createHash('sha1').update(data).digest('base64').slice(0, 8);

This function is called in each provider script; for MH, it's here. I don't know enough Javascript to understand if crypto.createHash would create a new "random" id even if data is the same as in a previous record. Can you clarify?

Thanks!
cc @gjwolken

@BobTorgerson
Copy link
Contributor

Hey @emiliom,

You raise an excellent point in regards to the insert statement into the CSO database in that if someone changes the depth of their previous record, it does appear from the code that this would be treated as a new observation by the user, at least in terms of the import to the database. I'm not sure if there is a good way around this though in its current implementation, as the risk of invalid data entering the database is an inherent risk with trusting citizen science (i.e. the data could be spoofed) and determining that a record has been modified rather than newly created is a determination that we are not currently making when importing from the three services of RegObs, SnowPilot, and MH.

As for the crypto function, it appears that if the data changes (i.e. the snow depth is modified), the SHA1 hash will also change representing that the ID that is being generated for that record in the database will also change. If the same record with the same data unchanged was passed to that createHash function, it would always return the same hashed value as it is a way to indicate the underlying data hasn't changed between runs of the hash function. The same is true of downloaded software; you can often get the expected hash of a binary from their website and after downloading, run the same hash function to determine if the data has been corrupted or changed in flight over the network.

If you would like, we could discuss possible fixes for this, but there are probably not a terribly large number of these records that are modified by their user that would skew the database's results wildly.

@emiliom
Copy link
Member Author

emiliom commented Feb 6, 2021

Thanks @BobTorgerson . At this stage we're just trying to understand how things currently work, exactly. The team is having discussions about what changes we might need if we want to accommodate subsequent edits. I've reported that the import API doesn't handle that at this time. Gabe will probably contact you to schedule a team call over the next few weeks to brainstorm about this.

Regarding the SHA1 hash, that's what I figured. But I'll add the detail that the call to that function passes an id value as an argument; eg, for the MH code, it's id: generateId(record.observation._id). So, it looks like the hash is constructed based on the id value alone and not on the entire record?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants