Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve wording describing what happens when you switch from Row to Record view in OpenRefine #329

Open
ostephens opened this issue Nov 22, 2023 · 7 comments
Assignees

Comments

@ostephens
Copy link
Contributor

How could the content be improved?

The wording under the images illustrating the difference between row and record layout doesn't currently make complete sense (as I read it). It says:

Note in the images above the difference between: Rows with the same Title appear below each shared title, interrupted the numbered sequence in the third column from the left. Shared titles have the same shading, which may be very difficult to distinguish visually, so look for each star and flag in the leftmost columns, which indicates a new row, that is an item with a different author.

I think this needs re-writing as I can't currently understand what it means. I think it needs to be more clearly linked to the description of what a Row is vs what a Record is in OpenRefine so that this is much clearer overall

Which part of the content does your suggestion apply to?

https://librarycarpentry.org/lc-open-refine/03-working-with-data.html#rows-and-records

@jas58
Copy link
Contributor

jas58 commented Nov 22, 2023

Sure, I'm a bit confused, too. The opening paragraph reads:
"OpenRefine has two modes of viewing data: ‘Rows’ and ‘Records’. At the moment we are in Rows mode, where each row represents a single record in the data set - in this case, an article. In Records mode, OpenRefine can link together multiple rows as belonging to the same Record. Rows will be assigned to Records based on the values in the first column. "

The second sentence tells me a row is a record. (!)

Then, a record is a record to which many rows may be assigned.

Perhaps pull from the documentation: "A row is a series of cells, related horizontally."
[ ]Then, "When a cell has many values (multiple authors, eg.) we need to split them. This creates a record, that is: multiple related rows."
[ ]or Then "In OpenRefine, we can switch to "Records" view to split those overstuffed cells' values into separate cells of unique information, while the shared information remains constant."

Sharing the example of Show: Actor : roles might be wise, unless a FRBR version : Work: Version: Manifestation, like Othello: Film/play/tv series: directors or date.

The key seems to be Records view allows some additionally subgrouping (filtering?) but only a few additional, so splitting the cells to get Tidy Data would be a better practice. Also there's a risk in Records row of deleting data when removing an empty cell's whole row (that is row in both visual and OR terms?).

Does part of this go under Transformation (later)?

@ostephens
Copy link
Contributor Author

ostephens commented Nov 23, 2023

The second sentence tells me a row is a record. (!)

I definitely see what you mean! But ... essentially this is true. To be clear. If we have a data set formatted like:

Article Author 1 Author 2
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini Angelo Plastino
Aflatoxin Contamination of the Milk Supply Naveed Aslam Peter C. Wynn

Then each row represents a single article metadata record - these are two separate articles being described, one row each. The downside is we have the author information split across multiple columns, and if we encounter an article with 3 (or 4 or 5 etc.) authors, we'll need to add a new column for every additional author.

However if we layout the same data like this:

Article Authors
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini
  Angelo Plastino
Aflatoxin Contamination of the Milk Supply Naveed Aslam
  Peter C. Wynn

Now each article metadata record takes up multiple rows. It's 2 row for each here, but if we had an article with more authors we'd just add the extra rows for that particular article metadata - and it keeps all the author data in a single column.

However in a spreadsheet (and in OpenRefine Rows mode) while using the second format makes sense to our eyes (maybe) the software has no idea that the two (or more) rows are connected - so an operation like a sort on the author column would reorder the rows with no care that each group of rows representing a single article metadata record should be kept together.

This is where OpenRefine Records mode comes in. When you switch to Records mode, OpenRefine will interpret these multiple rows as being part of the same single record still - and so will keep them together at all time. This way you get the advantages of the simpler layout, with all the author data in a single column, without losing the ability to keep all the data for a single article together.

NB its not just sort that's affected - we can manipulate the Record in a variety of ways, but sort is a simple example of why it's important that OpenRefine treats the group of rows as a single record

Does that make sense?

@jas58
Copy link
Contributor

jas58 commented Nov 23, 2023 via email

@ostephens ostephens self-assigned this Nov 24, 2023
@ostephens
Copy link
Contributor Author

@jas58 I'll make a PR based on what I've written above - I think I can probably still make some improvements! Once I've got a PR ready then I'll ask you to review and you can check it both makes sense and is helpful!

@jas58
Copy link
Contributor

jas58 commented Dec 15, 2023

With your closed PR, does that mean this issue is also closed? @ostephens If not, which element should I edit into the final checkbox? This seems related to issue 264 about rows expanding?

@ostephens
Copy link
Contributor Author

Discussed in call on 9th Feb.
Use TidyData inspired example to show how e.g. single title with multiple authors would work as tidy data (repeating title where necessary) and then how OpenRefine can use empty spaces in the title column to group the rows as a record - essentially replicating the tidy data approach (somewhat) without repeating the title

Potentially an exercise using blank down/fill down could be added, but concern that this will over load the learners early in the session

@jas58
Copy link
Contributor

jas58 commented Feb 9, 2024

Starting notes to open later:
In rows mode, each row is computed independently. So when we sort by column each row is sorted independently. Think of MARC or bib records

Whereas in Records mode, sometimes it does not look different.

Need to create library specific tidy data demo graphic (well sorted and jumbled tidy data of MARC record (multi author or subj))
Ti, Au, but subj or no?
\have empty fields
Make 2 or 3 tables
label each table view Records vs Rows modes

how to say record, row in data vs OpenRefine " the word line?" "horizontal group?"

And the nice part is, you haven't ruined the original

instructor note: if you catch yourself saying row when you mean record. please stop and restart the whole because a quick switch is gobsmackingly confusing to the new learner

How to format a table in markdown: https://carpentries.github.io/sandpaper-docs/episodes.html#tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants