Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on the simplification of multi-relation edges in networks #255

Merged
merged 14 commits into from
Mar 22, 2024

Conversation

maxloeffler
Copy link
Contributor

@maxloeffler maxloeffler commented Mar 7, 2024

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • I have checked whether I need to adjust the showcase file showcase.R with respect to my changes.
  • The pull request is opened against the branch dev.

Description

Adjust edge attribute handling slightly to make it compatible with the simplify.multiple.relations configuration option introduced in PR #250. In addition, provide more tests for network simplification.

Changelog

Added:

  • Network simplification tests using both approaches

Fixed:

  • Fix an issue in get.data.sources.from.relations when confronted with networks that include edges that stem from multiple relations (e.g., have been simplified using simplify.multiple.relations)

Changed:

  • Adjust the default handling of the relation edge-attribute when simplifying through igraph

PR se-sic#250 introduces the 'simplify.multiple.relations' configuration option,
which can alter the edge-simplification algorithm for multi-networks.
Correspondingly, the edge-attribute handling does need adjustment to
work with this configuration option.

Disambiguate and sort the 'relation' edge-attribute when simplifying
multi-networks with the 'simplify.multiple.relations' option set.

This works towards se-sic#251.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
As the 'relation' edge-attribute potentially holds a list of relation
sources, it needs to be unlisted first before it can be disambiguated.

This works towards se-sic#251.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
As the 'relation' edge-attribute can only be singular values if
'simplify.multiple.relations' is not set, we can set the previously
introduced lambda as default.

This works towards se-sic#251.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @MaLoefUDS. I have not performed a detailed review yet, but I had a quick look at the tests (without checking them in detail, I just skimmed them).

I have some questions targeting the second test ("Simplify author-network with relation = c('cochange', 'mail') using both algorithms"):

Is there a reason why you call the simplify.network function manually and pass the simplify.multiple.relations parameter manually to the simplify.network function? Isn't it possible to configure this via the recently introduced network-conf parameter simplify.multiple.relations (see PR #250) in combination with using simplify = TRUE in the network conf?
Maybe I overlook something or this does not belong here, but I would like to see the interplay of network conf and automatic simplification in the network builder also tested properly. If I am not mistaken, this should be easy to adjust in the test. As the automatic simplification calls the simplify.network function internally, it should not be necessary to test this function separately, as there is already the basic test above. So, just calling get.author.network on the network builder should be enough to derive the actual network that is compared to the expected network.

And a second idea: simplify.networks (the plural version) also uses the the new simplify.multiple.relations parameter. Do we already have basic tests for the plural variant? If so, would it be easily adjustable to also test the new parameter in the plural version, to make sure that the parameter is correctly passed to the single version inside?

(In addition, I spotted already two minor issues, see the comments below.)

tests/test-networks.R Outdated Show resolved Hide resolved
tests/test-networks.R Outdated Show resolved Hide resolved
@maxloeffler
Copy link
Contributor Author

Yeah, the only reason I did not use the network conf to set the simplification parameters were a lack of knowledge on my side 😅.

I just tried it out and I discovered something interesting. Creating a network, then simplifying it does not yield the same result as creating a network with the simplify parameter set. In particular, everything that is NA when post-simplifying (i.e., attributes of edges that only exist for edges of a specific relation) is NULL when simplifying via the config parameter. I could not observe any other changes in output.

I am not sure if that is an actually intended behavior but I guess its not and it needs further investigation. Also, we should decide on which is favored, NA or NULL.

@bockthom
Copy link
Collaborator

I just tried it out and I discovered something interesting. Creating a network, then simplifying it does not yield the same result as creating a network with the simplify parameter set. In particular, everything that is NA when post-simplifying (i.e., attributes of edges that only exist for edges of a specific relation) is NULL when simplifying via the config parameter. I could not observe any other changes in output.

This is really interesting and I'd guess that this behavior is not intended. In general, I've assumed that both ways of simplifying the network should end up in the same network, since the network-construction function of the NetworkBuilder actually calls the simplify function. So, maybe there is something additional going on in the NetworkBuilder after the simplification function is called? That's the only potential difference that comes to my mind.

I am not sure if that is an actually intended behavior but I guess its not and it needs further investigation. Also, we should decide on which is favored, NA or NULL.

Indeed, this needs further investigation. Could you please provide concrete examples of which edge attributes are affected and how their usual values look like? Also checking if there are some additional statements in the NetworkBuilder that could make a difference could be helpful.

Actually, I don't know yet whether NA or NULL is the expected value. Do we have similar situations in which these attributes are either NA or NULL (when not using the new functionality) such that we can choose the value that would be most consistent with similar attribute values?

@maxloeffler
Copy link
Contributor Author

maxloeffler commented Mar 11, 2024

I have investigated the issue and I think I found a bug in plyr::rbind.fill 💀, but lets start a bit earlier.

To get everyone on board, the pipeline for creating networks with multiple relations is as follows:

  1. Create individual networks for each relation
  2. Simplify individual networks
  3. Merge networks together into one

So when we create a network with simplify = TRUE once we reach step 3, the edges of the individual networks will be simplified, i.e., some attributes will now be of type list instead of character(0) or something else. For example, two cochange edges between two vertices will be simplified and the hash attribute of the simplified hash will be a concatination of the hashes of the individual edges. And it is here were plyr::rbind.fill in merge.network.data "fails".

According to its documentation plyr::rbind.fill is supposed to fill "... missing columns with NA" [doc]. However, it does fill the entries with NULL if and only if column entries contain values of type list. When we do not specify simplify = TRUE and therefore all columns only contain singular elements, missing values will be indeed filled with NA.

This is the end of my bug report and now I will use this comment to answer some of your other questions I have not answered yet:

And a second idea: simplify.networks (the plural version) also uses the the new simplify.multiple.relations parameter. Do we already have basic tests for the plural variant? If so, would it be easily adjustable to also test the new parameter in the plural version, to make sure that the parameter is correctly passed to the single version inside?

We have exactly 0 tests that test this function, however, adding tests similar to the ones I already added should not be too difficult.

Indeed, this needs further investigation. Could you please provide concrete examples of which edge attributes are affected and how their usual values look like?

Every mutually exclusive edge attribute between two relations is affected. For example, in mail edges, we have the message.id or the thread edge attribute, while in cochange networks we have hash. When merging mail and cochange edges into one, the attributes values for edges of relation-type one that do not exist in relation-type two will be filled with NA or NULL respectively.

Actually, I don't know yet whether NA or NULL is the expected value. Do we have similar situations in which these attributes are either NA or NULL (when not using the new functionality) such that we can choose the value that would be most consistent with similar attribute values?

After revisiting the R documentations of plyr::rbind.fill, I think having NA is the more reasonable, however, im open for discussions.

@maxloeffler
Copy link
Contributor Author

Another problem just came to my mind, but I have not verified it yet. According to the network creation pipeline in my last comment, I do not think simplify.multiple.relations as a parameter in the network configuration can even do what is expected from it, as simplification only happens on the individual networks in step 2).

@bockthom
Copy link
Collaborator

Another problem just came to my mind, but I have not verified it yet. According to the network creation pipeline in my last comment, I do not think simplify.multiple.relations as a parameter in the network configuration can even do what is expected from it, as simplification only happens on the individual networks in step 2).

Indeed, thanks for spotting this! Unfortunately, I just changed the call to a simplification function that is called inside the network creation but before the different networks are merged, my bad.

I guess the problem could be fixed when adding the following statements at two specific places:

if (network.conf$get.value("simplify")) {
        net = simplify.network(net,
                               simplify.multiple.relations = network.conf$get.value("simplify.multiple.relations"))
}

(1) After merge.networks in get.author.network line 758:

net = merge.networks(networks)

(2) After merge.networks in get.artifact.network line 820:

net = merge.networks(networks)

This way, the networks should be simplified after the different networks are combined.
Adding the simplification there should not change anything in case of single relation networks. Maybe we should extend the condition in the above if statement such that we only perform the additional simplification step in case of multiple relations, to only run the additional simplification in cases where it really is necessary, for performance reasons.

My previous change in line 1312, should not affect anything, so I assume we can keep this as I have implemented it? It should not make any difference, if I am not mistaken.

@bockthom
Copy link
Collaborator

bockthom commented Mar 12, 2024

I have investigated the issue and I think I found a bug in plyr::rbind.fill 💀

Puh, that's really interesting. Thanks for the thorough investigations. In general, I agree with you. However, we need to put some more efforts into analyzing how these NULL values emerge. Maybe the handling of lists only allows NULL values there. Thus, we need to find out if there is something we can change to avoid the NULL values and get the NA values instead. Otherwise, we need to think about whether we can replace the NULL values by NA values somewhere afterwards.
And finally, we need to find out whether this is really a bug in plyr::rbind.fill or whether this is just an artifact of some special situation. Actually, I don't think that this is a bug in plyr::rbind.fill, as I would assume that someone would already have reported this and this has already been fixed if this really is a bug there.

So, further investigations are necessary before we can decide on how to deal with this problem on our side and whether there really is a bug that we should report to plyr. Let's discuss this in our next meeting in more detail.

@maxloeffler
Copy link
Contributor Author

Let me concretize my findings: plyr::rbind.fill does not fill every column with NULLs once there is one column entry that is a list, just all entries belonging to that column where the list object is in. Here is an example:

Consider an arbitrary data.frame A and one data.frame B that contains columns b1 and b2 that are not included in A. After calling plyr::rbind.fill on A and B, we obtain a merged data.frame AB. AB$b1 will be filled with NA if not a single value in B$b1 is a list and it will be filled with NULL once there is just one list in B$b1. However, if B$b1 contains a list and B$b2 does not, AB$b2 will still be regularly filled with NA.

I have also checked the R documentation[doc] regarding NULL vs. NA because I did not really understand the difference. My conclusion is that while both represent a somehow missing or undefined value, NA in addition is a logical element, the third, TRUE, FALSE, and NA, while NULL is just the total abstinence of a value.

@bockthom
Copy link
Collaborator

Consider an arbitrary data.frame A and one data.frame B that contains columns b1 and b2 that are not included in A. After calling plyr::rbind.fill on A and B, we obtain a merged data.frame AB. AB$b1 will be filled with NA if not a single value in B$b1 is a list and it will be filled with NULL once there is just one list in B$b1. However, if B$b1 contains a list and B$b2 does not, AB$b2 will still be regularly filled with NA.

To be honest, I don't get the last sentence here. What's the relation between B$b1 and B$b2? The behavior regarding AB$b2 should be independent of B$b1 anyway. So, I don't get your point.

I have also checked the R documentation[doc] regarding NULL vs. NA because I did not really understand the difference. My conclusion is that while both represent a somehow missing or undefined value, NA in addition is a logical element, the third, TRUE, FALSE, and NA, while NULL is just the total abstinence of a value.

True. Or, more colloquially speaking: NA represents an element that is present but for which we don't know its actual value (either because we don't know the value or because we know that its value is different than the remaining pre-existing values). Although NA is most often used as a logical, it can also be used in numeric or character contexts to state that a value is not known or cannot be determined. NULL, instead, means that an element does not exist at all and is not present. For example, if you set a column of a data frame to NA, all the values of these column are set to NA in every row. However, if you set column of a data frame to NULL, you remove the column and it is not existent anymore afterwards.

@maxloeffler
Copy link
Contributor Author

Regarding the misunderstanding: In my first comment about the "bug" I made it sound like that if one entry of any column is a list then any missing values will be NULL instead of NA and I only wanted to concretize, that only the column where the list object sits in is affected. In my last sentence I say just that, that if one column, namely B$b1, contains a list object then column AB$b1 is affected but not AB$b2.

In this case, I actually believe NULL is the more fitting value, however, it appears that the developers of plyr did not think so, or else why would they set NA as a default.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments on your last two commits, see below.
For the last commit: Could you please adjust the commit message - the commit only cleans up the tests, but the commit message suggests that the actual implementation (not the tests) was changed.

tests/test-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
* Enable 'simplify.multiple.relations' config parameter to work properly.
Previously, networks would be simplified before merging datasources, i.e.,
before edges could even have multiple relations
* Circumvent undocumented behavior in 'plyr::rbind.fill' that occurs when
merging already simplified networks (see PR#255)

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
@bockthom
Copy link
Collaborator

To get an overview of the status of this PR: The only part that is still missing in this PR are tests for the "plural version" simplify.networks, right? Or was there anything else we did talk about for including it in this PR?

@maxloeffler
Copy link
Contributor Author

Yes, the tests are on my bucketlist for today. I don't remember anything that we did talk about but not yet implemented it in this PR ^^.

@maxloeffler
Copy link
Contributor Author

maxloeffler commented Mar 19, 2024

I was just wondering, since simplify.networks (plural) is just a wrapper for simplify.network, that we can keep tests simple for the plural version and rather focus on the main function. I think it would be more senseful to test the other parameters of simplify.network (e.g., remove.multiple) instead of having elaborate simplify.networks tests.

Edit: Nevermind I just realized that these parameters directly map to igraph parameters.

@bockthom
Copy link
Collaborator

I agree that there is no need for "elaborate" simplify.networks tests, since it is just a wrapper function, as you already mentioned. But we should make sure that the parameters are correctly passed internally and the output networks are simplified as expected, this is what the new tests should account for.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments below–I only performed a "syntactical" review, not yet a "semantic" one, that is, I did not yet think about the test scenarios yet.

tests/test-networks.R Show resolved Hide resolved
tests/test-networks.R Outdated Show resolved Hide resolved
tests/test-networks.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm done with my review now. Looks good to me.
I've a few suggestions on how to improve the tests with regard to other attributes etc. If you think my suggestions are beneficial improvements to the tests and easy to realize, I would appreciate it if you could extend the tests a little bit.
If you have good arguments why my suggestions are unnecessary, I am also open for them 😄

tests/test-networks.R Show resolved Hide resolved
tests/test-networks.R Show resolved Hide resolved
util-networks.R Show resolved Hide resolved
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
@maxloeffler
Copy link
Contributor Author

Im unsure whether the showcase.R file really needs an update regarding the new simplification option. Upon inspection, simplification is currently not an explicitly showcased feature. We obtain a sample network, simplify it, then plot it and continue. We also only do that one single simplification, we do not play with parameters later on or something. Therefore, I don't think we need an update but please let me know if you think differently 😄.

@bockthom
Copy link
Collaborator

bockthom commented Mar 20, 2024

Im unsure whether the showcase.R file really needs an update regarding the new simplification option. Upon inspection, simplification is currently not an explicitly showcased feature.

just because it's not an explicitly showcased feature now doesn't mean it has to stay that way 😉

We obtain a sample network, simplify it, then plot it and continue. We also only do that one single simplification, we do not play with parameters later on or something. Therefore, I don't think we need an update but please let me know if you think differently 😄.

We don't really need an update, but I still think it can be beneficial to demonstrate the new feature there.
Maybe the plotting section is not the right place, but maybe you can add a new section directly above the plotting section in which you demonstrate the two different simplification methods on a single multi-relation network? Just call "plot.network(...)" on each of them and it should be clearly visible to everyone what the difference between the two simplification options is.
If you want to avoid code duplication, you can also reuse the created network in the plotting section afterwards. [Edit: maybe not, as the current network for plotting only uses one relation.]

Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two small comments regarding the NEWS.md:

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@maxloeffler
Copy link
Contributor Author

maxloeffler commented Mar 22, 2024

Regarding the showcase.R: plot.network cannot handle lists in edge attributes as produced by our new simplification. I will need to see why ...

Update (1/2):

When plotting, we decide the color of an edge based on the relation of said edge, as seen here. For multi-relational edges we need to think about what we can do. We can probably take the first or the last element in the list to obtain a color that is already in use for other edge relations. I am new to ggplot, so I don't really know if we can just concat all elements in the relations list and get a new color from that.

Update (2/2):

I tried concating the relations and it seems to work now, I will finish up a commit so you can have a look.

util-plot.R Outdated Show resolved Hide resolved
showcase.R Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@bockthom
Copy link
Collaborator

bockthom commented Mar 22, 2024

Regarding the showcase.R: plot.network cannot handle lists in edge attributes as produced by our new simplification. I will need to see why ...

Great! First of all, thanks for spotting this bug! Nobody would ever have thought of fixing the plotting functionality to deal with edges that represent multiple relations. I've said it many times already: Playing around with the showcase.R really is beneficial for identifying potential problems with new features. That proved to be true again here 😄

Update (2/2):

I tried concating the relations and it seems to work now, I will finish up a commit so you can have a look.

Thanks for solving the problem. I really like your fix. I've tried plotting networks with multiple different combinations of relations on single edges, and all the different combinations get different edge colors, as intended.
When just looking at the code, I was wondering what your fix really does – but it produces the correct solution. Just R things... Good work @MaLoefUDS!

Concat all relations of an edge to a single string to determine the edges
color by, since ggplot2 does not accept lists as an identifier.

Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
Signed-off-by: Maximilian Löffler <s8maloef@stud.uni-saarland.de>
@bockthom bockthom changed the title Work on network simplification Work on the simplification of multi-relation edges in networks Mar 22, 2024
@bockthom bockthom merged commit 1d3d1a3 into se-sic:dev Mar 22, 2024
6 checks passed
@bockthom bockthom mentioned this pull request Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants