Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]Uniform layer-wise sampler #362

Closed
wants to merge 11 commits into from
Closed

Conversation

GaiYu0
Copy link
Collaborator

@GaiYu0 GaiYu0 commented Jan 21, 2019

Description

The implementation of uniform layer-wise sampler, tested with SSE on the PubMed dataset.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

@GaiYu0
Copy link
Collaborator Author

GaiYu0 commented Jan 21, 2019

I also need to test my implementation with multi-layer GCN.

@GaiYu0 GaiYu0 closed this Jan 21, 2019
@GaiYu0 GaiYu0 reopened this Jan 21, 2019
@jermainewang
Copy link
Member

Could we refactor the sampler codes to another source file? Also, I suggest you break down this PR into two. One is only the layer sampler codes in C++, and another is the model-related codes in Python.

subgraph.map_to_subgraph_nid(dst).asnumpy())
for src, dst in parent_uv_edges_per_hop]
#print(subg_uv_edges_per_hop)
return subgraph, subg_uv_edges_per_hop
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to call LayerSampler like this. Previously, neighbor sampling can't handle resampled nodes correctly. That's why @ZiyueHuang implement neighbor sampling like this. If your layer sampler can handle resampled nodes, you can directly use layer sampler.

BTW, this implementation generates only a single batch in an epoch. This isn't desired. We need to enable mini-batch training.


print(args)

main(args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file should eventually merge with @ZiyueHuang's neighbor sampling. Most of the code should be the same.

* \return a subgraph
*/
/* virtual SampledSubgraph LayerUniformSample(IdArray seeds, const std::string &neigh_type,
int n_layers, size_t layer_size) const = 0; */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment this?

if not prefetch:
return loader
else:
return _PrefetchingLoader(loader, num_prefetch=num_workers*2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between layer sampler and neighbor sampler, in terms of API? if they are the same, should we merge them?

@@ -976,6 +977,138 @@ SampledSubgraph ImmutableGraph::SampleSubgraph(IdArray seed_arr,
return subg;
}

SampledSubgraph ImmutableGraph::LayerSample(IdArray seed_array,
const float* probability,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass probability?

size_t n_seeds = seed_array->shape[0];
const dgl_id_t* seed_data = static_cast<dgl_id_t*>(seed_array->data);
candidate_set.insert(seed_data, seed_data + n_seeds);
std::copy(candidate_set.begin(), candidate_set.end(), nodes.begin());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to allocate memory for nodes first?

const dgl_id_t* seed_data = static_cast<dgl_id_t*>(seed_array->data);
candidate_set.insert(seed_data, seed_data + n_seeds);
std::copy(candidate_set.begin(), candidate_set.end(), nodes.begin());
layer_ids.insert(layer_ids.end(), nodes.size(), 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use push_back?

std::vector<size_t> positions {0, nodes.size()};
for (int i = 0; i != n_layers; i++) {
candidate_set.clear();
for (auto j = positions.end()[-2]; j != positions.back(); ++j) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positions.end()[-2] looks weird. it seems you want to use positions.front()

for (auto const &pair : n_times) {
nodes.push_back(pair.first);
layer_ids.push_back(i + 1);
probabilities.push_back(pair.second / n_nodes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this sampling probability? also, it's over the number of all nodes in the original graph?

@jermainewang jermainewang mentioned this pull request Jan 21, 2019
6 tasks
subg.layer_ids[i] = node_info[i].first;
subg.sample_prob[i] = node_info[i].second;
}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these?

fn.sum(msg='m', out='h'))
h = subg.ndata.pop('h')
# same as TODO above
h = h / math.sqrt(3.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization doesn't seem right. Could you try normalizing with vertex degree?

Copy link
Member

@ZiyueHuang ZiyueHuang Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try random walk normalized Laplacian, like in https://github.com/ZiyueHuang/dgl/blob/4aa22e63d42fba3a27709a02c55aba680e30dd33/examples/mxnet/gcn/gcn_ns.py for easier implementation (for uniform sampling, mean is OK, no need for the norm),

def gcn_reduce(node):
    accum = mx.nd.mean(node.mailbox['m'], 1)
    return {'h': accum}

instead of symmetric normalized Laplacian.

@jermainewang jermainewang mentioned this pull request Feb 18, 2019
26 tasks
@GaiYu0 GaiYu0 closed this Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants