Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count number of connected components more efficiently than length(connected_components(g)) #407

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
2 changes: 2 additions & 0 deletions src/Graphs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,8 @@ export

# connectivity
connected_components,
connected_components!,
count_connected_components,
strongly_connected_components,
strongly_connected_components_kosaraju,
strongly_connected_components_tarjan,
Expand Down
92 changes: 82 additions & 10 deletions src/connectivity.jl
Original file line number Diff line number Diff line change
@@ -1,26 +1,33 @@
# Parts of this code were taken / derived from Graphs.jl. See LICENSE for
# licensing details.
"""
connected_components!(label, g)
connected_components!(label, g, [search_queue])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am all for performance improvements. But I am a bit skeptical if it is worth making the interface more complicated.

Almost all graph algorithms need some kind of of work buffer, so we could have something like in al algorithms but in the end it should be the job for Julia's allocator to verify if there is some suitable piece of memory lying around. We can help it by using sizehint! with a suitable heuristic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this will usually not be relevant; in my case it is though, and is the main reason I made the changes. I also agree that there is a trade off between performance improvements and complications of the API. On the other hand, I think passing such work buffers as optional arguments is a good solution to such trade-offs: for most users, the complication can be safely ignored and shouldn't complicate their lives much.

As you say, there are potentially many algorithms in Graphs.jl that could take a work buffer; in light of that, maybe this could be more palatable if we settle on a unified name for these kinds of optional buffers, so that it lowers the complications by standardizing across methods.
Maybe just work_buffer (and, if there are multiple, work_buffer1, work_buffer2, etc?)

Copy link
Member

@gdalle gdalle Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this then all functions should take exactly one work_buffer (possibly a tuple) and have an appropriate function to initialize the buffer. I think it is a major change which should be discussed separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think if this is really important for your use case you can either

  • Create a version that uses a buffer in the Experimental submodule. Currently we don't guarantee semantic versioning there - this allows use to remove things in the future without breaking the API.
  • Or as this code is very simple you might just copy it to your own repository.

But just to clarify - your problem is not that you are building graphs by adding edges until they are connected? Because if that is the issue, there is a much better algorithm.


Fill `label` with the `id` of the connected component in the undirected graph
`g` to which it belongs. Return a vector representing the component assigned
to each vertex. The component value is the smallest vertex ID in the component.

### Performance
## Optional arguments
- `search_queue`, an empty `Vector{eltype(edgetype(g))}`, can be provided to avoid
reallocating this work array repeatedly on repeated calls of `connected_components!`.
If not provided, it is automatically instantiated.

## Performance
This algorithm is linear in the number of edges of the graph.
"""
function connected_components!(label::AbstractVector, g::AbstractGraph{T}) where {T}
function connected_components!(
label::AbstractVector{T}, g::AbstractGraph{T}, search_queue::Vector{T}=Vector{T}()
) where {T}
empty!(search_queue)
for u in vertices(g)
label[u] != zero(T) && continue
label[u] = u
Q = Vector{T}()
push!(Q, u)
while !isempty(Q)
src = popfirst!(Q)
push!(search_queue, u)
while !isempty(search_queue)
src = popfirst!(search_queue)
for vertex in all_neighbors(g, src)
if label[vertex] == zero(T)
push!(Q, vertex)
push!(search_queue, vertex)
label[vertex] = u
end
end
Expand Down Expand Up @@ -129,9 +136,74 @@ julia> is_connected(g)
true
```
"""
function is_connected(g::AbstractGraph)
function is_connected(g::AbstractGraph{T}) where {T}
mult = is_directed(g) ? 2 : 1
return mult * ne(g) + 1 >= nv(g) && length(connected_components(g)) == 1
if mult * ne(g) + 1 >= nv(g)
label = zeros(T, nv(g))
connected_components!(label, g)
return allequal(label)
else
return false
end
end

"""
count_connected_components( g, [label, search_queue]; reset_label::Bool=false)

Return the number of connected components in `g`.

Equivalent to `length(connected_components(g))` but uses fewer allocations by not
materializing the component vectors explicitly.

## Optional arguments
Mutated work arrays, `label` and `search_queue` can be provided to avoid allocating these
arrays repeatedly on repeated calls of `count_connected_components`.
For `g :: AbstractGraph{T}`, `label` must be a zero-initialized `Vector{T}` of length
`nv(g)` and `search_queue` a `Vector{T}`. See also [`connected_components!`](@ref).

## Keyword arguments
- `reset_label :: Bool` (default, `false`): if `true`, `label` is reset to a zero-vector
before returning.

## Example
```
julia> using Graphs

julia> g = Graph(Edge.([1=>2, 2=>3, 3=>1, 4=>5, 5=>6, 6=>4, 7=>8]));

length> connected_components(g)
3-element Vector{Vector{Int64}}:
[1, 2, 3]
[4, 5, 6]
[7, 8]

julia> count_connected_components(g)
3
```
"""
function count_connected_components(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit undecided if we should call this count_connected_components or num_connected_components. Currently we have both conventions, namely num_self_loops and Graphs.Experimental.count_isomorph.

Ideally we use the same word everywhere. @gdalle Do you have an opinion on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also nv(g) for the number of vertices. Maybe just nconnected_components?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had to pick I'd rather use count than num or n because it is a complete word

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely no to nconnected_components - nv and ne might be some exceptions as they are used all the time - but we might rename them one day.

I don't mind abbreviation from time to time, but lets go with count_connected_components then - after all we also have a count function in the Julia base.

g::AbstractGraph{T},
label::AbstractVector{T}=zeros(T, nv(g)),
search_queue::Vector{T}=Vector{T}();
reset_label::Bool=false,
) where {T}
connected_components!(label, g, search_queue)
c = count_unique(label)
reset_label && fill!(label, zero(eltype(label)))
return c
end

function count_unique(label::Vector{T}) where {T}
# effectively does `length(Set(label))` but faster, since `Set(label)` sizehints
# aggressively and assumes that most elements of `label` will be unique, which very
# rarely will be the case for caller `count_connected_components!`
seen = Set{T}()
for l in label
# faster than direct `push!(seen, l)` when `label` has few unique elements relative
# to `length(label)`
l ∉ seen && push!(seen, l)
end
return length(seen)
end

"""
Expand Down
1 change: 1 addition & 0 deletions test/operators.jl
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,7 @@
for i in 3:4
@testset "Tensor Product: $g" for g in testgraphs(path_graph(i))
@test length(connected_components(tensor_product(g, g))) == 2
@test count_connected_components(tensor_product(g, g)) == 2
end
end

Expand Down
24 changes: 18 additions & 6 deletions test/spanningtrees/boruvka.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,18 @@
g1t = GenericGraph(SimpleGraph(edges1))
@test res1.weight == cost_mst
# acyclic graphs have n - c edges
@test nv(g1t) - length(connected_components(g1t)) == ne(g1t)
@test nv(g1t) - ne(g1t) ==
length(connected_components(g1t)) ==
count_connected_components(g1t)
@test nv(g1t) == nv(g)

res2 = boruvka_mst(g, distmx; minimize=false)
edges2 = [Edge(src(e), dst(e)) for e in res2.mst]
g2t = GenericGraph(SimpleGraph(edges2))
@test res2.weight == cost_max_vec_mst
@test nv(g2t) - length(connected_components(g2t)) == ne(g2t)
@test nv(g2t) - ne(g2t) ==
length(connected_components(g2t)) ==
count_connected_components(g2t)
@test nv(g2t) == nv(g)
end
# second test
Expand Down Expand Up @@ -60,14 +64,18 @@
edges3 = [Edge(src(e), dst(e)) for e in res3.mst]
g3t = GenericGraph(SimpleGraph(edges3))
@test res3.weight == weight_vec2
@test nv(g3t) - length(connected_components(g3t)) == ne(g3t)
@test nv(g3t) - ne(g3t) ==
length(connected_components(g3t)) ==
count_connected_components(g3t)
@test nv(g3t) == nv(gx)

res4 = boruvka_mst(g, distmx_sec; minimize=false)
edges4 = [Edge(src(e), dst(e)) for e in res4.mst]
g4t = GenericGraph(SimpleGraph(edges4))
@test res4.weight == weight_max_vec2
@test nv(g4t) - length(connected_components(g4t)) == ne(g4t)
@test nv(g4t) - ne(g4t) ==
length(connected_components(g4t)) ==
count_connected_components(g4t)
@test nv(g4t) == nv(gx)
end

Expand Down Expand Up @@ -123,14 +131,18 @@
edges5 = [Edge(src(e), dst(e)) for e in res5.mst]
g5t = GenericGraph(SimpleGraph(edges5))
@test res5.weight == weight_vec3
@test nv(g5t) - length(connected_components(g5t)) == ne(g5t)
@test nv(g5t) - ne(g5t) ==
length(connected_components(g5t)) ==
count_connected_components(g5t)
@test nv(g5t) == nv(gd)

res6 = boruvka_mst(g, distmx_third; minimize=false)
edges6 = [Edge(src(e), dst(e)) for e in res6.mst]
g6t = GenericGraph(SimpleGraph(edges6))
@test res6.weight == weight_max_vec3
@test nv(g6t) - length(connected_components(g6t)) == ne(g6t)
@test nv(g6t) - ne(g6t) ==
length(connected_components(g6t)) ==
count_connected_components(g6t)
@test nv(g6t) == nv(gd)
end
end
Loading