Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH][DOC] local Geary statistics #145

Merged
merged 14 commits into from
Jan 18, 2021
Merged

[ENH][DOC] local Geary statistics #145

merged 14 commits into from
Jan 18, 2021

Conversation

jeffcsauer
Copy link
Collaborator

Purpose of PR

This PR addresses bullet point 7 and 8 of #61. Specifically, this PR adds two new functions to estimate:

  • Local Geary statistics - univariate (function name: Local_Geary)
  • Local Geary statistics - multivariate (function name: Local_Geary_MV)

Each function is written in the form of a scikit-learn style estimator. PEP8 formatting has been applied to all of the functions, although a handful of lines are left long due to readability.

Documentation

One notebook is included in the PR. The notebook reviews the core math of the statistic and explains the basic use of the statistic following examples provided by Anselin (2017).

All functions include docstrings and doctests that use built-in PySAL example datasets.

Tests

I have added in two .py test files that execute successfully on a branch specifically for the Geary estimators. These new tests follow the lead of the existing tests, specifically test_moran.py.

Notes for PR consideration

  • The univariate local Geary is numba-ized, whereas the multivariate local Geary uses a modified form of the old _crand() engine.

@codecov-commenter
Copy link

codecov-commenter commented Aug 22, 2020

Codecov Report

Merging #145 into master will increase coverage by 2.53%.
The diff coverage is 94.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #145      +/-   ##
==========================================
+ Coverage   77.61%   80.14%   +2.53%     
==========================================
  Files          25       37      +12     
  Lines        3274     3758     +484     
==========================================
+ Hits         2541     3012     +471     
- Misses        733      746      +13     
Impacted Files Coverage Δ
esda/local_geary.py 87.50% <87.50%> (ø)
esda/losh.py 89.74% <89.74%> (ø)
esda/tests/test_ljc.py 91.30% <91.30%> (ø)
esda/tests/test_ljc_mv.py 92.00% <92.00%> (ø)
esda/tests/test_local_geary.py 92.00% <92.00%> (ø)
esda/tests/test_losh.py 92.00% <92.00%> (ø)
esda/tests/test_ljc_bv.py 92.30% <92.30%> (ø)
esda/tests/test_local_geary_mv.py 92.59% <92.59%> (ø)
esda/local_join_count_bv.py 97.33% <97.33%> (ø)
esda/local_geary_mv.py 100.00% <100.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e660983...ffc1b30. Read the comment docs.

@ljwolf ljwolf requested review from ljwolf and sjsrey August 28, 2020 16:07
@jeffcsauer
Copy link
Collaborator Author

A few slight changes in 449e767, specifically:

  • updating docstrings with some formatting changes
  • ensuring scikit learn formatting for univariate statistics
  • fixed permutations such that it is part of init(...) and removed constant from top of file

@hh2110
Copy link

hh2110 commented Sep 2, 2020

Hi Jeff, I'm using this branch to do carry out some spatial autocorrelation. @ljwolf may have noticed this already but setting n_jobs > 1, causes the issue described in #146 - the quick fix described in #146, does not work here since the argument for keep is set to True in the call to _crand_plus (L#128 in local_geary.py).

@jeffcsauer
Copy link
Collaborator Author

@hh2110 thanks for catching that! I will take a look at it today and get back to you ASAP.

@jeffcsauer
Copy link
Collaborator Author

@hh2110 thanks again for the catch - I appreciate your code review and noting of the issue! The keep_simulations variable was not yet set to be modifiable as it should have been. I ran some basic tests on my side (see chunk below), but do let me know if it's not working!

import libpysal as lp
import geopandas as gpd
guerry = lp.examples.load_example('Guerry')
guerry_ds = gpd.read_file(guerry.get_path('Guerry.shp'))
w = libpysal.weights.Queen.from_dataframe(guerry_ds)
y = guerry_ds['Donatns']
lG = Local_Geary(connectivity=w, n_jobs=2, keep_simulations=False).fit(y)
lG.p_sim[0:5]
array([0.191, 0.05 , 0.058, 0.146, 0.48]

@hh2110
Copy link

hh2110 commented Sep 2, 2020

Thanks @jeffcsauer. One more thing please: I am analysing a 400x400 matrix using the Local_Geary.fit function. For this large array, I noticed that there was a time consuming step in Line#169 in local_geary.py: I think you can replace this line

gs = sum(list(w.weights.values()), []) * (zi-zj)**2

with

gs = adj_list.weight.values * (zi-zj)**2

The sum(list(w.weights.values()), []) seems to take a long time for large arrays. The only thing I am unsure of is whether or not the order of values in adj_list.weight.values matches that of sum(list(w.weights.values()), [])?

@jeffcsauer
Copy link
Collaborator Author

jeffcsauer commented Sep 3, 2020

@hh2110 taking a look at your suggestion this morning - will report back results! I think you may be onto something as a similar type of improvement has been implemented in the LOSH function.

@jeffcsauer
Copy link
Collaborator Author

@hh2110 another great catch! 🚀 I first checked to make sure that sum(list(w.weights.values()) and adj_list.weight.values were equivalent. On my local tests these appear to be the same. I also checked to ensure there was no difference in the results between row-standardized and unstandardized weights - all good there as well. I think in the original implementation I was fixated on a single approach to getting those adjacency values - didn't take the step back to see your more elegant solution!

I've pushed your suggested change in both Local_Geary and Local_Geary_MV. Could you let me know if its (1) producing same/similar results to the slow version of the function, and (2) if it's much faster?

@ljwolf
Copy link
Member

ljwolf commented Jan 18, 2021

I've fixed up imports and renamed things to conform to the other esda style (Local_Geary becomes Geary_Local, so that tab completion can work on the statistic, then the variant)

This is some very high-quality work, @jeffcsauer and I'm sorry we couldn't get it merged sooner!

@ljwolf
Copy link
Member

ljwolf commented Jan 18, 2021

This is ready to ship. It has notebooks, documentation, and conforms to style.

@sjsrey sjsrey merged commit 5e7e38d into pysal:master Jan 18, 2021
@jeffcsauer
Copy link
Collaborator Author

No worries at all - thanks for the kind comments and mentorship! 😃

@jeffcsauer jeffcsauer deleted the gearybranch branch January 18, 2021 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants