Skip to content

Commit

Permalink
Refactor entity matching name cleaner to be more efficient (#3953)
Browse files Browse the repository at this point in the history
* refactor name cleaner

* fix up

* fix legal terms dict variable

* fix read in of legal term dictionary json

* update release notes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* fix doc strings on name cleaner

* fix name cleaner rule

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
katie-lamb and pre-commit-ci[bot] authored Dec 18, 2024
1 parent 99ee4e5 commit 0dd0530
Show file tree
Hide file tree
Showing 3 changed files with 208 additions and 135 deletions.
8 changes: 7 additions & 1 deletion docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,17 @@ EPA CEMS
~~~~~~~~
* Added 2024 Q3 of CEMS data. See :issue:`3943` and :pr:`3948`.

FERC to EIA Record Linkage
Record Linkage
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Updated the ``splink`` FERC to EIA development notebook to be compatible with
the latest version of ``splink``. This notebook is not run in production but
is helpful for visualizing model weights and what is happening under the hood.
* Updated ``pudl.analysis.record_linkage.name_cleaner`` company name cleaning
module to be more efficient by removing all ``.apply`` and instead use
``pd.Series.replace`` to make regex replacement rules vectorized. Also removed
some of the allowed replacement rules to make the cleaner simpler and more
effective. This module runs approximately 3x faster now when cleaning a
string Series.

.. _release-v2024.10.0:

Expand Down
4 changes: 1 addition & 3 deletions src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,9 @@
cleaning_rules_list=[
"remove_word_the_from_the_end",
"remove_word_the_from_the_beginning",
"replace_amperstand_between_space_by_AND",
"replace_ampersand_by_AND",
"replace_hyphen_by_space",
"replace_hyphen_between_spaces_by_single_space",
"replace_underscore_by_space",
"replace_underscore_between_spaces_by_single_space",
"remove_all_punctuation",
"remove_numbers",
"remove_math_symbols",
Expand Down
Loading

0 comments on commit 0dd0530

Please sign in to comment.