Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed May 2, 2024
1 parent 91a3271 commit dbc3ac1
Show file tree
Hide file tree
Showing 21 changed files with 286 additions and 286 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
70cac868
6b37c638
2 changes: 1 addition & 1 deletion docs/reference/bloom_filters.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/cloud.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/config.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/embedder.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/encryption.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/features.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/local.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/perform.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/utils.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down
26 changes: 13 additions & 13 deletions docs/tutorials/example-febrl.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.4.553">
<meta name="generator" content="quarto-1.4.554">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

Expand Down Expand Up @@ -343,7 +343,7 @@ <h1 class="title">Linking the FEBRL datasets</h1>


<p>This tutorial shows how the package can be used locally to match the <a href="http://users.cecs.anu.edu.au/~Peter.Christen/publications/hdkm2008slides.pdf">FEBRL</a> datasets, included as example datasets in the <a href="https://recordlinkage.readthedocs.io/en/latest/"><code>recordlinkage</code></a> package.</p>
<div id="47b2048b" class="cell" data-execution_count="1">
<div id="5794dce7" class="cell" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> time</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> functools <span class="im">import</span> partial</span>
Expand All @@ -359,7 +359,7 @@ <h1 class="title">Linking the FEBRL datasets</h1>
<h2 class="anchored" data-anchor-id="load-the-data">Load the data</h2>
<p>The datasets we are using are 5000 records across two datasets with no duplicates, and each of the records has a valid match in the other dataset.</p>
<p>After loading the data, we can parse the true matched ID number from the indices.</p>
<div id="dff34971" class="cell" data-execution_count="2">
<div id="6a435ba6" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>feb4a, feb4b <span class="op">=</span> load_febrl4()</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>feb4a[<span class="st">"true_id"</span>] <span class="op">=</span> (</span>
Expand All @@ -382,7 +382,7 @@ <h2 class="anchored" data-anchor-id="create-a-feature-factory">Create a feature
<li>Pass a dictionary of dictionaries of keyword arguments as an optional <code>ff_args</code> parameter (e.g.&nbsp;<code>ff_args = {"dob": {"dayfirst": False, "yearfirst": True}})</code>)</li>
<li>Use <code>functools.partial()</code>, as we have below.</li>
</ol>
<div id="291895d4" class="cell" data-execution_count="3">
<div id="09bdb1ce" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>feature_factory <span class="op">=</span> <span class="bu">dict</span>(</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a> name<span class="op">=</span>feat.gen_name_features,</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> dob<span class="op">=</span>partial(feat.gen_dateofbirth_features, dayfirst<span class="op">=</span><span class="va">False</span>, yearfirst<span class="op">=</span><span class="va">True</span>),</span>
Expand All @@ -396,7 +396,7 @@ <h2 class="anchored" data-anchor-id="create-a-feature-factory">Create a feature
<section id="initialise-the-embedder-instance" class="level2">
<h2 class="anchored" data-anchor-id="initialise-the-embedder-instance">Initialise the embedder instance</h2>
<p>This instance embeds each feature twice into a Bloom filter of length 1024.</p>
<div id="e4753523" class="cell" data-execution_count="4">
<div id="5288a45e" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>embedder <span class="op">=</span> Embedder(feature_factory, bf_size<span class="op">=</span><span class="dv">1024</span>, num_hashes<span class="op">=</span><span class="dv">2</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
Expand All @@ -418,7 +418,7 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
<p>For example, to ensure suburb doesn’t collide with state (if they happened to be the same), <code>gen_misc_features()</code> would encode each of their tokens as <code>suburb&lt;token&gt;</code> and <code>state&lt;token&gt;</code>, respectively. If you want to map different columns into the same feature, such as <code>address</code> below, you can set the label explicitly when passing the function to the embedder.</p>
</div>
</div>
<div id="18d2807a" class="cell" data-execution_count="5">
<div id="c7722047" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>colspec <span class="op">=</span> <span class="bu">dict</span>(</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> given_name<span class="op">=</span><span class="st">"name"</span>,</span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> surname<span class="op">=</span><span class="st">"name"</span>,</span>
Expand All @@ -436,7 +436,7 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a>edf2 <span class="op">=</span> embedder.embed(feb4b, colspec<span class="op">=</span>colspec)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>Store the embedded datasets and their embedder to file.</p>
<div id="5db9dd9b" class="cell" data-execution_count="6">
<div id="8e1bac82" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>edf1.to_json(<span class="st">"party1_data.json"</span>)</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>edf2.to_json(<span class="st">"party2_data.json"</span>)</span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>embedder.to_pickle(<span class="st">"embedder.pkl"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
Expand All @@ -445,30 +445,30 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
<section id="calculate-similarity" class="level2">
<h2 class="anchored" data-anchor-id="calculate-similarity">Calculate similarity</h2>
<p>Compute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.</p>
<div id="8ff4c8fc" class="cell" data-execution_count="7">
<div id="098d27e1" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>start <span class="op">=</span> time.time()</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>edf1.update_thresholds()</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>edf2.update_thresholds()</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>end <span class="op">=</span> time.time()</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Updating thresholds took </span><span class="sc">{</span>end <span class="op">-</span> start<span class="sc">:.2f}</span><span class="ss"> seconds"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Updating thresholds took 8.37 seconds</code></pre>
<pre><code>Updating thresholds took 8.35 seconds</code></pre>
</div>
</div>
<p>Compute the matrix of similarity scores.</p>
<div id="d1d9845b" class="cell" data-execution_count="8">
<div id="715a0b10" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>similarity_scores <span class="op">=</span> embedder.compare(edf1,edf2)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="compute-a-match" class="level2">
<h2 class="anchored" data-anchor-id="compute-a-match">Compute a match</h2>
<p>Use the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.</p>
<div id="0050d0a2" class="cell" data-execution_count="9">
<div id="d5a9e12a" class="cell" data-execution_count="9">
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>matching <span class="op">=</span> similarity_scores.match(require_thresholds<span class="op">=</span><span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>Using the true IDs, evaluate the precision and recall of the match.</p>
<div id="43d424db" class="cell" data-execution_count="10">
<div id="8e3888bc" class="cell" data-execution_count="10">
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> get_results(edf1, edf2, matching):</span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> <span class="co">"""Get the results for a given matching."""</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span>
Expand All @@ -492,7 +492,7 @@ <h2 class="anchored" data-anchor-id="compute-a-match">Compute a match</h2>
</div>
</div>
<p>Then, we compute the match without using the row thresholds, calculating the same performance metrics:</p>
<div id="aa5a9828" class="cell" data-execution_count="11">
<div id="010d6dc9" class="cell" data-execution_count="11">
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>matching <span class="op">=</span> similarity_scores.match(require_thresholds<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>_ <span class="op">=</span> get_results(edf1, edf2, matching)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
Expand Down
Loading

0 comments on commit dbc3ac1

Please sign in to comment.