Skip to content

Commit

Permalink
More talk
Browse files Browse the repository at this point in the history
  • Loading branch information
daoudclarke committed Oct 9, 2024
1 parent 81caa10 commit c06f4c9
Show file tree
Hide file tree
Showing 8 changed files with 35 additions and 91 deletions.
Binary file added img/alpha-mwmbl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/contributions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/crawler-script.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/finances.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/firefox-plugin.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/inverted-index.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stats0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
126 changes: 35 additions & 91 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,42 @@ <h3>Daoud Clarke</h3>
OSSym24 - 10th October 2024
</section>

<section data-background-size="contain" data-background-image="img/alpha-mwmbl.png" data-background-color="white">
<a href="https://alpha.mwmbl.org">
<h3><br><br><br><br><br>Demo</h3>
</a>

</section>

<section>
<img class="stretch" src="img/finances.png">
</section>
<section>
<h3>Part 1: technology</h3>
</section>

<section>
<ul>
<li> 500 million pages indexed
<li> 4TB index size
<li> Over 4,000 registered users
<li> Over 33,000 user curations
</ul>
</section>

<section>
<h3>Crawling</h3>
</section>

<section data-background-size="contain" data-background-image="img/crawler-script.png" data-background-color="white"></section>

<section>
<img class="stretch" src="img/firefox-plugin.png">
</section>

<section>
<h3>Indexing and searching</h3>
</section>

<section>
<img class="stretch" src="img/latency-1.png">
Expand All @@ -60,107 +90,25 @@ <h3>We need a new architecture</h3>
</section>

<section data-background-size="contain" data-background-image="img/tiny-storage.svg" data-background-color="white"></section>

<section>
<h3>Part 2: community</h3>
</section>

<section data-background-image="img/stats0.png" data-background-size="contain">
</section>

<section>
<img class="stretch" src="img/contributions.png">
</section>


<section>
<pre><code data-trim data-noescape class="stretch">
def get_key_page_index(self, key) -> int:
key_hash = mmh3.hash(key, signed=False)
return key_hash % self.num_pages
</code></pre>
<span>"apple" => 1680</span>
</section>

<section>
<pre><code data-trim data-noescape class="stretch">
def _get_page_tuples(self, i):
page_data = self.mmap[
i * self.page_size +
METADATA_SIZE:(i + 1) * self.page_size
+ METADATA_SIZE]
try:
decompressed_data = self.decompressor.decompress(
page_data)
except ZstdError:
logger.exception(f"Error decompressing: {page_data}")
return []
return json.loads(decompressed_data.decode('utf8'))
</code></pre>
</section>

<section data-background-size="contain" data-background-image="img/page.png" data-background-color="#300a24"></section>
<section data-background-size="contain" data-background-image="img/page-2.png" data-background-color="#300a24"></section>

<section>
<ul>
<li> Query: "open broadcaster software"
<li> Look up:
<ul>
<li> open
<li> broadcaster
<li> software
<li> open broadcaster
<li> broadcaster software
</ul>
</ul>
</section>

<section>
<h3>What is the maximum number of reads?</h3>
</section>

<section>
<h3>Will it work?</h3>
</section>

<section>
<ul>
<li> We fit around 23 results in one page of 4096 bytes
<li> Google indexes around 100 billion pages per locale
<li> We would need an index of around 16 terabytes
<li> We would need an index of around 16TB
</ul>
</section>

<section>
<h3>How can we reduce crawling costs?</h3>
<h3>Part 2: community</h3>
</section>

<section>
<pre><code data-trim data-noescape class="stretch">
@router.post('/batches/new')
def request_new_batch(batch_request: NewBatchRequest) \
-> list[str]:
...

@router.post('/batches/')
def post_batch(batch: Batch):
...
</code></pre>
</section>

<section>
<h3>Will it work?</h3>
<section data-background-image="img/stats0.png" data-background-size="contain">
</section>

<section>
<ul>
<li> We are currently crawling around 1 million pages a day
<li> We currently have about 26 active volunteers
<li> We want our index of 100 billion pages to be refreshed at least once a month
<li> We need to crawl around 3 billion pages a day
</ul>
</section>
<img class="stretch" src="img/contributions.png">
</section>

<section>
<h3>But will it work?</h3>
Expand All @@ -175,10 +123,6 @@ <h3>How to fly (or build a search engine)</h3>
- Douglas Adams
</section>

<section data-background-image="img/larry-page.png">
<h3>Why should the gateway to the world's knowledge be in the hands of a corporation?</h3>
</section>

</div>
</div>

Expand Down

0 comments on commit c06f4c9

Please sign in to comment.