-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
216 lines (202 loc) · 16.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
<!DOCTYPE html>
<html lang="en">
<head>
<meta name="robots" content="noml">
<meta name="description" content="sign the open letter proposing noml, a specification for those who want content searchable on search engines, but not used for machine learning.">
<meta name="twitter:card" content="/images/social/header.png">
<meta name="twitter:site" content="noml.info">
<meta name="twitter:title" content="noml open letter">
<meta name="twitter:description" content="sign the open letter proposing noml, a specification for those who want content searchable on search engines, but not used for machine learning.">
<meta name="twitter:url" content="https://noml.info">
<meta name="twitter:image" content="/images/social/header.png">
<meta name="twitter:creator" content="@mojeek">
<meta property="og:title" content="noml open letter">
<meta property="og:description" content="sign the open letter proposing noml, a specification for those who want content searchable on search engines, but not used for machine learning.">
<meta property="og:url" content="https://noml.info">
<meta property="og:site_name" content="noml.info">
<meta property="og:image" content="/images/social/header.png">
<meta charset="utf-8">
<link href="style.css" rel="stylesheet" type="text/css" media="all">
<title>NoML Open Letter</title>
<link rel="icon" type="image/x-icon" href="./images/svg/search.svg">
<script src="./signatures/signatures.js" charset="utf-8"></script>
<script>
const shuffle = (array) => {
for (let i = array.length - 1; i > 0; i--) {
const j = Math.floor(Math.random() * (i + 1));
[array[i], array[j]] = [array[j], array[i]];
}
};
window.addEventListener("load", () => {
const olIndividuals = document.getElementById("individuals")
olIndividuals.innerHTML = ""
shuffle(individuals);
individuals.forEach((individual) => {
if (individual.url) {
let li = document.createElement("li")
let a = document.createElement("a")
let i = document.createElement("i")
let br = document.createElement("br")
a.href = individual.url
a.innerText = individual.name
i.innerText = `${individual.affiliation}`
li.appendChild(a)
li.appendChild(br)
li.appendChild(i)
olIndividuals.appendChild(li)
} else {
// The individual does not have a URL, so we will not create a link.
let li = document.createElement("li")
let a = document.createElement("a")
let i = document.createElement("i")
let br = document.createElement("br")
a.innerText = individual.name
i.innerText = `${individual.affiliation}`
li.appendChild(a)
li.appendChild(br)
li.appendChild(i)
olIndividuals.appendChild(li)
}
})
})
</script>
</head>
<body>
<div class="container main-container">
<section id="header">
<br>
<h1>NoML Open Letter</h1>
<p>A specification for those who want content searchable on search engines, but not used for machine learning.</p>
<p>Publishers need improved ways to indicate how they want content to be used in search and machine learning. Using robots.txt does not cover all use cases, and so a complementary approach is needed as proposed here. It is one which can be applied to individual webpages as desired, and can be preserved as such in datasets of web content.</p>
<p><a href="https://github.com/Mojeek/noml-open-letter/issues/new?assignees=PrivacyDingus&labels=&projects=&template=sign.md&title=SIGN%3A+NAME">Sign the open letter via Github,</a> <strong>or if you'd prefer to</strong>, you can <a href="mailto:josh@mojeek.com?subject=Sign the open letter&body=Please provide your name, a URL if you would like your name to be hyperlinked somewhere, and an affiliation (company, organisation etc.) if relevant. If you are signing the letter as a company or organisation, consider attaching a logo">sign the letter via email.</a></p>
<h2>NoML Proposal</h2>
<h3>Four cases</h3>
<p>Blocking a user agent in robots.txt (such as the search spider BingBot) for ML/AI products may also block potential visibility in associated search results. So if you don’t want your content to be used in products which leverage machine learning, you may also lose the benefits of being listed in search results. This is an unacceptable situation where content creators and publishers may be required to make a choice between “all or nothing”. We therefore need an approach which addresses four basic use cases as follows:</p>
<table width="100%" align="center" style="padding: 20px; border-spacing: 20px; text-align: center; font-size: 14px;">
<tr>
<td width="50%" style="border: 2px solid black; line-height: 20px;"><strong>1.</strong> Do use content for search <br> Do use content for machine learning</td>
<td width="50%" style="border: 2px solid black; line-height: 20px;"><strong>2.</strong> Do use content for search <br> <strong>Do not </strong>use content for machine learning</td>
</tr>
<tr>
<td width="50%" style="border: 2px solid black; line-height: 20px;"><strong>3. Do not</strong> use content for search <br> Do use content for machine learning</td>
<td width="50%" style="border: 2px solid black; line-height: 20px;"><strong>4. Do not</strong> use content for search <br> <strong>Do not</strong> use content for machine learning</td>
</tr>
</table>
<p>Several suggestions for how to address this issue have been put forward. Here we propose something different, simpler and we think fairer. It is a simple adaptation of how meta and X-Robots tags are already used as we explain below in detail.</p>
<h3>Explanation and Examples</h3>
<p>Since there are many companies scraping/crawling webpages in order to collect data, and often without identifying themselves, a way which addresses more than just search engine crawler bots and using robots.txt is needed. The proposal here is to add a new ‘noml’ value to the already-existing <a href="https://www.semrush.com/blog/robots-meta/">meta and X-Robots tag</a>. Meta robots tags are used already for search engine crawlers, so companies like Google that crawl for their search engine already follow requests made in this way. For example the <code>noindex</code> and <code>nofollow</code> values are used to instruct search engines on how to handle a webpage.</p>
<p> Also meta data is stored in Common Crawl whose crawled data is a very common, and often the biggest, source of data used in machine learning and to train AI models. The <code>noindex</code> attribute is used to tell search engines not to include a page in their search results, even though they can crawl the page. This tag is used when a webpage is not intended to be indexed by search engines. The <code>nofollow</code> attribute is used to tell search engines not to follow the links on a webpage. This attribute is used when a webpage that contains links should not be considered as endorsements or recommendations. Similarly the <code>noml</code> could be used to instruct any service (e.g. a search engine or AI builder) that any data from that page should not be used in machine learning.</p>
<p>This can be simply expressed for HTML pages using:</p>
<p><code><meta name="robots" content="noml"></code></p>
<p>and for non-HTML using:</p>
<p><code>X-Robots-Tag: noml</code></p>
<p>Just as for the meta tag, where the name “robots” refers to all user-agent tokens, you can also identify individual user-agents. For example, you can currently request that Google does not include a page in the search results:</p>
<p><code><meta name="googlebot" content="noindex"></code></p>
<p>and similarly you could request that Google <strong>does include a page in their search</strong> index, but <strong>does not use the data for machine learning</strong> with:</p>
<p><code><meta name="googlebot" content="noml"></code></p>
<p>and request that Microsoft <strong>does not include</strong> a page in their Bing search index and <strong>does not use</strong> the page for training, with:</p>
<p><code><meta name="bingbot" content="noindex, noml"></code></p>
<p>you might also request that OpenAI, for example, does not use the page for machine learning with:</p>
<p><code><meta name="gptbot" content="noml"></code></p>
<p>however in this case, since OpenAI is not operating a search engine, you could (also) block them from crawling in the <a href="https://en.wikipedia.org/wiki/Robots.txt">robots.txt</a> file using:
<p><code>User-agent: gptbot</code><br>
<code>Disallow: /</code></p>
<p>Further examples of are shown below, for HTML and non-HTML:</p>
<table width="100%" align="center" style="padding: 10px; text-align: center; border: 1px solid black; font-size: 14px; border-collapse: collapse;">
<tr style="border: 1px solid black; ">
<td width="70%" style="border: 1px solid black;"></td>
<td style="border: 1px solid black;">Include page in search</td>
<td style="border: 1px solid black;">Follow links</td>
<td style="border: 1px solid black;">Use in machine learning</td>
</tr>
<tr style="border: 1px solid black;">
<td style="text-align: left; border: 1px solid black;"><code><meta name="robots"></code><hr><code>X-Robots-Tag:</code></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
</tr>
<tr style="border: 1px solid black;">
<td style="text-align: left; border: 1px solid black;"><code><meta name="robots" content="noindex"></code><hr><code>X-Robots-Tag: noindex</code></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
</tr>
<tr style="border: 1px solid black;">
<td style="text-align: left; border: 1px solid black;"><code><meta name="robots" content="nofollow"></code><hr><code>X-Robots-Tag: nofollow</code></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
</tr>
<tr style="border: 1px solid black;">
<td style="text-align: left; border: 1px solid black;"><code><meta name="robots" content="noml"></code><hr><code>X-Robots-Tag: noml</code></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
</tr>
<tr style="border: 1px solid black;">
<td style="text-align: left; border: 1px solid black;"><code><meta name="robots" content="noindex, nofollow"></code><hr><code>X-Robots-Tag: noindex, nofollow</code></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✓</span></td>
</tr>
<tr style="border: 1px solid black;">
<td style="text-align: left; border: 1px solid black;"><code><meta name="robots" content="noindex, nofollow, noml"></code><hr><code>X-Robots-Tag: noindex, nofollow, noml</code></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
<td style="border: 1px solid black; font-size: 22px;"><span>✕</span></td>
</tr>
</table>
<h3>Search Engine API Usage</h3>
<p>Crawler based search engines such as Mojeek, Bing and Google, also provide their results to other search engines and services via APIs. We propose that these search engines, as Mojeek would do, include the ‘noml’ directive within their API response. Search API providers should make it part of their terms of API service that results labelled as such are not used for machine learning purposes by API end users.</p>
<h3>Conclusion</h3>
<p>This proposed solution achieves the desired goals, by simply adding one additional value to already existing methods, and without necessarily requiring more user-agents to be used.
<p>With this proposal creators and publishers can indicate separately whether they want content to be findable in search engines and/or used for machine learning.</p>
<h3 class="socials-header">Share the open letter</h3>
<table id="buttons" class="socials">
<tr>
<td><a href="https://twitter.com/intent/tweet?text=https%3A//noml.info"><img src=./images/svg/nav-soc-t.svg width="100%" alt="an icon representing twitter"></a></p></td>
<td><a href="https://www.reddit.com/submit?url=https%3A%2F%2Fnoml.info&title=NoML "><img src=./images/svg/nav-soc-r.svg width="100%" alt="an icon representing reddit"></a></p></td>
<td><a href="https://mastodonshare.com/?url=https://noml.info"><img src=./images/svg/nav-soc-m.svg width="100%" alt="an icon representing mastodon"></a></p></td>
<td><a href="https://www.facebook.com/sharer/sharer.php?u=https://noml.info"><img src=./images/svg/nav-soc-f.svg width="100%" alt="an icon representing facebook"></a></p></td>
<td><a href="https://www.linkedin.com/shareArticle?mini=true&url=https%3A//noml.info"><img src=./images/svg/nav-soc-l.svg width="100%" alt="an icon representing linkedin"></a></p></td>
</tr>
</table>
<h2>Dear AI companies, Crawlers, Search Engines, ML Projects, Scrapers etc.</h2>
<p>We, the undersigned, support the adoption of this proposal, which enables creators, publishers, and all other content contributors on the web to indicate whether their content can be utilized for machine learning training and applications.</p>
<h3>Signatures from organisations and companies</h3>
<ul id="orgco">
<li id="orgcofirst"><a href="https://www.mojeek.com/" target="_blank"><img src="./images/logos/mojeek.svg" style="height:50px"></a></li>
<li id="orgco"><a href="https://worldethicaldata.org/" target="_blank"><img src="./images/logos/WEDF.png" style="height:50px"></a></li>
<li id="orgco"><a href="https://openwebsearch.eu/" target="_blank"><img src="./images/logos/OpenwebsearchEU.png" style="height:60px"></a></li>
<li id="orgco"><a href="https://www.metager.org/" target="_blank"><img src="./images/logos/metaGER.svg" style="height:40px"></a></li>
<li id="orgco"><a href="https://andisearch.com/" target="_blank"><img src="./images/logos/andi.png" style="height:50px"></a></li>
<li id="orgco"><a href="https://search.jojoyou.org/" target="_blank"><img src="./images/logos/PriEco.png" style="height:50px"></a></li>
<li id="orgco"><a href="http://x-industries.co.uk/" target="_blank"><img src="./images/logos/x-industries.png" style="height:50px"></a></li>
</ul>
<h3>Signatures from individuals</h3>
<ul id="individuals">
Turn on JavaScript to view signatures.
</ul>
</section>
<nav class="navbar">
<div class="inner">
<!-- NAVIGATION MENUS -->
<div class="menu">
<a href="https://github.com/Mojeek/noml-open-letter/issues/new?assignees=PrivacyDingus&labels=&projects=&template=sign.md&title=SIGN%3A+NAME" class="nav-item btn-sign" target="_blank">Sign on GitHub</a>
<a href="mailto:josh@mojeek.com?subject=Sign the open letter&body=Please provide your name, a URL if you would like your name to be hyperlinked somewhere, and an affiliation (company, organisation etc.) if relevant. If you are signing the letter as a company or organisation, consider attaching a logo" class="nav-item btn-sign-email" target="_blank">Sign via Email</a>
<a href="https://blog.mojeek.com/2023/10/noml-proposal-and-open-letter.html" class="nav-item">Blog Post</a>
<a href="https://www.mojeek.com/about/press/" class="nav-item">Press</a>
</div>
<table id="buttons" class="socials">
<tr>
<td><a href="https://twitter.com/intent/tweet?text=https%3A//noml.info"><img src=./images/svg/nav-soc-t.svg width="100%" alt="an icon representing twitter"></a></p></td>
<td><a href="https://www.reddit.com/submit?url=https%3A%2F%2Fnoml.info&title=NoML "><img src=./images/svg/nav-soc-r.svg width="100%" alt="an icon representing reddit"></a></p></td>
<td><a href="https://mastodonshare.com/?url=https://noml.info"><img src=./images/svg/nav-soc-m.svg width="100%" alt="an icon representing mastodon"></a></p></td>
<td><a href="https://www.facebook.com/sharer/sharer.php?u=https://noml.info"><img src=./images/svg/nav-soc-f.svg width="100%" alt="an icon representing facebook"></a></p></td>
<td><a href="https://www.linkedin.com/shareArticle?mini=true&url=https%3A//noml.info"><img src=./images/svg/nav-soc-l.svg width="100%" alt="an icon representing linkedin"></a></p></td>
</tr>
</table>
</div>
</nav>
</div>
</body>