Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

把 UUID 或者 GUID 作为主键?你得小心啦! #1804

Merged
merged 5 commits into from
Jun 30, 2017

Conversation

zaraguo
Copy link
Contributor

@zaraguo zaraguo commented Jun 24, 2017

@sqrthree 翻译完成

@zaraguo
Copy link
Contributor Author

zaraguo commented Jun 27, 2017

@Glowin 邀请校对

@canonxu
Copy link

canonxu commented Jun 27, 2017

校对认领 @sqrthree

@linhe0x0
Copy link
Member

@canonxu 好的呢 🍺

@zaraguo zaraguo closed this Jun 27, 2017
@zaraguo zaraguo reopened this Jun 27, 2017
@yifili09
Copy link
Contributor

@sqrthree 申请一个校对。

@linhe0x0
Copy link
Member

@yifili09 好哒

Copy link
Contributor

@yifili09 yifili09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也增长了我对 UUID 的认识。 加油 加油 加油

1. At scale, when you have multiple databases containing a segment (shard) of your data, for example a set of customers, using a UUID means that one ID is unique across *all* databases, not just the one you’re in now. This makes moving data across databases safe. Or in my case where all of our database shards are merged onto our Hadoop cluster as one, no key conflicts.
2. You can know your PK before insertion, which avoids a round trip DB hit, and simplifies transactional logic in which you need to know the PK before inserting child records using that key as it’s foreign key (FK)
3. UUIDs do not reveal information about your data, so would be safer to use in a URL, for example. If I am customer 12345678, it’s easy to guess that there are customers 12345677 and 1234569, and this makes for an attack vector. (But see below for a better alternative).
1. 在扩展数据库的时候,当你有多个数据库包含同一段(片)数据时,比如一个顾客集,使用 UUID 意味着该 ID 可以跨所有数据库唯一,而不是仅仅本数据库唯一。这保障了跨数据库迁移数据的安全。又比如,我曾在项目中把多个数据库分片合并到一个 Hadoop 集群中,也没有产生键的冲突。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用 UUID 意味着该 ID 可以跨所有数据库唯一

使用 UUID 意味着这个 ID 在所有的数据库中是唯一标识的。

2. You can know your PK before insertion, which avoids a round trip DB hit, and simplifies transactional logic in which you need to know the PK before inserting child records using that key as it’s foreign key (FK)
3. UUIDs do not reveal information about your data, so would be safer to use in a URL, for example. If I am customer 12345678, it’s easy to guess that there are customers 12345677 and 1234569, and this makes for an attack vector. (But see below for a better alternative).
1. 在扩展数据库的时候,当你有多个数据库包含同一段(片)数据时,比如一个顾客集,使用 UUID 意味着该 ID 可以跨所有数据库唯一,而不是仅仅本数据库唯一。这保障了跨数据库迁移数据的安全。又比如,我曾在项目中把多个数据库分片合并到一个 Hadoop 集群中,也没有产生键的冲突。
2. 你可以在插入之前知道你的主键值,这避免了一轮的数据查找,简化了在插入将主键值作为外键的子记录前需要知道该主键值这一场景的逻辑。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你可以在插入之前知道你的主键值

在插入数据之前,你就能知道这个主键的值,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

简化了在插入将主键值作为外键的子记录前需要知道该主键值这一场景的逻辑。

并且简化了交易事物的逻辑,既在你插入子记录之前,因为需要使用这个主键作为一个外键,你必须要知道这个主键的值。

3. UUIDs do not reveal information about your data, so would be safer to use in a URL, for example. If I am customer 12345678, it’s easy to guess that there are customers 12345677 and 1234569, and this makes for an attack vector. (But see below for a better alternative).
1. 在扩展数据库的时候,当你有多个数据库包含同一段(片)数据时,比如一个顾客集,使用 UUID 意味着该 ID 可以跨所有数据库唯一,而不是仅仅本数据库唯一。这保障了跨数据库迁移数据的安全。又比如,我曾在项目中把多个数据库分片合并到一个 Hadoop 集群中,也没有产生键的冲突。
2. 你可以在插入之前知道你的主键值,这避免了一轮的数据查找,简化了在插入将主键值作为外键的子记录前需要知道该主键值这一场景的逻辑。
3. UUIDs 不会透露数据的信息,因此被用在 URL 中也比自增整数更安全。比如,我是编号 12345678 号顾客,那么人们就会猜测编号为 12345677 和 12345679 的顾客的存在,这就提供了一种攻击向量。(但是后面我们会看到一个更好的替代品)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

攻击向量

攻击途径

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

攻击向量是专业名词呢。


A naive use of a UUID, which might look like `70E2E8DE-500E-4630-B3CB-166131D35C21`, would be to treat as a string, e.g. `varchar(36)` — don’t do that!!
一个基础的 UUID 大概是这个样子的: `70E2E8DE-500E-4630-B3CB-166131D35C21`,它将会被视为字符串对待,比如 `varchar(36)` - 千万不要这么做!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个基础的 UUID 大概是这个样子的: 70E2E8DE-500E-4630-B3CB-166131D35C21,它将会被视为字符串对待

把一个 UUID(它看上去可能是 70E2E8DE-500E-4630-B3CB-166131D35C21 这样的 )以字符串形式对待是缺乏经验的表现。


Think twice — in two cases of very large databases I have inherited at relatively large companies, this was *exactly* the implementation. Aside from the 9x cost in size (36 vs. 4 bytes for an int), strings don’t sort as fast as numbers because they rely on collation rules.
我想了想 - 就我所接触的就有两个来自我先前公司的大型数据库是这么设计的。除了 9 倍的多余开销外(比起 36 字节,整数类型只占了 4 字节),字符串在排序上也没有数字快,因为它们依赖校对规则。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我想了想 -

我再三考虑了下,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就我所接触的就有两个来自我先前公司的大型数据库是这么设计的

就我所接手的两个大型企业级数据库来看,他们确实是那么实施的。

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collation rules是排序规则,校对规则是什么?


Not just on disk but during joins and sorts these keys need to live in memory. Memory is getting cheaper, but whether disk or RAM, it’s limited. And neither is free.
不单单在磁盘上,在进行 join 和 sort 时这些 key 还需要载入到内存中。内存的确越来越便宜了,但是无论磁盘还是内存它们都是有限的,并且也都不是免费的。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory

存储器

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAM

闪存 ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAM 应该翻成 内存 也可以吧。 @canonxu 你怎么看?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调整下语序,首先我们应该要意识到...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

超出 -> 溢出

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20亿大小,是条数,还是20亿M,20亿K...? 容易混淆,建议"20亿条数据记录"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“PostgreSQL 和 PostgreSQL这类“,PostgreSQL和PostgreSQL是哪类?ORDBMS?ORDBMS都有16 字节的原生类型吗? 建议改成"这些",后者干脆去掉“这类”

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“进行统计“,统计什么? 建议改成“评估开销“

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据上下文可得,20 亿这里指的是 20亿 这个数字。因为 20亿 超过了 int(4 字节)的表示范围。


#### It’s really hard to sort random numbers
#### 随机数排序十分困难
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对随机排列的数字进行排序是十分困难的

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extra size of foreign keys adds up fast,再联系上下文揣度下意思? 我猜这段话想表达的意思是:如果UUID再作为外键的话,空间开销会更快速增大。


Another problem is fragmentation — because UUIDs are random, they have no natural ordering so cannot be used for clustering. This is why SQL Server has implemented a `newsequentialid()` function that is suitable for use in clustered indexes, and is [probably the right implementation](https://msdn.microsoft.com/en-us/library/ms189786.aspx) for all UUID PKs. It is probable that there are similar solutions for other databases, certainly PostgreSQL, MySQL and likely the rest.
另外一个问题就是分裂 - 因为 UUIDs 是随机的,他们没有天然的生成顺序因此不能够被用于集群。这就是为什么 SQL Server 实现了一个 `newsequentialid()` 方法用于集群化索引的使用,这可能就是将 UUIDs 作为主键使用的[正确实践](https://msdn.microsoft.com/en-us/library/ms189786.aspx)了。其他的数据库可能也有类似的解决方案,PostgreSQLMySQL 肯定是有的,其他的可能有。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外一个问题就是分裂 fragmentation

另外一个问题是碎片化

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议[正确实践] -> [正确打开方式],嘿嘿

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@canonxu 哈哈,我一开始也是这么想的。


I would argue that *using a PK in any public context is a bad idea.*
下面我将阐明 *在公开环境中暴露主键是十分不好的* 这一观点。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

阐明

讨论?


But there’s a far more compelling reason not to use any kind of PK in a public context: if you *ever* need to change keys, all your external references are broken. Think “404 Page Not Found”.
不在公开环境使用主键还有一个无法反驳的原因:如果你 *一旦* 需要改变这个键值,那么所有外在的引用就不可用了。想象一下 “404 页面无法找到”的情形。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无法反驳的原因

更有说服力的原因

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果 ,一旦,语义重复

@yifili09
Copy link
Contributor

@sqrthree 老板, 我看完啦。 收工,等盒饭!

Copy link

@canonxu canonxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

翻译得很好很用心!赞!

全局唯一ID是分库分表的一个难题,这篇文章解释UUID通俗易懂, 我也学到了很多!


I just read a post on ways to scale your database that hit home with me — the author suggests the use of UUIDs (similar to GUIDs) as the primary key (PK) of database tables.
在阅读时,一篇谈论如何扩展数据库的文章引起了我的关注 - 作者在文中建议大家使用 UUIDs(类似 GUIDs)作为数据库表的主键。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just,加上最近


Think twice — in two cases of very large databases I have inherited at relatively large companies, this was *exactly* the implementation. Aside from the 9x cost in size (36 vs. 4 bytes for an int), strings don’t sort as fast as numbers because they rely on collation rules.
我想了想 - 就我所接触的就有两个来自我先前公司的大型数据库是这么设计的。除了 9 倍的多余开销外(比起 36 字节,整数类型只占了 4 字节),字符串在排序上也没有数字快,因为它们依赖校对规则。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collation rules是排序规则,校对规则是什么?


Things got really bad in one company where they had originally decided to use Latin-1 character set. When we converted to UTF-8 several of the compound-key indexes were not big enough to contain the larger strings. Doh!
在一家公司还曾发生过十分糟糕的事情,一开始他们使用 Latin-1 字符集。当我们打算转为 UTF-8 时,好几个联合索引因为太大而存不下。哦!
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议"组合索引"或者"联合索引",二选一吧


Not just on disk but during joins and sorts these keys need to live in memory. Memory is getting cheaper, but whether disk or RAM, it’s limited. And neither is free.
不单单在磁盘上,在进行 join 和 sort 时这些 key 还需要载入到内存中。内存的确越来越便宜了,但是无论磁盘还是内存它们都是有限的,并且也都不是免费的。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调整下语序,首先我们应该要意识到...


Not just on disk but during joins and sorts these keys need to live in memory. Memory is getting cheaper, but whether disk or RAM, it’s limited. And neither is free.
不单单在磁盘上,在进行 join 和 sort 时这些 key 还需要载入到内存中。内存的确越来越便宜了,但是无论磁盘还是内存它们都是有限的,并且也都不是免费的。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

超出 -> 溢出


Indeed, my current company’s context is a perfect example of why UUIDs are needed, and why they are costly, and why exposing primary keys is an issue.
事实上,我现在公司的环境就是为什么需要 UUIDs 的最好例子,以及为什么 UUIDs 开销巨大,为什么在公开环境中暴露主键是一个问题。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

环境 -> 场景


One solution used in several different contexts that has worked for me is, in short, to use both. (Please note: not a good solution — see note about response to original post below).
有一个解决方法在多个不同的场景中都起到了作用,简单来说就是,两者都用。(请注意:这不是一个好方法 - 请看下面我记录的 Chris 对原始博文回复)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有一个在多个不同场景下都有效的解决办法,。。


Then *add a column* populated with a UUID (perhaps as a trigger on insert). Within the scope of the database itself, relationships can be managed using the usual PKs and FKs.
然后 *增加一列* 用于存放 UUID(可以将其设计进插入的预处理操作里)。在一个数据库自身的范围内,可以使用普通的主键和外键来管理关系。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps as a trigger on insert: 伴随着insert操作一起插入

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解的是将插入 UUID 设置成一个 insert 的 hook,简单翻译成一起插入感觉少了点什么。


In another case, we would generate a “slug” of text (e.g. in blog posts like this one) that would make the URL a little more human friendly. If we had a duplicate, we would just append a hashed value.
另一种情况,我会生成了一“段”文本(例如在像本篇一样的博文)用于 URL 使其更加对用户友好的。如果有冲突,那么只需追加一段哈希值。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e.g. in blog posts like this one),作者的意思应该是像本篇博文的URL一样:
https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,这里我笔误了,多了一个在字,其实是:我会生成了一“段”文本(例如像本篇一样的博文)用于 URL。你的意思是说我需要把 url 中的文本部分加进文章么?


Use integers because they are efficient. Use the database implementation of UUIDs in addition for any external reference to obfuscate.
使用整型是因为它们是高效的。另外也可将数据库实现的 UUIDs 用于混淆外部引用。 :TOBECONFIRMED
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用于混淆外部引用,啥意思? 作者的意思应该是:使得对外部引用无规律,避免暴力破解吧? obfuscate怎么翻, 迷惑?模糊化?再想想。。。

@linhe0x0
Copy link
Member

@zaraguo 两位校对者都已经校对好了~ 可以来根据校对意见进行调整了哈 ┏ (゜ω゜)=☞

@zaraguo
Copy link
Contributor Author

zaraguo commented Jun 29, 2017

@yifili09 @canonxu @sqrthree 已根据意见修改,可以再看下还有什么问题。

@canonxu
Copy link

canonxu commented Jun 29, 2017

OK,挺好,没有问题 @zaraguo @sqrthree

Copy link
Member

@linhe0x0 linhe0x0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有一丢丢小问题辛苦调整下好

@@ -1,133 +1,133 @@
> * 原文地址:[UUID or GUID as Primary Keys? Be Careful!](https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439)
> * 原文作者:[Tom Harrison Jr](https://tomharrisonjr.com/@tomharrisonjr)
> * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner)
> * 译者:
> * 译者:[zaraguo](https://github.com/zaraguo)
> * 校对者:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

校对者信息要加上哈


If our goal is to scale, and I mean *really scale* let’s first acknowledge that an `int` is not big enough in many cases, maxing out at around 2 billion, which needs 4 bytes. We have way more than 2 billion transactions in each of several databases.
如果我们的目标是扩展,我是说 *真正的扩展*。那么首先让我们意识到 `int` 类型在很多情况下是不够大的。在大约 20 亿(需要 4 字节)的时候就溢出了。然而每个数据库中我们都有远超 20 亿大小的数据存在。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

斜体的问题,请改成加粗哈。


Our database has plenty of intermediate tables that are mainly containers for the foreign keys of others, especially in 1-to-many relations. Accounts have multiple card numbers, addresses, phone numbers, usernames, and all that. For each of these columns in a set of table with billions of accounts, the extra size of foreign keys adds up fast.
我们的数据库用大量的关系表来存储外键,尤其是在一对多的关系中。账户表内含有多个卡号,地址,电话号码,用户名等等。对于拥有数十亿账户的一组表中的任意一列,外键的空间开销的增长都是十分快速的。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『多个卡号,地址,电话号码,用户名』=>『多个卡号、地址、电话号码、用户名』


Another problem is fragmentation — because UUIDs are random, they have no natural ordering so cannot be used for clustering. This is why SQL Server has implemented a `newsequentialid()` function that is suitable for use in clustered indexes, and is [probably the right implementation](https://msdn.microsoft.com/en-us/library/ms189786.aspx) for all UUID PKs. It is probable that there are similar solutions for other databases, certainly PostgreSQL, MySQL and likely the rest.
另外一个问题就是碎片化 - 因为 UUIDs 是随机的,他们没有天然的生成顺序因此不能够被用于集群。这就是为什么 SQL Server 实现了一个 `newsequentialid()` 方法用于集群化索引的使用,这可能就是将 UUIDs 作为主键使用的[正确打开方式](https://msdn.microsoft.com/en-us/library/ms189786.aspx)了。其他的数据库可能也有类似的解决方案,PostgreSQLMySQL 肯定是有的,其他的可能有。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『PostgreSQL,MySQL』=>『PostgreSQL、MySQL』


I would argue that *using a PK in any public context is a bad idea.*
下面我将阐明 *在公开环境中暴露主键是十分不好的* 这一观点。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

斜体的问题呢。


But there’s a far more compelling reason not to use any kind of PK in a public context: if you *ever* need to change keys, all your external references are broken. Think “404 Page Not Found”.
不在公开环境使用主键还有一个无法反驳的原因:你 *一旦* 需要改变这个键值,那么所有外在的引用就不可用了。想象一下 “404 页面无法找到”的情形。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有斜体哟


Then *add a column* populated with a UUID (perhaps as a trigger on insert). Within the scope of the database itself, relationships can be managed using the usual PKs and FKs.
然后 *增加一列* 用于存放 UUID(可以将其设计进插入的预处理操作里)。在一个数据库自身的范围内,可以使用普通的主键和外键来管理关系。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有斜体哈


But when a reference to the data needs to be exposed to the outside world, *even when “outside” means another internal system,* they must rely only on the UUID.
当需要暴露一个数据的引用到外部时,*即使这里的“外部”是另一个内部系统,*它们则必须依赖 UUID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@zaraguo
Copy link
Contributor Author

zaraguo commented Jun 30, 2017

@sqrthree done

@linhe0x0 linhe0x0 merged commit e364d79 into xitu:master Jun 30, 2017
@linhe0x0
Copy link
Member

@zaraguo 已经 merge 啦~ 快快麻溜发布到掘金专栏然后给我发下链接,方便及时添加积分哟。

@zaraguo
Copy link
Contributor Author

zaraguo commented Jun 30, 2017

@sqrthree 已发布到掘金
@canonxu @yifili09 感觉校对

@zaraguo zaraguo deleted the translate branch June 30, 2017 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants