Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i and b vs em and strong #652

Open
snan opened this issue Jun 6, 2020 · 41 comments
Open

i and b vs em and strong #652

snan opened this issue Jun 6, 2020 · 41 comments

Comments

@snan
Copy link

snan commented Jun 6, 2020

Em and strong aren't always the right thing. Is there a way to get i or b?

I know we love semantics over rendering but em and strong are sometimes hypercorrect. Looking at the html source of someone who wrote in Markdown and seeing em and strong when they clearly meant i or b, or sometimes cite.

For example when introducing a new name or term in another language.

@Crissov
Copy link
Contributor

Crissov commented Jun 6, 2020

Implementers (or plugin authors) could decide to use asterisk * for em / strong and underscore _ for i / b or vice versa, but I think it is too late to make this a mandatory change – also it is a distinction rather specific to HTML output.

PS: I prefer the underscore for presentational elements, because it fits well with introducing underlined u when four of those characters are used.

@alerque
Copy link

alerque commented Jun 6, 2020

Inkwells have been drained and spools of paper emptied rehashing the arguments for whether <em>/<strong> or <i>/<b> should be the default HTML output for the markdown elements. The lack of reference to prior discussions doesn't really endear your argument to anybody, most of the arguments for & against a tag pair cut both ways.

The short version is that this is an HTML distinction that doesn't exist in Markdown and use cases vary so each application may make different choices on how to map the elements to another format. Markdown only has one pair of semantic elements (in spite of having to syntax options) while HTML has two. You can really only map to one of them at a time. Many rendering engines give you the option, or a way to filter tags and write them how you please. Doing both is not really at option at this point for legacy reasons.

@snan
Copy link
Author

snan commented Jun 6, 2020

Personal attacks and assumptions about what I've researched are not appropriate. I'm not new to the topic of what elements Markdown should use. I'm just new to CommonMark specifically.

HTML has two

With one being a not-always-valid subset of the other.

@alerque
Copy link

alerque commented Jun 6, 2020

I wasn't making a personal attack. I did assume since you didn't even hint at knowing any of the background that you might not be aware of it. In any event if you are aware of some of the background then surely you know jumping in with a dogmatic assertion that one set of tags is better isn't going to resolve things.

With one being a not-always-valid subset of the other.

No, one is not a subset of the other. It's more like two competing standards, one focusing more on structural semantics and the other on presentation and legacy. If one was a subset of the other, the superset would always be interchangeable with some loss of meaning. Such is not the case, one could intend a kind of emphasis that was not supposed to be styled with italics or just as well as italics may not be used only for emphasis.

@snan
Copy link
Author

snan commented Jun 6, 2020

I didn't mean to step right into a fight here. I love markdown and I know there is a lot of hard work behind it (and projects adjacent to it) over the years. I wouldn't care about this issue if I wasn't invested in the language and seeing it as the future of markup.

And, I understand that I'm probably a decade too late (or I don't know when the CommonMark project started). It seems pretty locked in. But this language might still be in use fifty years from now. I hope Markdown's future is longer than its past.

The W3C puts it like this in the HTML standard:

The i element represents a span of text in an alternate voice or mood, or otherwise offset from the normal prose in a manner indicating a different quality of text, such as a taxonomic designation, a technical term, an idiomatic phrase from another language, transliteration, a thought, or a ship name in Western texts.

Emphasis is a subset of that.

But yeah, their description of b isn't a superset of strong emphasis on the other hand, although historically b has been used that way.

I tried to be careful when phrasing the original post in this issue thread. I wrote "Is there a way to get i" rather than "em is always wrong".
Seeing

<em>Hamlet</em> is a great play. It has a certain <em>je ne sais quoi</em> about it.

in source code is not correct. Seeing

"Wow, this <i>really</i> is a warm day!"

has legacy precedent and is not wrong per se, even though I'd rather use em there.

Markdown by its very nature as a plaintext-adjacent, "email tradition based" language does have its roots in presentation and legacy rather than structural semantics, which is why I feel that a way to easily make i or b is appropriate. Just as easily as I would write

_Hamlet_ is a great play, it has a certain *je ne sais quoi* about it.

in an email.

But it's for the sake of structural semantics that this matters. A web-scrapin' robot going through web pages and trying to scan for what is emphatic text might go "Wow, people back in the early 21st century really got stoked about their French phrases!"

Yes, i or b is sometimes a little blunt, sometimes a little non-specific, sometimes not optimal. But when I overuse em or strong that's sometimes flat out a falsehood, sometimes flat out what I don't mean.

I've seen some semantical horror stories over the years, like people using h1 for centering images (because that particular forum had h1 headers centered) or h6 when semantically they should've used h1 "because they think the smaller letters look cuter". There's no way to stop all such misuse, but we don't have to build it into the language, either.

@alerque
Copy link

alerque commented Jun 6, 2020

You're talking to somebody who uses custom Pandoc filters to overload *...* and _..._ to render to different markup, *_..._* to render a third kind of highlighting that is not those two nested nor is it <strong>/<b>. I realize there is a use case for more precise semantics than Markdown provides for, I just don't think there is any way to codify them as "common". This project isn't about inventing new markup (if anything it's a downgrade from the Pandoc + extras flavor I use) but about standardizing the exactly usage of ambiguous bits.

The way you get <i> instead of <em> is by requesting it from your tooling. The spec and reference implementation default to one pair of tags. I believe the reason for that is that it's more flexible semantic markup with less direct locking to a specific presentation style, but you'd have to lookup the actual discussions to see if that was the deciding factor. Since Markdown has no way to differentiate between them it is outside the scope of CommonMark to specify when to use which. You can either use tooling that lets you pick or post-process the output.

@wooorm
Copy link
Contributor

wooorm commented Jun 7, 2020

To go back to the first question:

Em and strong aren't always the right thing. Is there a way to get i or b?

I do think they often are the right thing, but indeed, they aren’t always. To get <i> or <b>, you can type those in literally, because both original Markdown and CommonMark have a “breakout” feature where it accepts HTML. As others have said, tools could allow options (I particularly like @Crissov’s idea: #652 (comment))

I do think it’s unfortunate that CommonMark does not say anything about semantics. And that its definition (“6.4 Emphasis and strong emphasis”) is not aligned with HTML (In HTML, nested emphasis is used for “strong” emphasis, whereas the strong element means importance, seriousness, or urgency).

@Crissov
Copy link
Contributor

Crissov commented Jun 7, 2020

What the CM reference implementation should do, though, is to retain information in its AST about which character has been used in the source.

PS: https://talk.commonmark.org/t/em-strong-vs-i-b-or-cite-dfn-etc/1242 https://talk.commonmark.org/t/revisting-underline-healthcare-documents/3078/3

@vassudanagunta
Copy link

@snan I agree with most everything if not everything you said. I agree with the following in spirit

It seems pretty locked in. But this language might still be in use fifty years from now. I hope Markdown's future is longer than its past.

but I don't think Markdown will last even another ten unless it evolves*. It's still mostly used by technical types, myself included, who are comfortable writing for machines -- that is to say, quite used to and adept at thinking in terms of, What do I need to do to get the machine to do what I want?. Markdown was definitely a step in the right direction away from HTML for authoring. But we need to make more steps.


*I'm not sure it can. I think/hope something Markdown-like will replace it. A bit of a reboot is necessary.

@snan
Copy link
Author

snan commented Jun 9, 2020

I do think they often are the right thing, but indeed, they aren’t always. To get <i> or <b>, you can type those in literally, because both original Markdown and CommonMark have a “breakout” feature where it accepts HTML.

I was under the (mistaken?) impression that that was for HTML output only; i.o.w. it's more of a "passthrough" feature than a "breakout" feature. In pandoc 2.5, which I have at hand, when compiling the text to LaTeX, it just drops those <i> tags.

somebody who uses custom Pandoc filters

I do that too, on my own system, (lua ftw) but the reason I just found out about CM is that Stack Exchange announced that they are going to adopt it and I was like "OK, so it's no longer Gruber that I have to go bug about this".

@snan
Copy link
Author

snan commented Jun 9, 2020

I think/hope something Markdown-like will replace it. A bit of a reboot is necessary.

I'm seeing a lot of sites switching over to wysiwyg or wysiwym but I'm not wholly on board with that. I love markup languages.

@snan
Copy link
Author

snan commented Jun 9, 2020

In HTML, nested emphasis is used for “strong” emphasis, whereas the strong element means importance, seriousness, or urgency

Wow, so it is a subset of b after all!

@wooorm
Copy link
Contributor

wooorm commented Jun 9, 2020

I was under the […] impression that that was for HTML output only […] In pandoc 2.5, which I have at hand, when compiling the text to LaTeX, it just drops those <i> tags.

That impression is correct: though when going to LaTeX, it doesn’t really matter whether <strong>, <b> or something else is used, no? These semantics matter when going to HTML, in which case, HTML tags are fine?

Wow, so it is a subset of b after all!

The HTML spec also says on <b>:

The b element represents a span of text to which attention is being drawn for utilitarian purposes without conveying any extra importance and with no implication of an alternate voice or mood, such as key words in a document abstract, product names in a review, actionable words in interactive text-driven software, or an article lede. […] The b element should be used as a last resort when no other element is more appropriate.

I don’t think I agree that it’s good to see <strong> as “inheriting” from / subset of <b>. The last quoted sentence especially makes it sound to me as if defaulting to <b> is a worse approach.

@snan
Copy link
Author

snan commented Jun 9, 2020

That impression is correct: though when going to LaTeX, it doesn’t really matter whether <strong>, <b> or something else is used, no? These semantics matter when going to HTML, in which case, HTML tags are fine?

It just drops the tags. So if you want to publish to both TeX and HTML you're sol if you use <i> tags.

The b element should be used as a last resort when no other element is more appropriate.

This language to me also implies fallback, catchall, default. When you can be more specific, you should. With a visual/presentation based markup like the email-derived **asterisks** you can't be that specific, and you can't easily select an appropriate element. Not that the W3 spec's specific wording is the be-all-end-all of my argument here, that'd be taking "appeal to authority" a bit far. History, legacy, intent and spirit of the HTML language is also relevant.

@snan
Copy link
Author

snan commented Jun 9, 2020

As alluded to upthread, we know that through the life-changing magic of CSS, em and strong aren't strict subsets of i and b respectively. You can style it to use underlining or small caps to emphasize. So I don't mean a strict subset, I mean… kinda a subset. It's correct to say that one is semantics and the other is presentation.

It's just that

  1. the semantic elements cover fewer use cases. And very many of those use cases are a subset of the much larger set of use cases that the presentation-based elements cover.
  2. markdown's historical precendent "plaintext email formatting" is a presentation-based language.

I'm not disputing that we want semantics. I just don't want wrong semantics.♥

@snan
Copy link
Author

snan commented Jun 9, 2020

I'm also definitely not saying that the solution is that markdown's output for em and strong should instead always be i and b. I've tried avoiding taking that position in this thread. It's what I would do, but I realize that that's a compromise with some serious downsides, and I'm open to other solutions.

@wooorm
Copy link
Contributor

wooorm commented Jun 9, 2020

It just drops the tags.

That to me sounds like a Pandoc problem, which I was under the impression could turn HTML into TeX.

I'm also definitely not saying that the solution is that markdown's output for em and strong should instead always be i and b.

I do see that you never proposed that in posts; but to me the title of this issue, “i and b vs em and strong”, pits them against each other.

I do think em and strong are better defaults than i and b, but I recognize they aren’t always. I would say that CommonMark talking about semantics is an acceptable solution, ushering users to care about semantics instead of presentation. And that i and b created according to @Crissov’s suggestion would be a welcome addition in userland.

@jgm
Copy link
Member

jgm commented Jun 9, 2020

That to me sounds like a Pandoc problem, which I was under the impression could turn HTML into TeX.

Pandoc can indeed convert HTML to LaTeX. However, here the input format is Markdown, and pandoc drops raw HTML when rendering to non-HTML formats. (This behavior is at least sometimes what you want.)

However, you can always use a lua filter that converts these raw HTML nodes to something that makes sense in your target format.

@snan
Copy link
Author

snan commented Jun 9, 2020

I love lua♥
We also discussed this for pandoc specifically over on pandoc's issue tracker.

@snan
Copy link
Author

snan commented Jun 19, 2020

Here's just one idea (just green hat brainstorming for a solution here):

What if </i> and </b> and </cite>, and their respective opening tags (attributes could be dropped) could be elevated to be part of the language instead of seen as passing HTML through?

@snan
Copy link
Author

snan commented Jun 19, 2020

I prefer the underscore for presentational elements, because it fits well with introducing underlined u when four of those characters are used.

Intuitively I feel the same way; I usually do think "emphasis" when I use the asterisks, and do usually think presentationally cursive when I use the underscore (sometimes that part of my brain is sloppy and thinks presentationally cursive when it should be thinking emphasis ← wow, I just did it in this sentence involuntarily, those were underlines just then).

However, in some implementations asterisks work inside of words like this and underscores don't, like t_hi_s. Are people more likely to use presentational cursive in words or emphasis? I guess emphasis so this paragraph isn't much of a "however" and instead should be an "additionally" since I come down on the same divide as you do, Crissov.

And that it might be too late to change, I wouldn't know if that was true. Crystal ball is on the fritz over here

@Crissov
Copy link
Contributor

Crissov commented Jun 19, 2020

All proper implementations of Commonmark support asterisks inside words (but not underscores), while only some implementations of Markdown do.

@snan
Copy link
Author

snan commented Jun 19, 2020

Which only strengthens I was trying to say about that, rather than contradict it.♥

@snan
Copy link
Author

snan commented Jul 9, 2020

What if </i> and </b> and </cite>, and their respective opening tags (attributes could be dropped) could be elevated to be part of the language instead of seen as passing HTML through?

Oh, I just saw that this is getting downvotes. And I'd rather have find a perfect solution than a compromise that no-one is truly happy with, but it's frustrating that we aren't getting anywhere nearer a solution here. For those who use pandoc for html only, it's not a big problem because they can do manual italics but it's difficult for when we want to use the same source documents for standards compliant HTML and for ConTeXt or LaTeX.

@snan
Copy link
Author

snan commented Aug 22, 2020

A month has passed and I find I'm OK with writing <i>, <b> and <cite> manually. It's a strength of Markdown that the non-specific syntax is easy to remember and that there is redundancy. Textile and YAML are stressful to write for me, languagues where I need to get everything just so, while Markdown is chill.

However, there are many times where the i, b or cite is getting lost. On Reddit, on Stack Exchange, and sometimes in Pandoc. That's why I wanted i, b and cite to become "part of the language" or at least some sort of recommendation that implementations don't throw away this information.

@snan
Copy link
Author

snan commented Mar 27, 2023

Implementers (or plugin authors) could decide to use asterisk * for em / strong and underscore _ for i / b or vice versa

It would be good if this was made explicit.
I misunderstood what was said here. I thought it said implementers can decide to use */_ for i and **/__ for b. That's what I want.

Here is a thread where that has been an issue.

You're talking to somebody who uses custom Pandoc filters to overload *...* and _..._ to render to different markup

That doesn't help someone who is posting on Reddit or Stack Exchange or the hundreds of other sites where these render into em. CommonMark implementations make the web full of <em>Gingko biloba</em> is a tree never mentioned in Romeo and Julia`.

Example 393 in CommonMark's own spec is evidence of this. The call is coming from inside the house! 😱

snan added a commit to snan/commonmark-spec that referenced this issue Mar 27, 2023
@wooorm
Copy link
Contributor

wooorm commented Mar 27, 2023

That's why I wanted i, b and cite to become "part of the language" or at least some sort of recommendation that implementations don't throw away this information.

If by “part of the language” you mean a new syntax, I’d probably be against it. I worry that the grammar will become too crowded. Depending on what design you come up with, it’s either likely easy to type, which will also mean that it would break lots of existing markdown. Or it’s complex to type, but then I’d prefer something like generic directives.

For a recommendation, I dunno.
But I recognize that lots of people are looking for recommendations on what to do, but the spec currently doesn’t want to get involved in those decisions.
So maybe an appendix for such things might be useful. For example talking about semantics (ref: #652 (comment))

I misunderstood what was said here. I thought it said implementers can decide to use */_ for i and **/__ for b. That's what I want.

I’m not quite sure what you‘re saying, to phrase it differently: I am in favor of implementation adding options to use i instead of em, and b instead of strong: #652 (comment).

I don’t think we need to add that everywhere in the spec though. I don’t think we need to describe that implementations are free to use div instead of p, or h2 instead of h1. Etc.

Example 393

Can you clarify what you don’t like about that example?

@snan
Copy link
Author

snan commented Mar 27, 2023

Example 393

Can you clarify what you don’t like about that example?

Sure, thanks for the question, that's illustrative of the issue so it's good to dig in a li'l deeper:

The example is:

<p><strong>Gomphocarpus (<em>Gomphocarpus physocarpus</em>, syn.
<em>Asclepias physocarpa</em>)</strong></p>

That is not correct HTML, which should be:

<p><strong>Gomphocarpus (<i>Gomphocarpus physocarpus</i>, syn.
<i>Asclepias physocarpa</i>)</strong></p>

or, depending on the context, maybe even:

<p><b>Gomphocarpus (<i>Gomphocarpus physocarpus</i>, syn.
<i>Asclepias physocarpa</i>)</b></p>

Linnaean names, like that Latin name for balloon plant used in that example, are always marked italics, cursive, oblique, or otherwise text-decorated, but not emphasized; <em> is semantically wrong for them. I can not correctly write that plant's Latin name here on GitHub since there is currently no way, that I know of, to emit <i>.

<i> and <b> are good fallbacks. They only indicate style. <em>, <cite>, <strong> are for when you specifically wanna indicate the semantics of emphasis, citation, or strong emphasis.

Refering to a poodle as "a dog" is slightly weird but not that bad and it's technically correct (and that's what we're doing when we're using <i> when we mean <em>).

Refering to a collie as "a poodle" is, on the other hand, quite wack (but that's what we're doing when we're using <em> for a Linnaean name or for a citation or for a foreign-language phrase).

And before someone asks: "But we should express semantics, not style. I heard someone back in the nineties say that a lot of the time people are wrongly using <i> when we should be more precise and use <em>". Yes, that's true. Em is more specific when i and is better to use—but only when we know for sure that we mean emphasis.

Yes, it's true that <em> is the most common one. 90% of the time it's what you mean. But just because you are in a town where 90% of the dogs are poodles it doesn't turn a collie into a poodle. A collie is still a non-poodle dog, just like a citation or a foreign phrase is still a non-emphasis use of italics.

I backed off from this argument a few years ago because of this argument: "We support raw HTML so people can type out <i> or <cite> or <b> when they mean <i> or <cite> or <b>, and they can use the shorthand * or _ for the most common one, which is <em>, and ** and __ for the second-most common one, which is <strong>."

But two things are becoming clear to me.

1A. People are using CommonMark-derived converters in places where raw HTML is (and should be) turned off, like on public forums and comment
sections.
1B. Implementors of those public forums are referring to this spec saying "I'm just doing what CommonMark says".

\2. Not everyone is, wants to be, or needs to be a linguistics nerd. People shouldn't have to learn the specifics minutia of when to use em, cite, or i. They just want the text to look slanted so they jam stars or underscores around. Making * and _ be <i> match their expectations.

That's why my recommendation is this:

Sites where raw HTML is turned off (as it should be, for public text inputs) should emit <i> for * and for _, and <b> for ** and for __.

Installations where markdown is used as a tool for writers, where it's a shortcut for HTML as opposed to a replacement for it, and raw HTML is allowed, may optionally continue to emit <em> and <strong> or have a flag for that behavior.

That's what I would use for my own blog where I can type out <i>, <cite>, or <b> manually, as needed, and most of the time I would get the default, <em>. I just checked, and I use <em> 70% of the time, <cite> 20% of the time, and <i> 10% of the time, so it's appropriate for me to have * and _ emit em since I know to get the others when I need them (I even have an shortcut that I bolted on to Emacs markdown-mode to get them as raw HTML), but even then, that's not necessarily the best for all installations depending on how nerdy the users of that tool are expected to have to be.

Not everyone should have to learn this stuff but that doesn't mean it's OK that the web is littered with wrong semantics like <em>Gomphocarpus physocarpus</em>.

That's more wrong than I'm <i>really</i> tired.

If by “part of the language” you mean a new syntax, I’d probably be against it.

Yeah. It became clear upthread that that particular idea (elevating <i> and <cite> and <b> from being seen as raw HTML to being seen as first-class markdown language constructs) was not popular even with those who otherwise agree with me, and I've accepted that that idea is not gonna fly.

I’m not quite sure what you‘re saying

I want to be able to use italics and bold on Reddit, StackExchange, here on GitHub, and dozens of other sites that pass the buck by saying "We're only doing what CommonMark says".

I don’t think we need to add that everywhere in the spec though. I don’t think we need to describe that implementations are free to use div instead of p, or h2 instead of h1. Etc.

It's becoming clearer and clearer to me that we do need to be explicit about that.
C.f. this Comrak pull request.

Summary:

It should be i and b instead of em and strong (at least on most of the websites out there like GitHub, Reddit, StackExchange, wikis etc).

I like that * and _ both mean the same thing, that * can be used intraword and _ can't, etc. That's all good. I just don't want to call collies "poodles".

@wooorm
Copy link
Contributor

wooorm commented Mar 27, 2023

Latin name[s] […] are always marked italics

OK, so there is a typographic convention of how things should look.
I want to stress that <i> in HTML, does not mean italics, or in any way how things look. This is perhaps pedantic, but if you want to express italics, use <span style="font-style:italic">, or for oblique use <span style="font-style:oblique">.
<i> is about “offset[ing] from the normal prose”. Latin names aren’t normal English, that’s why they can be marked as <i>: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-i-element.

but not emphasized

I don’t see any reason to conclude that Latin names must never be marked as stress emphasis as described by the HTML spec: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element.

I can understand that it might be better to remove things with certain typographic conventions here. Perhaps:

My **cats (*Cheddar* and *Whiskers*)**.
<p>My <strong>cats (<em>Cheddar</em> and <em>Whishers</em>)</strong>.</p>

…might be an improvement. I am sure there’s an example we can think of that you accept that embeds <em> inside <strong>, while not using Latin names.

I can not correctly write that plant's Latin name here on GitHub since there is currently no way, that I know of, to emit <i>.

You can type <i>Gomphocarpus physocarpus</i> here: Gomphocarpus physocarpus.

They only indicate style.

You answer this yourself: “But we should express semantics, not style.” HTML is about semantics, not about presentation.


1A
1B

Sure

2

Using your own terms around poodles and collies: just because 90% of people don’t give a hoot about semantics, doesn’t mean we need to remove all semantics and go with HTML 2 again.

That's why my recommendation is this:

I have supported this: #652 (comment)

but that doesn't mean it's OK that the web is littered with wrong semantics like <em>Gomphocarpus physocarpus</em>. That's more wrong than I'm <i>really</i> tired.

I am not sure why you deem one more or less wrong than the other. Both can be right. Both can be wrong.

It became clear upthread that that particular idea ([…]) was not popular even with those who otherwise agree with me, and I've accepted that that idea is not gonna fly.

If you’re interested in a markdown-like language that does make separate tags a part of the language, you might enjoy https://mdxjs.com.

I want to be able to use italics and bold on Reddit, StackExchange, here on GitHub, and dozens of other sites that pass the buck […]

It's becoming clearer and clearer to me that we do need to be explicit about that. C.f. this Comrak pull request.

That’s not what your PR there does. You break CommonMark there by changing everything for everyone.
Your feature request is to do this optionally, which is acceptable to the maintainer there: kivikakk/comrak#285 (comment)

@snan
Copy link
Author

snan commented Mar 27, 2023

<i> is about “offset[ing] from the normal prose”.

Yes, that's a really good way to phrase what the semantics of <i> is about! Thank you, that wording makes my case (that we should emit <i> by default) a lot stronger.

This is perhaps pedantic, but if you want to express italics, use <span style="font-style:italic">, or for oblique use <span style="font-style:oblique">.

Right, offsetting from the normal prose is what we want, as opposed to a specific visual representation of that offset.

I don’t see any reason to conclude that Latin names must never be marked as stress emphasis as described by the HTML spec: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element.

I mean, they can be part of stress, like you could say “Is that a <em><i>Tyrannosaurus Rex</i></em>?” the same way you could say “Is that a <em>spider<em>?”’ but marking them as Latin names is done with <i>, not <em>. It is not correct to use stress to offset them.

I've told this story before but I remember an old social media site (now defunct) in the early 00s and I saw someone had managed to center an image on their profile, something that the dinky markup of the time didn't allow. But when I looked under the hood, I saw to my shock & horror that they had marked the image as a h1, which had cause the site CSS to center it. That's not what h1 means. And that's as bad as using stress to mark latin names. Using italics to mark them, sure. Because the point is to offset them from prose. (I wrote as much in my previous post, saying " italics, cursive, oblique, or otherwise text-decorated". I've seen them underlined in old type-written manuscripts and that's fine too, for example. And <i>+CSS can do that.)

I can understand that it might be better to remove things with certain typographic conventions here. Perhaps:

My **cats (*Cheddar* and *Whiskers*)**.
<p>My <strong>cats (<em>Cheddar</em> and <em>Whishers</em>)</strong>.</p>

…might be an improvement.

That is not better. That'd be super weird, semantically, to stress their names that way.

I am sure there’s an example we can think of that you accept that embeds \<em\> inside \<strong\>, while not using Latin names.

Yeah, I can think of a few. The other examples use text like foo and bar and that'd be fine here. Using names is not good.

You can type <i>Gomphocarpus physocarpus</i> here: Gomphocarpus physocarpus.

If that's true, GitHub is not an applicable example for this problem. But there are many other sites out there where that's not possible because they have turned off raw HTML, and, there are also users who don't understand (and shouldn't have to understand) when to use which.

Using your own terms around poodles and collies: just because 90% of people don’t give a hoot about semantics, doesn’t mean we need to remove all semantics and go with HTML 2 again.

Cite and em was added to HTML at the same time as i and b was, with HTML 2 (as you know, since you linked to the HTML 2 RFC which does mention em and strong).

That's why my recommendation is this:

I have supported this: #652 (comment)

but that doesn't mean it's OK that the web is littered with wrong semantics like \<em\>Gomphocarpus physocarpus\</em\>. That's more wrong than I'm <i>really</i> tired.

I am not sure why you deem one more or less wrong than the other. Both can be right. Both can be wrong.

Being overly broad is less wrong than being specific-but-wrong. Calling a poodle a "dog" is less wrong than calling a collie a "poodle".

If you’re interested in a markdown-like language that does make separate tags a part of the language, you might enjoy https://mdxjs.com.

The problem isn't my own websites where I have control over what Markdown implementations to use. I personally already have a setup that lets me write my choice of em, cite, i, strong, or b.

The problem is sites like Reddit, StackExchange and many, many others where A: users have no way to type i or cite as distinct from em, and B: they shouldn't have to, they shouldn't have to learn to do that nor to learn to understand hyper nitty-gritty semantics perfectly. And there's no way to automatically detect when they mean cite or em or i so I propose we use i. Offsetting from prose is what they want, even though they might mean to do that offsetting for emphasis purposes 70% of the time.

In hindsight it was a bad idea for HTML to create em and cite and strong tags because they presupposed every single formatted text online needs to go through an editor with enough linguistics chops to distinguish between which to properly use when. That's fine for institutions but an unreasonable requirement for a discussion site or other public-writable spaces.

I'm a linguist—I can nerd out enough to know when to use em and when to use cite and when neither is applicable and I need to use the superset, i. And even then I make mistakes every now and then—but I'm not a biologist so, if to reuse the poodle/collie example: if every website like Reddit or StackExchange required me to use one syntax when talking about poodles, one syntax when talking about collies, and another when talking about non-poodle non-collie dogs, I'd be in trouble.

Letting * and _ be <i> lets people keep using * and _ in the way they think they are alreaday using them. To offset from normal prose.

That’s not what your PR there does. You break CommonMark there by changing everything for everyone. Your feature request is to do this optionally, which is acceptable to the maintainer there: kivikakk/comrak#285 (comment)

The maintainer hadn't written that response yet when I posted here.

I think the default should be i and b, with em and strong being tucked away as an option (only to be turned on by people who know exactly what they are doing and who can emit cite and i and b by other means, such as raw HTML).

Changing everything for everyone is the point. There's a lot of collies marked "poodle" out there on the web. If they can be turned into "dogs" that'd be a win for semantics.

These sites look to CommonMark as an authority on this. They're like "we're emitting em and strong because that's what CommonMark tells us to do". That wasn't necessarily CommonMark's intent—which was more to clarify the specifics of nesting and overlapping and so on—but that's what has happened and that gives CommonMark a responsibility clear this up.

@alerque
Copy link

alerque commented Mar 27, 2023

Whatever tool you give by default people are bound to conflate them. But if you give them a tool that is inherently presentational rather than semantic what you will end up with is a bunch of presentational markup and no structure. If you give them a semantic tool and they use it wrong they will produce the opposite problem.

As a dude producing books the reason I force authors and editors to use Markdown and then convert the results is specifically to take away the formatting tools and make them concentrate on the content. I don't want them fiddling with what font size to use for a heading, I want them to think about whether this heading is a section or a subsection. Typesetting and will take care of the style.

Markdown is especially useful in this context because it gives almost exclusively structural markup options. For the few cases where more is needed, divs or spans with classes can be used. For example I have authors use *word* as normal for emphasis, but [conra]{lang="la"} for Latin words. The typeset output often makes them look the same presentationally, but some digital formats make a useful disciction (such as ePub's and screen readers).

I would suggest passing out the semantic markup tooling by default and making everybody else adapt is better than giving people presentational markup in Markdown. Either way you will get mistakes and stuff marked up wrong, but one fits the pattern of other provided tooling in outputting semantic markup.

@snan
Copy link
Author

snan commented Mar 27, 2023

Again, <i> indicates "offset from prose". It is also a semantic. It's just more broad than something like <cite>.

If Markdown had been written by a level-headed never-emphatic often-book-citing literature nerd, * and _ might've been <cite> instead of <em> and we'd still have the exact same problem and the exact same argument, except with one difference—<em> has a cargo-cult, half-understood myth around it of being "the correct and semantic way to write <i>", which is not correct.

As a dude producing books the reason I force authors and editors to use Markdown and then convert the results is specifically to take away the formatting tools and make them concentrate on the content.

Yes! (I also produce books, for what that's worth.)

That's what I want too! Correct semantics.

And these authors use (many are forced to use, even) <em> incorrectly.

But if you give them a tool that is inherently presentational rather than semantic what you will end up with is a bunch of presentational markup and no structure.

The solution is to give them very few things.

We can't fully get away from this. I've seen Markdown end users use ## as their main header and #### as their secondary header because "it looks nicer lol".

But one thing exists that can mean emphasis, citation, or other offset prose and that thing is <i>.

If you write <em>Gomphocarpus physocarpus</em> you are part of the problem. Part of the cargo cult of semantics who go "semantics are good, semantics are what we're supposed to use, and <em> is semantic so let's use it lol". Semantics are not a toy.

If you give them a semantic tool and they use it wrong they will produce the opposite problem.

Right. And 99% of people writing text in Markdown online don't understand semantics. They do understand that if they slam asterisks around a phrase, it looks like this, so that's what they do.

Markdown is especially useful in this context because it gives almost exclusively structural markup options. For the few cases where more is needed, divs or spans with classes can be used.

That is great for real Markdown, and CommonMark also specifies how raw HTML. But the problem is all those sites that turn off all that fancy stuff but still say "We emit <em> because that's what our holy CommonMark does".

(The secondary problem is that even if that stuff was left on, 99% of people writing Markdown into forum sites, question sites, chats like Slack or Discord or Matrix, they wouldn't know how to use it correctly.)

For example I have authors use *word* as normal for emphasis, but [conra]{lang="la"} for Latin words.

So you have authors who are brilliant enough to be able to use spans (which isn't necessarily correct, <i lang="la"> is) and em and be able to tell which is which, and you have a platform that supports those features (as do I, on my own platform).

That's great for you and your publishing pipeline. That doesn't help me who has to wade though waist-high levels of misapplied <em> online, most of it generated by platform implementers who venerate this CommonMark spec and who have a surface-level, cargo cult understanding of semantics.

I don't want to take away your toolset. You have editors and typesetters polishing the author's texts.
Most rando forum apps and chat apps don't have that. It's those sites that should switch to <i> as default.

All those sites who use "markdown-derived" markup in a wikilike-way open to the public. Those who don't have an editor team and a typesetting team.

The typeset output often makes them look the same presentationally, but some digital formats make a useful disciction (such as ePub's and screen readers).

Yes. That's exactly the problem. You've hit the nail on the head why this is an issue. Epubs and screenreaders getting forcefed <em>Gomphocarpus physocarpus</em> when they would've been able to handle <i>Gomphocarpus physocarpus</i> just <i>fine</i>.

In a printed book, or PDF, all semantics are replaced and rendered into style anyway.

Either way you will get mistakes and stuff marked up wrong,

Yes, but UX and conveyance and good defaults can make a big difference, and over-broad (marking poodles as "dogs") is less wrong than wrong-specific (marking collies as "poodles").

but one fits the pattern of other provided tooling in outputting semantic markup.

A kind of irrelevant response to the specific request at hand here, which is to have the libraries these sites and apps use emit i and b instead of em and strong. Tooling can still handle i and b.

@wooorm
Copy link
Contributor

wooorm commented Mar 27, 2023

I’m kinda lost what you are proposing.

You’ve before said that you don’t always want <i>: #652 (comment). I don’t see consensus here for switching the defaults either.

Many ideas have been thrown around over the years.

Perhaps it’s better to open new issues for ideas that do have consensus?

@snan
Copy link
Author

snan commented Mar 28, 2023

I’m kinda lost what you are proposing.

OK. I'll repeat it here. Thanks, it's good to be super clear.

\1. On websites/​chat apps/​forums/​question sites etc for the general public ,* and _ should emit i, and ** and __ should emit b, and the CommonMark spec should explicitly say that (or at least encourage or at the very least allow that).

The CommonMark spec is technically already ambiguous about this. It's mostly about how specifically nesting & parsing works. You can then use it to generate TeX or HTML or SVG or RTF or anything you want.

It's already not a violation of CommonMark to emit b and i.

However a lot of implementers out there who incorrectly emit em and strong say "We are only following what CommonMark says".

\2. On installations used only by technical editors who know what the heck whey are doing, it's OK that the shorthands * and _ emit em, since they can use raw HTML (or raw TeX or similar) to emit cite when needed (and the same goes ** and __ with strong and b). It should probably not be the default, but that argument is less important than point 1 above.

This pull request is a very mild first step in the right direction. I would like to go much further.'

You’ve before said that you don’t always want <i>: #652 (comment).

Yes, but I then wrote three years later:

I backed off from this argument a few years ago because of this argument: "We support raw HTML so people can type out <i> or <cite> or <b> when they mean <i> or <cite> or <b>, and they can use the shorthand * or _ for the most common one, which is \<em\>, and ** and __ for the second-most common one, which is \<strong\>."

But two things are becoming clear to me.

1A. People are using CommonMark-derived converters in places where raw HTML is (and should be) turned off, like on public forums and comment sections. 1B. Implementors of those public forums are referring to this spec saying "I'm just doing what CommonMark says".

\2. Not everyone is, wants to be, or needs to be a linguistics nerd. People shouldn't have to learn the specifics minutia of when to use em, cite, or i. They just want the text to look slanted so they jam stars or underscores around. Making * and _ be <i> match their expectations.

The solution has kept being really bad and wrong these past years; the argument that convinced me back then has not borne out in practice. It's been a mess of wrong ems out there.

I don’t see consensus here for switching the defaults either.

Some of the people protesting have been misunderstanding what I ask, or at least have expressed their objections in a way that came across as if they hadn't understood what I was asking for and why.

Perhaps it’s better to open new issues for ideas that do have consensus?

That doesn't fix the problem that a lot of website out there fill the world with wrong em and strong in the name of CommonMark.

@wooorm
Copy link
Contributor

wooorm commented Mar 28, 2023

Re \1.: I think you’re free to tell people that the output you prefer is semantically better for user generated content.

“We are only following what CommonMark says”

Do you have links other than your comrak PR?
I am personally open to adding an appendix with suggestions on how to handle semantics and user generated content. I don’t think one new sentence somewhere deep in the spec, per your PR, is clear enough.


I’ll try to summarize my personal opinions again:

  • I don’t think there’s a practical difference between i and em (https://www.tpgi.com/screen-readers-support-for-text-level-html-semantics/)
  • Changing defaults breaks a lot, so there would need to be broad consensus, I’d particularly weigh heavy on experts (such as browser makers, accessibility experts) and folks affected by this (such as those using screen readers)
  • Most arguments for why i is a better default than em, could also be used for why <span class="italic"> is better than i
  • This problem exists in any markup language, asciidoc, restructuredtext, HTML itself
  • I think an actual fix for the fact that user-generated content will contain an h4 because it looks nicer and an em even though they should’ve user a cite, etcetera, would be to introduce a new <user-generated-content> element, or user-generated-content attribute, in HTML. Something like the rel="ugc" attribute supported by Google on <a>s

@snan
Copy link
Author

snan commented Mar 28, 2023

I am personally open to adding an appendix with suggestions on how to handle semantics and user generated content.

Aw yes! Such an appendix would be awesome, to the extent that it reflects my position that i and b are a better default for many installations (admittedly not all installations).

It seems like we are on the same page.

I don’t think one new sentence somewhere deep in the spec, per your PR, is clear enough.

You're right. The PR doesn't solve it and if we find a solution that solves it better, the PR isn't necessary. So the PR is not good. I guess I was desperate and over-compromising in a way that didn't bring the issue all the way over to "solved".

I don't know about that but practical differences isn't the entirety of the issue. It's also that what's right is right.

  • Most arguments for why i is a better default than em, could also be used for why <span class="italic"> is better than i

I don't agree that, since i semantically indicates "offset from prose", which is the intent rather than a specific styling. (Yeah, that's a shift in my position compared what I've been saying in some posts upthread, a shift that happened here. I think the new position is even stronger.)

  • This problem exists in any markup language, asciidoc, restructuredtext,

It doesn't to the extent that they emit i and b, like all wiki formats and bbcode style formats I have ever seen up until Markdown broke the trend (as far as I know) and changed it to em and strong. In these other formats can semantically express "offset from prose" and "strongly offset from prose" in a way that covers emphasis, citation, and other uses such as Linnaean names.

This isn't a slag on upstream Markdown since it, for its particular use case at the time, was just a tool for HTML writers, a shorthand, a complement that made it easier to write common things like paragraphs and emphasis and links and blockquotes, but still let you drop down to HTML for anything special and unusual, like tables and citations and other non-emphatic use of i.

[the problem also exists in] HTML itself

Half of the problem exists.
The first half ("users are forced to wrongly use em") doesn't exist there since HTML also has i and cite.
The second half ("the language has semantics more fine-grained than 99% of text authors can be expected to master") is a problem with upstream HTML.

  • I think an actual fix for the fact that user-generated content will contain an h4 because it looks nicer and an em even though they should’ve user a cite, etcetera, would be to introduce a new <user-generated-content> element, or user-generated-content attribute, in HTML. Something like the rel="ugc" attribute supported by Google on <a>s

Maybe. Header level issues and other issues with ugc is beyond the scope of this specific issue #652, which is that i and b, being "supersets" in a way of em/cite and strong respectively, does solve a lot of the "wrong semantics" out there. In an overly broad way—a lot of poodles will be relabled from “poodles” to “dogs”, but with the upshot that a lot of collies will also be (correctly) relabled from “poodles” to “dogs”.

@jgm
Copy link
Member

jgm commented Mar 28, 2023

Our use of <em> and <strong> just followed Markdown's...it goes way back. reStructuredText does the same.

I agree that <em> is not always semantically correct for all the cases where people use * in commonmark. I'm not sure whether this is causing any actual problems, though. Who is impacted by this?

If I could be convinced that <i> is in fact a superset, including the uses covered by <em>, then I think there'd be a pretty good argument for changing. But it's not entirely clear to me that this definition does cover the use assigned to <em>:

The i element represents a span of text in an alternate voice or mood, or otherwise offset from the normal prose in a manner indicating a different quality of text, such as a taxonomic designation, a technical term, an idiomatic phrase from another language, transliteration, a thought, or a ship name in Western texts.

There's also the "conservativeness" argument noted by @woorm: if we made this change, it could cause substantial breakage and inconvenience. (E.g. style sheets would no longer work as expected.) So we'd want to be sure that there's an equally substantial benefit.

In short, I'm not yet convinced we should make this change, but I am sympathetic to the idea. Perhaps others can tell us new things that might bear on the issue.

@snan
Copy link
Author

snan commented Mar 28, 2023

Our use of <em> and <strong> just followed Markdown's...it goes way back.

Right, and that was appropriate for Markdown which was made for technical bloggers who had access to fallback i, cite, and b when needed.

In short, I'm not yet convinced we should make this change, but I am sympathetic to the idea. Perhaps others can tell us new things that might bear on the issue.

That makes sense to me. Thank you.

@Crissov
Copy link
Contributor

Crissov commented Mar 28, 2023

I still think optional separation makes the most sense.

1 2 3 4 5 6 7
MD/CM _x_ __x__ ___x___ ____x____ _____x_____ ______x______ _______x_______
HTML <i>x</i> <b>x</b> <b><i>x</i></b> <u>x</u> <u><i>x</i></u> <u><b>x</b></u> <u><b><i>x</i></b></u>
MD/CM *x* **x** ***x*** ****x**** *****x***** ******x****** *******x*******
HTML <em>x</em> <strong>x</strong> <strong><em>x</em></strong> <?>x</?> <?><em>x</em></?> <?><strong>x</strong></?> <?><strong><em>x</em></strong></?>

<?> could be <mark>, <dfn>, <label> or some other element.

@johnfactotum
Copy link

As far as the HTML standard goes, <i> is not a superset of <em>. But typographically/linguistically, it most certainly is. Thus even though using <i> for emphasis is wrong, it is less wrong than using <em> for italics, as the semantics of <i> can be more broadly interpreted to include stress emphasis but not the other way around.

I'd also like to add that "<i> does not mean italics", though technically correct in a sense, is misleading. The italic style itself is semantically meaningful. The semantic meaning <i> has now was completely derived from italics. Any phrasing element set in italics almost certainly has some semantic meaning rather than being purely decorative (which isn't really surprising, for why would anyone change the font style in the middle of the paragraph for no reason?). Cases where one should use <span class="italic"> over <i>/<em>/<cite> are non-existent.

There is however one reason why you might want to keep <em> as the default: in everyday, non-technical writing, stress emphasis is used more often than other kinds of italics. It boils down to whether you want to be exactly right most of the time, but sometimes extremely wrong, or somewhat wrong most of the time, but never extremely wrong.

As a counter example, imagine rendering all italics with <cite>. I think most would agree that it would be completely unacceptable. But the same applies to <em>. Marking taxonomic designations with <em>, for example, would be just as wrong as marking them with <cite>.

@DominoPivot
Copy link

It certainly doesn't help that the quick reference and crash course on commonmark.org itself states that *this* is for italics and **this** is for bold. I think my heart skipped a beat when I noticed it. I just opened an issue. commonmark/commonmark-web#69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants