Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement x86 AVX intrinsics #3192

Merged
merged 1 commit into from
Feb 16, 2024
Merged

Conversation

eduardosm
Copy link
Contributor

@eduardosm eduardosm commented Nov 26, 2023

Blocked on #3214

@RalfJung
Copy link
Member

RalfJung commented Dec 3, 2023

Thanks for the PR! Unfortunately I am exceptionally busy right now and honestly I am not sure if I'll be able to review a PR of this size before the Christmas break. Sorry for that.

@workingjubilee
Copy link
Member

Perhaps if @eduardosm splits out the "mere code movement" parts with rounding and sqrt into a separate PR so that the maskload, etc. stuff can be reviewed separately?

@RalfJung
Copy link
Member

RalfJung commented Dec 7, 2023

Splitting up is always helpful, yeah. Reviewing has a strictly superlinear time complexity, I suspect somewhere between O(n * log n) and O(n^2), so two smaller PRs are less work to review than a single PR with twice the size.

@workingjubilee
Copy link
Member

Yeah. I was also thinking "hm, I could help examine the tricky parts and doublecheck stuff, but Ralf knows how stuff is/should be organized better than me (obviously), so I have no comment on 'where should enum FloatUnaryOp live, and/or does this abstraction need to be changed now that it's seeing more use?"

@eduardosm
Copy link
Contributor Author

Done, created #3214 with the "just moving things" part

bors added a commit that referenced this pull request Dec 8, 2023
Move some x86 intrinsics code to helper functions in `shims::x86`

To make them reusable for intrinsics of other x86 features.

Splitted from #3192
@bors
Copy link
Collaborator

bors commented Dec 8, 2023

☔ The latest upstream changes (presumably #3214) made this pull request unmergeable. Please resolve the merge conflicts.

@eduardosm eduardosm marked this pull request as ready for review December 9, 2023 12:02
RalfJung pushed a commit to RalfJung/rust that referenced this pull request Dec 17, 2023
Move some x86 intrinsics code to helper functions in `shims::x86`

To make them reusable for intrinsics of other x86 features.

Splitted from rust-lang/miri#3192
Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Mostly looks good, I have some nits and questions though.

I didn't look at the tests (yet); I assume they cover all the new intrinsics and in particular the corner cases.

src/shims/x86/avx.rs Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

@rustbot author

@rustbot rustbot added the S-waiting-on-author Status: Waiting for the PR author to address review comments label Dec 22, 2023
@eduardosm
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Waiting for a review to complete and removed S-waiting-on-author Status: Waiting for the PR author to address review comments labels Jan 12, 2024
Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, finally went through all the ops and comments. Sorry for the long wait!

src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Outdated Show resolved Hide resolved
///
/// Each 128-bit chunk is treated independently (i.e., the value for
/// the is i-th 128-bit chunk of `dest` is calculated with the i-th
/// 128-bit blocks of `left` and `right`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that mean the output should be shorter than the input, since each 128bit chunk produces one result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output has the same size as each input, since each 128-bit chunk of the input produces a 128-bit chunk for the output.

Copy link
Member

@RalfJung RalfJung Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand what is meant by performing an operation "horizontally".

Given two vectors, I can see how can one do an operation "pointwise" (that's what it would be called in mathematics, anyway): the i-th output element is computed as op(left[i], right[i]). Naturally, then the output is the same length as either input. Is that what you mean by "horizontally"? If not, what do you mean? I thought it may be a sort of fold, like summing all elements, but then you get one output element per input chunk, so that doesn't match the comment. I was trying to reverse engineer the code but failed.^^

Copy link
Member

@RalfJung RalfJung Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I stared at the code some more, and I think it's composing neighboring elements?

  • concatenate left ++ right to form input
  • then compute output[i] as op(input[2*i], input[2*i + 1])

Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct

src/shims/x86/mod.rs Show resolved Hide resolved
src/shims/x86/mod.rs Outdated Show resolved Hide resolved
src/shims/x86/avx.rs Show resolved Hide resolved
src/shims/x86/avx.rs Show resolved Hide resolved
src/shims/x86/avx.rs Show resolved Hide resolved
@RalfJung RalfJung added S-waiting-on-author Status: Waiting for the PR author to address review comments and removed S-waiting-on-review Status: Waiting for a review to complete labels Jan 25, 2024
src/shims/x86/mod.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

(FYI, I will assume the ball is in your court until you do rustbot ready)

@eduardosm
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Waiting for a review to complete and removed S-waiting-on-author Status: Waiting for the PR author to address review comments labels Feb 14, 2024
let control = this.project_index(&control, i)?;

// Each 128-bit lane is shuffled independently. Since each lane contains
// two 64-bit elements, only the second bit from `right` is used (yes, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right does not exist any more, should probably be control now?

// second instead of the first, ask Intel). To read the value from the current
// lane, add the destination index truncated to a multiple of 2.
let src_i = ((this.read_scalar(&control)?.to_u64()? >> 1) & 1)
.checked_add(i & !1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, i & !1 is the chunk_base, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, since each element is 64 bits, and each chunk is 128 bits, a chunk has two elements, so we chop off the lowest bit to get the chunk base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then let-bind it like above please. :)

for i in 0..dest_len {
let control = this.project_index(&control, i)?;

// Each 128-bit lane is shuffled independently. Since each lane contains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these lane be chunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed those

///
/// Each 128-bit chunk is treated independently (i.e., the value for
/// the is i-th 128-bit chunk of `dest` is calculated with the i-th
/// 128-bit blocks of `left` and `right`).
Copy link
Member

@RalfJung RalfJung Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand what is meant by performing an operation "horizontally".

Given two vectors, I can see how can one do an operation "pointwise" (that's what it would be called in mathematics, anyway): the i-th output element is computed as op(left[i], right[i]). Naturally, then the output is the same length as either input. Is that what you mean by "horizontally"? If not, what do you mean? I thought it may be a sort of fold, like summing all elements, but then you get one output element per input chunk, so that doesn't match the comment. I was trying to reverse engineer the code but failed.^^

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: Waiting for the PR author to address review comments and removed S-waiting-on-review Status: Waiting for a review to complete labels Feb 14, 2024
src/shims/x86/mod.rs Outdated Show resolved Hide resolved
Comment on lines 184 to 187
this.copy_op(
&this.project_index(&data, src_i)?,
&this.project_index(&dest, i)?,
/*allow_transmute*/ false,
)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature of copy_op changed, this needs a rebase.

@eduardosm
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Waiting for a review to complete and removed S-waiting-on-author Status: Waiting for the PR author to address review comments labels Feb 16, 2024
@RalfJung
Copy link
Member

@bors r+

@bors
Copy link
Collaborator

bors commented Feb 16, 2024

📌 Commit 3cd30a5 has been approved by RalfJung

It is now in the queue for this repository.

@bors
Copy link
Collaborator

bors commented Feb 16, 2024

⌛ Testing commit 3cd30a5 with merge 454f054...

@bors
Copy link
Collaborator

bors commented Feb 16, 2024

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing 454f054 to master...

@bors bors merged commit 454f054 into rust-lang:master Feb 16, 2024
8 checks passed
@eduardosm eduardosm deleted the x86-avx-intrinsics branch February 18, 2024 00:20
lnicola pushed a commit to lnicola/rust-analyzer that referenced this pull request Apr 7, 2024
Move some x86 intrinsics code to helper functions in `shims::x86`

To make them reusable for intrinsics of other x86 features.

Splitted from rust-lang/miri#3192
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024
Move some x86 intrinsics code to helper functions in `shims::x86`

To make them reusable for intrinsics of other x86 features.

Splitted from rust-lang/miri#3192
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Waiting for a review to complete
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants