Skip to content

Commit

Permalink
linking patchscopes
Browse files Browse the repository at this point in the history
  • Loading branch information
asmadotgh committed Sep 5, 2024
1 parent 8c15dad commit f576a0e
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions personas/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<meta name="description" content="Who's asking?">
<meta property="og:title" content="Who's asking?"/>
<meta property="og:description" content="Who's asking? User personas and the mechanics of latent misalignment"/>
<meta property="og:url" content="https://pair-code.github.io/interpretability/patchscopes/"/>
<meta property="og:url" content="https://pair-code.github.io/interpretability/personas/"/>
<!-- Path to banner image, should be in the path listed below. Optimal dimenssions are 1200X630-->
<meta property="og:image" content="static/image/method.png" />
<meta property="og:image:width" content="1200"/>
Expand Down Expand Up @@ -181,7 +181,7 @@ <h2 class="subtitle is-size-3-tablet has-text-weight-bold has-text-centered has-
<h3 class="subtitle is-size-4-tablet has-text-left pr-4 pl-4 pt-3 pb-3">
<p>
From a mechanistic perspective, we find that safeguards are layer-specific, and that decoding directly from earlier layers may bypass safeguards and recover misaligned content that would otherwise not have been generated. <br>
We then use Patchscopes to analyze why certain user personas disable safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries.
We then use <a href="https://pair-code.github.io/interpretability/patchscopes/" target="_blank">Patchscopes</a> to analyze why certain user personas disable safeguards and find that they enable the model to form more charitable interpretations of otherwise dangerous queries.
</p>
<p style="text-align:center;">
<br>
Expand Down

0 comments on commit f576a0e

Please sign in to comment.