-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.tex
936 lines (785 loc) · 40.8 KB
/
report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
\documentclass{article}
%\usepackage{xunicode}
\usepackage[T1]{fontenc}
%\usepackage{fontspec}
\usepackage{geometry}
\usepackage[english]{babel}
%\usepackage{polyglossia}
%\setmainlanguage{english}
\usepackage{amsmath}
\usepackage{amssymb}
%\usepackage{unicode-math}
\usepackage[hidelinks]{hyperref}
\usepackage{enumitem}
\setlist{nosep,leftmargin=*}
\frenchspacing
\newcommand\setR{\mathbb{R}}
\newcommand\setZ{\mathbb{Z}}
\newcommand\setN{\mathbb{N}}
\title{Improving array expansion in Jasmin compiler}
\author{Quentin \textsc{Corradi}}
\setlength{\parindent}{0pt}
\begin{document}
\maketitle
\section{Introduction}
Jasmin\cite{10.1145/3133956.3134078, 10.1145/3319535.3363211, 9152665} is a
compiler and a language designed for writing high-assurance and
high-speed cryptography primitives. The semantics is formally defined to allow
rigorous reasoning about program behaviors.
The Jasmin language is close to assembly, the main differences lie in:
\begin{itemize}
\item Jasmin abstracts numerical operations common in cryptography
\item Jasmin provides variables and performs variable allocation in registers
\item Jasmin supports arrays
\item Jasmin provides common control flow statements
\item Jasmin provides functions with support for C calling convention if needed
\item Jasmin presents assembly instruction with a functional style to avoid
hidden side effects
\end{itemize}
In Jasmin there are two kind of variables, the stack allocated ones and the
register allocated ones. The user has to tell the compiler the kind of every
variables. This also holds true for arrays. However on most processors on which
people run cryptographic primitives and software, there is no hardware support
for indexing registers with a value, all registers need to be known at
compilation. In Jasmin a compilation pass does it.
\medskip
In most programming languages, arrays are an abstraction for handling memory
using integers. It abstracts contiguous memory into a partial map from integers
to memory cells in order to hide arithmetic on pointers. This abstraction could
also be thought as a partial map from integers to variables, that is any array
\(t\) of size \(N\) could be replaced by \(N\) variables
\(t_0, \dots, t_{N - 1}\) and a pair of functions
\((\mathit{get}, \mathit{set})\) such that \(\mathit{get} : k \mapsto t_k\) and
\(\mathit{set}\;k\;x\) sets \(t_k\) to \(x\).
This way of thinking about arrays allows us to eliminate Jasmin register arrays
by replacing them by variables. This compilation pass is called array expansion
and is not a standard compilation pass. In the following array expansion will
also refer to the action of eliminating arrays by replacing them with the
appropriate amount of variables.
\medskip
In this article we try to extend the array expansion pass the Jasmin compiler
already performs. In particular we implemeted array expansion in function
signature and function calls, and we studied how more array expansion could be
performed in particular for slice assignments informed by how the expansion of
whole arrays in function call works.
\medskip
This article is organised as follows:
In Section \ref{sec:syntax} the relevant parts of Jasmin syntax trees are
presented then the syntactic side of array expansion is explained more in-depth
for our intended use in Jasmin.
In Section \ref{sec:semantics} the relevant parts of Jasmin semantics are
presented then the semantics side of array expansion is explored.
In Section \ref{sec:perch} we describe and justify a change we made to the core
of Jasmin semantics and the consequent changes we had to make in order to
accomodate it and fix the correctness theorems.
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
\begin{minipage}{0.45\textwidth}
type wsize = \(\{ 2^i \mid i \in \setN \}\)
type ty =
| Word of wsize
| Arr of wsize \(\times\;\setN\)
type v\_kind =
| Stack
| Reg
type var = \{
~ v\_id : uid;
~ v\_kind : v\_kind;
~ v\_ty : ty;
\}
type expr =
| Pconst of wsize \(\times\;\setZ\)
| Parr\_init of wsize \(\times\;\setN\)
| Pvar of var
| Pget of var \(\times\) expr
| Psub of \(\setN\;\times\) var \(\times\) expr
\end{minipage}\hfill\vline\hfill\begin{minipage}{0.5\textwidth}
type lval =
| Lnone
| Lvar of var
| Laset of var \(\times\) expr
| Lasub of \(\setN\;\times\) var \(\times\) expr
type instr =
| Cassgn of lval \(\times\) expr
| Ccall of lval list \(\times\) uid \(\times\) expr list
type call\_conv =
| Export
| Subroutine
type func = \{
~ f\_name : uid;
~ f\_args : var list;
~ f\_body : instr list;
~ f\_ret : var list;
~ f\_cc : call\_conv;
\}
type prog = func list
\end{minipage}\normalfont%
\caption{Stripped and simplified version of Jasmin programs' AST type.}\label{fig:types}
\end{figure}
\section{Array expansion compiler pass}\label{sec:syntax}
\subsection{Introduction to relevant parts of Jasmin syntax}
In this section, the relevant parts of Jasmin programs' AST type is described.
The irrelevant parts, if mentioned, will not be expanded upon. Most of the types
are described in Figure \ref{fig:types} using a mix of a mathematical and
OCaml-like syntax.
A Jasmin progam is a list of functions. A function is made of its name, a unique
identifier; its parameters, a list of variables; its body, a list of
instructions; its returned variables, a list of variables; and its kind. The
kind of a function can be ``exported to the outside'', in which case the
function must follow some calling convention, or internal (subroutine).
\smallskip
An instruction can be an assignment or a function call. Other instructions
include assembly instructions and control flow instructions, however they are
not relevant to our work.
An assignment is made of an lvalue and an expression. A function call is
made of the called function identifier, a list of lvalues matching the
function returned variables list on length, and a list of expressions matching
the function parameters list on length.
Function arguments and return variables, lvalues and expressions refer to
variables. A variable is made of an identifier, a type and a kind. In the AST a
variable doesn't have to be declared in order to be used. Across a function
body, parameters and returned variables, all variables featuring the same
identifier and type are considered to be the same. A variable kind is a
constraint on where the compiler can store the value, it is either on the stack
or in a register.
\smallskip
An lvalue indicates where to store a value, it is either a variable
(\texttt{Lvar} constructor), a cell of an array variable (\texttt{Laset}), a
slice of an array variable (\texttt{Lasub}) or nowhere (\texttt{Lnone})
indicating the value will not be used and can be discarded as soon as possible.
An expression can be a constant word (\texttt{Pconst}), an empty array with no
initialised cell (\texttt{Parr\_init}), a variable (\texttt{Pvar}), a cell of an
array variable (\texttt{Pget}) or a slice of an array variable (\texttt{Psub}).
Operations like the addition of two expressions are also expressions.
Single array cells are indexed by expressions. Slice are contiguous sections of
arrays. The section is described using its length and offset from the begining
of the array. The length is a non-negative number and the offset an expression.
\smallskip
Arrays and words are two Jasmin types. A word is two to a non-negative power,
representing the size of the word in bytes. An array is a partial map from
non-negative numbers to words of some size.
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
let rec expand\_e m = function
~ | Pvar x \(\mapsto\) if expand\_var m x is Some vx
~ then Ok (inr [:: Pvar v | v \(\gets\) vx ]) else Ok (inl (Pvar x))
~ | Pget x e \(\mapsto\) if expand\_var m x is Some vx then
~ if e is Pconst \_ z then if get vx z is Some v then Ok (inl (Pvar v))
~ else Error else Error
~ else Let inl e = expand\_e m e in Ok (inl (Pget x e))
~ | Psub len x e \(\mapsto\) if expand\_var m x is Some vx then
~ if e is Pconst \_ z then Ok (inr [:: Pvar v | v \(\gets\) take len (drop z vx) ])
~ else Error
~ else Let inl e = expand\_e m e in Ok (inl (Psub ws len x e))
~ | e \(\mapsto\) Ok (inl e)
let expand\_lv m = function
~ | Lvar x \(\mapsto\) if expand\_var m x is Some vx
~ then Ok (inr [:: Lvar v | v \(\gets\) vx ]) else Ok (inl (Lvar x))
~ | Laset x e \(\mapsto\) if expand\_var m x is Some vx then
~ if e is Pconst \_ z then if get vx z is Some v then Ok (inl (Lvar v))
~ else Error else Error
~ else Let inl e = expand\_e m e in Ok (inl (Laset x e))
~ | Lasub len x e \(\mapsto\) if expand\_var m x is Some vx then
~ if e is Pconst \_ z then Ok (inr [:: Lvar v | v \(\gets\) take len (drop z vx) ])
~ else Error
~ else Let inl e = expand\_e m e in Ok (inl (Lasub len x e))
~ | x \(\mapsto\) Ok (inl x)
\normalfont%
\caption{Expansion of expressions and lvalues.}\label{fig:fullexp}
\end{figure}
\subsection{Array expansion in Jasmin compiler}
A variable of type array and kind register means an array of registers, but in
most common processors on which cryptographic primitives are run there is no
hardware support for indexing registers with a register. Therefore we need to
eliminate them in some compilation pass; this is what the array expansion pass
does in the Jasmin compiler. Other kind of arrays are not expanded --even though
we could try to expand them-- because there is hardware support for indexing
memory with a register and so we consider users know what they want.
In order to expand an array we have to replace all store and load operations
performed on this array and modify function signature it appears in. The load
and store operations (resp. in expressions and lvalues) come in three kinds:
single cell indexing, array slicing and whole array access.
\smallskip
Assume we have a map associating each array to the list of variables it is
expanded to (called \texttt{m} in the following figures; as all array length are
known in Jasmin it is just a matter of generating enough fresh names).
\medskip
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
let expand\_e\_n m n = function
~ | Pvar x \(\mapsto\) if v\_ty x is Arr \_ \_
~ then Ok [:: Pget v (Pconst U64 i) | 0 \(\leq\) i \(<\) n ] else Error
~ | Psub len x e \(\mapsto\) Ok [:: Pget v (Padd e (Pconst U64 i)) | 0 \(\leq\) i \(<\) n ]
~ | \_ \(\mapsto\) Error
let expand\_lv\_n m n = function
~ | Lnone \(\mapsto\) Ok [:: Lnone | 0 \(\leq\) i \(<\) n ]
~ | Lvar x \(\mapsto\) if v\_ty x is Arr \_ \_
~ then Ok [:: Laset x (Pconst U64 i) | 0 \(\leq\) i \(<\) n ] else Error
~ | Lasub len x e \(\mapsto\) Ok [:: Laset x (Padd e (Pconst U64 i)) | 0 \(\leq\) i \(<\) n ]
~ | \_ \(\mapsto\) Error
\normalfont%
\caption{Expansion for unrolling expressions and lvalues.}\label{fig:unrexp}
\end{figure}
The array expansion for expressions and lvalues are depicted in Figure
\ref{fig:fullexp} using a pseudo OCaml-like syntax for convenience.
In this figure and all the following, \texttt{Let} \textit{pat} \texttt{=}
\textit{e1} \texttt{in} \texttt{e2} is a shorthand for matching the result of
\textit{e1} against \texttt{Ok} \textit{pat} and \texttt{Error}; in the former
case the relevant variables of \textit{pat} are bound and \textit{e2} is
evaluated, in the latter case the expression evaluates to \texttt{Error}.
The figure doesn't describe actual implemented code because expansion of
variables and subarrays for assignments is performed in a previous compilation
pass\footnote{Prior to our work, single cell access for lvalues and expression
in assignments was implmented in the array expansion pass and variable expansion
was implemented in a previous compilation pass}, but they could all be
implemented in one compiler pass as the code required to expand expressions and
lvalues for function calls and assignment only differs slightly as will be shown
further in this section.
\smallskip
Single cell indexing (\texttt{Pget} and \texttt{Laget}) can be expanded to a
single variable (\texttt{Pvar} and \texttt{Lvar}) when the index is a constant.
Otherwise it is not supported. When the index is a for loop variable in the
original program, the loop unrolling and constant propagation passes which come
before the array expansion pass have already transformed the indexes into
constants, so it also falls under the constant case.
Array slices (\texttt{Psub} and \texttt{Lasub}) are expanded by acceding to
the relevant slice of variables replacing the array provided the offset
parameter is a constant. The operation performed on this slice will then require
an unrolling.
Finally whole array accesses (\texttt{Pvar} and \texttt{Lvar}) are equivalent to
slices of arrays with no offset and same length as the base array on which they
are performed. They are replaced by all the variables the array is to be
replaced with.
\smallskip
In order to perform the unrolling we need to know whether the result of the
expansion of an expression or lvalue changed its type (\texttt{inr} cases) or
not (\texttt{inl} cases). For instace expanding a single cell access doesn't
change type (\texttt{Pget} or \texttt{Laget} are typed as word, and the variable
they are expanded to is also a word of the same size) even though an array
expansion is performed, whereas expanding a whole array does (array to several
words). We need this information because any instruction featuring expressions
or lvalues usually features the other and an array may have been expanded in one
but not in the other (because it is a register array in one side and not the
other), and only looking at the length of the expanded expressions and lvalues
isn't sufficient as arrays can be of length one.
For instance when expanding an assignment between single cell arrays, the
expanded expression and lvalue will be of length 1, which doesn't indicate
whether any array was expanded into a single word variable.
The expansion for unrolling expressions and lvalues (no array expansion
performed) is illustrated in Figure \ref{fig:unrexp} with the functions
\texttt{expand\_e\_n} and \texttt{expand\_lv\_n} to expand expressions and
lvalues when the other side expansion is of length \texttt{n}.
The expansions performed for unrolling array variables and slices consist in
accessing the relevant cells. The only other case where an unrolling makes sense
is the nowhere (\texttt{Lnone}) lvalue, which is the only one requiring a length
to know how many times it should be duplicated.
\medskip
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
let expand\_fvar m v =
~ if expand\_var m v is Some vx then (Some (size vx), vx) else (None, [:: v ])
let expand\_fsig m fd =
~ let \{| f\_cc; f\_name; f\_args; f\_body; f\_ret |\} = fd in
~ if f\_cc = Export then (fd, None) else
~ let (args, expargs) = split [:: expand\_fvar m v | v \(\gets\) f\_args ] in
~ let (ret, expret) = split [:: expand\_fvar m v | v \(\gets\) f\_ret ] in
~ (\{| f\_cc; f\_name; f\_args := flatten args; f\_body; f\_ret := flatten ret |\},
~ Some (expargs, expret))
let expand\_arg m ex e =
~ match ex, expand\_e m e with
~ | Some \_, Ok (inr es) \(\mapsto\) Ok es
~ | Some n, Ok (inl e) \(\mapsto\) expand\_e\_n m n e
~ | None , Ok (inl e) \(\mapsto\) Ok [:: e ]
~ | \_ \(\mapsto\) Error
let expand\_ret m ex x =
~ match ex, expand\_lv m x with
~ | Some \_, Ok (inr xs) \(\mapsto\) Ok xs
~ | Some n, Ok (inl x) \(\mapsto\) expand\_lv\_n m n x
~ | None , Ok (inl x) \(\mapsto\) Ok [:: x ]
~ | \_ \(\mapsto\) Error
\normalfont%
\caption{Expansion of signatures, argument expressions and returned values lvalues.}\label{fig:sigexp}
\end{figure}
Array expansion for function signatures and argument expressions and
returned values lvalues in function calls are depicted in Figure
\ref{fig:sigexp}.
The expansion of an array in function signature is simply a matter of replacing
it with the variables it is expanded to. A slight subtelty however is that like
expansion for expressions and lvalues we need to know which variables were
expanded in order to expand (sometimes without actual array expansion as
illustrated in \texttt{expand\_arg} and \texttt{expand\_ret}) arguments and
return assignments accordingly.
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
let rec expand\_i fexpd m = function
~ | Cassgn x e \(\mapsto\)
~ match expand\_lv m x, expand\_e m e with
~ | Ok (inl x) , Ok (inl e) \(\mapsto\) Ok [:: Cassgn x e ]
~ | Ok (inr xs) , Ok (inr es) \(\mapsto\) [:: Cassgn x e | x \(\gets\) xs, e \(\gets\) es ]
~ | Ok (inl x) , Ok (inr es) \(\mapsto\)
~ Let xs = expand\_lv\_n m x (size es) in [:: Cassgn x e | x \(\gets\) xs, e \(\gets\) es ]
~ | Ok (inl xs) , Ok (inr e) \(\mapsto\)
~ Let es = expand\_e\_n m x (size xs) in [:: Cassgn x e | x \(\gets\) xs, e \(\gets\) es ]
~ | \_ \(\mapsto\) Error
~ | Ccall xs fn es \(\mapsto\)
~ if get fexpd fn is Some (expargs, expret) then
~ Let xs = [:: expand\_ret m xp x | xp \(\gets\) expargs, x \(\gets\) xs ] in
~ Let es = [:: expand\_arg m xp e | xp \(\gets\) expret , e \(\gets\) es ] in
~ Ok [:: Ccall (flatten xs) fn (flatten es) ]
~ else
~ Let xs = [:: Let inl x' = expand\_lv x in x' | x \(\gets\) xs ] in
~ Let es = [:: Let inl e' = expand\_e e in e' | e \(\gets\) es ] in
~ Ok [:: Ccall xs fn es ]
\normalfont%
\caption{Expansion of instructions.}\label{fig:insexp}
\end{figure}
\smallskip
The expansion in function signature is only wanted when the function is not
exported to other programs because we do not want to make such changes unbeknown
to the user. Two solution satisfy this constraint\footnote{If register array
were allowed in exported functions by the typechecker}: warning the user or
failing to compile. As the global correction theorem only mentions exported
functions it would have to be complexified in order to reflect signature
expansion if we chose the former solution so we chose the latter.
\medskip
Expansion of instructions is illustrated in Figure \ref{fig:insexp}. As
described previously assignment are expanded by expanding each side of the
assignment and unrolling if necessary and function calls are expanded using
informating gathered while expanding the signature, which means signature
expansion needs to be performed first on every function before expanding the
body of any function.
\section{Semantics aspect of array expansion}\label{sec:semantics}
\subsection{Relevant parts of Jasmin semantics}\label{ssec:jassem}
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
type value =
| Varr s n of array s n
| Vword s of word s
type sem\_t = function
| Arr s n \(\mapsto\) array s n
| Word s \(\mapsto\) word s
let type\_of\_val = function
~ | Vword s \_ \(\mapsto\) Word s
~ | Varr s n \_ \(\mapsto\) Arr s n
let to\_arr n v = if v is Varr s n t then Ok t else Error
let to\_word s v = if v is Vword s w then Ok w else Error
let to\_val : sem\_t t \(\to\) value = function
~ | Arr s n \(\mapsto\) Varr s n
~ | Word s \(\mapsto\) Vword s
\normalfont%
\caption{Simplified value types and related utilitary functions.}\label{fig:oldval}
\end{figure}
In this section, the relevant parts of Jasmin semantics are described. None of
it is part of our contribution but it is necessary in order to understand
the main properties we have to prove for the compiler correctness theorem.
Jasmin values can be words and partially defined arrays of words. Variable
values are stored in contexts using the dependent type \texttt{sem\_t} as
defined in Figure \ref{fig:oldval} in order to keep the invariant
\texttt{m.[x] = Ok v\(\implies\)type\_of\_val v = v\_ty x}.
\medskip
\begin{figure}[p] % TODO LATER: compare buty against sem rules
\obeylines\obeyspaces\ttfamily%
\begin{minipage}{0.52\textwidth}
let rec sem\_pexpr vm = function
~ | Pconst s z \(\mapsto\) Ok (Vword s (wrepr s z))
~ | Parr\_init s n \(\mapsto\)
~ Ok (Varr s n (array\_empty s n))
~ | Pvar v \(\mapsto\) vm.[v]
~ | Pget x e \(\mapsto\)
~ Let Varr s n t = vm.[x] in
~ Let vi = sem\_pexpr vm e in
~ Let i = to\_int vi in
~ Let w = array\_get t i in
~ Ok (Vword s w)
~ | Psub len x e \(\mapsto\)
~ Let Varr s n t = vm.[x] in
~ Let vi = sem\_pexpr vm e in
~ Let i = to\_int vi in
~ Let t' = array\_get\_sub len t i in
~ Ok (Varr s len t')
\end{minipage}\hfill\vline\hfill\begin{minipage}{0.45\textwidth}
let write\_lval l v vm =
~ match l with
~ | Lnone \(\mapsto\) Ok s
~ | Lvar x \(\mapsto\) vm.[x \(\gets\) v]
~ | Laset x e \(\mapsto\)
~ Let Varr s n t = vm.[x] in
~ Let vi = sem\_pexpr vm e in
~ Let i = to\_int vi in
~ Let w = to\_word s v in
~ Let t = set t i w in
~ vm.[x \(\gets\) to\_val (Arr s n) t]
~ | Lasub len x e \(\mapsto\)
~ Let Varr s n t = vm.[x] in
~ Let vi = sem\_pexpr vm e in
~ Let i = to\_int vi in
~ Let t' = to\_arr s len v in
~ Let t = set\_sub n len t i t' in
~ vm.[x \(\gets\) to\_val (Arr s len) t]
\end{minipage}\normalfont%
\caption{Left: function evaluating expression e in context vm;}\vspace{-0.8\baselineskip}\center
Right: function modifying context s with value v according to lvalue l.\label{fig:semelv}
\end{figure}
The semantics of expressions and lvalues are illustrated in Figure
\ref{fig:semelv}.
Semantics of expressions is a partial function (\texttt{sem\_pexpr}) taking a
context and an expression and returning a value. The semantics of a constant is
the constant word with the same value. The semantics of an empty array is the
array value where no indices are defined. The semantics of a variable is the
value associated to that variable in the context. The semantics of indexing an
array variable (\texttt{Pget x e}) is the word at the index given by the
semantics of the index expression (\texttt{e}) of the array associated to the
variable (\texttt{x}) in the context. The semantics of an array slice
\texttt{Psub len x e} is the array value obtained by slicing the array
associated to the variable \texttt{x} in the context, slice of length
\texttt{len} and starting at the offset given by the semantics of the offset
expression \texttt{e}.
Semantics of lvalues is a partial function (\texttt{write\_lval}) taking an
lvalue, a value and a context, and returning a context.
The semantics of the nowhere (\texttt{Lnone}) is the unchanged context. The
semantics of a variable (\texttt{Lvar x}) is the context where the variable
(\texttt{x}) is associated to the given value. The semantics of indexing an
array variable (\texttt{Laset x e}) is the context where the array associated
to the variable (\texttt{x}) has its cell given by the semantics of the index
expression (\texttt{e}) associated to the given value. The semantics of slicing
an array variable (\texttt{Lasub len x e}) is the context where the array
variable (\texttt{x}) is associated to an array which \texttt{len} cells
starting from the index given by the semantics of the offset expression
(\texttt{e}) are associated to the \texttt{len} cells from the array value
given.
\medskip
\begin{figure}[p]
\ttfamily
sem\_call P :
\begin{center}
\begin{tabular}{c}
get\_func P fn = Some f \\
fold2 (\(\lambda\)x v, vm.[x \(\gets\) v]) (f\_args f) vargs empty = Ok vm1 \\
sem P vm1 (f\_body f) vm2 \\
{}[:: vm2.[x] | x \(\gets\) f\_ret f ] = Ok vres
\\\hline
sem\_call P fn vargs vres
\end{tabular}
\end{center}
sem P :
\begin{center}
\begin{tabular}{c}
\\\hline
sem P vm [::] vm
\end{tabular} \textrm{\textsc{Eskip}}\quad
\begin{tabular}{c}
sem\_i P vm1 i vm2\quad
sem P vm2 l vm3
\\\hline
sem P vm1 (i::l) vm3
\end{tabular} \textrm{\textsc{Eseq}}
\end{center}
sem\_i P :
\begin{center}
\begin{tabular}{c}
sem\_pexpr vm1 e = Ok v\quad
write\_lval x v vm1 = Ok vm2
\\\hline
sem\_i P vm1 (Cassgn x e) vm2
\end{tabular} \textrm{\textsc{Eassgn}}
~
\begin{tabular}{c}
[:: sem\_pexpr vm1 e | e \(\gets\) es ] = Ok vargs \\
sem\_call P f vargs vres \\
fold2 write\_lval xs vres vm1 = Ok vm2
\\\hline
sem\_i P vm1 (Ccall xs f es) vm2
\end{tabular} \raisebox{-\baselineskip}{\textrm{\textsc{Ecall}}}
\end{center}
\normalfont%
\caption{Simplified inductive big step semantics rules.}\label{fig:semall}
\end{figure}
The semantics of Jasmin functions and instructions in a program \texttt{P} are
defined as the big step inductive predicates \texttt{sem\_call} and
\texttt{sem\_i} as depicted in Figure \ref{fig:semall}. The former relates a
function and its arguments to the returned values, the latter relates a context
and an instruction to the context after executing the instruction. The semantics
for instructions is naturally extanded to a sequence of instructions in
predicate \texttt{sem} by chaining the contexts (\textsc{Eseq}) with the empty
sequence being the identity on contexts (\textsc{Eskip}).
\smallskip
The semantics of a function call is made of only one rule with four premisses:
the function identifier is bound to a function (\texttt{f}) of the program,
binding this function arguments to the parameters in an empty context gives a
valid context (\texttt{vm1}), the semantics of sequences of instructions
(\texttt{sem}) relates this valid context and the function body to another
context (\texttt{vm2}) and in this latter context the values associated to the
return variables of the function correspond to the return values of the jugement.
\smallskip
The semantics of an instruction is made of two rules. The rule for assignment
(\textsc{Eassgn}) gives the semantics of assignment instructions
(\texttt{Cassgn x e}), it has two premisses: the semantics of the assignment
expression (\texttt{e}) in the initial context (\texttt{vm1}) is a value
(\texttt{v}) and the semantics of the assignment lvalue (\texttt{x}) on the
initial context (\texttt{vm1}) and that value (\texttt{v}) correspond to the
final context (\texttt{vm2}) of the jugement. The rule for function calls
(\textsc{Ecall}) gives the semantics of the call instruction
(\texttt{Ccall xs f es}), it has three premisses: the semantics of the argument
expressions (\texttt{es}) in the initial context (\texttt{vm1}) are values
(\texttt{vargs}), the semantics of function call relates the function
(\texttt{f}) and these values (\texttt{vargs}) to other values (\texttt{vres})
and the semantics of the lvalues (\texttt{xs}) chained on contexts, starting
from the initial context (\texttt{vm1}), on the latter values (\texttt{vres})
correspond to the final context (\texttt{vm2}) of the jugement.
\subsection{Semantic implications of array expansion}
In this section the semantics changes of all the cases for array expansion are
described and their implications are discussed. Semantic preservation lemmas are
illustrated in Figure \ref{fig:semlem} using a mix of Coq\footnote{Most of
developpement presented previously is in Coq, but for proofs there is no simpler
equivalent}-like and mathematical syntax for lemmas.
\smallskip
\begin{figure}[p]
\obeylines\obeyspaces\ttfamily%
let eval\_array s t i = Let w = array\_get t i in Vword s w
let expand\_v e v = if e then
~ if v is Arr s n t then [:: eval\_array s t i | 0 \(\leq\) i \(<\) n ] else Error
~ else Ok [:: v ]
Definition eq\_alloc m vm1 vm2 =
~ \(\forall\)x, match get m x with
~ | Some vx \(\mapsto\) \(\forall\)i vi, get vx i = Some vi \(\Longrightarrow\)
~ (Let Varr s n t = vm1.[x] in eval\_array t i) = vm2.[vi]
~ | None \(\mapsto\) vm1.[x] = vm2.[x]
~ end.
Lemma expand\_ePl m vm1 vm2 :
~ eq\_alloc m vm1 vm2 \(\Longrightarrow\)
~ \(\forall\)e1 e2, expand\_e m e1 = Ok (inl e2) \(\Longrightarrow\)
~ \(\forall\)v, sem\_pexpr vm1 e1 = Ok v \(\Longrightarrow\) sem\_pexpr vm2 e2 = Ok v.
Lemma expand\_ePr m vm1 vm2 :
~ eq\_alloc m vm1 vm2 \(\Longrightarrow\)
~ \(\forall\)e1 es2, expand\_e m e1 = Ok (inr es2) \(\Longrightarrow\)
~ \(\forall\)v, sem\_pexpr vm1 e1 = Ok v \(\Longrightarrow\)
~ \(\exists\)vs, expand\_v true v = Ok vs \(\wedge\) [:: sem\_pexpr vm2 e | e \(\gets\) es2 ] = Ok vs.
Lemma expand\_lvPl m vm1 vm2 :
~ eq\_alloc m vm1 vm2 \(\Longrightarrow\)
~ \(\forall\)x1 x2, expand\_lv m x1 = Ok (inl x2) \(\Longrightarrow\)
~ \(\forall\)v vm1', write\_lval x1 v vm1 = Ok vm1' \(\Longrightarrow\)
~ \(\exists\)vm2', write\_lval x2 v vm2 = Ok vm2' \(\wedge\) eq\_alloc m vm1' vm2'.
Lemma expand\_lvPr m vm1 vm2 :
~ eq\_alloc m vm1 vm2 \(\Longrightarrow\)
~ \(\forall\)x1 xs2, expand\_lv m x1 = Ok (inr xs2) \(\Longrightarrow\)
~ \(\forall\)v vm1', write\_lval x1 v vm1 = Ok vm1' \(\Longrightarrow\)
~ \(\exists\)vs, expand\_v true v = Ok vs \(\wedge\)
~ \(\exists\)vm2', fold2 write\_lval xs2 vs vm2 = Ok vm2' \(\wedge\) eq\_alloc m vm1' vm2'.
Lemma expand\_funcP P1 P2 fexpd fn vargs vret =
~ sem\_call P1 fn vargs vret \(\Longrightarrow\)
~ \(\forall\)xarg xret, get fexpd fn = Some (xarg, xret) \(\Longrightarrow\)
~ \(\forall\)vargs', [:: expand\_v (is\_some x) v | x \(\gets\) xarg, v \(\gets\) vargs ] = Ok vargs' \(\Longrightarrow\)
~ \(\exists\)vret', [:: expand\_v (is\_some x) v | x \(\gets\) xret, v \(\gets\) vret ] = Ok vret' \(\wedge\)
~ sem\_call P2 fn (flatten vargs') (flatten vres').
Lemma expand\_iP P1 P2 fexpd m vm1 vm2 =
~ eq\_alloc m vm1 vm2 \(\Longrightarrow\)
~ \(\forall\)i1 vm1', sem\_i P1 vm1 i1 vm1' \(\Longrightarrow\)
~ \(\forall\)is2, expand\_i fexpd m i1 = Ok is2 \(\Longrightarrow\)
~ \(\exists\)vm2', sem P2 vm2 is2 vm2' \(\wedge\) eq\_alloc m vm1' vm2'.
\normalfont%
\caption{Lemmas relating the semantics before and after array expansion.}\label{fig:semlem}
\end{figure}
As values and contexts are at the core of semantics, the semantics preservation
theorem has to express relations between them and their expanded version. The
relation between a value and its expanded version --a list of values-- is
\texttt{expand\_v}, it takes a boolean argument to express whether the value
should be expanded or not. If the value is expanded then it is an array and its
expansion is the list of its cells. The relation between a context and its
expanded version is \texttt{eq\_alloc}, it takes the same information for
expanding variables as the AST expansion. The relation holds if all expanded
arrays have their cell equal to the variable to which it is expanded in the
expanded context and all the other variables are equal in both contexts.
\medskip
Assume we have a program \texttt{P1}, its expansion \texttt{P2} and the
expansion information \texttt{fexpd} gathered when expanding the function
signatures (\texttt{expand\_fsig} in Figure \ref{fig:sigexp}).
\smallskip
The semantics of a function call after array expansion takes an expanded list
of values as arguments, executes the expaded body on these variables and returns
the values of the expanded returned variables. Hence the semantics preservation
lemma for expanded functions (\texttt{expand\_funcP}) assumes the original
semantics and expansion of arguments and requires the semantics of the expanded
functions on expanded arguments to correspond to the expanded list of return
values.
The semantics of an instruction after array expansion takes an expanded context,
executes instruction on it and returns an expanded context. Hence the semantics
preservation lemma for expanded instructions (\texttt{expand\_iP}) assumes the
original semantics relates a context \texttt{vm1} to a context \texttt{vm1'},
the first context \texttt{vm1} is related with the expansion relation to an
initial context \texttt{vm2} for executing the expanded instruction and requires
the context \texttt{vm1'} to be related with the expansion relation to a context
\texttt{vm2'} and the semantics of the expanded instructions relates the expanded
initial context \texttt{vm1'} to the expanded final context \texttt{vm2'}.
\smallskip
After array expansion, an expression is either a single expression (\texttt{inl}
case) or it becomes a list of expressions (\texttt{inr} case). Their semantics
in the expanded context is a list of values. Hence the semantics preservation
lemmas for expanded expressions (\texttt{expand\_ePl} and \texttt{expand\_ePr})
assume the original semantics in a context \texttt{vm1} is a value \texttt{v},'
the orgiginal context \texttt{vm1} is related with the expansion relation to a
context \texttt{vm2} and requires to prove that the expanded expressions
semantics in the expanded context \texttt{vm2} corresponds to the expansion of
value \texttt{v}, or when the expansion is a single expression that this
expression semantics is the same as the original expression semantics.
After array expansion, a lvalue is either a single lvalue (\texttt{inl} case) or
it becomes a list of lvalues (\texttt{inr} case). Their semantics starting from
an expanded context on an expanded value is an expanded context, and in the case
the lvalue is expanded to a list of lvalues their semantics is chained on the
contexts. Hence the semantics preservation lemmas for expanded lvalues
(\texttt{expand\_lvPl} and \texttt{expand\_lvPr}) assume the semantics of the
original lvalue on a value \texttt{v} relates a context \texttt{vm1} to a
context \texttt{vm1'}, the initial context \texttt{vm1} is related using the
expansion relation to a context \texttt{vm2} and requires to prove the final
context \texttt{vm1'} is in relation to an expanded context \texttt{vm2'}, using
expansion relation, corresponding to the final context related to the context
\texttt{vm2} by the semantics of the expanded lvalue on the expansion of value
\texttt{v}.
\medskip
With these lemmas proved, proving the semantics preservation theorem is just a
matter of proving the exported functions are unmodified and using the semantic
preservation lemma for function. However this lemma is not provable: Calling a
function requires to build a context where each parameter receives a value, but
the array semantics is partial arrays so some cells are not defined and
accessing their value fails, so programs where some functions were called on
partial arrays wouldn't have a semantics after the transformation. It is not an
issue when comparing contexts as the objects compared in \texttt{eq\_alloc} are
either \texttt{Ok} of some value or \texttt{Error} which account for the case
where the access to an array cell or a variable fails.
\section{Pervasive changes to Jasmin semantics}\label{sec:perch}
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
type value =
| Varr s n of array s n
| Vword s of word s
| Vundef
let type\_of\_val = function
~ | Vundef \(\mapsto\) Word 0
~ | Vword s \_ \(\mapsto\) Word s
~ | Varr s n \_ \(\mapsto\) Arr s n
\normalfont%
\caption{New value type.}\label{fig:newval}
\end{figure}
In this section the changes made to be able to prove the semantics preservation
theorem are described and motivated. This part takes the most time of all as the
changes made are at the core of the semantics, thus affecting every compilation
pass, requiring the proof of every compilation pass to be fixed. Because of the
change affecting all compilation passes, less simplifications than in the
previous sections will be made, exposing some details previously hidden.
\smallskip
The issue raised by array expansion preventing us from proving semantics
preservation is caused by the semantics of arrays which are partial arrays, so
the fix requires to modify the semantics\footnote{or give up some parts of array
expansion}. It can be solved either by eliminating parially initialised arrays,
requiring arrays to be completely initialised, tracking the initialised cells
(which would require some kind of reduction of expressiveness as otherwise it
isn’t computable), store optional words instead of words in the context or allow
values to be undefined. We chose the last solution as undefined values were
already in the value type for booleans (which were hidden in previous sections
because of irrelevance).
The new value type described in Figure \ref{fig:newval} features the undefined
word value (arrays are always defined even though some cells may not be) and
context have been modified to store values of the new \texttt{value} type
instead of the \texttt{sem\_t} dependent type.
\smallskip
There are two main motivations for replacing the dependent type in the contexts.
The first is to extend undefined values to the word type, which could be done
several ways without removing dependent types, for instance by using an option
word type instead of the word type
(\(\texttt{sem\_t} : \texttt{Word s} \mapsto \texttt{option (word s)}\)).
The second motivation is that eliminating dependent types makes invariants
explicit and tend to simplify the proofs.
Proofs on dependent type tend to be tiresome because every rewriting need to
ensure types are not violated, which in the case of dependent type means
ensuring the parameters given to the dependent type are rewritten uniformly in
the goal and assumptions. In particular in proofs using variables we sometimes
weren't able to rewrite the \texttt{v\_ty} field without extra work. This need
is eliminated by using a non-dependant value type since all values are of the
\texttt{value} type instead of being of one of \texttt{word}, \texttt{array} or
the other types.
Removing dependent types where unnecessary and favoring decidable alternatives
is also a design choice made by SSReflect, a widespread Coq proof library used
in Jasmin. The drawbacks of not using dependent types are that invariants need
to be stated and proved manually, which is actually not that much of a drawback
since the invariants become explicit and the proof are usually easy to do.
\medskip
\begin{figure}
\obeylines\obeyspaces\ttfamily%
type pword s = \{
~ pw\_size : wsize;
~ pw\_word : word pw\_size;
~ pw\_proof : pw\_size \(\leq\) s;
\}
type psem\_t = function
| Word s \(\mapsto\) pword s
| t \(\mapsto\) sem\_t t
\normalfont%
\caption{\texttt{psem\_t} dependent type definition.}\label{fig:psemt}
\end{figure}
The old contexts used a dependent type to store the values, this dependent type
was either \texttt{sem\_t} or \texttt{psem\_t}. The latter described in Figure
\ref{fig:psemt} is used to store smaller words into seemingly bigger words for
a compilation pass where all registers and words become 64 bits as is available
in a x86 processor. The old invariant the context maintained using the dependent
types was that the type of the variable corresponded to the type of the value
stored, since the type of the value stored was the dependent type applied to the
type of the variable, which in turn gave the invariant presented in Section
\ref{ssec:jassem}
\texttt{vm.[x] = Ok v \(\Longrightarrow\) type\_of\_val v = v\_ty x} for free.
\begin{figure}[t]
\obeylines\obeyspaces\ttfamily%
let subtype t1 t2 =
~ match t1, t2 with
~ | Word s1, Word s2 \(\mapsto\) s1 \(\leq\) s2
~ | \_, \_ \(\mapsto\) t1 = t2
let compat\_type t1 t2 =
~ match t1, t2 with
~ | Word \_, Word \_ \(\mapsto\) true
~ | \_, \_ \(\mapsto\) t1 = t2
let is\_word t = if t is Word \_ then true else false
let truncate\_val ty = function
~ | Varr s n \_ \(\mapsto\) if ty = Arr s n then Ok v else Error
~ | Vword s w \(\mapsto\) if ty is Word s'
~ then Ok (if truncate\_word s' w is Ok w' then Vword s' w' else v) else Error
~ | Vundef \(\mapsto\) if ty is Arr \_ then Ok Vundef else Error
\normalfont%
\normalfont%
\caption{Subtyping relation and type compatibility relation.}\label{fig:typerel}
\end{figure}
The new context store their value in the new \texttt{value} type. To account for
the two cases where we have words of the right size or of a smaller size
allowed, the contexts take a type relation as parameter, this type relation is
either equality or subtyping (\texttt{subtype}) only on words as defined in
Figure \ref{fig:typerel}. We need to manually ensure the previous context
invariant introduced in Section \ref{ssec:jassem} is still maintained.
Previously the access to a variable could fail if the variable is undefined and
a word for instance, now access to undefined word variable succeeds with the
undefined value so there is no fail case. Therefore the invariant becomes
\texttt{if vm.[x] = Vundef then is\_word (type\_of\_val vm.[x]) else (rel vm) (type\_of\_val vm.[x]) (v\_ty x)}
where \texttt{rel vm} is the type relation parameter of the context \texttt{vm}.
\smallskip
Another place where the storage of smaller words into bigger words has an impact
is the semantics of instructions and function call. The one presented in the
previous section was simplified. In particular because of the compiler pass
where all words are stored on 64 bits, they are truncated to the size they are
supposed to be before any operation and after being retrieved from the context.
The truncation function (\texttt{truncate\_val}) had to be extended to undefined
values.
With the modifications we did, we are confident the correctness proof is
provable. At the writing of this article the correctness proof for the register
allocation pass succeeded and the proof for the register array expansion is in
progress.
\section{Conclusion}\label{sec:ccl}
The implementation of array expansion in the Jasmin compiler allows the users to
manipulate registers by manipulating arrays. These arrays are allocated to the
registers by allocating each cell, unrolling the operations and modifying
function signatures. This last transformation necessited a deep change in the
semantics in order to prove the compiler correctness. The influence of this
change is global and currently only a few compiler pass correction proofs are
fixed.
Allocating an array to registers is a pass that can be implemented in other
compiler similarly to what we did, only the proof work is specific to Jasmin
value semantics.
\section{Bibliography}
\bibliographystyle{plain}
\bibliography{biblio}
\end{document}