-
Notifications
You must be signed in to change notification settings - Fork 2
/
notes.txt
154 lines (136 loc) · 5.55 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
Tool architecture:
User runs 'gopar' executable with the same arguments as they would to 'go'. The
tool examines the source code to be built and creates a matching directory
structure to store generated kernel files. The tool then does the source
generation. Finally, the tool invokes 'go' with the same arguments, but modifies
the GOPATH env variable to include the auto-generated files, as well as include
the gopar directory that contains the parallel runtime library.
Steps:
- $ gopar [run|build|install] <main package>
- LLVM-inspired structure
- Analysis passes
- Module passes
- Gather identifiers
- Function passes
- Calculate function complexity
- Loop passes
- Calculate branching factor
- Calculate loop complexity
- Block passes
- Dependency analysis
- Block use-modify list
- Transform passes (declare dependencies on Analysis)
- Loop splitting
- Any variables set during the last loop iteration need to be visible
- Build function calls starting from main()
- Determine call graph, and identify possible parallelism points
- Tag non-parallelizable library calls
- Record the work factor increase for loops to later calculate the number
of threads to launch.
- Deal with recursion
- Pick the cutoff loop for parallelizing along each branch in the call graph
- Analyze each loop as being independent, or recognize reductions and shared
variables.
- For each loop picked to be parallelized, copy all of the files in that
package to the temporary directory
- For each loop to be parallelized:
- Modify the surrounding AST to correctly marshall data to/from OpenCL
- Generate the OpenCL kernel
- Include the call to the OpenCL kernel
- Launch the underlying 'go' program with modified GOPATH env
- <path to src/rtlib>:<path to generated src/>:<original path>
Competitors:
CUDA, OpenCL, DirectCompute
OpenACC, C++AMP, Thrust, Bolt
X10, Chapel, Nesl, Delite, Par4all
Loop dependencies: "PLDI'12 Logical Inference Techniques for Loop Parallelization"
- "array abstraction"
- "uniform set representation"
- summarize variables as read-only, write-first, read-write
- or WAR, WAW, RAW?
- build DAGs
- privatization + loop splitting (record last iteration of loop)
- recording loop bounds in an algebraic equation
Compiler Passes:
- Analysis
- ExternalFunction: determine if a function contains any references to
external package functions/variables. TODO: resolve them if possible.
- DataDependence: This is the most complicated - analyze the data
dependencies between "basic blocks" (for loops + functions). Record the
variables created, and classify each variable as ReadOnly, WriteFirst, or
ReadWrite.
- CalculateLoopIterations: record an expression for the number of iterations
a loop makes, or if it cannot be determined (while() loop, writes to the
iteration variable, etc) [DataDependence]
- PickParallelLoops: use previous analysis to identify loops that can and
should be parallelized. Record the expressions used for indexing, and the
thread space. Also record the variables that need to be transferred, and
any variables that can be marked as private. [CalculateLoopIterations,
DataDependence]
- //CalculateLoopCutoff: generate an expression O(<vars>) < const condition for
choosing at runtime to run the parallel function or not
- Transform
- //UnrollLoops: use the CalculateLoopCost to figure out
- RuntimeLibImport: add the 'import "rtlib"' line
- GenerateKernel: generate the kernel source and supporting functions, and
store it in a text constant injected in another file (kernels.go)
- RewriteLoops: rewrite the loop site with data transfer calls and kernel
invocation (rtlib calls)
Call graph generation:
func main() {
var a []int
var b []int
for i, val := range a {
b[i] = a * 100
}
sum := 0
for i, val := range b {
if i != 0 {
b[i] += b[i-1] // prefix sum
}
sum = sum + val // reduction
}
}
Go-to-C translation:
- Simple support, no interfaces/typecasting/etc
- Support for ordered channel writes within kernels (prefix sum prerun for idx)
Parallel kernel detection:
- Only range/simple for loops
go func() {
for i := 0 ; i < 7*6*5*4*3*2 ; i++ {
c <- f(i)
}
done<- 1
}()
go func() {
for i := range workchan {
c <- f(i)
}
done<- 1
}()
- The loop is parallelized, outer go code is run as normal.
Affine array accesses:
var N = 10
for x := 0; x < N; x++ {
for y := 0; y < M; y++ {
a[y*N + x] += 1
}
}
Store as an array of sums, or store nil if the expression is not affine. Later
passes will test if the value of N is fixed for all loop iterations. If the
expression is not affine, then we must assume the access introduces a
dependency across all iterations. This is fine if we're only reading a value,
but for writes they must not be parallelized.
Runtime behavior: /gopar/rtlib
- find a device to use on startup
- copy slices to/from GPU memory
Interface support?
- Interfaces are a pair of [type, val], where val is the actual value if less
than uintptr, or a pointer to the struct if greater than uintptr
- Need to copy all structs into contiguous memory
- Structs could have differing types -> different sizes
- How to calculate offset
- Reduce warp branching by sorting different interface types together
- Don't "sort", but group into buckets as anaylzing
- Need to copy back output in the original order
- Offset array to the beginning of each bucket