A Brief Overview of Computer Architecture

Big ideas: Memory, CPUs, and Networking
Let's draw some pictures!

Link to Lecture

https://tinyurl.com/rsmas-python2

Memory Hierarchy

CPU Layout and Networking

Somewhere between 1 and 16 cores
Modern processors will support hyperthreading / virtual cores
- Oftentimes the benefit of two processors for the price of 1!

Numbers Every Programmer Should Know

Source and Jeff Dean, Peter Norvig, etc.

Latency Comparison Numbers (~2012)

Event	Nanoseconds	Microseconds	Milliseconds	Comparison
L1 cache reference	0.5	-	-	-
L2 cache reference	7.0	-	-	14x L1 cache
Main memory reference	100.0	-	-	20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy	3,000.0	3	-	-
Send 1K bytes over 1 Gbps network	10,000.0	10	-	-
Read 4K randomly from SSD	150,000.0	150	-	~1GB/sec SSD
Read 1 MB sequentially from memory	250,000.0	250	-	-
Round trip within same datacenter	500,000.0	500	-	-
Read 1 MB sequentially from SSD	1,000,000.0	1,000	1	~1GB/sec SSD, 4X memory
Disk seek	10,000,000.0	10,000	10	20x datacenter roundtrip
Read 1 MB sequentially from disk	20,000,000.0	20,000	20	80x memory, 20X SSD
Send packet CA → Netherlands → CA	150,000,000.0	150,000	150	-

Notes

1 ns = 10^-9 seconds 1 us = 10^-6 seconds = 1,000 ns 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

Humanized Latency Numbers

Lets multiply all these durations by a billion:

Magnitudes:

Minute:

L1 cache reference                  0.5 s         One heart beat (0.5 s)
L2 cache reference                  7 s           Long yawn

Hour:

Main memory reference               100 s         Brushing your teeth
Compress 1K bytes with Zippy        50 min        One episode of a TV show (including ad breaks)

Day:

Send 2K bytes over 1 Gbps network   5.5 hr        From lunch to end of work day

Week:

SSD random read                     1.7 days      A normal weekend
Read 1 MB sequentially from memory  2.9 days      A long weekend
Round trip within same datacenter   5.8 days      A medium vacation
Read 1 MB sequentially from SSD    11.6 days      Waiting for almost 2 weeks for a delivery

Year

Disk seek                           16.5 weeks    A semester in university
Read 1 MB sequentially from disk    7.8 months    Almost producing a new human being
The above 2 together                1 year

Decade

Send packet CA->Netherlands->CA     4.8 years     Average time it takes to complete a bachelor's degree

Conclusions

CPUs are really fast
- The goal: Have the CPU always be working
Memory is fast-ish, hard drives are slow
You want to make your CPU happy by always giving it data that is close at hand
Networks are reeeeaaally slow

Processes and Threads

Since there's so much downtime when we need to get data from disk or over a network, we can use the CPU for another task
Switching between tasks can keep our CPU working
- 3,000-30,000 ns for a context switch, or about 50 minutes to 8 hours in human time
To allow this, we want multiple different programs that are able to run at the same time
- These are called processes
A process has 4 parts
- The program code
- The Call Stack (what is actually happening in the program)
- Variables, files, memory, etc
Within a program, we have also have multiple mini-programs running
- These are called threads
- A child thread and its parent share the same memory (usually)

Rolling Your Own Processes in Python: `subprocess`

Running a script from the command line

import subprocess
subprocess.call(['python3 run_me.py', shell=True])

Getting the stdout and stderr from a process

output, errors = subprocess.Popen(['ls', '-la', '*.py'], 
			          stdout=subprocess.PIPE).communicate()

Piping processes

p1 = subprocess.Popen(["cat", "file.log"], stdout=subprocess.PIPE)
p2 = subprocess.Popen(["tail", "-1"], stdin=p1.stdout, stdout=subprocess.PIPE)
p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits.
output,err = p2.communicate()

Parallel Programming

Lots of models
Hard to do super efficiently
However, there are some easy ways to get gains

Multi-Core Parallel Computing vs. Distributed Parallel Computing

Using many cores from your machines is different than using many cores on multiple machines
Pegasus, as far as I can tell, makes it as if you are using a lot of cores on one machine

Big Ideas that Matter

`numpy` is your friend

Vectorizing functions is the easiest step to wriitng fast and parallel code
Many Numpy Operations already are parallelized across machines

Split-Apply-Combine

Split
- Think about how to break up the data so that you can compute the same thing on smaller parts of the larger dataset
Apply
- Solve the problem a bunch of times on smaller versions of your large data
Combine
- Then combine those mini solutions into one big solution

Data Needs to be Independent

In image analysis, you can apply convolutions because they only matter on a small subset of the data
If you need all of the data to do each computation, it's less efficient

MapReduce or Map and then Reduce

# Parallel version is equivalent to the following
def map_parallel(func, input_data):
    intermediate_data = []
    for i in input_data:
        intermediate_data.append(func(i)
    return intermediate_data

def reduce(func, intermediate_data):
    starting_value = None
    for value in intermediate_data:
        if starting_value == None:
          starting_value = value
        else:
          starting_value = func(starting_value, value)
    return value

MapReduce in Python: `multiprocessing`

Prepare data to be split between processors
Use multiprocessing to apply map step
Combine the results of the map result as needed
Easy when your data is small
Good for simple functions
Some good examples

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

###################################

# Sequential
results = []
for item in my_array:
    results.append(my_function(item))

# Parallel
from multiprocessing.dummy import Pool as ThreadPool 
pool = ThreadPool(4) 
results = pool.map(my_function, my_array)

`Numba`

For more complicated parallel tasks
Converts functions to C code
Can parallelize code easily (if you read the documentation :D )

@numba.jit(nopython=True, parallel=True)
def logistic_regression(Y, X, w, iterations):
    for i in range(iterations):
        w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
    return w

Post Session Discussion

Virtual environments: https://conda.io/docs/user-guide/tasks/manage-environments.html
Figuring out your dependencies: pip freeze
pylint <filename> for linting
autopep8 <filename> for correcting easy linting errors

Testing

Two types
1. Unit Test
  - Just tests individual functions
  - Define some input and some desired output
  - Tests whether it works
2. Integration Test
  - Combining functions
  - COmbining modules
  - Very simple example cases
Tests are written in a separate folder usually

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Brief Overview Of Computer Architectures.md

A Brief Overview Of Computer Architectures.md

A Brief Overview of Computer Architecture

Link to Lecture

Memory Hierarchy

CPU Layout and Networking

Numbers Every Programmer Should Know

Latency Comparison Numbers (~2012)

Notes

Humanized Latency Numbers

Magnitudes:

Minute:

Hour:

Day:

Week:

Year

Decade

Conclusions

Processes and Threads

Rolling Your Own Processes in Python: `subprocess`

Parallel Programming

Multi-Core Parallel Computing vs. Distributed Parallel Computing

Big Ideas that Matter

`numpy` is your friend

Split-Apply-Combine

Data Needs to be Independent

MapReduce or Map and then Reduce

MapReduce in Python: `multiprocessing`

`Numba`

Post Session Discussion

Testing

Files

A Brief Overview Of Computer Architectures.md

Latest commit

History

A Brief Overview Of Computer Architectures.md

File metadata and controls

A Brief Overview of Computer Architecture

Link to Lecture

Memory Hierarchy

CPU Layout and Networking

Numbers Every Programmer Should Know

Latency Comparison Numbers (~2012)

Notes

Humanized Latency Numbers

Magnitudes:

Minute:

Hour:

Day:

Week:

Year

Decade

Conclusions

Processes and Threads

Rolling Your Own Processes in Python: subprocess

Parallel Programming

Multi-Core Parallel Computing vs. Distributed Parallel Computing

Big Ideas that Matter

numpy is your friend

Split-Apply-Combine

Data Needs to be Independent

MapReduce or Map and then Reduce

MapReduce in Python: multiprocessing

Numba

Post Session Discussion

Testing

Rolling Your Own Processes in Python: `subprocess`

`numpy` is your friend

MapReduce in Python: `multiprocessing`

`Numba`