Using Linux Lab
For this project you will create a shell pipeline that truncates a file via random shuffling, then verifies the correct number of lines. Many times large files are so big that traditional data science libraries like pandas
or jupyter
cannot process them. One approach to dealing with this problem is to sample the file by truncating and shuffling the results.
- Run
wc -l nba_2017.csv
- How many lines are in the file?
- Run the
head nba_2017.csv
and inspect the first few rows of the file.
- Truncate and shuffle the file
shuf -n 100 nba_2017.csv > small_nba_2017.csv
- Count the number of lines. How many are there?
- If you inspect the first few lines what do you see? `head nba_2017.csv
- What happens when you run
tail -n +2 nba_2017.csv | head
? - How could use this approach to remove the column heads before shuffling?
- Why would want to do this and how could you append them back on after you shuffle?