#Text Parsing with awk
, sed
and paste
- Authors: Kathi Unglert
- Research field: Volcano Geophysics
- Lesson Topic:
awk
,sed
andpaste
in the bash shell.
Check whether you have awk
installed by using the which
command:
$: which awk
You should get something like this:
/usr/bin/awk
If you don't have awk
installed you can get it here.
volcano_observations.dat
contains the following lines
(try viewing it with cat volcano_observations.dat
):
month level observation
april 1 steam
may 2 ash
june 4 plinian
july 1 steam
august 2 earthquakes
september 3 subplinian
october 2 earthquakes
november 1 steam
We can achieve the same effect as cat
with:
awk '{print $0}' volcano_observations.dat
- Challenge Question: How to get output to print to file?
- Solution:
awk '{print $0}' volcano_observations.dat > observations2.dat
We can select individual columns to print like this:
awk '{print $3}' volcano_observations.dat
But, maybe we don't want the header line observation
in our output.
We can get rid of it via
awk '{if (NR > 1) print $3}' volcano_observations.dat
We can pick all entries that contain "earth" in their observation column:
awk '{if (index($3, "earth") != 0) print $3}' volcano_observations.dat
Or, we can combine if statements to pick all entries that have an "e" in their observation column, not including the header:
awk '{if (NR > 1 && index($3, "e") != 0) print $3}' volcano_observations.dat
Or, the same thing, but print out the whole entry:
awk '{if (NR > 1 && index($3, "e") != 0) print $0}' volcano_observations.dat
- Challenge Question: How to print all months with alert level of 2 or higher?
- Solution:
awk '{if (NR > 1 && $2 >= 2) print $1}' volcano_observations.dat
earthquakes.dat
contains the following (view with cat
):
date time magnitude lat lon depth
2011/03/01 20:21:11.11 3.1 49.12 -123.10 45
2011/03/01 21:45:51.04 3.8 49.21 -123.08 42
2011/03/01 21:53:42.33 2.5 49.01 -122.89 5
2011/03/01 21:58:32.17 3.4 48.99 -122.89 2
2011/03/01 22:03:56.10 4.3 49.03 -123.12 35
2011/03/01 23:22:45.89 3.1 49.01 -122.91 1
2011/03/02 04:17:03.77 4.3 49.02 -123.01 0.5
2011/03/02 12:01:34.32 3.7 49.17 -123.20 43
2011/03/02 15:34:56.51 2.8 49.14 -123.00 46
2011/03/03 05:21:23.09 3.4 49.09 -123.11 41
2011/03/03 08:54:56.67 3.3 49.09 -123.10 43
2011/03/03 16:32:45.52 4.1 49.10 -123.12 42
In this data, shallow earthquakes may indicate volcanic activity, while deep earthquakes may originate from a subduction zone.
- Challenge Question: How to print all earthquake entries from volcano?
- Solution:
awk '{ if (NR > 1 && $6 <= 10) print $0}' earthquakes.dat
Or, print out only the magnitudes of those:
awk '{ if (NR > 1 && $6 <= 10) print $3}' earthquakes.dat
Or, the magnitudes and depths:
awk '{ if (NR > 1 && $6 <= 10) print $3,$6}' earthquakes.dat
We can also add some text to what we return:
awk '{ if (NR > 1 && $6 <= 10) print "magnitude",$3,"depth",$6}' earthquakes.dat
We'd like to be able to extract which days had volcanic earthquakes.
But, the year, month and day are seaparated by /
instead of spaces,
so they appear as one column. We can define a new 'field separator'
with the -F
flag; we can get the month like this:
awk -F/ '{ if (NR > 1) print $2}' earthquakes.dat
But if we ask for the next column in order to get the days,
we get the whole rest of the line, too - awk
is no longer
breaking on spaces:
awk -F/ '{ if (NR > 1) print $3}' earthquakes.dat
We can define additional field separators:
awk '-F[/ :]+' '{if (NR > 1) print $1}' earthquakes.dat
Which lets us find the days with volcanic earthquakes:
awk '-F[/ :]+' '{if (NR > 1 && $10 <= 10) print $3}' earthquakes.dat
and remove duplicates in the list using awk
:
awk '-F[/ :]+' '{if (NR > 1 && $10 <= 10) print $3}' earthquakes.dat | awk '!_[$0]++'
compare this to:
awk '-F[/ :]+' '{if (NR > 1 && $10 <= 10) print $3}' earthquakes.dat | uniq
sed
is a powerful text manipulation tool,
most usually used for doing find-and-replace operations.
We can substitute all instances of 2011
for 2010
in earthquakes.dat
(/g = globally):
sed 's/2011/2010/g' earthquakes.dat > earthquakes2.dat
If we want to substitute the month, we have to be a bit more careful;
replacing every instance of 03
might lead to some unintended replacements.
We can be sure that we're looking at a month if the text follows the pattern /03/
,
but now sed
will get confused as to which backslashes are part of the command,
and which backslashes are characters to be replaced.
We have to 'escape' the slashes we want to count as text,
using the character \
:
sed 's/\/03\//\/04\//g' earthquakes.dat > earthquakes2.dat
Since this is a bit hard to read,
many implementations of sed
accept other delimiting characters;
for example, try:
sed 's|/03/|/04/|g' earthquakes.dat > earthquakes2.dat
We can also substitute inside the original file:
sed -i 's/2011/2010/g' earthquakes2.dat
or on Mac:
sed -i.tmp 's/2011/2010/g' earthquakes2.dat
or:
sed -i '' 's/2011/2010/g' earthquakes2.dat
We can also delete lines:
sed '/2011\/03\/03/d' earthquakes.dat > earthquakes2.dat
We can combine the columns of two files with the paste
command:
paste -d" " earthquakes.dat earthquakes2.dat > alleqs.dat