AWK Basic
expand_circle_rightAwk is a programming language that can handle and manipulate data. Suppose we have the following data file.
orange 22.5 0
banana 19.5 12
apple 23.0 0
mango 25.0 30
papaya 22.5 25Print only certain field, first field for instance.
awk '{print $1}' datafile.txtPrint the lines (the whole line, $0) with value at third field is greater than 0.
awk '$3 > 0 {print $0}' datafile.txtDo some maths (value at second field is multiplied by value at third field) and print.
awk '$3 > 0 {print $1, $2 * $3}' datafile.txtPrint only the lines with value at third field is equal to 0.
awk '$3 == 0 {print $0}' datafile.txtPrint the line that matches given pattern.
awk '/apple/ {print $0}' datafile.txtPrint the line that certain field matches given pattern, exact match and regex match.
awk '$1 == "apple" {print $0}' datafile.txtawk '$1 ~ /apple/ {print $0}' datafile.txtPassing variable and matching, and print.
fruit="apple"
awk -v f=$fruit '$0 ~ f {print $0}' datafile.txtfruit="apple"
awk -v f=$fruit '$1 ~ f {print $0}' datafile.txtRemoving duplicated lines can be done by sort and uniq commands but it requires sorting first. If in case of sorting is not undesirable, awk can handle it. Suppose the following content.
apple 23.0 0
orange 22.5 0
orange 22.5 0
banana 19.5 12
apple 23.0 0
papaya 22.5 25
mango 25.0 30
orange 22.5 0
papaya 22.5 25Print unique lines (removing duplicate lines) with sort and uniq
sort -u datafile.txtor
sort datafile.txt | uniqBut in case of without sorting, use awk.
awk '!_[$0]++' datafile.txtThe expression !_[$0]++ is the pattern part of the awk rule. $0 represents the entire line and _[$0] refers to an element in an associative array named _, using the current line’s content [$0] as the index key.
The expression ++ is a post-increment operator. It returns the current value of _[$0] before it is incremented.
The first time a line is encountered, _[$0] is uninitialized, so its value is treated as 0 (false). After this value is retrieved, _[$0] is incremented to 1 and stored in memory, marking the line as seen.
The ! is the logical NOT operator. It negates the value returned by the post-increment operation.
- For the first occurrence of a line, the value returned by
_[$0]++is 0 (false). The!operator makes this!0which evaluates to true. - For any subsequent occurrence of the same line,
_[$0]will have a value of 1 (or greater). The value returned by_[$0]++is 1 (true). The!operator makes this!1which evaluates to false.
Because the expression !_[$0]++ is used as the pattern and no explicit action is provided, awk uses its default action of printing the current line (print $0) only when the pattern evaluates to true.
This results in only the first (unique) occurrence of each line being printed to the output, while subsequent duplicates are ignored.
Note: This !_[$0]++ part is a little higher than basic level. _ is just a name of the associative array, it still works if it is replaced with any other character or name like !a[$0]++ or even !line[$0]++.