Installation:
To install from Hackage, run:
cabal install smap
To install from source, you can use that or download this repo and run
stack install smap
You will need cabal or stack if you don't already have one of them.
Tutorial:
The setup:
cat > patients << EOF
Bob Smith
Jane Doe
John Smith
Carol Carell
EOF
cat > has_cold << EOF
Jane Doe
John Smith
EOF
cat > has_mumps << EOF
Jane Doe
Carol Carell
EOF
Simple usage (sets)
cat - Set Union (and Deduplication)
Sick patients:
$ smap cat has_cold has_mumps
Jane Doe
John Smith
Carol Carell
You can also use -
instead of a filename to represent stdin/stdout. (This works for any command.)
$ cat has_cold | smap cat - has_mumps
Jane Doe
John Smith
Carol Carell
If you don't provide any arguments, cat
will assume you mean stdin.
$ cat has_cold has_mumps | smap cat
Jane Doe
John Smith
Carol Carell
sub - Set subtraction
Healthy patients:
$ smap sub patients has_cold has_mumps
Bob Smith
int - Set intersection
Patients with both a cold and mumps:
$ smap int has_cold has_mumps
Jane Doe
It's worth noting that both int and sub treat their first argument as a stream, not a set. This means that they won't deduplicate values from their first argument. In practice you will find that this is the most useful arrangement. You can always use smap cat
to turn a stream into a set.
To put this all together, let's find patients who only have a cold or mumps, but not both:
$ smap sub <(smap cat has_cold has_mumps) <(smap int has_cold has_mumps)
Carol Carell
John Smith
If you haven't seen the <(command)
syntax before, it's a very useful shell tool called process substitution.
Advanced usage (maps)
When using smap
with sets, the behavior is pretty straightforward. It gets a bit more complicated when
dealing with maps.
If you provide smap
with a filepath, it will construct a map where the keys equal the values. (This
is equivalent to a set). If you pass in +file1,file2
as an argument, smap
will construct a map using lines from file1 as keys and lines from file2 as values.
We can get a list of patient last names using cut -f 2 -d ' ' <patient file>
Pick one patient from each family:
$ smap cat +<(cut -f 2 -d ' ' patients),patients
Bob Smith
Jane Doe
Carol Carell
To understand the above:
<(cut -f 2 -d ' ' patients)
gets a list of all the patients' last names and creates a pipe containing this list.
+<(cut -f 2 -d ' ' patients),patients
constructs a stream where the keys are the last names and the values are the whole names.
cat
deduplicates by key, so if we see a second (or third, or fourth, etc.) person from a given family we don't print them out.
Patients who have family members with a cold:
$ smap int +<(cut -f 2 -d ' ' patients),patients <(cut -f 2 -d ' ' has_cold)
Bob Smith
Jane Doe
John Smith
To understand the above:
<(cut -f 2 -d ' ' patients)
gets a list of all the patients' last names.
+<(cut -f 2 -d ' ' patients),patients
constructs a stream where the keys are the last names and the values are the whole names.
<(cut -f 2 -d ' ' has_cold)
gets a list of family names of everyone who has a cold.
So int
is filtering the first argument (treated as a key,value
stream) by the keys present in the second argument.
Approximate mode
If you're processing lots of lines and running up against memory limits,
you can use the --approximate
or -a
option to keep track of a 64-bit hash
of each line instead of the entire line. You can also use
--approx-with-key
or -k
if you want to specify the SipHash key.