An intro to comm and diff commands

It's often useful to compare versions of text files. Let's take a look at comm and diff

The comm command

This command compares two text files and displays the lines that are unique to each one and the lines they have in common.

Let's say we have these two files:

When we run comm file1.txt file2.txt we get

In my opinion, the comm output is somewhat hard to look at, but it's three columns. Excuse my terrible lines:

The first column contains lines unique to the first file argument, the second column contains the lines unique to the second file argument, and the third column contains the lines shared by both files.

We can choose to suppress a specific column by using the option -n where n is either 1, 2, or 3. Say we wanted to output only the lines shared by both files, we can use comm -12 file1.txt file2.txt

The diff command

diff is a much more complex tool. It supports many output formats and has the ability to process large collections of text files at once. diff is often used to create diff files (patches) that are used by programs such as path to convert one version of a file or files to another version. Let's run diff on our same two files from before diff file1.txt file2.txt

This is the default style output, in this format, each group of changes is preceded by a change command in the form of range operation range to describe the positions and types of changes required to convert the first file to the second file.

First we see

1d0
< a

This is telling us that we have to delete the first row in the file1, which is the line with a.

Next we have

4a4
> e

which is telling us that we have to add a line to the first file, in the fourth line position, then it tells us which line to add > e

I know this is confusing, to be fair, the default style isn't used as much as the context format and unified format are, let's look at those an explain more.

We can use the context format by adding the -c option

diff -c file1.txt file2.txt

At the top we see the names of the two files and their timestamps, the first file is marked with asterisks, and the second file is marked with dashes. diff will use either asterisks or dashes to let us know which file it's talking about throughout the remainder of the listing.

Next we see a line of asterisks which is just formatting.

Then we've got groups of changes, in the first group we see

*** 1,4 ****

which means lines 1 through 4 in the first file

and then we see

- a
  b
  c
  d

Which is the contents of the file, except there's a - before the a, that means we have to remove it.

Indicator Meaning
blank No change needs to be made
(-) Line needs to be deleted
(+) Line needs to be added
! Line needs to be changed

In our first group, we can see that the line with - a needs to be removed from our first file. Our second group of changes is

--- 1,4 ----
  b
  c
  d
+ e

the ---1,4---- is the range of the second file, the + e means we need to add this line to the first file, remember the goal is to make the first file match the second file.

We can also use the unified format it's similar, but more concise, it eliminates the duplicated lines of context. diff -u file1.txt file2.txt