Thursday, May 22, 2008

PERL ONE LINERS

Level: Introductory


This article, as regular readers may have guessed, is the sequel to "One-liners 101," which appeared in a previous installment of "Cultured Perl". The earlier article is an absolute requirement for understanding the material here, so please take a look at it before you continue.

The goal of this article, as with its predecessor, is to show legible and reusable code, not necessarily the shortest or most efficient version of a program. With that in mind, let's get to the code!


Awk is commonly used for basic tasks such as breaking up text into fields; Perl excels at text manipulation by design. Thus, we come to our first one-liner, intended to add two columns in the text input to the script.

Listing 1. Like awk?


# add first and penultimate columns
# NOTE the equivalent awk script:
# awk '{i = NF - 1; print $1 + $i}'
perl -lane 'print $F[0] + $F[-2]'


So what does it do? The magic is in the switches. The -n and -a switches make the script a wrapper around input that splits the input on whitespace into the @F array; the -e switch adds an extra statement into the wrapper. The code of interest actually produced is:

Listing 2: The full Monty


while (<>)
{
@F = split(' ');
print $F[0] + $F[-2]; # offset -2 means "2nd to last element of the array"
}


Another common task is to print the contents of a file between two markers or between two line numbers.

Listing 3: Printing a range of lines


# 1. just lines 15 to 17
perl -ne 'print if 15 .. 17'

# 2. just lines NOT between line 10 and 20
perl -ne 'print unless 10 .. 20'

# 3. lines between START and END
perl -ne 'print if /^START$/ .. /^END$/'

# 4. lines NOT between START and END
perl -ne 'print unless /^START$/ .. /^END$/'


A problem with the first one-liner in Listing 3 is that it will go through the whole file, even if the necessary range has already been covered. The third one-liner does not have that problem, because it will print all the lines between the START and END markers. If there are eight sets of START/END markers, the third one-liner will print the lines inside all eight sets.

Preventing the inefficiency of the first one-liner is easy: just use the $. variable, which tells you the current line. Start printing if $. is over 15 and exit if $. is greater than 17.

Listing 4: Printing a numeric range of lines more efficiently


# just lines 15 to 17, efficiently
perl -ne 'print if $. >= 15; exit if $. >= 17;'


Enough printing, let's do some editing. Needless to say, if you are experimenting with one-liners, especially ones intended to modify data, you should keep backups. You wouldn't be the first programmer to think a minor modification couldn't possibly make a difference to a one-liner program; just don't make that assumption while editing the Sendmail configuration or your mailbox.

Listing 5: In-place editing


# 1. in-place edit of *.c files changing all foo to bar
perl -p -i.bak -e 's/\bfoo\b/bar/g' *.c

# 2. delete first 10 lines
perl -i.old -ne 'print unless 1 .. 10' foo.txt

# 3. change all the isolated oldvar occurrences to newvar
perl -i.old -pe 's{\boldvar\b}{newvar}g' *.[chy]

# 4. increment all numbers found in these files
perl -i.tiny -pe 's/(\d+)/ 1 + $1 /ge' file1 file2 ....

# 5. delete all but lines between START and END
perl -i.old -ne 'print unless /^START$/ .. /^END$/' foo.txt

# 6. binary edit (careful!)
perl -i.bak -pe 's/Mozilla/Slopoke/g' /usr/local/bin/netscape


Why does 1 .. 10 specify line numbers 1 through 10? Read the "perldoc perlop" manual page. Basically, the .. operator iterates through a range. Thus, the script does not count 10 lines, it counts 10 iterations of the loop generated by the -n switch (see "perldoc perlrun" and Listing 2 for an example of that loop).

The magic of the -i switch is that it replaces each file in @ARGV with the version produced by the script's output on that file. Thus, the -i switch makes Perl into an editing text filter. Do not forget to use the backup option to the -i switch. Following the i with an extension will make a backup of the edited file using that extension.

Note how the -p and -n switch are used. The -n switch is used when you want explicitly to print out data. The -p switch implicitly inserts a print $_ statement in the loop produced by the -n switch. Thus, the -p switch is better for full processing of a file, while the -n switch is better for selective file processing, where only specific data needs to be printed.

Examples of in-place editing can also be found in the "One-liners 101" article.

Reversing the contents of a file is not a common task, but the following one-liners show than the -n and -p switches are not always the best choice when processing an entire file.

Listing 6: Reversal of files' fortunes


# 1. command-line that reverses the whole input by lines
# (printing each line in reverse order)
perl -e 'print reverse <>' file1 file2 file3 ....

# 2. command-line that shows each line with its characters backwards
perl -nle 'print scalar reverse $_' file1 file2 file3 ....

# 3. find palindromes in the /usr/dict/words dictionary file
perl -lne '$_ = lc $_; print if $_ eq reverse' /usr/dict/words

# 4. command-line that reverses all the bytes in a file
perl -0777e 'print scalar reverse <>' f1 f2 f3 ...

# 5. command-line that reverses each paragraph in the file but prints
# them in order
perl -00 -e 'print reverse <>' file1 file2 file3 ....


The -0 (zero) flag is very useful if you want to read a full paragraph or a full file into a single string. (It also works with any character number, so you can use a special character as a marker.) Be careful when reading a full file in one command (-0777), because a large file will use up all your memory. If you need to read the contents of a file backwards (for instance, to analyze a log in reverse order), use the CPAN module File::ReadBackwards. Also see "One-liners 101," which shows an example of log analysis with File::ReadBackwards.

Note the similarity between the first and second scripts in Listing 6. The first one, however, is completely different from the second one. The difference lies in using <> in scalar context (as -n does in the second script) or list context (as the first script does).

The third script, the palindrome detector, did not originally have the $_ = lc $_; segment. I added that to catch those palindromes like "Bob" that are not the same backwards.

My addition can be written as $_ = lc; as well, but explicitly stating the subject of the lc() function makes the one-liner more legible, in my opinion.


Listing 7: Rewrite with a random number


# replace string XYZ with a random number less than 611 in these files
perl -i.bak -pe "s/XYZ/int rand(611)/e" f1 f2 f3


This is a filter that replaces XYZ with a random number less than 611 (that number is arbitrarily chosen). Remember the rand() function returns a random number between 0 and its argument.

Note that XYZ will be replaced by a different random number every time, because the substitution evaluates "int rand(611)" every time.

Listing 8: Revealing the files' base nature


# 1. Run basename on contents of file
perl -pe "s@.*/@@gio" INDEX

# 2. Run dirname on contents of file
perl -pe 's@^(.*/)[^/]+@$1\n@' INDEX

# 3. Run basename on contents of file
perl -MFile::Basename -ne 'print basename $_' INDEX

# 4. Run dirname on contents of file
perl -MFile::Basename -ne 'print dirname $_' INDEX


One-liners 1 and 2 came from Paul, while 3 and 4 were my rewrites of them with the File::Basename module. Their purpose is simple, but any system administrator will find these one-liners useful.

Listing 9: Moving or renaming, it's all the same in UNIX


# 1. write command to mv dirs XYZ_asd to Asd
# (you may have to preface each '!' with a '\' depending on your shell)
ls | perl -pe 's!([^_]+)_(.)(.*)!mv $1_$2$3 \u$2\E$3!gio'

# 2. Write a shell script to move input from xyz to Xyz
ls | perl -ne 'chop; printf "mv $_ %s\n", ucfirst $_;'


For regular users or system administrators, renaming files based on a pattern is a very common task. The scripts above will do two kinds of job: either remove the file name portion up to the _ character, or change each filename so that its first letter is uppercased according to the Perl ucfirst() function.

There is a UNIX utility called "mmv" by Vladimir Lanin that may also be of interest. It allows you to rename files based on simple patterns, and it's surprisingly powerful. See the Resources section for a link to this utility.




Some of mine

The following is not a one-liner, but it's a pretty useful script that started as a one-liner. It is similar to Listing 7 in that it replaces a fixed string, but the trick is that the replacement itself for the fixed string becomes the fixed string the next time.

The idea came from a newsgroup posting a long time ago, but I haven't been able to find original version. The script is useful in case you need to replace one IP address with another in all your system files -- for instance, if your default router has changed. The script includes $0 (in UNIX, usually the name of the script) in the list of files to rewrite.

As a one-liner it ultimately proved too complex, and the messages regarding what is about to be executed are necessary when system files are going to be modified.

Listing 10: Replace one IP address with another one


#!/usr/bin/perl -w

use Regexp::Common qw/net/; # provides the regular expressions for IP matching

my $replacement = shift @ARGV; # get the new IP address

die "You must provide $0 with a replacement string for the IP 111.111.111.111"
unless $replacement;

# we require that $replacement be JUST a valid IP address
die "Invalid IP address provided: [$replacement]"
unless $replacement =~ m/^$RE{net}{IPv4}$/;

# replace the string in each file
foreach my $file ($0, qw[/etc/hosts /etc/defaultrouter /etc/ethers], @ARGV)
{
# note that we know $replacement is a valid IP address, so this is
# not a dangerous invocation
my $command = "perl -p -i.bak -e 's/111.111.111.111/$replacement/g' $file";

print "Executing [$command]\n";
system($command);
}


Note the use of the Regexp::Common module, an indispensable resource for any Perl programmer today. Without Regexp::Common, you will be wasting a lot of time trying to match a number or other common patterns manually, and you're likely to get it wrong.

No comments: