A common thing I do is to scrape a Web page, run it through some Perl magic and marvel over the result. A frequent reason of contention in this process is the issue of getting å’s ä’s and ö’s correctly handled by Perl and various terminals, here’s a write up of a simple example.

The webpage is UTF-8 encoded, I save it to disk using “Save as…” in my browser. The resulting file on disk is UTF-8 encoded.

In this example the file is reasonably small so I use File::Slurp to get the full file in a scalar…


my $text = read_file( <filename> ) ; # Slurp the file
utf8::decode($text); # Decode the file from UTF-8

I can now match with å ä and ö in my Perl code like this:


my ($address) = ($text =~ m{title="Visa alla bilder för ([^"]+)"}sm);

Later when I have finished my text processing and want to print the result in my terminal, Cygwin in this case I do:


my $output = "";

$output .= "Adress: " . $house->{address} . "\n" if defined($house->{address});
$output .= "Område: " . $house->{area} . "\n" if defined($house->{area});

...

utf8::encode($output); # Encode the text as UTF-8 which is correctly displayed by Cygwin
print $output;

Note: You should not “use utf8;” in this Perl script, “use utf8;” should only be used if your Perl script is written in UTF-8!

Tags: