A common thing I do is to scrape a Web page, run it through some Perl magic and marvel over the result. A frequent reason of contention in this process is the issue of getting å’s ä’s and ö’s correctly handled by Perl and various terminals, here’s a write up of a simple example.

The webpage is UTF-8 encoded, I save it to disk using “Save as…” in my browser. The resulting file on disk is UTF-8 encoded.

In this example the file is reasonably small so I use File::Slurp to get the full file in a scalar…

 my $text = read_file( <filename> ) ; # Slurp the file utf8::decode($text); # Decode the file from UTF-8 

I can now match with å ä and ö in my Perl code like this:

 my ($address) = ($text =~ m{title="Visa alla bilder för ([^"]+)"}sm); 

Later when I have finished my text processing and want to print the result in my terminal, Cygwin in this case I do:

 my $output = ""; $output .= "Adress: " . $house->{address} . "\n" if defined($house->{address}); $output .= "Område: " .$house->{area} . "\n" if defined($house->{area}); ...   utf8::encode($output); # Encode the text as UTF-8 which is correctly displayed by Cygwin print \$output; 

Note: You should not “use utf8;” in this Perl script, “use utf8;” should only be used if your Perl script is written in UTF-8!

Tags: