Wednesday, July 18, 2012

Date Conversion With Perl

Recently I've started to keep a personal log as I'm coding.  This has already started to be a big help in terms of not-resolving problems while I'm working on a project.  At the same time, a number of my friends have started learning to code.

Since my logs are already pretty close to blog-post sized, I'll start posting a few of the more interesting ones, in hopes that they'll help people who are working on similar projects.

Recently, I've been working on a project where I want to use SQLite to work on some data files.  The data were initially stored in CSV files.  In the last couple days, I've gotten them converted to tab-delimited files, which will work better for the specific application.

The latest challenge came about when I was trying to use some date and time information in my data files.  It turns out that I'd downloaded them with the dates in a slightly wrong format.

In my files, the dates look like this:

29-Jun-04 13:04:40

SQLite wants dates that look like this:


2004-06-29 13:04:40



Normally, I'd just go and re-download the files with the dates in the correct format.  It's an option somewhere in Sqldeveloper.  The catch is that my data set takes a couple days to re-download.  And, my logins to the server with the original data aren't working right now.  So, I'd be looking at a week or so delay before I had the information I needed.

My next thought was to use a pre-built tool to do a date conversion.  Perl is pretty good about having code for specific applications like this.  But I couldn't find anything to do exactly what I wanted.  So, I had to code something myself.

I'd recently learned a neat command for perl.  It looks like this:

perl -p -i.orig -e 'FILTER COMMAND' file1

Going in order, that says, "Run Perl".  The "-p" is "read through file1, line by line, and run the stuff following '-e' for each line.  Normally, this will just print the output to my screen.  So, the "-i.orig" is "and replace the lines in file1.  But make a backup called file1.orig, in case anything goes wrong"

So, 'all' I need to do is to replace the FILTER COMMAND with some code that will turn dates like 29-Jun-04 13:04:40 into dates like 2004-06-29 13:04:40.

Perl is actually pretty great at this kind of thing, and this is clearly a job for a 'regular expression'.  Instead of looking for the number '29', for instance, I want to find two digits.  I don't necessarily want just 'Jun', either, some dates could be 'Jul' or 'Aug' or 'Dec'.

So, the first part of the regular expression is to decide what kinds of text I'm looking for.  For me, the answer was:

  1. two digits
  2. a hyphen
  3. a three-letter month name
  4. another hyphen
  5. two more digits
This is pretty easy to write in regular-expression code.  A digit is, "\d".  To get two digits, I could write "\d\d" or "\d{2}".  A hyphen is just "-".  The only tricky part are the three digit months.  I could have written "any three letters" (roughly \w\w\w or \w{3}), but I wanted to be sure that I was getting a month.  So, all I could think of doing was writing out the alternatives and saying "any of these options."  To do that, in regexes, it's just (Thing1|Thing2).  Or, in my case:

(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)

Ugly, right?

But, it's enough for me to get my search string working.  In regex-speak, I'm looking for this:

\d{2}-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{2}

It's helpful (for reasons I'll get to in a sec) to have 'pieces' of the regex wrapped in smooth parenthesis.   So, the string becomes:

(\d{2})-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(\d{2})
Since that's really long, I'll just shorten it to 'SEARCH'.

I want to do a search and replace.  In perl, the code for this is, "s/SEARCH/REPLACE/"  That'll find 'SEARCH' every time it appears, and stick in REPLACE.

I need a couple options on my search-replace command.  In perl, these are single letters that get tacked after the last slash.  I want a "g" which says "do multiple replacements on a line" and "i" which says "ignore case".  I also realized after a bit that I'm really going to need an "e".  That "e" tells Perl that REPLACE should be run like it's code, and that the replacement is the result of that code.

So, now my command is:

perl -p -i.orig -e 's/SEARCH/REPLACE/gie' file1

I have the piece I want for SEARCH and need to write REPLACE.

The next thing to know is how those parentheses work.  When Perl matches something inside the first set, it saves that match as the variable "$1".  The match inside the next parentheses is "$2" and so on.  So, after I've run my SEARCH and found that example string, I've got the following variables set:
  • $1 = "29"
  • $2 = "Jun"
  • $3 =  "04"
Then, if I set REPLACE as "$3-$2-$1", my code would turn 29-Jun-04 13:04:40 into 04-Jun-29 13:04:40. Not exactly right, but re-ordering is a step in the right direction.
The earliest dates I'm working with are in 2002.  This makes the problem a lot easier, since I can just put a 2 and a 0 in front of the year and always get the right answer.  So, I change REPLACE to "20".$1."-$2-$3".  The periods in the middle of that just tell perl to join the strings together.  Now I've gotten the output as far as 2004-Jun-29 13:04:40.

The tricky part is that three-letter month.  It needs to become a two-digit number.  I'd wanted to find an elegant way to convert that, but I couldn't find anything.  Instead, I had to list all of the possibilities as a 'hash'.

In Perl, hashes are like arrays that are indexed by strings.  So, if I wanted to, I can have a hash take an input 'Jan' and return the string "01".

Sometimes, people define hashes with code that looks like, %hash = ('key1' => 'val1', 'key2' => 'val2').  This tells Perl that when when we call $hash('key1') the program should return 'val1'.  A slightly more terse way to write this is, %hash = ('key1' , 'val1', 'key2' , 'val2').

For my little program, I just wrote out all the possibilities.

%hash =  ('Jan', '01', 'Feb', '02', 'Mar', '03', 'Apr', '04', 'May', '05', 'Jun', '06', 'Jul', '07', 'Aug', '08', 'Sep', '09', 'Oct', '10', 'Nov', '11',  'Dec', '12')

I thought I should be able to do that, then have REPLACE be "20".$1."-".hash($2)."-$3".

Two things are broken with this approach.  The first is the quoting.  Notice how I'm using single-quotes when I call Perl?  It turned out that bad things happened when I defined '%hash' that way. The program got to the first double quote and interpreted it as the end of the FILTER COMMAND string.

The second problem is the case insensitivity.  I'm not sure if I'm going to encounter 'Jun' or 'JUN' or 'jun'.  Hashes will think all three of those are different.

I fixed both of those problems by changing my %hash definition to:

%hash = ("JAN", "01", "FEB", "02", "MAR", "03", "APR", "04", "MAY", "05", "JUN", "06", "JUL", "07", "AUG", "08", "SEP", "09", "OCT", "10", "NOV", "11", "DEC", "12")

Then, instead of $hash($2), I wrote $hash(uc($2)).  The 'uc' function converts the three-letter-month to upper case before using it.

That was actually enough to get me some working code.  My REPLACE became:
%hash =  ('Jan', '01', 'Feb', '02', 'Mar', '03', 'Apr', '04', 'May', '05', 'Jun', '06', 'Jul', '07', 'Aug', '08', 'Sep', '09', 'Oct', '10', 'Nov', '11',  'Dec', '12');
"20$3-".$hash{uc($2)}."-
$1"

The semicolon is really important there.  What's happening is that Perl executes the first line and gets hash defined.  Then, it executes the second line and creates a string from the bits of date-looking-text it's run across.
The next thing is that I wanted to do this for a bunch of files.  Perl can actually take a list of files, but that doesn't give me any idea of progress.  So, I wanted to use a little bit of shell coding.  In particular, I used some code that looks like this:

for file in PROCESSED/*.tsvdoecho "Processing file $file"done

The first line finds all of the tsv files in the directory processed.  Every time it finds one, it runs the do loop.  The current loop just tells me the filename.  But, if I write "perl -p -i.orig -e 'FILTER COMMAND' $file" in the line after the echo, then I'll process the files, too.

Putting everything together (and writing SEARCH and REPLACE the long ways) gave me:

for file in PROCESSED/*.tsv
do
    echo "Processing file $file"
    perl -i.orig -p -e 's/(\d{2})-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(\d{2})/%hash = ("JAN", "01", "FEB", "02", "MAR", "03", "APR", "04", "MAY", "05", "JUN", "06", "JUL", "07", "AUG", "08", "SEP", "09", "OCT", "10", "NOV", "11", "DEC", "12");"20$3"."-".$hash{uc($2)}."-$1/gie' $file
done


So, that's the kind of thing I'm using Perl for lately.   

No comments: