Text Proccessing script - advice? - Discuss

List overview All Threads
Download

newer

Text Proccessing script - advice?

older

Prozilla-gui for centos 5.5 x86_64

CentOS 6

Roland RoLaNd

21 Dec 2010 21 Dec '10

5:30 p.m.

Hello,

I have a log file with the following input: X , ID , Date, Time, Y 01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass 01,01368,2010-12-03,16:07:00,Pass

My goal is to get the number of times ID has a TIME that's after 09:00:00 each DATE. That would give me two output. one is the number of days ID has been late, and secondly, the day and time this ID has been late .

I've started as such:

sort -t ',' -k 3,3 -k 4,4 file.log # this will sort the file according to the DATE field as well as the Time fileld. I'm stuck for the last 30 min to find a way to get the first line of each day (logically it'll be the earliest as i've sorted by date/time previously) once i know how to do this, i'll be able to compare time and proceed..

Can any one help ? i looked into sort - u and uniq -f3 though i didnt get far with it..

Show replies by date

lhecking＠users.sourceforge.net

21 Dec 21 Dec

5:33 p.m.

...

sort -t ','? -k 3,3 -k 4,4? file.log? # this will sort the file according to the DATE field as well as the Time fileld. I'm stuck for the last 30 min to find a way to get the first line of each day (logically it'll be the earliest as i've sorted by date/time previously) once i know how to do this, i'll be able to compare time and proceed..

If you're not afraid of perl, the Date-Manip module allows comparing time and date, among other things.

--------------------------------------------------------------- This message and any attachments may contain Cypress (or its subsidiaries) confidential information. If it has been received in error, please advise the sender and immediately delete this message. ---------------------------------------------------------------

Eduardo Grosclaude

5:55 p.m.

On Tue, Dec 21, 2010 at 2:33 PM, lhecking@users.sourceforge.net wrote:

...

If you're not afraid of perl, the Date-Manip module allows comparing time and date, among other things.

A dirtier take could be

perl -ne '/,(\d+),(.*),(\d\d):.*/ && ($3>=9) and $s->{$1,$2}++ ; END {use Data::Dumper; print Dumper($s)}' < data $VAR1 = { '01368 2010-12-02' => 4, '01368 2010-12-03' => 3 };

-- Eduardo Grosclaude Universidad Nacional del Comahue Neuquen, Argentina

m.roth＠5-cent.us

5:58 p.m.

Roland RoLaNd wrote:

...

I have a log file with the following input: X , ID , Date, Time, Y 01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass 01,01368,2010-12-03,16:07:00,Pass

My goal is to get the number of times ID has a TIME that's after 09:00:00 each DATE. That would give me two output. one is the number of days ID has been late, and secondly, the day and time this ID has been late .

awk 'BEGIN { FS=",";} \ { if ( $4 > "09:00:00" ) { array[ $2 ][1]++; array[ $2 ][ array[$2][1] + 1] = $3 "::" $4; } } END { for j in array { for k in array[j] { print j, array[j][k]; } } }

It's been a while since I needed to do this, but I *think* the nested "for <var> in array" will work. <snip> mark

Roland RoLaNd

6:30 p.m.

First of all i'd like to appologize for those who helped me by giving an advice using "perl" i'm ashamed to say that i have no experience with it.

Mark, thanks for your effort in writing the below though could you help me understand how it goes ? the best way to do thigns, is to learn them for future references.

I'm no expert with AWK, so i need your help with the below if possible:

awk 'BEGIN { FS=",";} \ ## awk -f begin triggers the afterwords commands to be executed in awk, with , as field delimiter { if ( $4 > "09:00:00" ) { # condition that matched 09 am array[ $2 ][1]++; # incrementing count by one though im a bit at a loss with "array" array[ $2 ][ array[$2][1] + 1] = $3 "::" $4; } # couldn't figure it out } END { for j in array { for k in array[j] { print j, array[j][k]; # prints out what exactly? } } }

----------------------------------------

...

Date: Tue, 21 Dec 2010 12:58:33 -0500 From: m.roth@5-cent.us To: centos@centos.org Subject: Re: [CentOS] Text Proccessing script - advice?

Roland RoLaNd wrote:

...
I have a log file with the following input: X , ID , Date, Time, Y 01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass 01,01368,2010-12-03,16:07:00,Pass

My goal is to get the number of times ID has a TIME that's after 09:00:00 each DATE. That would give me two output. one is the number of days ID has been late, and secondly, the day and time this ID has been late .

awk 'BEGIN { FS=",";} \ { if ( $4 > "09:00:00" ) { array[ $2 ][1]++; array[ $2 ][ array[$2][1] + 1] = $3 "::" $4; } } END { for j in array { for k in array[j] { print j, array[j][k]; } } }

It's been a while since I needed to do this, but I *think* the nested "for in array" will work.

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

John Lundin

7:14 p.m.

On Tue, Dec 21, 2010 at 08:30:43PM +0200, Roland RoLaNd wrote:

(chuckle) That's a bit more verbose than necessary. As a one-liner:

awk -F, '($4>"09:00:00"){c[$2 "," $3]++};END{for (i in c){print i "," c[i]}}' $filename

01368,2010-12-02,4 01368,2010-12-03,3

(You might check if you want >="09:00:00", and include the edge case.)

-F, # set separator to comma

# (automatic loop over all data lines) ($4>"09:00:00"){ # do if fourth field greater than 09:... c[$2 "," $3]++ # increment hash element pointed to by # second and third fields separated by comma # (that is, hash on id,date)

END{ # after finishing the data for (i in c){ # for each observed hash value in array c print i "," c[i] # print the hash value, comma, count

-- lundin@fini.net

m.roth＠5-cent.us

7:35 p.m.

John Lundin wrote:

...

On Tue, Dec 21, 2010 at 08:30:43PM +0200, Roland RoLaNd wrote:

(chuckle) That's a bit more verbose than necessary. As a one-liner:

awk -F, '($4>"09:00:00"){c[$2 "," $3]++};END{for (i in c){print i "," c[i]}}' $filename

Well, yes, but he also wanted a count....

mark

...

01368,2010-12-02,4 01368,2010-12-03,3

(You might check if you want >="09:00:00", and include the edge case.)

-F, # set separator to comma
                  # (automatic loop over all data lines)
($4>"09:00:00"){ # do if fourth field greater than 09:... c[$2 "," $3]++ # increment hash element pointed to by # second and third fields separated by comma # (that is, hash on id,date)

END{ # after finishing the data for (i in c){ # for each observed hash value in array c print i "," c[i] # print the hash value, comma, count

-- lundin@fini.net _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Roland RoLaNd

7:40 p.m.

Thanks to your help i've reached this step:

original data:

01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass 01,01368,2010-12-03,16:07:00,Pass

awk -F , '{if ($4 > "09:10:00") print $2 " was late on", $3 " by coming at ",$4}' test | tee DaysLate ; wc -l DaysLate

01368 was late on 2010-12-02 by coming at 10:54:00

01368 was late on 2010-12-02 by coming at 13:07:04

01368 was late on 2010-12-02 by coming at 18:54:01

01368 was late on 2010-12-03 by coming at 13:53:00

01368 was late on 2010-12-03 by coming at 16:07:00

5 DaysLate

the only thing missing is to find a way to just take the earliest time of each day.

in other words the above output should be:

0 DaysLate # as on 12-02 he came in at 09:07 which is before 09:10 and on 12-03 he came in at 09:02 which is also before the set time

----------------------------------------

...

Date: Tue, 21 Dec 2010 14:35:13 -0500 From: m.roth@5-cent.us To: centos@centos.org Subject: Re: [CentOS] Text Proccessing script - advice?

John Lundin wrote:

...
On Tue, Dec 21, 2010 at 08:30:43PM +0200, Roland RoLaNd wrote:

(chuckle) That's a bit more verbose than necessary. As a one-liner:

awk -F, '($4>"09:00:00"){c[$2 "," $3]++};END{for (i in c){print i "," c[i]}}' $filename

Well, yes, but he also wanted a count....

mark

...
01368,2010-12-02,4 01368,2010-12-03,3

(You might check if you want >="09:00:00", and include the edge case.)

-F, # set separator to comma

# (automatic loop over all data lines) ($4>"09:00:00"){ # do if fourth field greater than 09:... c[$2 "," $3]++ # increment hash element pointed to by # second and third fields separated by comma # (that is, hash on id,date)

END{ # after finishing the data for (i in c){ # for each observed hash value in array c print i "," c[i] # print the hash value, comma, count

-- lundin@fini.net _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Les Mikesell

7:54 p.m.

On 12/21/2010 1:40 PM, Roland RoLaNd wrote:

...

awk -F , '{if ($4> "09:10:00") print $2 " was late on", $3 " by coming at ",$4}' test | tee DaysLate ; wc -l DaysLate

01368 was late on 2010-12-02 by coming at 10:54:00

01368 was late on 2010-12-02 by coming at 13:07:04

01368 was late on 2010-12-02 by coming at 18:54:01

01368 was late on 2010-12-03 by coming at 13:53:00

01368 was late on 2010-12-03 by coming at 16:07:00
    5 DaysLate

On my calendar 12-02 and 12-03 are only 2 days...

-- Les Mikesell lesmikesell@gmail.com

Roland RoLaNd

7:58 p.m.

Exactly, hence:

[quote] the only thing missing is to find a way to just take the earliest time of each day.

in other words the above output should be:

0 DaysLate

[/quote]

----------------------------------------

...

Date: Tue, 21 Dec 2010 13:54:41 -0600 From: lesmikesell@gmail.com To: centos@centos.org Subject: Re: [CentOS] Text Proccessing script - advice?

On 12/21/2010 1:40 PM, Roland RoLaNd wrote:

...
awk -F , '{if ($4> "09:10:00") print $2 " was late on", $3 " by coming at ",$4}' test | tee DaysLate ; wc -l DaysLate

01368 was late on 2010-12-02 by coming at 10:54:00

01368 was late on 2010-12-02 by coming at 13:07:04

01368 was late on 2010-12-02 by coming at 18:54:01

01368 was late on 2010-12-03 by coming at 13:53:00

01368 was late on 2010-12-03 by coming at 16:07:00

5 DaysLate

On my calendar 12-02 and 12-03 are only 2 days...

-- Les Mikesell lesmikesell@gmail.com

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Les Mikesell

8:45 p.m.

On 12/21/2010 1:58 PM, Roland RoLaNd wrote:

...

the only thing missing is to find a way to just take the earliest time of each day.

in other words the above output should be:
   0 DaysLate

That means my perl script was wrong... This looks more like what you want, except for your last change to 9:10.

my %id_count; my %id_date; #date already seen; my %iddate_time; #1st time each day while (<>) { my ($x,$id,$date,$time,$junk) = split /,/; next if ($x == 'X'); #skip header $iddate_time{$id . $date} = $time unless ($iddate_time{$id . $date}); #store earliest next if ($time le "09:00:00"); # not late $t = $iddate_time{$id . $date}; next if ($iddate_time{$id . $date} le "09:00:00"); # 1st wasn't late next if ($id_date{$id} eq $date); # already counted today print "Late: $id - $date - $time\n"; $id_count{$id}++; $id_date{$id} = $date; } print "----\n"; while (( my $id,$count) = each(%id_count)) { print "$id late $count days\n"; }

-- Les Mikesell lesmikesell@gmail.com

Mihai T. Lazarescu

7:59 p.m.

On Tue, Dec 21, 2010 at 09:40:42PM +0200, Roland RoLaNd wrote:

...

original data:

01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass 01,01368,2010-12-03,16:07:00,Pass

...

the only thing missing is to find a way to just take the earliest time of each day.

You may use mktime(datespec) (see man awk) to covert date and time into comparable integers.

Mihai

John Lundin

9:50 p.m.

On Tue, Dec 21, 2010 at 02:35:13PM -0500, m.roth@5-cent.us wrote:

...

John Lundin wrote:

...
On Tue, Dec 21, 2010 at 08:30:43PM +0200, Roland RoLaNd wrote:

[...]

...

Well, yes, but he also wanted a count....

Oh, lord, it's worse than that. I was solving the wrong problem. (And still am if he really wanted a count of after-nine entries.)

Once again with awk one-liners:

awk -F, '{k=$2 "," $3};(!e[k]||($4<e[k])){e[k]=$4}\ ;END{for (i in e){if (e[i]>"09:00:00"){print i "," e[i]}}}' infile \ |tee latedays\ |awk -F, '{c[$1]++};END{for (i in c){print i "," c[i]}}' >latecounts

01368,2010-12-02,09:07:00 01368,2010-12-03,09:02:00

01368,2

You may now wince.

If earliest time seen for user and date is undefined or if this time is less, then set earliest time to this time. After all processed, print out the user, date and time if it's later than 09:00:00.

Second awk script just counts lines reported above, by user.

(I usually switch to perl or at least a bash script file before it gets this unreadable. And add some sanity testing.)

-- lundin@fini.net "I have a photographic memory. If only I could remember where I left the film..."

m.roth＠5-cent.us

10 p.m.

John Lundin wrote:

...

On Tue, Dec 21, 2010 at 02:35:13PM -0500, m.roth@5-cent.us wrote:

...
John Lundin wrote:

...
On Tue, Dec 21, 2010 at 08:30:43PM +0200, Roland RoLaNd wrote:

[...]

...
Well, yes, but he also wanted a count....

Oh, lord, it's worse than that. I was solving the wrong problem. (And still am if he really wanted a count of after-nine entries.)

Once again with awk one-liners:

Why? What do you have against more-than-one-line awk scripts?

asks the guy who's written 100 and 200 line awk scripts....

Les Mikesell

6:43 p.m.

On 12/21/2010 11:30 AM, Roland RoLaNd wrote:

...

Hello,

I have a log file with the following input: X , ID , Date, Time, Y 01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass 01,01368,2010-12-03,16:07:00,Pass

My goal is to get the number of times ID has a TIME that's after 09:00:00 each DATE. That would give me two output. one is the number of days ID has been late, and secondly, the day and time this ID has been late .

I've started as such:

sort -t ',' -k 3,3 -k 4,4 file.log # this will sort the file according to the DATE field as well as the Time fileld. I'm stuck for the last 30 min to find a way to get the first line of each day (logically it'll be the earliest as i've sorted by date/time previously) once i know how to do this, i'll be able to compare time and proceed..

Can any one help ? i looked into sort - u and uniq -f3 though i didnt get far with it..

Most logs are written in append mode so ascending date/time comes naturally. This perl should list each instance and the count:

my %id_count; my %id_date; #date already seen; while (<>) { my ($x,$id,$date,$time) = split /,/; next if ($x == 'X'); #skip header next if ($time le "09:00:00"); next if ($id_date{$id} eq $date); $id_date{$id} = $date; print "$id - $date - $time\n"; $id_count{$id}++; } print "----\n"; while (( my $id,$count) = each(%id_count)) { print "$id late $count days\n"; }

-- Les Mikesell lesmikesell@gmail.com

5342

Age (days ago)

5342

Last active (days ago)

discuss@lists.centos.org

14 comments

7 participants

tags (0)

participants (7)

Eduardo Grosclaude
John Lundin
Les Mikesell
lhecking＠users.sourceforge.net
m.roth＠5-cent.us
Mihai T. Lazarescu
Roland RoLaNd