Scripting help please....

List overview All Threads
Download

newer

older

ata errors from dmesg

vmware 2.x

Truejack

28 Oct 2009 28 Oct '09

6:09 p.m.

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

Attachments:

attachment.html (text/html — 831 bytes)

Show replies by date

Neil Aggarwal

28 Oct 28 Oct

6:24 p.m.

I wonder if you can do this in two steps:

1. Parse out the unique values from the thrid column into a file. 2. Run the processor on the script to print where the third column matches one of the values identified.

I dont know how to do this in a script. I would write a simple Java program to do it.

Neil

-- Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com http://www.jammconsulting.com/ CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU 1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime

_____

From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Truejack Sent: Wednesday, October 28, 2009 12:10 PM To: centos@centos.org Subject: [CentOS] Scripting help please....

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

Benjamin Donnachie

6:32 p.m.

2009/10/28 Neil Aggarwal neil@jammconsulting.com:

...

I dont know how to do this in a script.

Could be a job for awk.

Bit too busy at work to look into it further at the moment though.

Ben

John Doe

6:41 p.m.

...

From: Truejack truejack@gmail.com To: centos@centos.org Sent: Wed, October 28, 2009 6:09:41 PM Subject: [CentOS] Scripting help please....

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no

...
host17:dev258mum.dd.mum.test.com:31:17:19:no

host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no

...
host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

A quick and dirty example (only prints the extra duplicate lines; not the original duplicate): awk -F: ' { v[$3]=v[$3]+1; if (v[$3]>1) print $0; } ' datafile

John Doe

29 Oct 29 Oct

11:26 a.m.

From: John Doe jdmls@yahoo.com

...

A quick and dirty example (only prints the extra duplicate lines; not the original duplicate): awk -F: ' { v[$3]=v[$3]+1; if (v[$3]>1) print $0; } ' datafile

Here's the version will the 1st duplicate included: awk -F: ' { v[$3]=v[$3]+1; if (v[$3] == 1) { f[$3]=$0; } else { if (v[$3] == 2) print f[$3]; print $0; } } ' datafile

Paul Bijnens

28 Oct 28 Oct

6:41 p.m.

On 2009-10-28 18:09, Truejack wrote:

...

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

Long time ago (when I was still young and beautiful) and encountering also the limitations of "uniq", I wrote a small program in C to do these kinds of things. It is designed to handle record oriented stuff in groups similar to uniq. The primary purpose was as prepocessor to awk/perl, but simple things like this are builtin. You find it here:

ftp://ftp.xplanation.com/utils/by-src.zip

Unpack; make; and copy the program "by" somehwere in your PATH.

Then, to solve your problem do:

sort -t: -k 3 InputFile | by -F: -f3 -D

This sorts the input on field 3, fields separated by colon, and outputs all lines that are duplicate according to field 3 (-D).

The program can do more as well, and a little tutorial is included in the zip.

-- Paul Bijnens, Xplanation Technology Services Tel +32 16 397.525 Interleuvenlaan 86, B-3001 Leuven, BELGIUM Fax +32 16 397.552 *********************************************************************** * I think I've got the hang of it now: exit, ^D, ^C, ^, ^Z, ^Q, ^^, * * quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, ~., * * stop, end, ^]c, +++ ATH, disconnect, halt, abort, hangup, KJOB, * * ^X^X, :D::D, kill -9 1, kill -1 $$, shutdown, init 0, Alt-F4, * * Alt-f-e, Ctrl-Alt-Del, Alt-SysRq-reisub, Stop-A, AltGr-NumLock, ... * * ... "Are you sure?" ... YES ... Phew ... I'm out * ***********************************************************************

Arturas Mr.

7:05 p.m.

I think it can be optimized, and if programing language doesn't matter: #!/usr/bin/python

file="test.txt" fl = open(file,'r') toParse = fl.readlines() fl.close() dublicates = [] firstOne = [] for ln in toParse: ln=ln.strip() lnMap = ln.split(':') target = lnMap[2] if target in firstOne: if not target in dublicates: dublicates.append(target) else: firstOne.append(target) for ln in toParse: ln = ln.strip() lnMap = ln.split(':') target = lnMap[2] if target in dublicates: print ln

On Wed, Oct 28, 2009 at 7:09 PM, Truejack truejack@gmail.com wrote:

...

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

8:57 p.m.

...

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

list.awk BEGIN { FS=":"; } { if ( $3 == last ) {

print $0; } last = $3; }

sort <file> | awk -f list.awk

mark "*how* long an awk script would you like?"

Les Mikesell

9:05 p.m.

m.roth@5-cent.us wrote:

...

...
Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

list.awk BEGIN { FS=":"; } { if ( $3 == last ) {
  print $0;
} last = $3; }

sort <file> | awk -f list.awk
 mark "*how* long an awk script would you like?"

This doesn't print the first of the duplicates. Also, the question wasn't clear as to whether every line with matching 3rd fields should be printed or just ones where the others or previous fields matched (but the sort options could control that).

-- Les Mikesell lesmikesell@gmail.com

m.roth＠5-cent.us

9:17 p.m.

...

m.roth@5-cent.us wrote:

...
...
Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not successful.

list.awk BEGIN { FS=":"; } { if ( $3 == last ) {
  print $0;
} last = $3; }

sort <file> | awk -f list.awk
 mark "*how* long an awk script would you like?"
This doesn't print the first of the duplicates. Also, the question wasn't clear as to whether every line with matching 3rd fields should be printed or just ones where the others or previous fields matched (but the sort options could control that).

Oh, sorry: BEGIN { FS=":"; } { if ( $3 == last ) { if ( first == 0 ) { print saved; first++; } print $0; } else { first = 0; last = $3; saved = $0; } }

mark "did I mention that I've written 100 -200 line awk scripts?"

Nifty Cluster Mitch

10:26 p.m.

On Wed, Oct 28, 2009 at 10:39:41PM +0530, Truejack wrote:

...

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no

A key to your answer is the --all-repeated option for uniq on a sorted file.

I call this "find-duplicates" -- this post makes it GPL

#! /bin/bash #SIZER=' -size +10240k' SIZER=' -size +0' #SIZER="" DIRLIST=". " find $DIRLIST -type f $SIZER -print0 | xargs -0 md5sum |\ sort > /tmp/looking4duplicates tput bel; sleep 2 cat /tmp/looking4duplicates | uniq --check-chars=32 --all-repeated=prepend | less

5756

Age (days ago)

5757

Last active (days ago)

discuss@lists.centos.org

10 comments

9 participants

tags (0)

participants (9)

Arturas Mr.
Benjamin Donnachie
John Doe
Les Mikesell
m.roth＠5-cent.us
Neil Aggarwal
Nifty Cluster Mitch
Paul Bijnens
Truejack