Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
I wonder if you can do this in two steps:
1. Parse out the unique values from the thrid column into a file. 2. Run the processor on the script to print where the third column matches one of the values identified.
I dont know how to do this in a script. I would write a simple Java program to do it.
Neil
-- Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com http://www.jammconsulting.com/ CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU 1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime
_____
From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Truejack Sent: Wednesday, October 28, 2009 12:10 PM To: centos@centos.org Subject: [CentOS] Scripting help please....
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
2009/10/28 Neil Aggarwal neil@jammconsulting.com:
I dont know how to do this in a script.
Could be a job for awk.
Bit too busy at work to look into it further at the moment though.
Ben
From: Truejack truejack@gmail.com To: centos@centos.org Sent: Wed, October 28, 2009 6:09:41 PM Subject: [CentOS] Scripting help please....
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no
host17:dev258mum.dd.mum.test.com:31:17:19:no
host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no
host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
A quick and dirty example (only prints the extra duplicate lines; not the original duplicate): awk -F: ' { v[$3]=v[$3]+1; if (v[$3]>1) print $0; } ' datafile
JD
From: John Doe jdmls@yahoo.com
A quick and dirty example (only prints the extra duplicate lines; not the original duplicate): awk -F: ' { v[$3]=v[$3]+1; if (v[$3]>1) print $0; } ' datafile
Here's the version will the 1st duplicate included: awk -F: ' { v[$3]=v[$3]+1; if (v[$3] == 1) { f[$3]=$0; } else { if (v[$3] == 2) print f[$3]; print $0; } } ' datafile
On 2009-10-28 18:09, Truejack wrote:
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
Long time ago (when I was still young and beautiful) and encountering also the limitations of "uniq", I wrote a small program in C to do these kinds of things. It is designed to handle record oriented stuff in groups similar to uniq. The primary purpose was as prepocessor to awk/perl, but simple things like this are builtin. You find it here:
ftp://ftp.xplanation.com/utils/by-src.zip
Unpack; make; and copy the program "by" somehwere in your PATH.
Then, to solve your problem do:
sort -t: -k 3 InputFile | by -F: -f3 -D
This sorts the input on field 3, fields separated by colon, and outputs all lines that are duplicate according to field 3 (-D).
The program can do more as well, and a little tutorial is included in the zip.
I think it can be optimized, and if programing language doesn't matter: #!/usr/bin/python
file="test.txt" fl = open(file,'r') toParse = fl.readlines() fl.close() dublicates = [] firstOne = [] for ln in toParse: ln=ln.strip() lnMap = ln.split(':') target = lnMap[2] if target in firstOne: if not target in dublicates: dublicates.append(target) else: firstOne.append(target) for ln in toParse: ln = ln.strip() lnMap = ln.split(':') target = lnMap[2] if target in dublicates: print ln
On Wed, Oct 28, 2009 at 7:09 PM, Truejack truejack@gmail.com wrote:
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
list.awk BEGIN { FS=":"; } { if ( $3 == last ) {
print $0; } last = $3; }
sort <file> | awk -f list.awk
mark "*how* long an awk script would you like?"
m.roth@5-cent.us wrote:
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
list.awk BEGIN { FS=":"; } { if ( $3 == last ) {
print $0;
} last = $3; }
sort <file> | awk -f list.awk
mark "*how* long an awk script would you like?"
This doesn't print the first of the duplicates. Also, the question wasn't clear as to whether every line with matching 3rd fields should be printed or just ones where the others or previous fields matched (but the sort options could control that).
m.roth@5-cent.us wrote:
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not successful.
list.awk BEGIN { FS=":"; } { if ( $3 == last ) {
print $0;
} last = $3; }
sort <file> | awk -f list.awk
mark "*how* long an awk script would you like?"
This doesn't print the first of the duplicates. Also, the question wasn't clear as to whether every line with matching 3rd fields should be printed or just ones where the others or previous fields matched (but the sort options could control that).
Oh, sorry: BEGIN { FS=":"; } { if ( $3 == last ) { if ( first == 0 ) { print saved; first++; } print $0; } else { first = 0; last = $3; saved = $0; } }
mark "did I mention that I've written 100 -200 line awk scripts?"
On Wed, Oct 28, 2009 at 10:39:41PM +0530, Truejack wrote:
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no
A key to your answer is the --all-repeated option for uniq on a sorted file.
I call this "find-duplicates" -- this post makes it GPL
#! /bin/bash #SIZER=' -size +10240k' SIZER=' -size +0' #SIZER="" DIRLIST=". " find $DIRLIST -type f $SIZER -print0 | xargs -0 md5sum |\ sort > /tmp/looking4duplicates tput bel; sleep 2 cat /tmp/looking4duplicates | uniq --check-chars=32 --all-repeated=prepend | less