OT: grep regex pointer appreciated

List overview All Threads
Download

newer

older

Centos 6

IPERF Server

Patrick Lists

5 Mar 2011 5 Mar '11

10:13 p.m.

Hi,

My grep regex foo is not very good and googling is getting me nowhere so hopefully someone is kind enough to give me some pointers.

Goal: grep (non .dbg) filenames and versions from a ftp dir listing and a raw html file:

$ wget --no-remove-listing -O ftp-index.txt ftp://127.0.0.1/test/ $ wget --no-remove-listing -O index.html http://127.0.0.1/test/

The relevant parts of the files above (first one is ftp listing, second part is the html file, both copied to test_regex.txt) are:

2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.dbg.tgz">bar-4.5.6.i686.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.tgz">bar-4.5.6.i686.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.dbg.tgz">bar-4.5.6.x86_64.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.tgz">bar-4.5.6.x86_64.tgz</a> (5551274 bytes)

This is what I now have (improvements most welcome):

$ egrep -o ">([A-Za-z_-]+)([[:digit:]]{1,3}(.[[:digit:]]{1,3})*).+(.|t)gz" ./test_regex.txt | grep -v ".dbg" | tr -d '>'

Output:

foo-bar-1.2.3+1.2.3.tar.gz baz-4.5.6.i686.tgz baz-4.5.6.x86_64.tgz

So far so good but now I also want to get the version numbers which I can't figure out. Anyone have a pointer how to get the version number from these filenames (1.2.3+1.2.3 and 4.5.6)?

Thanks! Patrick

Show replies by date

Nico Kadel-Garcia

5 Mar 5 Mar

11:18 p.m.

On Sat, Mar 5, 2011 at 5:13 PM, Patrick Lists centos-list@puzzled.xs4all.nl wrote:

...

Hi,

My grep regex foo is not very good and googling is getting me nowhere so hopefully someone is kind enough to give me some pointers.

Goal: grep (non .dbg) filenames and versions from a ftp dir listing and a raw html file:

$ wget --no-remove-listing -O ftp-index.txt ftp://127.0.0.1/test/ $ wget --no-remove-listing -O index.html http://127.0.0.1/test/

The relevant parts of the files above (first one is ftp listing, second part is the html file, both copied to test_regex.txt) are:

2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.dbg.tgz">bar-4.5.6.i686.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.tgz">bar-4.5.6.i686.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.dbg.tgz">bar-4.5.6.x86_64.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.tgz">bar-4.5.6.x86_64.tgz</a> (5551274 bytes)

<tr><td><a href="foo-bar-1.2.3+1.2.3.tar.gz">foo-bar-1.2.3+1.2.3.tar.gz</td></tr>

This is what I now have (improvements most welcome):

$ egrep -o ">([A-Za-z_-]+)([[:digit:]]{1,3}(.[[:digit:]]{1,3})*).+(.|t)gz" ./test_regex.txt | grep -v ".dbg" | tr -d '>'

Output:

foo-bar-1.2.3+1.2.3.tar.gz baz-4.5.6.i686.tgz baz-4.5.6.x86_64.tgz

So far so good but now I also want to get the version numbers which I can't figure out. Anyone have a pointer how to get the version number from these filenames (1.2.3+1.2.3 and 4.5.6)?

Separate the ".i686.tgz" with something like a '-' or "_", not a dot. and be consistent about using .tar.gz instead of mixing .tar.gz and .tgz, if possible.

Robert Grasso

7 Mar 7 Mar

11:23 a.m.

Hello,

On my opinion, grep is not powerful enough in order to achieve what you want. It would be preferable to use at least some (old but powerful) tools such sed, awk, or even better : perl. Actually, what you need is a tool providing a capture buffer (this is perl jargon - "back references" in sed jargon) in which you can get the string you want to extract, rather than trying to build up a positive matching regex, as the string boundaries seem to be easy enough to describe with regexs.

Regards

--- Robert GRASSO System engineer

CEDRAT S.A. 15 Chemin de Malacher - Inovallée - 38246 MEYLAN cedex - FRANCE Phone: +33 (0)4 76 90 50 45 - Fax: +33 (0)4 56 38 08 30 mailto:robert.grasso@cedrat.com - http://www.cedrat.com

...

-----Message d'origine----- De : centos-bounces@centos.org [mailto:centos-bounces@centos.org] De la part de Patrick Lists Envoyé : 5 mars 2011 23:14 À : CentOS mailing list Objet : [CentOS] OT: grep regex pointer appreciated

Hi,

My grep regex foo is not very good and googling is getting me nowhere so hopefully someone is kind enough to give me some pointers.

Goal: grep (non .dbg) filenames and versions from a ftp dir listing and a raw html file:

$ wget --no-remove-listing -O ftp-index.txt ftp://127.0.0.1/test/ $ wget --no-remove-listing -O index.html http://127.0.0.1/test/

The relevant parts of the files above (first one is ftp listing, second part is the html file, both copied to test_regex.txt) are:

2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.dbg.tgz">bar-4.5.6.i686.d bg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.tgz">bar-4.5.6.i686.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.dbg.tgz">bar-4.5.6.x86_ 64.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.tgz">bar-4.5.6.x86_64.tgz</a> (5551274 bytes)

<tr><td><a href="foo-bar-1.2.3+1.2.3.tar.gz">foo-bar-1.2.3+1.2.3.tar.gz</td></tr>

This is what I now have (improvements most welcome):

$ egrep -o ">([A-Za-z_-]+)([[:digit:]]{1,3}(.[[:digit:]]{1,3})*).+(.|t)gz" ./test_regex.txt | grep -v ".dbg" | tr -d '>'

Output:

foo-bar-1.2.3+1.2.3.tar.gz baz-4.5.6.i686.tgz baz-4.5.6.x86_64.tgz

So far so good but now I also want to get the version numbers which I can't figure out. Anyone have a pointer how to get the version number from these filenames (1.2.3+1.2.3 and 4.5.6)?

Thanks! Patrick _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Patrick Lists

1:44 p.m.

On 03/07/2011 12:23 PM, Robert Grasso wrote:

...

Hello,

On my opinion, grep is not powerful enough in order to achieve what you want. It would be preferable to use at least some (old but powerful) tools such sed, awk, or even better : perl. Actually, what you need is a tool providing a capture buffer (this is perl jargon - "back references" in sed jargon) in which you can get the string you want to extract, rather than trying to build up a positive matching regex, as the string boundaries seem to be easy enough to describe with regexs.

Thank you for your advice. After much fiddling I came up with something that seems to work. I have never dabbled with perl but will dig up my sed/awk book and see if there's a more elegant way to do this.

Regards, Patrick

Bill Campbell

5:55 p.m.

On Mon, Mar 07, 2011, Robert Grasso wrote:

...

Hello,

...

On my opinion, grep is not powerful enough in order to achieve what you want. It would be preferable to use at least some (old but powerful) tools such sed, awk, or even better : perl. Actually, what you need is a tool providing a capture buffer (this is perl jargon - "back references" in sed jargon) in which you can get the string you want to extract, rather than trying to build up a positive matching regex, as the string boundaries seem to be easy enough to describe with regexs.

One can use pcregrep which is grep that groks perl regular expressions.

Bill

-- INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way Voice: (206) 236-1676 Mercer Island, WA 98040-0820 Fax: (206) 232-9186 Skype: jwccsllc (206) 855-5792 If the government can take a man's money without his consent, there is no limit to the additional tyranny it may practise upon him; for, with his money, it can hire soldiers to stand over him, keep him in subjection, plunder him at discretion, and kill him if he resists. Lysander Spooner, 1852

5270

Age (days ago)

5272

Last active (days ago)

discuss@lists.centos.org

4 comments

4 participants

tags (0)

participants (4)

Bill Campbell
Nico Kadel-Garcia
Patrick Lists
Robert Grasso