Jump to content

Need help parsing / extracting links from log files


Recommended Posts

I am looking for some help with extracting links from log files, as it is a pain to do this manually (which I do right now). I basically have some log files which I need to check for ERROR messages and copy and paste the found URL's into another text file.

 

My log file format looks like this:

 

INFO  <11 Feb 2012 00:00:23,822> <index> <D2> <Processing URL : http://www.domain1.com/>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Indexed: http://www.domain2.com/> <Time:146 msecs>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Processing URL : http://www.domain3.com/>
ERROR <11 Feb 2012 00:00:23,924> <index> <D1> <http://www.domain4.org/operas/2003-2004/mourning/composer.aspx: >
org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing
        at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965)
        at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659)
        at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:674)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain6.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain7.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:32,988> <index> <D1> <Processing URL : http://www.domain8.com/>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Indexed: http://www.domain9.com/> <Time:128 msecs>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Processing URL : http://www.domain10.com/>
ERROR <11 Feb 2012 00:00:33,116> <index> <D2> <http://www.domain11.com/: Connection timeout>
org.apache.commons.httpclient.HttpConnection$ConnectionTimeoutException
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:736)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:661)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:33,154> <index> <D1> <Indexing http://www.domain12.com/ ...>
INFO  <11 Feb 2012 00:00:33,159> <index> <D1> <http://www.domain13.com/ - Last-Modified date: Sat Feb 11 00:00:33 CET 2012>
ERROR <11 Feb 2012 00:00:33,207> <index> <D6> <http://www.domain14.com/: Connection timeout>

 

 

Now what I am after is some piece of code which basically saves the http://domain.com/ part to a text file IF the line starts with ERROR. There are many different error reasons, so the strings are all different at the start and at the end, so maybe you know a way to open a log file, look out for the word ERROR at the beginning of a line and if that's the case, either save the whole line to another text file or if possible just the domain part (which would be even more great)

 

If possible, please post a fully functional code block, as I am extremely bad with anything that has to do with regex, opening and closing files etc.

 

Your help would be greatly appreciated :)

 

I attached a sample log file to this post in case it helps (same as the lines above)

17560_.txt

Hi salathe,

 

giventhe fact that I really have no clue about how to even get started with this, I would love to see a complete solution with maybe some comments on "why this will work best" so I can learn from it when I need to complete similar tasks. So yes, I am more looking for a "complete solution" instead of pointers, as I am currently doing this all by hand (about 100000 lines per day) so it would save me a lot of time.

 

 

A very rough hack, tested using the data you supplied. I am SURE there is a more efficient/elegant way; however, this worked...

 

<?PHP
/* create a test log */
$myfile = "mytest.log";
$contents = "INFO  <11 Feb 2012 00:00:23,822> <index> <D2> <Processing URL : http://www.domain1.com/>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Indexed: http://www.domain2.com/> <Time:146 msecs>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Processing URL : http://www.domain3.com/>
ERROR <11 Feb 2012 00:00:23,924> <index> <D1> <http://www.domain4.org/operas/2003-2004/mourning/composer.aspx: >
org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing
        at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965)
        at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659)
        at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:674)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain6.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain7.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:32,988> <index> <D1> <Processing URL : http://www.domain8.com/>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Indexed: http://www.domain9.com/> <Time:128 msecs>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Processing URL : http://www.domain10.com/>
ERROR <11 Feb 2012 00:00:33,116> <index> <D2> <http://www.domain11.com/: Connection timeout>
org.apache.commons.httpclient.HttpConnection$ConnectionTimeoutException
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:736)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:661)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:33,154> <index> <D1> <Indexing http://www.domain12.com/ ...>
INFO  <11 Feb 2012 00:00:33,159> <index> <D1> <http://www.domain13.com/ - Last-Modified date: Sat Feb 11 00:00:33 CET 2012>
ERROR <11 Feb 2012 00:00:33,207> <index> <D6> <http://www.domain14.com/: Connection timeout>
";


file_put_contents($myfile, $contents);

/* FROM  HERE FORWARD IS WHERE YOU WILL USE YOUR REAL DATA */
/* read the entire file into a string */
$contents = file_get_contents($myfile);

/* remove extraneous characters */
$contents = str_ireplace ("<", "", $contents);
$contents = str_ireplace (">", "", $contents);
$contents = str_ireplace ("Connection timeout", "", $contents);

/* write the cleansed data back to the file */
file_put_contents($myfile, $contents);

/* read the log file into an array */
$lines = file($myfile);


/* count the number of lines (elements) */
$c = count($lines);

/* loop thru the lines - grabing only those lines containing ERROR  into a new array */
for($i=0;$i<$c;$i++) {
$string = "This is a strpos() test";
$pos = strpos($lines[$i], "ERROR");
if ($pos === false) {
}else{
	$my_line = explode("http://", $lines[$i]);
	$new_content = $new_content . $my_line[1];
}
}
echo nl2br($new_content);
/* save the data to a new file */
$new_file = "test_log_" . time() . ".txt";
file_put_contents($new_file, $new_content);
?>

end output here http://www.nstoia.com/logtest.php

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.