Mcod Posted February 11, 2012 Share Posted February 11, 2012 I am looking for some help with extracting links from log files, as it is a pain to do this manually (which I do right now). I basically have some log files which I need to check for ERROR messages and copy and paste the found URL's into another text file. My log file format looks like this: INFO <11 Feb 2012 00:00:23,822> <index> <D2> <Processing URL : http://www.domain1.com/> INFO <11 Feb 2012 00:00:23,842> <index> <D4> <Indexed: http://www.domain2.com/> <Time:146 msecs> INFO <11 Feb 2012 00:00:23,842> <index> <D4> <Processing URL : http://www.domain3.com/> ERROR <11 Feb 2012 00:00:23,924> <index> <D1> <http://www.domain4.org/operas/2003-2004/mourning/composer.aspx: > org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965) at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:674) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529) at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source) at com.searchblox.scanner.Scanner.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain6.com/~cdobie/kearnsindex.htm> INFO <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain7.com/~cdobie/kearnsindex.htm> INFO <11 Feb 2012 00:00:32,988> <index> <D1> <Processing URL : http://www.domain8.com/> INFO <11 Feb 2012 00:00:33,072> <index> <D5> <Indexed: http://www.domain9.com/> <Time:128 msecs> INFO <11 Feb 2012 00:00:33,072> <index> <D5> <Processing URL : http://www.domain10.com/> ERROR <11 Feb 2012 00:00:33,116> <index> <D2> <http://www.domain11.com/: Connection timeout> org.apache.commons.httpclient.HttpConnection$ConnectionTimeoutException at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:736) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:661) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529) at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source) at com.searchblox.scanner.Scanner.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO <11 Feb 2012 00:00:33,154> <index> <D1> <Indexing http://www.domain12.com/ ...> INFO <11 Feb 2012 00:00:33,159> <index> <D1> <http://www.domain13.com/ - Last-Modified date: Sat Feb 11 00:00:33 CET 2012> ERROR <11 Feb 2012 00:00:33,207> <index> <D6> <http://www.domain14.com/: Connection timeout> Now what I am after is some piece of code which basically saves the http://domain.com/ part to a text file IF the line starts with ERROR. There are many different error reasons, so the strings are all different at the start and at the end, so maybe you know a way to open a log file, look out for the word ERROR at the beginning of a line and if that's the case, either save the whole line to another text file or if possible just the domain part (which would be even more great) If possible, please post a fully functional code block, as I am extremely bad with anything that has to do with regex, opening and closing files etc. Your help would be greatly appreciated I attached a sample log file to this post in case it helps (same as the lines above) 17560_.txt Quote Link to comment https://forums.phpfreaks.com/topic/256893-need-help-parsing-extracting-links-from-log-files/ Share on other sites More sharing options...
salathe Posted February 11, 2012 Share Posted February 11, 2012 I'm confused, are you looking for help or someone to do the work for you? (It's fine either way, but dictates the responses you'll get here.) Quote Link to comment https://forums.phpfreaks.com/topic/256893-need-help-parsing-extracting-links-from-log-files/#findComment-1316975 Share on other sites More sharing options...
Mcod Posted February 11, 2012 Author Share Posted February 11, 2012 Hi salathe, giventhe fact that I really have no clue about how to even get started with this, I would love to see a complete solution with maybe some comments on "why this will work best" so I can learn from it when I need to complete similar tasks. So yes, I am more looking for a "complete solution" instead of pointers, as I am currently doing this all by hand (about 100000 lines per day) so it would save me a lot of time. Quote Link to comment https://forums.phpfreaks.com/topic/256893-need-help-parsing-extracting-links-from-log-files/#findComment-1316978 Share on other sites More sharing options...
litebearer Posted February 11, 2012 Share Posted February 11, 2012 A very rough hack, tested using the data you supplied. I am SURE there is a more efficient/elegant way; however, this worked... <?PHP /* create a test log */ $myfile = "mytest.log"; $contents = "INFO <11 Feb 2012 00:00:23,822> <index> <D2> <Processing URL : http://www.domain1.com/> INFO <11 Feb 2012 00:00:23,842> <index> <D4> <Indexed: http://www.domain2.com/> <Time:146 msecs> INFO <11 Feb 2012 00:00:23,842> <index> <D4> <Processing URL : http://www.domain3.com/> ERROR <11 Feb 2012 00:00:23,924> <index> <D1> <http://www.domain4.org/operas/2003-2004/mourning/composer.aspx: > org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965) at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:674) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529) at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source) at com.searchblox.scanner.Scanner.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain6.com/~cdobie/kearnsindex.htm> INFO <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain7.com/~cdobie/kearnsindex.htm> INFO <11 Feb 2012 00:00:32,988> <index> <D1> <Processing URL : http://www.domain8.com/> INFO <11 Feb 2012 00:00:33,072> <index> <D5> <Indexed: http://www.domain9.com/> <Time:128 msecs> INFO <11 Feb 2012 00:00:33,072> <index> <D5> <Processing URL : http://www.domain10.com/> ERROR <11 Feb 2012 00:00:33,116> <index> <D2> <http://www.domain11.com/: Connection timeout> org.apache.commons.httpclient.HttpConnection$ConnectionTimeoutException at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:736) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:661) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529) at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source) at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source) at com.searchblox.scanner.Scanner.run(Unknown Source) at java.lang.Thread.run(Unknown Source) INFO <11 Feb 2012 00:00:33,154> <index> <D1> <Indexing http://www.domain12.com/ ...> INFO <11 Feb 2012 00:00:33,159> <index> <D1> <http://www.domain13.com/ - Last-Modified date: Sat Feb 11 00:00:33 CET 2012> ERROR <11 Feb 2012 00:00:33,207> <index> <D6> <http://www.domain14.com/: Connection timeout> "; file_put_contents($myfile, $contents); /* FROM HERE FORWARD IS WHERE YOU WILL USE YOUR REAL DATA */ /* read the entire file into a string */ $contents = file_get_contents($myfile); /* remove extraneous characters */ $contents = str_ireplace ("<", "", $contents); $contents = str_ireplace (">", "", $contents); $contents = str_ireplace ("Connection timeout", "", $contents); /* write the cleansed data back to the file */ file_put_contents($myfile, $contents); /* read the log file into an array */ $lines = file($myfile); /* count the number of lines (elements) */ $c = count($lines); /* loop thru the lines - grabing only those lines containing ERROR into a new array */ for($i=0;$i<$c;$i++) { $string = "This is a strpos() test"; $pos = strpos($lines[$i], "ERROR"); if ($pos === false) { }else{ $my_line = explode("http://", $lines[$i]); $new_content = $new_content . $my_line[1]; } } echo nl2br($new_content); /* save the data to a new file */ $new_file = "test_log_" . time() . ".txt"; file_put_contents($new_file, $new_content); ?> end output here http://www.nstoia.com/logtest.php Quote Link to comment https://forums.phpfreaks.com/topic/256893-need-help-parsing-extracting-links-from-log-files/#findComment-1317052 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.