drewbee Posted March 5, 2008 Share Posted March 5, 2008 I was wondering if anyone in here has much experience with the php DOM Object within PHP 5? I am going to be parsing some data off of any given submited site, and was originally developing regular expressions to handle this. However, I ran accross this PHP DOM Object (I never new it existed) and I am wondering how forgiving it is in relation to poorly formated html/xhtml. Does the document have to be perfect for this to work well? Or does it pick up on all the random garbage that consists of a poorly coded page? I am just looking into some insight on this... Thanks for your comments / thoughts. Link to comment https://forums.phpfreaks.com/topic/94570-php-dom-object/ Share on other sites More sharing options...
rhodesa Posted March 5, 2008 Share Posted March 5, 2008 It follows the rules strictly. If you are going to be parsing something that might not be XHTML compliant, I would stick to Regex. Link to comment https://forums.phpfreaks.com/topic/94570-php-dom-object/#findComment-484288 Share on other sites More sharing options...
drewbee Posted March 5, 2008 Author Share Posted March 5, 2008 Yeah. I need to find a badly written site to test it out on and see how it handles it. So far I have been playing with the DOM object in combination with DOMXPATH and it is really crazy cool stuff. Link to comment https://forums.phpfreaks.com/topic/94570-php-dom-object/#findComment-484416 Share on other sites More sharing options...
rhodesa Posted March 5, 2008 Share Posted March 5, 2008 check out SimpleXML too...another good one Link to comment https://forums.phpfreaks.com/topic/94570-php-dom-object/#findComment-484426 Share on other sites More sharing options...
drewbee Posted March 6, 2008 Author Share Posted March 6, 2008 Wow... it definately isn't THAT strict. <html> <head> <title>title test</title> <body><a href="link.html">Normal Link</a> <a href=link.html>Link no Quotes</a> <a href=link.html rel=nofollow>Rel no follow no quotes</a> <a rel=nofollow href=link.html>Rel no follow first no quotes</a> </body> </html> Outputted once parsed by the DOM and DOMXpath Normal Link () Link no Quotes () Rel no follow no quotes (nofollow) Rel no follow first no quotes (nofollow) As far as I am concerned, if the html is bad enough not to be picked up by this, then I dont need to be scrapping there page This is more then satisfactory. Link to comment https://forums.phpfreaks.com/topic/94570-php-dom-object/#findComment-485375 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.