Jump to content

Prismatic

Members
  • Posts

    503
  • Joined

  • Last visited

Everything posted by Prismatic

  1. So here's what I'm trying to do, and I haven't found any clear tutorials on how to properly navigate a DOMDocument object, at least not in the strict sense of PHP. I'm building a web scraper, I've had it working for some time now using more traditional methods (a combination of string manipulation and clever regex). I've been told xpath can be much faster and more reliable for what I need. Sold. Let's say I'm parsing a forum. This forum separates each reply in a post with a set of <li></li> with a class of "message" <li class="message"> // Stuff here </li> <li class="message"> // Stuff here </li> So far so good. These list items contain all the formatting for each post, including user info and the message text. Each sitting in it's own div. <li class="message"> <div class="user info"> User info here </div> <div class="message text"> Message text here </div> </li> <li class="message"> <div class="user info"> User info here </div> <div class="message text"> Message text here </div> </li> Still with me? Good. With this bit of code I can select each message list item block and iterate over all the sub nodes inside. $items = $xpath->query("//li[starts-with(@class, 'message')]"); for ($i = 0; $i < $items->length; $i++) { echo $items->item($i)->nodeValue . "\n"; } This produces a basic text dump of the entire forum. Close, but not what I need. What I'm trying to do is as follows Select all the class="message" list items [done] Once those have been selected, run another $xpath->query to select the child nodes which contain the user info and message text Step one is done, step two is what is confusing me. How can I run a new query based on the output from the first query? Thanks guys
  2. That works great thank you [h1=/t][/h1]This text is inside the pre element, it will be parsed. [h1=/t][/h1]Tabbed text [h1=/s/s][/h1]Two spaces [h1=/s/s/s][/h1]Three spaces [h1=/s/s/s/s][/h1]four spaces[h1=/t][/h1]and a tab [h1=/t/t][/h1]Two tabs Adapted it for spaces 2+ $output = preg_replace_callback("/[ ]{2,}/", "space_replace" , $output); ... function space_replace($args) { $tmp = '[h1='.str_replace(" ",'/s', $args[0]).'][/h1]'; return $tmp; } Cheers! Edit - where's the solved button at?
  3. What I'm working on is complicated but my problem isn't. I'm trying to convert sets of tabs and spaces into other characters. For example, say I have the following. One tab Two tabs Three tabs What I'm trying to do is end up with the following [hl=/t][/hl]One tab [hl=/t/t][/hl]Two tabs [hl=/t/t/t][/hl]Three tabs Note the three /t's for the three tabs. My issue is when the script generates the regex to do the first line there, which has one tab, the regex is /[\t]{1}/ But that converts all the tabs. It's hard to explain ugh. All I can manage is [hl=/t][/hl]One tab [hl=/t/[/hl][hl=/t][/hl]Two tabs [hl=/t/][/hl][hl=/t][/hl][hl=/t][/hl]Three tabs help?
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.