Jump to content

selma

New Members
  • Posts

    1
  • Joined

  • Last visited

    Never

Profile Information

  • Gender
    Not Telling

selma's Achievements

Newbie

Newbie (1/5)

0

Reputation

  1. Hi there, hope one of you can help me with a problem im just having. Ok, lets start with the explanation what i want to to: I'd like to collect headlines from a html site, and get the test results in an array, so that the array structure represents the dom-levels of the site. Small Example: <h1>test1.1<h1/> <h2>test2.1</h2> <h2>test2.2</h2> <h3>test3</h3> <h1>test1.2<h1/> Shoul end in a array structure like: array('level 1' => array ( 'sibble1' => array ( 'headline' => 'test1.1', 'level2' => array( 'sibble1' => array ( 'headline' => 'test2.1', 'level3' => array(), // empty data, needs to be processed anyway to find gaps, to maybe a h4 headline would be existing ), 'sibble2' => array ( 'headline' => 'test2.2', 'level3' => array ( 'sibble1' => array ( 'headline' => 'test3.1', 'level4' => array(), // empty data, needs to be processed anyway to find gaps, to maybe a h4 headline would be existing ), ), ), ), ), 'sibble2' => array ( 'headline' => 'test1.1', 'level2' => array(), ), ), ); So i hope out of this example you can see what i want to do. level represents the healdine level 1-9, sibblin is as name for the childs on the headline level. Ok, so to extract the herefore needed data out of the html, i build a class with a recursive function, that filters the html by a regex from one headline to the next, first iteratin all childs, if there are no more childs i go tho ne nextsibbling element. as an running example code look here: <?php class Application_Model_DomParser { const PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH = 4; private $slicedFields = array(); public function __construct() { } public function sliceFirstSectionToDataFields($level = 1, $haystack) { preg_match_all("#(.*?)<h$level>(.*?)</h$level>(.*?)(<h$level>|$)#s", $haystack, $data); // prepare the chunkData $pageText = ''; if (isset($data[1][0])) { $pageText = $data[1][0]; } $headline = ''; if (isset($data[2][0])) { $headline = $data[2][0]; } $dataToProcessNextLevel =''; if (isset($data[3][0])) { $dataToProcessNextLevel = $data[3][0]; } // @todo dirty warnings compression, search why warning occures @$posOfNextChild = strlen($data[0][0]) - self::PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH; // from here goes the debug... echo $headline ."::" .strlen($dataToProcessNextLevel). "<br>"; if (strlen($dataToProcessNextLevel) <= self::PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH) { if ($level < 9) { $dataToProcessNextLevel = $haystack; } else { return; } } //recursive check for next level in $dataToProcessNextLevel $nextLevel = $level + 1; $this->sliceFirstSectionToDataFields($nextLevel, $dataToProcessNextLevel); $haystack = substr($haystack, $posOfNextChild); // slized To The End if (self::PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH == strlen($haystack)) { // die("slized To The End"); return; } // recursive check for other childs in actuall level... $this->sliceFirstSectionToDataFields($level, $haystack); } } $htmlString = 'test<h1>headkline1.1</h1> <p>test test</p> <h2>headline 2.1</h2>test test <h2>headline 2.2</h2> <p>tes test test</p> <h3>headline 3.1</h3>test <h3>headline 3.2</h3>test <h2>headline 2.3</h2> <p> </p> <p>test</p> <h1>headline 1.2</h1> <h2>headline 2.4</h2> <p>11111111111112222222222222222222222</p> <p> ewfwrefg upowmdg w3q09umq09wrt n3q089ty 3q0898943ty -98 41</p> <h3>headline 3.3</h3> <p>test</p> <p>test</p> <p>test</p> <h1>1.3 testtest</h1> <p>test</p> <h3>head 3.4</h3> <p>sadfsadfsadf asdfsda f sdaas saf saddas</p> <h3>head 3.4</h3> <h3>test 1.3</h3> <p>test;</p> '; $model = new Application_Model_DomParser(); $result = $model->sliceFirstSectionToDataFields(1, $this->htmlString); So letting this piece of code run, you can se via the debug echo, every headline is found, even in the right order of its occurence. My problem now is dont get it how to return the extracted values, and collect them to get a result as my above shown structure shows. So i spent ours to solve this problem but didnt come to a result. I know its a hard problem, and to help me takes time, cause its a complex situation. Allthough i hope someone knows a anser. I need this problem solved, to get my mind rested! Thanks, and greetings Selma
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.