dgoosens Posted February 12, 2013 Share Posted February 12, 2013 Hi, I have been searching all over the web but have not found a useful answer... Hope you guys can help me. This is the situation. I have a rather big XML file (13Mb - 4100 lines) where I need to search for data in the text elements using regex patterns. To do so, I parse the file with XMLReader (http://www.php.net/XMLReader). I have tried to use DOMDocument (http://www.php.net/m...domdocument.php), which I prefer, but it really is too slow. The script runs really well as it returns me all the matches without any issue and very very fast. BUT, I need to know the exact Xpath of every matched node and, surprisingly, XMLReader does not come with a XPath attribute or metod. So, basically, what I am searching for is a effective (speed is important) way to get to know the XPath of any node parsed with XMLReader... Any suggestions ? Thanks for your time and feedback. Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/ Share on other sites More sharing options...
dgoosens Posted February 12, 2013 Author Share Posted February 12, 2013 Just gave SimpleXML (Element & Iterator - http://www.php.net/manual/en/book.simplexml.php) a try... Same issue... Is very fast but it seems impossible to get the XPath of a given node. Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1411971 Share on other sites More sharing options...
Barand Posted February 12, 2013 Share Posted February 12, 2013 So you have told us you have a problem. Can you be more specific and show us what you are trying that doesn't work because, at present, I havn't a clue what you want. Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1411972 Share on other sites More sharing options...
salathe Posted February 12, 2013 Share Posted February 12, 2013 (edited) You could use XMLReader's expand() method to get a DOM version of the node. Then use the DOMNode base-class's method getNodePath() to return its location path. Note that this will give different location paths than if you were to use SimpleXML/DOM, which load the entire document. The XMLReader -> DOM -> getNodePath will not include positional information (e.g. /example/something[3]), whereas (optionally, SimpleXML ->) DOM -> getNodePath will include it. Hopefully you can see why XMLReader won't know that positional information. Edited February 12, 2013 by salathe Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412021 Share on other sites More sharing options...
dgoosens Posted February 13, 2013 Author Share Posted February 13, 2013 So you have told us you have a problem. Can you be more specific and show us what you are trying that doesn't work because, at present, I havn't a clue what you want. It is very simple... I've got an XML document that I need to parse node by node with XMLReader. If the node's content matches a given regex pattern, I need to get the Xpath to that node. Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412212 Share on other sites More sharing options...
dgoosens Posted February 13, 2013 Author Share Posted February 13, 2013 (edited) You could use XMLReader's expand() method to get a DOM version of the node. Then use the DOMNode base-class's method getNodePath() to return its location path. Note that this will give different location paths than if you were to use SimpleXML/DOM, which load the entire document. The XMLReader -> DOM -> getNodePath will not include positional information (e.g. /example/something[3]), whereas (optionally, SimpleXML ->) DOM -> getNodePath will include it. Hopefully you can see why XMLReader won't know that positional information. Thanks Salathe... I owe you a beer... Or at least half one XMLReader -> DOM -> getNodePath() was not what I was looking for... As you mention already, it does not include positional information. BUT... Thanks to your idea (SimpleXML -> DOM -> getNodePath()) I found the dom_import_simplexml() function I did not know yet... This does exactly what I needed : <?php function parse(SimpleXMLIterator $xi) { for ($xi->rewind(); $xi->valid(); $xi->next()) { if($xi->hasChildren()) { parse($xi->current()); } else { if(true) { //CONDITION HERE $domEl = dom_import_simplexml($xi->current()); echo $domEl->getNodePath() . PHP_EOL; } } } } $file = 'PATH/test.xml'; $xi = new SimpleXMLIterator($file, null, true); parse($xi); Now I get all the exact nodePaths Cheers Edited February 13, 2013 by dgoosens Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412214 Share on other sites More sharing options...
salathe Posted February 13, 2013 Share Posted February 13, 2013 Cheers! P.S. There is no reason to iterate over the SimpleXMLIterator with for... foreach is an iterator's best friend. Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412346 Share on other sites More sharing options...
dgoosens Posted February 14, 2013 Author Share Posted February 14, 2013 Cheers! P.S. There is no reason to iterate over the SimpleXMLIterator with for... foreach is an iterator's best friend. thanks... the first trial did not run with foreach... now that I looked at it properly, it runs flawlessly cheers once more Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412397 Share on other sites More sharing options...
dgoosens Posted February 14, 2013 Author Share Posted February 14, 2013 damn... SimpleXml runs for 15 seconds with XMLReader, barely 2 Keep on searching for Xpaht in XMLReader... Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412408 Share on other sites More sharing options...
dgoosens Posted February 14, 2013 Author Share Posted February 14, 2013 (edited) If it does not exist, invent it... There is quite some work that has to be done, but for what I am doing, this is a very good start : class XMLReaderI extends \XMLReader { /** * depth of the current node * * @var int */ private $_currentDepth = 0; /** * depth of the previous node * * @var int */ private $_previousDepth = 0; /** * list of the parsed nodes * * @var array */ private $_nodesParsed = array(); /** * keep track of the node types * * @var array */ private $_nodesType = array(); /** * keeps track of the node number * * @var array */ private $_nodesCounter = array(); /** * Move to next node in document * * @link http://php.net/manual/en/xmlreader.read.php * @return bool <b>TRUE</b> on success or <b>FALSE</b> on failure. */ public function read() { $r = parent::read(); if($this->depth < $this->_previousDepth) { if(!isset($this->_nodesParsed[$this->depth])) { throw new \Exception('Missing items in $_nodesParsed'); } if(!isset($this->_nodesCounter[$this->depth])) { throw new \Exception('Missing items in $_nodesCounter'); } if(!isset($this->_nodesType[$this->depth])) { throw new \Exception('Missing items in $_nodesType'); } $this->_nodesParsed = array_slice($this->_nodesParsed, 0, $this->depth + 1, true); $this->_nodesCounter = array_slice($this->_nodesCounter, 0, $this->depth + 1, true); $this->_nodesType = array_slice($this->_nodesType, 0, $this->depth + 1, true); } if(isset($this->_nodesParsed[$this->depth]) && $this->localName == $this->_nodesParsed[$this->depth] && $this->nodeType == $this->_nodesType[$this->depth]) { $this->_nodesCounter[$this->depth] = $this->_nodesCounter[$this->depth] + 1; } else { $this->_nodesParsed[$this->depth] = $this->localName; $this->_nodesType[$this->depth] = $this->nodeType; $this->_nodesCounter[$this->depth] = 1; } $this->_previousDepth = $this->depth; return $r; } /** * getNodePath() * * @return string XPath of the current node */ public function getNodePath() { if(count($this->_nodesCounter) != count($this->_nodesParsed) && count($this->_nodesCounter) != count($this->_nodesType)) { throw new Exception('Counts do not match'); } $nodePath = ''; foreach ($this->_nodesParsed as $depth => $nodeName) { switch ($this->_nodesType[$depth]) { case parent::ELEMENT: $nodePath .= '/' . $nodeName . '[' . $this->_nodesCounter[$depth] . ']'; break; case parent::TEXT: case parent::CDATA: $nodePath .= '/text()'; break; case parent::COMMENT: $nodePath .= '/comment()'; break; case parent::ATTRIBUTE: $nodePath .= '[@' . $nodeName . ']'; break; default: break; } } return $nodePath; } } Edited February 14, 2013 by dgoosens Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412445 Share on other sites More sharing options...
dgoosens Posted February 15, 2013 Author Share Posted February 15, 2013 the above class contains errors... do not use it yet Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412572 Share on other sites More sharing options...
dgoosens Posted February 21, 2013 Author Share Posted February 21, 2013 (edited) 'k this seems to work : use \XMLReader; class XMLReaderX extends XMLReader { /** * depth of the previous node * * @var int */ protected $_previousDepth = 0; /** * list of the parsed nodes * * @var array */ protected $_nodesParsed = array(); /** * keep track of the node types * * @var array */ protected $_nodesType = array(); /** * keeps track of the node number * * @var array */ protected $_nodesCount = array(); /** * list of nodes that matter for XPath * * @var array */ protected $_referencedNodeTypes = array( parent::ELEMENT, parent::ATTRIBUTE, parent::TEXT, parent::CDATA, parent::COMMENT ); /** * keep track of all the parsed paths * * @var array */ protected $_parsedPaths = array(); /** * Move to next node in document * * @throws XMLReaderException * @link http://php.net/manual/en/xmlreader.read.php * @return bool <b>TRUE</b> on success or <b>FALSE</b> on failure. */ public function read() { $read = parent::read(); if(in_array($this->nodeType, $this->_referencedNodeTypes)) { if($this->depth < $this->_previousDepth) { if(!isset($this->_nodesParsed[$this->depth])) { throw new \Exception('Missing items in $_nodesParsed'); } if(!isset($this->_nodesCount[$this->depth])) { throw new \Exception('Missing items in $_nodesCounter'); } if(!isset($this->_nodesType[$this->depth])) { throw new \Exception('Missing items in $_nodesType'); } $this->_nodesParsed = array_slice($this->_nodesParsed, 0, $this->depth + 1, true); $this->_nodesCount = array_slice($this->_nodesCount, 0, $this->depth + 1, true); $this->_nodesType = array_slice($this->_nodesType, 0, $this->depth + 1, true); } if(isset($this->_nodesParsed[$this->depth]) && $this->localName == $this->_nodesParsed[$this->depth] && $this->nodeType == $this->_nodesType[$this->depth]) { $this->_nodesCount[$this->depth] = $this->_nodesCount[$this->depth] + 1; } else { $this->_nodesParsed[$this->depth] = $this->localName; $this->_nodesType[$this->depth] = $this->nodeType; $logPath = $this->_getLogPath(); if(isset($this->_parsedPaths[$logPath])) { $this->_nodesCount[$this->depth] = $this->_parsedPaths[$logPath] + 1; } else { $this->_nodesCount[$this->depth] = 1; // first node is 1, not 0 } } if($this->nodeType == parent::ELEMENT) { $this->_parsedPaths[$this->_getLogPath()] = $this->_nodesCount[$this->depth]; } $this->_previousDepth = $this->depth; } return $read; } /** * getNodePath() * * @return string XPath of the current node */ public function getNodePath() { if(count($this->_nodesCount) != count($this->_nodesParsed) && count($this->_nodesCount) != count($this->_nodesType)) { throw new \Exception('Counts do not match'); } $nodePath = ''; foreach ($this->_nodesParsed as $depth => $nodeName) { switch ($this->_nodesType[$depth]) { case parent::ELEMENT: $nodePath .= '/' . $nodeName . '[' . $this->_nodesCount[$depth] . ']'; break; case parent::ATTRIBUTE: $nodePath .= '[@' . $nodeName . ']'; break; case parent::TEXT: case parent::CDATA: $nodePath .= '/text()'; break; case parent::COMMENT: $nodePath .= '/comment()'; break; default: throw new \Exception('Unknown node type'); break; } } return $nodePath; } /** * get the path of the actual node for logging * * @return string */ protected function _getLogPath() { $path = ''; $localCopy = $this->_nodesParsed; if(isset($localCopy[$this->depth])) { unset($localCopy[$this->depth]); } foreach ($localCopy as $depth => $nodeName) { $path .= '/' . $nodeName . '[' . $this->_nodesCount[$depth] . ']'; } $path .= '/' . $this->localName; return $path; } } Let me know if you encounter any issues... GITHUB : https://github.com/dGo/XMLReaderX Edited February 21, 2013 by dgoosens Quote Link to comment https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1413867 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.