Jump to content

XMLReader & XPath


dgoosens

Recommended Posts

Hi,

 

I have been searching all over the web but have not found a useful answer...

Hope you guys can help me.

 

This is the situation.

I have a rather big XML file (13Mb - 4100 lines) where I need to search for data in the text elements using regex patterns.

To do so, I parse the file with XMLReader (http://www.php.net/XMLReader).

I have tried to use DOMDocument (http://www.php.net/m...domdocument.php), which I prefer, but it really is too slow.

 

The script runs really well as it returns me all the matches without any issue and very very fast.

BUT, I need to know the exact Xpath of every matched node and, surprisingly, XMLReader does not come with a XPath attribute or metod.

 

So, basically, what I am searching for is a effective (speed is important) way to get to know the XPath of any node parsed with XMLReader...

Any suggestions ?

 

Thanks for your time and feedback.

Link to comment
https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/
Share on other sites

You could use XMLReader's expand() method to get a DOM version of the node. Then use the DOMNode base-class's method getNodePath() to return its location path.

 

Note that this will give different location paths than if you were to use SimpleXML/DOM, which load the entire document. The XMLReader -> DOM -> getNodePath will not include positional information (e.g. /example/something[3]), whereas (optionally, SimpleXML ->) DOM -> getNodePath will include it. Hopefully you can see why XMLReader won't know that positional information.

Link to comment
https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412021
Share on other sites

So you have told us you have a problem. Can you be more specific and show us what you are trying that doesn't work because, at present, I havn't a clue what you want.

 

It is very simple...

I've got an XML document that I need to parse node by node with XMLReader.

If the node's content matches a given regex pattern, I need to get the Xpath to that node.

Link to comment
https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412212
Share on other sites

You could use XMLReader's expand() method to get a DOM version of the node. Then use the DOMNode base-class's method getNodePath() to return its location path.

 

Note that this will give different location paths than if you were to use SimpleXML/DOM, which load the entire document. The XMLReader -> DOM -> getNodePath will not include positional information (e.g. /example/something[3]), whereas (optionally, SimpleXML ->) DOM -> getNodePath will include it. Hopefully you can see why XMLReader won't know that positional information.

 

Thanks Salathe...

I owe you a beer... Or at least half one

 

XMLReader -> DOM -> getNodePath() was not what I was looking for...

As you mention already, it does not include positional information.

 

BUT... Thanks to your idea (SimpleXML -> DOM -> getNodePath())

I found the dom_import_simplexml() function I did not know yet...

This does exactly what I needed :

 

<?php
function parse(SimpleXMLIterator $xi)
{
for ($xi->rewind(); $xi->valid(); $xi->next()) {	
 if($xi->hasChildren()) {
	 parse($xi->current());
 } else {
	 if(true) { //CONDITION HERE
		 $domEl = dom_import_simplexml($xi->current());
		 echo $domEl->getNodePath() . PHP_EOL;
	 }
 }
}
}
$file = 'PATH/test.xml';
$xi = new SimpleXMLIterator($file, null, true);
parse($xi);

 

Now I get all the exact nodePaths

 

 

Cheers

Link to comment
https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412214
Share on other sites

If it does not exist, invent it...

There is quite some work that has to be done, but for what I am doing, this is a very good start :

class XMLReaderI extends \XMLReader
{
/**
 * depth of the current node
 *
 * @var int
 */
private $_currentDepth = 0;

/**
 * depth of the previous node
 *
 * @var int
 */
private $_previousDepth = 0;

/**
 * list of the parsed nodes
 *
 * @var array
 */
private $_nodesParsed = array();

/**
 * keep track of the node types
 *
 * @var array
 */
private $_nodesType = array();

/**
 * keeps track of the node number
 *
 * @var array
 */
private $_nodesCounter = array();


/**
 * Move to next node in document
 *
 * @link http://php.net/manual/en/xmlreader.read.php
 * @return bool <b>TRUE</b> on success or <b>FALSE</b> on failure.
 */
public function read()
{
 $r = parent::read();

 if($this->depth < $this->_previousDepth) {
	 if(!isset($this->_nodesParsed[$this->depth])) {
		 throw new \Exception('Missing items in $_nodesParsed');
	 }
	 if(!isset($this->_nodesCounter[$this->depth])) {
		 throw new \Exception('Missing items in $_nodesCounter');
	 }
	 if(!isset($this->_nodesType[$this->depth])) {
		 throw new \Exception('Missing items in $_nodesType');
	 }
	 $this->_nodesParsed	 = array_slice($this->_nodesParsed, 0, $this->depth + 1, true);
	 $this->_nodesCounter = array_slice($this->_nodesCounter, 0, $this->depth + 1, true);
	 $this->_nodesType	 = array_slice($this->_nodesType, 0, $this->depth + 1, true);
 }
 if(isset($this->_nodesParsed[$this->depth])
	 && $this->localName == $this->_nodesParsed[$this->depth]
	 && $this->nodeType == $this->_nodesType[$this->depth])
 {
	 $this->_nodesCounter[$this->depth] = $this->_nodesCounter[$this->depth] + 1;
 } else {
	 $this->_nodesParsed[$this->depth] = $this->localName;
	 $this->_nodesType[$this->depth]	 = $this->nodeType;
	 $this->_nodesCounter[$this->depth] = 1;
 }
 $this->_previousDepth = $this->depth;

 return $r;
}

/**
 * getNodePath()
 *
 * @return string XPath of the current node
 */
public function getNodePath()
{
 if(count($this->_nodesCounter) != count($this->_nodesParsed)
	 && count($this->_nodesCounter) != count($this->_nodesType))
 {
	 throw new Exception('Counts do not match');
 }

 $nodePath = '';
 foreach ($this->_nodesParsed as $depth => $nodeName)
    {
	    switch ($this->_nodesType[$depth]) {
		    case parent::ELEMENT:
			    $nodePath .= '/' . $nodeName . '[' . $this->_nodesCounter[$depth] . ']';
			    break;

		    case parent::TEXT:
		    case parent::CDATA:
			    $nodePath .= '/text()';
			    break;

		    case parent::COMMENT:
			    $nodePath .= '/comment()';
			    break;

		    case parent::ATTRIBUTE:
			    $nodePath .= '[@' . $nodeName . ']';
			    break;

		    default:
			    break;
	    }
    }
    return $nodePath;
   }
}

Link to comment
https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1412445
Share on other sites

'k

this seems to work :

 

use \XMLReader;
class XMLReaderX extends XMLReader
{
/**
 * depth of the previous node
 *
 * @var int
 */
protected $_previousDepth = 0;

/**
 * list of the parsed nodes
 *
 * @var array
 */
protected $_nodesParsed = array();

/**
 * keep track of the node types
 *
 * @var array
 */
protected $_nodesType = array();

/**
 * keeps track of the node number
 *
 * @var array
 */
protected $_nodesCount = array();

/**
 * list of nodes that matter for XPath
 *
 * @var array
 */
protected $_referencedNodeTypes = array(
 parent::ELEMENT,
 parent::ATTRIBUTE,
 parent::TEXT,
 parent::CDATA,
 parent::COMMENT
);

/**
 * keep track of all the parsed paths
 *
 * @var array
 */
protected $_parsedPaths = array();

/**
 * Move to next node in document
 *
 * @throws XMLReaderException
 * @link http://php.net/manual/en/xmlreader.read.php
 * @return bool <b>TRUE</b> on success or <b>FALSE</b> on failure.
 */
public function read()
{
 $read = parent::read();

 if(in_array($this->nodeType, $this->_referencedNodeTypes)) {
	 if($this->depth < $this->_previousDepth) {
		 if(!isset($this->_nodesParsed[$this->depth])) {
			 throw new \Exception('Missing items in $_nodesParsed');
		 }
		 if(!isset($this->_nodesCount[$this->depth])) {
			 throw new \Exception('Missing items in $_nodesCounter');
		 }
		 if(!isset($this->_nodesType[$this->depth])) {
			 throw new \Exception('Missing items in $_nodesType');
		 }
		 $this->_nodesParsed	 = array_slice($this->_nodesParsed, 0, $this->depth + 1, true);
		 $this->_nodesCount = array_slice($this->_nodesCount, 0, $this->depth + 1, true);
		 $this->_nodesType	 = array_slice($this->_nodesType, 0, $this->depth + 1, true);
	 }
	 if(isset($this->_nodesParsed[$this->depth])
		 && $this->localName == $this->_nodesParsed[$this->depth]
		 && $this->nodeType == $this->_nodesType[$this->depth])
	 {
		 $this->_nodesCount[$this->depth] = $this->_nodesCount[$this->depth] + 1;
	 } else {
		 $this->_nodesParsed[$this->depth] = $this->localName;
		 $this->_nodesType[$this->depth]	 = $this->nodeType;

		 $logPath = $this->_getLogPath();
		 if(isset($this->_parsedPaths[$logPath])) {
			 $this->_nodesCount[$this->depth] = $this->_parsedPaths[$logPath] + 1;
		 } else {
			 $this->_nodesCount[$this->depth] = 1; // first node is 1, not 0
		 }
	 }

	 if($this->nodeType == parent::ELEMENT) {
		 $this->_parsedPaths[$this->_getLogPath()] = $this->_nodesCount[$this->depth];
	 }

	 $this->_previousDepth = $this->depth;
 }

 return $read;
}

/**
 * getNodePath()
 *
 * @return string XPath of the current node
 */
public function getNodePath()
{
 if(count($this->_nodesCount) != count($this->_nodesParsed)
	 && count($this->_nodesCount) != count($this->_nodesType))
 {
	 throw new \Exception('Counts do not match');
 }

 $nodePath = '';
 foreach ($this->_nodesParsed as $depth => $nodeName) {
	 switch ($this->_nodesType[$depth]) {
		 case parent::ELEMENT:
			 $nodePath .= '/' . $nodeName . '[' . $this->_nodesCount[$depth] . ']';
			 break;

		 case parent::ATTRIBUTE:
			 $nodePath .= '[@' . $nodeName . ']';
			 break;
		 case parent::TEXT:
		 case parent::CDATA:
			 $nodePath .= '/text()';
			 break;
		 case parent::COMMENT:
			 $nodePath .= '/comment()';
			 break;
		 default:
			 throw new \Exception('Unknown node type');
			 break;
	 }
 }
 return $nodePath;
}

/**
 * get the path of the actual node for logging
 *
 * @return string
 */
protected function _getLogPath()
{
 $path = '';

 $localCopy = $this->_nodesParsed;
 if(isset($localCopy[$this->depth])) {
	 unset($localCopy[$this->depth]);
 }

 foreach ($localCopy as $depth => $nodeName) {
	 $path .= '/' . $nodeName . '[' . $this->_nodesCount[$depth] . ']';
 }
 $path .= '/' . $this->localName;

 return $path;
}
}

 

Let me know if you encounter any issues...

 

GITHUB : https://github.com/dGo/XMLReaderX

Link to comment
https://forums.phpfreaks.com/topic/274397-xmlreader-xpath/#findComment-1413867
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.