Jump to content

XMLReader & XPath


dgoosens

Recommended Posts

Hi,

 

I have been searching all over the web but have not found a useful answer...

Hope you guys can help me.

 

This is the situation.

I have a rather big XML file (13Mb - 4100 lines) where I need to search for data in the text elements using regex patterns.

To do so, I parse the file with XMLReader (http://www.php.net/XMLReader).

I have tried to use DOMDocument (http://www.php.net/m...domdocument.php), which I prefer, but it really is too slow.

 

The script runs really well as it returns me all the matches without any issue and very very fast.

BUT, I need to know the exact Xpath of every matched node and, surprisingly, XMLReader does not come with a XPath attribute or metod.

 

So, basically, what I am searching for is a effective (speed is important) way to get to know the XPath of any node parsed with XMLReader...

Any suggestions ?

 

Thanks for your time and feedback.

Link to comment
Share on other sites

You could use XMLReader's expand() method to get a DOM version of the node. Then use the DOMNode base-class's method getNodePath() to return its location path.

 

Note that this will give different location paths than if you were to use SimpleXML/DOM, which load the entire document. The XMLReader -> DOM -> getNodePath will not include positional information (e.g. /example/something[3]), whereas (optionally, SimpleXML ->) DOM -> getNodePath will include it. Hopefully you can see why XMLReader won't know that positional information.

Edited by salathe
Link to comment
Share on other sites

So you have told us you have a problem. Can you be more specific and show us what you are trying that doesn't work because, at present, I havn't a clue what you want.

 

It is very simple...

I've got an XML document that I need to parse node by node with XMLReader.

If the node's content matches a given regex pattern, I need to get the Xpath to that node.

Link to comment
Share on other sites

You could use XMLReader's expand() method to get a DOM version of the node. Then use the DOMNode base-class's method getNodePath() to return its location path.

 

Note that this will give different location paths than if you were to use SimpleXML/DOM, which load the entire document. The XMLReader -> DOM -> getNodePath will not include positional information (e.g. /example/something[3]), whereas (optionally, SimpleXML ->) DOM -> getNodePath will include it. Hopefully you can see why XMLReader won't know that positional information.

 

Thanks Salathe...

I owe you a beer... Or at least half one

 

XMLReader -> DOM -> getNodePath() was not what I was looking for...

As you mention already, it does not include positional information.

 

BUT... Thanks to your idea (SimpleXML -> DOM -> getNodePath())

I found the dom_import_simplexml() function I did not know yet...

This does exactly what I needed :

 

<?php
function parse(SimpleXMLIterator $xi)
{
for ($xi->rewind(); $xi->valid(); $xi->next()) {	
 if($xi->hasChildren()) {
	 parse($xi->current());
 } else {
	 if(true) { //CONDITION HERE
		 $domEl = dom_import_simplexml($xi->current());
		 echo $domEl->getNodePath() . PHP_EOL;
	 }
 }
}
}
$file = 'PATH/test.xml';
$xi = new SimpleXMLIterator($file, null, true);
parse($xi);

 

Now I get all the exact nodePaths

 

 

Cheers

Edited by dgoosens
Link to comment
Share on other sites

If it does not exist, invent it...

There is quite some work that has to be done, but for what I am doing, this is a very good start :

class XMLReaderI extends \XMLReader
{
/**
 * depth of the current node
 *
 * @var int
 */
private $_currentDepth = 0;

/**
 * depth of the previous node
 *
 * @var int
 */
private $_previousDepth = 0;

/**
 * list of the parsed nodes
 *
 * @var array
 */
private $_nodesParsed = array();

/**
 * keep track of the node types
 *
 * @var array
 */
private $_nodesType = array();

/**
 * keeps track of the node number
 *
 * @var array
 */
private $_nodesCounter = array();


/**
 * Move to next node in document
 *
 * @link http://php.net/manual/en/xmlreader.read.php
 * @return bool <b>TRUE</b> on success or <b>FALSE</b> on failure.
 */
public function read()
{
 $r = parent::read();

 if($this->depth < $this->_previousDepth) {
	 if(!isset($this->_nodesParsed[$this->depth])) {
		 throw new \Exception('Missing items in $_nodesParsed');
	 }
	 if(!isset($this->_nodesCounter[$this->depth])) {
		 throw new \Exception('Missing items in $_nodesCounter');
	 }
	 if(!isset($this->_nodesType[$this->depth])) {
		 throw new \Exception('Missing items in $_nodesType');
	 }
	 $this->_nodesParsed	 = array_slice($this->_nodesParsed, 0, $this->depth + 1, true);
	 $this->_nodesCounter = array_slice($this->_nodesCounter, 0, $this->depth + 1, true);
	 $this->_nodesType	 = array_slice($this->_nodesType, 0, $this->depth + 1, true);
 }
 if(isset($this->_nodesParsed[$this->depth])
	 && $this->localName == $this->_nodesParsed[$this->depth]
	 && $this->nodeType == $this->_nodesType[$this->depth])
 {
	 $this->_nodesCounter[$this->depth] = $this->_nodesCounter[$this->depth] + 1;
 } else {
	 $this->_nodesParsed[$this->depth] = $this->localName;
	 $this->_nodesType[$this->depth]	 = $this->nodeType;
	 $this->_nodesCounter[$this->depth] = 1;
 }
 $this->_previousDepth = $this->depth;

 return $r;
}

/**
 * getNodePath()
 *
 * @return string XPath of the current node
 */
public function getNodePath()
{
 if(count($this->_nodesCounter) != count($this->_nodesParsed)
	 && count($this->_nodesCounter) != count($this->_nodesType))
 {
	 throw new Exception('Counts do not match');
 }

 $nodePath = '';
 foreach ($this->_nodesParsed as $depth => $nodeName)
    {
	    switch ($this->_nodesType[$depth]) {
		    case parent::ELEMENT:
			    $nodePath .= '/' . $nodeName . '[' . $this->_nodesCounter[$depth] . ']';
			    break;

		    case parent::TEXT:
		    case parent::CDATA:
			    $nodePath .= '/text()';
			    break;

		    case parent::COMMENT:
			    $nodePath .= '/comment()';
			    break;

		    case parent::ATTRIBUTE:
			    $nodePath .= '[@' . $nodeName . ']';
			    break;

		    default:
			    break;
	    }
    }
    return $nodePath;
   }
}

Edited by dgoosens
Link to comment
Share on other sites

'k

this seems to work :

 

use \XMLReader;
class XMLReaderX extends XMLReader
{
/**
 * depth of the previous node
 *
 * @var int
 */
protected $_previousDepth = 0;

/**
 * list of the parsed nodes
 *
 * @var array
 */
protected $_nodesParsed = array();

/**
 * keep track of the node types
 *
 * @var array
 */
protected $_nodesType = array();

/**
 * keeps track of the node number
 *
 * @var array
 */
protected $_nodesCount = array();

/**
 * list of nodes that matter for XPath
 *
 * @var array
 */
protected $_referencedNodeTypes = array(
 parent::ELEMENT,
 parent::ATTRIBUTE,
 parent::TEXT,
 parent::CDATA,
 parent::COMMENT
);

/**
 * keep track of all the parsed paths
 *
 * @var array
 */
protected $_parsedPaths = array();

/**
 * Move to next node in document
 *
 * @throws XMLReaderException
 * @link http://php.net/manual/en/xmlreader.read.php
 * @return bool <b>TRUE</b> on success or <b>FALSE</b> on failure.
 */
public function read()
{
 $read = parent::read();

 if(in_array($this->nodeType, $this->_referencedNodeTypes)) {
	 if($this->depth < $this->_previousDepth) {
		 if(!isset($this->_nodesParsed[$this->depth])) {
			 throw new \Exception('Missing items in $_nodesParsed');
		 }
		 if(!isset($this->_nodesCount[$this->depth])) {
			 throw new \Exception('Missing items in $_nodesCounter');
		 }
		 if(!isset($this->_nodesType[$this->depth])) {
			 throw new \Exception('Missing items in $_nodesType');
		 }
		 $this->_nodesParsed	 = array_slice($this->_nodesParsed, 0, $this->depth + 1, true);
		 $this->_nodesCount = array_slice($this->_nodesCount, 0, $this->depth + 1, true);
		 $this->_nodesType	 = array_slice($this->_nodesType, 0, $this->depth + 1, true);
	 }
	 if(isset($this->_nodesParsed[$this->depth])
		 && $this->localName == $this->_nodesParsed[$this->depth]
		 && $this->nodeType == $this->_nodesType[$this->depth])
	 {
		 $this->_nodesCount[$this->depth] = $this->_nodesCount[$this->depth] + 1;
	 } else {
		 $this->_nodesParsed[$this->depth] = $this->localName;
		 $this->_nodesType[$this->depth]	 = $this->nodeType;

		 $logPath = $this->_getLogPath();
		 if(isset($this->_parsedPaths[$logPath])) {
			 $this->_nodesCount[$this->depth] = $this->_parsedPaths[$logPath] + 1;
		 } else {
			 $this->_nodesCount[$this->depth] = 1; // first node is 1, not 0
		 }
	 }

	 if($this->nodeType == parent::ELEMENT) {
		 $this->_parsedPaths[$this->_getLogPath()] = $this->_nodesCount[$this->depth];
	 }

	 $this->_previousDepth = $this->depth;
 }

 return $read;
}

/**
 * getNodePath()
 *
 * @return string XPath of the current node
 */
public function getNodePath()
{
 if(count($this->_nodesCount) != count($this->_nodesParsed)
	 && count($this->_nodesCount) != count($this->_nodesType))
 {
	 throw new \Exception('Counts do not match');
 }

 $nodePath = '';
 foreach ($this->_nodesParsed as $depth => $nodeName) {
	 switch ($this->_nodesType[$depth]) {
		 case parent::ELEMENT:
			 $nodePath .= '/' . $nodeName . '[' . $this->_nodesCount[$depth] . ']';
			 break;

		 case parent::ATTRIBUTE:
			 $nodePath .= '[@' . $nodeName . ']';
			 break;
		 case parent::TEXT:
		 case parent::CDATA:
			 $nodePath .= '/text()';
			 break;
		 case parent::COMMENT:
			 $nodePath .= '/comment()';
			 break;
		 default:
			 throw new \Exception('Unknown node type');
			 break;
	 }
 }
 return $nodePath;
}

/**
 * get the path of the actual node for logging
 *
 * @return string
 */
protected function _getLogPath()
{
 $path = '';

 $localCopy = $this->_nodesParsed;
 if(isset($localCopy[$this->depth])) {
	 unset($localCopy[$this->depth]);
 }

 foreach ($localCopy as $depth => $nodeName) {
	 $path .= '/' . $nodeName . '[' . $this->_nodesCount[$depth] . ']';
 }
 $path .= '/' . $this->localName;

 return $path;
}
}

 

Let me know if you encounter any issues...

 

GITHUB : https://github.com/dGo/XMLReaderX

Edited by dgoosens
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.