Jump to content

php code regex


beta0x64

Recommended Posts

Hey guys, do you think that this is a good regex to capture a class declaration?

 

/^(?P<type>abstract\s|final\s)*class\s(?P<name>[a-z0-9_]+)\s*(extends\s(?P<parent>[a-z0-9_]+)\s*)?(implements\s(?P<interfaces>([a-z0-9_,\s])+)\s*)?\{/imS

 

Am I missing anything? Is it possible for a class to be both abstract and final, too? Can I do something to break the interfaces up into named subpatterns that are predictable (inter1, inter2, inter3, etc.), or will I have to do that with explode or something as I'm assuming? :shrug:

 

Output on example subject:

 

Array
(
    [0] => Array
        (
            [0] => class patsSQL extends MySQL implements patsInfo, patsDisplay {
            [1] => class Controller {
        )

    [type] => Array
        (
            [0] =>
            [1] =>
        )

    [1] => Array
        (
            [0] =>
            [1] =>
        )

    [name] => Array
        (
            [0] => patsSQL
            [1] => Controller
        )

    [2] => Array
        (
            [0] => patsSQL
            [1] => Controller
        )

    [3] => Array
        (
            [0] => extends MySQL
            [1] =>
        )

    [parent] => Array
        (
            [0] => MySQL
            [1] =>
        )

    [4] => Array
        (
            [0] => MySQL
            [1] =>
        )

    [5] => Array
        (
            [0] => implements patsInfo, patsDisplay
            [1] =>
        )

    [interfaces] => Array
        (
            [0] => patsInfo, patsDisplay
            [1] =>
        )

    [6] => Array
        (
            [0] => patsInfo, patsDisplay
            [1] =>
        )

    [7] => Array
        (
            [0] =>
            [1] =>
        )

)

Link to comment
Share on other sites

Well, I'm trying to make a program that will split source code files into classes, functions, and everything else.

 

const C_pattern = "/^(?P<type>abstract\s+|final\s+)?class\s(?P<name>[a-z_][a-z0-9_]*)\s*(extends\s(?P<parent>[a-z0-9_]+)\s*)?(implements\s(?P<interfaces>([a-z0-9_,\s])+)\s*)?\{/imS";
const F_pattern = "/^function\s+(?P<name>[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)\s*\((?P<operands>[\$a-z0-9_,\s]*)\)\s*\{/imS"; 

 

The one thing I am concerned about is functions inside of functions, classes inside of functions, functions inside of classes, etc. if the user does not use proper tabs. I think that I can determine the offset of the match, then replace that match with a require_once(), like nothing happened! This would work even inside of a class, correct?

 

In order to handle the functions inside of classes problem, well I just parse classes first and delete them! (Don't worry, I plan on doing all of this inside a tmp file, so deletion is not a problem)

 

What do you guys think?

 

 

Link to comment
Share on other sites

You know, there is no need to write a PHP parser yourself; there is already one in... PHP.

 

Consider the following file (test.php):

<?php
function foo() {
    echo 'hello world';
}

echo 2 + 4;

function
hello
()
{
    // aslkjdlakjds
    return 'abc' +
        
        1;}

echo 'more junk here';
?>

 

Then you can extract all the functions in that file like this:

<?php
$tokens = token_get_all(file_get_contents('test.php'));

$started = false;
$open = 0;
$functions = array();
$tmp = '';

foreach ($tokens as $token) {
    if (!$started && $token[0] === T_FUNCTION) {
        $started = true;
        $tmp .= $token[1];
    }
    else if ($started) {
        if (is_array($token)) {
            $tmp .= $token[1];
        }
        else {
            $tmp .= $token;

            if ($token === '{') {
                ++$open;
            }
            else if ($token === '}') {
                if (--$open === 0) {
                    $started = false;
                    $functions[] = $tmp;
                    $tmp = '';
                }
            }
        }
    }
}

var_dump($functions);

 

The output of running that would be:

array(2) {
  [0]=>
  string(42) "function foo() {
    echo 'hello world';
}"
  [1]=>
  string(79) "function
hello
()
{
    // aslkjdlakjds
    return 'abc' +
        
        1;}"
}

Link to comment
Share on other sites

Awww, but now how will I show off my l33t regex skillz?  ::)

 

Anyway, the Tokenizer also grabs functions inside of classes (but strangely not functions inside of functions, hmmm), which is not what I want, per se. I think I should stick with my current M.O., especially because I've already coded most of it...

 

Thanks, though!

Link to comment
Share on other sites

Well, it was not meant to be a complete solution for you. You were supposed to extend it in the same manner to capture classes.

 

Take a look at how it works. When a function declaration starts it ignores the tokens and just adds them to the string until the function declaration ends. You can do the same with classes. When a class declaration starts, ignore everything until it ends.

Link to comment
Share on other sites

For what it's worth, this will extract functions, classes and interfaces from a file. Took me ten minutes to write. I bet it took longer time writing those regular expressions.

 

Sample file (test.php):

<?php
function foo() {
    echo 'hello world';
}

echo 2 + 4;

function
hello
()
{
    // aslkjdlakjds
    return 'abc' +
        
        1;}

class Hello
{
    function thisMethodWontGetIncludedInFunctions() {
        echo 'foo';
    }
}
abstract class Foo {}
final class Bar {}
interface Baz {}

echo 'more junk here';
?>

 

Parser:

<?php
class FileParser
{
    private $_path;
    private $_parsed = false;

    private $_classes = array();
    private $_functions = array();
    private $_interfaces = array();

    public function __construct($path)
    {
        $this->_path = $path;
    }

    private function _parse()
    {
        if ($this->_parsed) return;

        $parsing = null;
        $tmp = '';
        $open = 0;

        foreach (token_get_all(file_get_contents($this->_path)) as $token) {
            if ($parsing === null && is_array($token)) {
                switch ($token[0]) {
                    case T_FUNCTION:
                        $parsing = T_FUNCTION;
                        break;
                    case T_CLASS:
                    case T_ABSTRACT:
                    case T_FINAL:
                        $parsing = T_CLASS;
                        break;
                    case T_INTERFACE:
                        $parsing = T_INTERFACE;
                        break;
                }
                if ($parsing !== null) $tmp .= $token[1];
            }
            else {
                if (is_array($token)) {
                    $tmp .= $token[1];
                }
                else {
                    $tmp .= $token;

                    switch ($token) {
                        case '{':
                            ++$open;
                            break;
                        case '}':
                            if (--$open === 0) {
                                switch ($parsing) {
                                    case T_FUNCTION:
                                        $this->_functions[] = $tmp;
                                        break;
                                    case T_CLASS:
                                        $this->_classes[] = $tmp;
                                        break;
                                    case T_INTERFACE:
                                        $this->_interfaces[] = $tmp;
                                        break;
                                }
                                $parsing = null;
                                $tmp = '';
                            }
                            break;
                    }
                }
            }
        }

        $this->_parsed = true;
    }

    public function getClasses()
    {
        $this->_parse();
        return $this->_classes;
    }

    public function getFunctions()
    {
        $this->_parse();
        return $this->_functions;
    }

    public function getInterfaces()
    {
        $this->_parse();
        return $this->_interfaces;
    }
}

$parser = new FileParser('test.php');

var_dump(
    $parser->getFunctions(),
    $parser->getClasses(),
    $parser->getInterfaces()
);

 

Output:

array(2) {
  [0]=>
  string(42) "function foo() {
    echo 'hello world';
}"
  [1]=>
  string(81) "+;function
hello
()
{
    // aslkjdlakjds
    return 'abc' +
        
        1;}"
}
array(3) {
  [0]=>
  string(95) "class Hello
{
    function thisMethodWontGetIncludedInFunctions() {
        echo 'foo';
    }
}"
  [1]=>
  string(21) "abstract class Foo {}"
  [2]=>
  string(18) "final class Bar {}"
}
array(1) {
  [0]=>
  string(16) "interface Baz {}"
}

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.