Jump to content

Gurus needed to fine-tune REGEXP query


Wildbug

Recommended Posts

I'm writing a PHP wiki program. As with other wikis, a preset number (n) of old versions of pages are maintained before they are deleted, i.e., every time a page is edited, the version number is incremented and (new_version_number - n) is deleted. One of the functions related to this site involves finding pages that aren't linked from any other page. This means for every distinct page title, searching every version of every other page's content for its presence. For a developed site with many pages I expect this to be an expensive query due to the many REGEXP operations MySQL must perform. I've tried a few different approaches to this query...

On my small development site of 8 pages and 23 versions of those various pages:
[code]
With LEFT JOIN/RIGHT JOIN:
SELECT p2.title FROM pages AS p3 LEFT JOIN pages AS p1 ON p3.title=p1.title RIGHT JOIN pages AS p2 ON p1.title!=p2.title AND p1.content REGEXP CONCAT('[[:<:]]',p2.title,'[[:>:]]'),page_state WHERE p2.version=page_state.current_version AND p2.title=page_state.title AND p1.title IS NULL;
(0.09 sec)

With subselect:
SELECT DISTINCT p3.title FROM pages AS p3 LEFT JOIN (SELECT p1.id FROM pages AS p1,pages AS p2 WHERE p1.title!=p2.title AND p2.content REGEXP CONCAT('[[:<:]]',p1.title,'[[:>:]]')) AS p4 ON p4.id=p3.id WHERE p4.id IS NULL;
(0.05 sec)

With GROUP BY:
SELECT p1.title,SUM(p2.content REGEXP CONCAT('[[:<:]]',p1.title,'[[:>:]]')) AS found FROM pages AS p1,pages AS p2,page_state WHERE p1.title=page_state.title AND p1.version=page_state.current_version AND p1.title!=p2.title GROUP BY p1.title HAVING found=0;
(0.03 sec on 161 rows)
[/code]
... with the GROUP BY version working the best.

Although the GROUP BY version works the best, I can see some room for improvement. Since I'm using the HAVING clause, no optimization is occuring during the search (at least for that part). Think of a result table with (1) each distinct page name (2) every other version of every other page except (1), and (3) whether or not p1.title is in p2.content. The GROUP BY clause adds up every match, groups it together by the page title it's searching by (1), and after all that only returns the ones which add up to zero. Performance could be improved by stopping searching for a given title once a match is found for it, but I can't think of a way to get MySQL to do this. I can't put a "...WHERE NOT p2.content REGEXP..." in the query because another p2.content might contain that p1.title. I'd also rather not use subqueries, if at all possible.

Any ideas?

(Table explanations: "pages" contains the title, version, and content (among other things) for pages. "page_state" only contains one entry for each distinct pages.title; among other bits of data, this includes the MAX(pages.version) for a given title.)
Link to comment
Share on other sites

[!--quoteo(post=383384:date=Jun 13 2006, 01:41 PM:name=fenway)--][div class=\'quotetop\']QUOTE(fenway @ Jun 13 2006, 01:41 PM) [snapback]383384[/snapback][/div][div class=\'quotemain\'][!--quotec--]
Can you post some sample DB records? I'm having trouble visualizing the problem you're trying to solve.
[/quote]
Sure, it'll be a long post, though. [img src=\"style_emoticons/[#EMO_DIR#]/smile.gif\" style=\"vertical-align:middle\" emoid=\":smile:\" border=\"0\" alt=\"smile.gif\" /] Note that the queries above [i]do[/i] work (and they all retreive the same data), but I'm afraid that for larger sites (more than my measly 8 test pages) this will be a slow query. I'd like to optimize it as best I can -- that's the question I'm posing on this forum. Specifically, the GROUP BY query (which currently is the fastest I can come up with) could be optimized by rejecting all instances of a title once it's found that it matches in another page's content, but I can't figure out how to get MySQL to do this (or if it's even possible).

My relevant table definitions:
[code]CREATE TABLE pages (
    id SMALLINT UNSIGNED ZEROFILL NOT NULL AUTO_INCREMENT PRIMARY KEY,
    version SMALLINT UNSIGNED NOT NULL,
    title VARCHAR(64) NOT NULL,
    description VARCHAR(128),
    updated_by VARCHAR(32),
    updated_ip INT UNSIGNED,
    updated TIMESTAMP,
    content TEXT
);
    
CREATE TABLE page_state (
    title VARCHAR(64) NOT NULL PRIMARY KEY,
    current_version SMALLINT UNSIGNED NOT NULL,
    locked CHAR(0),
    permEdit CHAR(0),
    permFile CHAR(0),
    magic CHAR(0),
    author VARCHAR(32),
    created TIMESTAMP
);[/code]

An excerpt of a table joined to itself so you can see what I need to search:
[code]
mysql> SELECT p1.title,p2.title,p2.version FROM pages AS p1 RIGHT JOIN pages AS p2 ON p1.title!=p2.title,page_state WHERE p1.title!=p2.title AND p1.title=page_state.title AND p1.version=page_state.current_version ORDER BY p1.title,p2.title,p2.version DESC;
+---------------------+---------------------+---------+
| title               | title               | version |
+---------------------+---------------------+---------+
| HomePage            | LinkedBy            |       0 |
| HomePage            | SandBox             |       8 |
| HomePage            | SandBox             |       7 |
| HomePage            | SandBox             |       6 |
| HomePage            | SandBox             |       5 |
| HomePage            | SandBox             |       4 |
| HomePage            | SandBox             |       3 |
| HomePage            | SandBox             |       2 |
| HomePage            | SandBox             |       1 |
| HomePage            | SandBox             |       0 |
| HomePage            | SessionTest         |       0 |
| HomePage            | SiteIndex           |       0 |
| HomePage            | SiteSearch          |       0 |
| HomePage            | TextFormattingRules |       1 |
| HomePage            | TextFormattingRules |       0 |
| HomePage            | WeeKiControl        |       0 |
| ...                 | ...                 |     ... |
| WeeKiControl        | HomePage            |       7 |
| WeeKiControl        | HomePage            |       6 |
| WeeKiControl        | HomePage            |       5 |
| WeeKiControl        | HomePage            |       4 |
| WeeKiControl        | HomePage            |       3 |
| WeeKiControl        | HomePage            |       2 |
| WeeKiControl        | HomePage            |       1 |
| WeeKiControl        | LinkedBy            |       0 |
| WeeKiControl        | SandBox             |       8 |
| WeeKiControl        | SandBox             |       7 |
| WeeKiControl        | SandBox             |       6 |
| WeeKiControl        | SandBox             |       5 |
| WeeKiControl        | SandBox             |       4 |
| WeeKiControl        | SandBox             |       3 |
| WeeKiControl        | SandBox             |       2 |
| WeeKiControl        | SandBox             |       1 |
| WeeKiControl        | SandBox             |       0 |
| WeeKiControl        | SessionTest         |       0 |
| WeeKiControl        | SiteIndex           |       0 |
| WeeKiControl        | SiteSearch          |       0 |
| WeeKiControl        | TextFormattingRules |       1 |
| WeeKiControl        | TextFormattingRules |       0 |
+---------------------+---------------------+---------+
161 rows in set (0.00 sec)[/code]So you see that I need to search for an instance of each distinct page title in each version of every other page's content. For example, I have to search for "HomePage" in sixteen pages.content rows including the nine versions of "SandBox." (pages.content is a TEXT column containing semi-rendered HTML. A page is considered linked by another page if it contains its name. So "SandBox" is considered to link to "SessionTest" if the word "SessionTest" shows up in SandBox's content.) Also, I'm just using page_state.current_version to isolate one instance of p1.title.

And what I need to return:
[code]mysql> SELECT p1.title,SUM(p2.content REGEXP CONCAT('[[:<:]]',p1.title,'[[:>:]]')) AS found FROM pages AS p1,pages AS p2,page_state WHERE p1.title=page_state.title AND p1.version=page_state.current_version AND p1.title!=p2.title GROUP BY p1.title;
+---------------------+-------+
| title               | found |
+---------------------+-------+
| HomePage            |     0 |
| LinkedBy            |     8 |
| SandBox             |     7 |
| SessionTest         |     6 |
| SiteIndex           |     0 |
| SiteSearch          |     0 |
| TextFormattingRules |     2 |
| WeeKiControl        |     0 |
+---------------------+-------+
8 rows in set (0.03 sec)
mysql> SELECT p1.title,SUM(p2.content REGEXP CONCAT('[[:<:]]',p1.title,'[[:>:]]')) AS found FROM pages AS p1,pages AS p2,page_state WHERE p1.title=page_state.title AND p1.version=page_state.current_version AND p1.title!=p2.title GROUP BY p1.title HAVING found=0;
+--------------+-------+
| title        | found |
+--------------+-------+
| HomePage     |     0 |
| SiteIndex    |     0 |
| SiteSearch   |     0 |
| WeeKiControl |     0 |
+--------------+-------+
4 rows in set (0.03 sec)[/code]Notice in the first query (where I've left off the HAVING clause) that "found" is 8 for "LinkedBy." This means that MySQL continues to perform the REGEXP on additional pages.content rows when it's already found "LinkedBy" in one of them (which, I realize, is exactly what I've asked it to do). Is there a way to have MySQL stop searching for "LinkedBy" once it's found it (since that's my rejection criteria)?

Does this help? Make sense?

Thanks for your interest.
Link to comment
Share on other sites

As you discovered, MySQL is doing exactly what you asked it to... and it won't "stop trying" based on your rejection criteria because that can't be translated into a JOIN/WHERE condition without having checked all of the rows.
Link to comment
Share on other sites

[!--quoteo(post=383641:date=Jun 14 2006, 02:26 AM:name=fenway)--][div class=\'quotetop\']QUOTE(fenway @ Jun 14 2006, 02:26 AM) [snapback]383641[/snapback][/div][div class=\'quotemain\'][!--quotec--]
As you discovered, MySQL is doing exactly what you asked it to... and it won't "stop trying" based on your rejection criteria because that can't be translated into a JOIN/WHERE condition without having checked all of the rows.
[/quote]

So you can't think of another, more efficient way to render the same result set?
Link to comment
Share on other sites

[!--quoteo(post=383781:date=Jun 14 2006, 09:59 AM:name=Wildbug)--][div class=\'quotetop\']QUOTE(Wildbug @ Jun 14 2006, 09:59 AM) [snapback]383781[/snapback][/div][div class=\'quotemain\'][!--quotec--]
So you can't think of another, more efficient way to render the same result set?
[/quote]
Not really, unless you can do an initial query to "pre-select" or "exclude" certain results from the "main" query... I'll have to think about it some more.
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.