MarcusZ Posted June 26, 2010 Share Posted June 26, 2010 Hi I need help with parsing the data in the table cell. <td class="start_no">(1)</td> My regex isn't working.... $therow = '<td class="start_no">(1)</td>'; preg_match("'<td class=\"start_no\">(.*?)</td>'si", $therow, $startnr); echo $startnr[1]; Quote Link to comment Share on other sites More sharing options...
MarcusZ Posted June 26, 2010 Author Share Posted June 26, 2010 it doesn't echo anything... Quote Link to comment Share on other sites More sharing options...
cags Posted June 26, 2010 Share Posted June 26, 2010 Works for me. Quote Link to comment Share on other sites More sharing options...
MarcusZ Posted June 26, 2010 Author Share Posted June 26, 2010 I now see the problem: <td class=\"start_no\">(1)</td> I'll try to solve it myself by deleting all the reversed backslashes. Quote Link to comment Share on other sites More sharing options...
MarcusZ Posted June 26, 2010 Author Share Posted June 26, 2010 It was not possible to delete the reversed backslashes - they are still there in the browser. How shall I change the regex to adapt after the reversed backslashes? I believe they are treated as " " and ' '. Thank you. Quote Link to comment Share on other sites More sharing options...
Goldeneye Posted June 27, 2010 Share Posted June 27, 2010 You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ [as cited from http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454] Quote Link to comment Share on other sites More sharing options...
cags Posted June 27, 2010 Share Posted June 27, 2010 What a useful post... Regex CAN be used for parsing HTML, it isn't however the best way of doing it when we have DOM tools for doing the job (having said that, it's quite possible some of these tools use Regex at the back end). Either way, I have no idea what the OP is talking about. The code block posted in the first post that reportedly wasn't working, worked perfectly fine. As for the more recent post I'm not really sure what's being said. It seems to be the implication that the backslashes are in the source, which means the source probably isn't a HTML document anyway so any form of DOM manipulation probably wouldn't work anyway. If that is the case you just need to write a pattern that matches them. '#<td class=\\"start_no\\">(.*?)</td>#s' Quote Link to comment Share on other sites More sharing options...
ZachMEdwards Posted June 27, 2010 Share Posted June 27, 2010 $pattern = '%<td class=\\\\"start_no\\\\">(.*?)</td>%'; Quote Link to comment Share on other sites More sharing options...
salathe Posted June 27, 2010 Share Posted June 27, 2010 I now see the problem: <td class=\"start_no\">(1)</td> Is that part of some JavaScript or something; otherwise, why on earth would there be backslashes before double quotes in HTML? @Zach, please don't just post code. Some cursory description of what it does, or why it works, or how it improves on previous posts, or well, anything, would better than just a one-line code snippet. Quote Link to comment Share on other sites More sharing options...
MarcusZ Posted June 27, 2010 Author Share Posted June 27, 2010 The backslashes are there because I get the html code from a form. Automatically the backslashes appear before the " in order to make the query work correctly. My php code is now working, thanks! Quote Link to comment Share on other sites More sharing options...
ZachMEdwards Posted June 28, 2010 Share Posted June 28, 2010 He already has all the code, he just is using the wrong pattern Quote Link to comment Share on other sites More sharing options...
cags Posted June 28, 2010 Share Posted June 28, 2010 Oops, I forgot a slash, you don't need 4 however, it only requires 3. Incidentally if you are getting it from a submitted form value, it sounds like you have magic_quotes enabled on your server, the simplest thing would be to disable them if you have access. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.