d.shankar Posted August 17, 2007 Share Posted August 17, 2007 I need to extract the textbox names and form names under the form tag of any html source file of a site. Consider the code <form action="new.asp" method="post"> <input type="text" name="txt1"> </form> In this code , i have to extract new.asp and txt1. Please help. Quote Link to comment Share on other sites More sharing options...
markjoe Posted August 18, 2007 Share Posted August 18, 2007 If this is the Greatest problem on Earth, there's a lot your leaving out. (hehe) I am new to regular expressions, but I believe I'm on the right track here. /action="(\w+.*\w+)"/ should match new.asp and similar. /name="(\w+)"/ should match any name value. You may want to add to the second one if you only want to find names of input elements. Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 18, 2007 Author Share Posted August 18, 2007 Mark ! I once again say this is the most toughest part in the world . say if you have two or three forms in a single page ? for eg; i have this source <form action="form1.asp" method="get"> <input type="text" name="val1"> </form> <form action="form2.asp" method="post"> <input type="text" name="val2"> </form> i have to capture these resullts to an array .. i.e. array(0)=form1.asp & val1 array(1)=form2.asp & val2 Hope you understand my problem. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted August 18, 2007 Share Posted August 18, 2007 try this <?php $data = '<form action="form1.asp" method="get"> <input type="text" name="val1"> </form> <form action="form2.asp" method="post"> <input type="text" name="val2"> </form> '; preg_match_all('/(?<=\<form )action="(\w+\.\w+)".*?name="(\w+)"/si', $data, $result, PREG_PATTERN_ORDER); $forms = $result[1]; $values = $result[2]; echo "<pre>"; print_r($forms); print_r($values); echo "or<br />"; $newarray = array(); foreach($forms as $K => $V) { $newarray[] = "$V & {$values[$K]}"; } print_r($newarray); ?> Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 18, 2007 Author Share Posted August 18, 2007 Hi MT thanks for reply. you did a great job. but actually i have captured the source of website in a variable $data in which it was previously holding the <form> thing in the previous post. I am not getting the results into the variable for this code <?php $url="www.google.com"; $ch = curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_FAILONERROR,true); curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false); $data = curl_exec($ch); //here $data variable contains the source of google.com preg_match_all('/(?<=\<form )action="(\w+\.\w+)".*?name="(\w+)"/si', $data, $result, PREG_PATTERN_ORDER); $forms = $result[1]; $values = $result[2]; echo "<pre>"; print_r($forms); print_r($values); echo "or<br />"; $newarray = array(); foreach($forms as $K => $V) { $newarray[] = "$V & {$values[$K]}"; } print_r($newarray); ?> any idea buddy ? Quote Link to comment Share on other sites More sharing options...
MadTechie Posted August 18, 2007 Share Posted August 18, 2007 yep.. not sure what you want this for but.. <?php $url="www.google.com"; $ch = curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_FAILONERROR,true); curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false); $data = curl_exec($ch); //here $data variable contains the source of google.com preg_match_all('/(?<=\<form )action="([\S]+)".*?name=(\w+)/s', $data, $result, PREG_PATTERN_ORDER); $forms = $result[1]; $values = $result[2]; echo "<pre>"; print_r($forms); print_r($values); echo "or<br />"; $newarray = array(); foreach($forms as $K => $V) { $newarray[] = "$V & {$values[$K]}"; } print_r($newarray); ?> Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 18, 2007 Author Share Posted August 18, 2007 MT.. Actually i am working on a spidering/crawling project and i am 0 in regex. thats why i need these accurate details.. back to the code.. the code works only for google.com buddy.. why is it so ? does the regex needs to be changed or it is perfect? Quote Link to comment Share on other sites More sharing options...
MadTechie Posted August 18, 2007 Share Posted August 18, 2007 the only way i can think if you doing this something like this <?php $url="www.google.com"; $ch = curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_FAILONERROR,true); curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false); $data = curl_exec($ch); //here $data variable contains the source of google.com preg_match_all('/.*action\s?=(?:\'|"|\s)?([^\'"\s\>]*)(?:\'|"|\s)?.*?name=(?:\'|"|\s)?([^\'"\s\>]*)(?:\'|"|\s)?/s', $data, $result, PREG_PATTERN_ORDER); $forms = $result[1]; $values = $result[2]; echo "<pre>"; print_r($forms); print_r($values); echo "or<br />"; $newarray = array(); foreach($forms as $K => $V) { $newarray[] = "$V & {$values[$K]}"; } print_r($newarray); ?> have a go see if it works Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 18, 2007 Author Share Posted August 18, 2007 Thanks MT for spending your precious time. But each and every time i post for asking help i really feel guilty. i am really sorry buddy. your code works perfectly well but i come to back to the main requirement. the code is able to fetch the values for only a single form. FYI: this page www.dnschart.com/name.php contains two forms. your superb code fetched one but ignored the other. your code fetched the form "whois/results.php" and variables "query" it left "recordres2.php" and variables "domain" I changed the for loop but it didnt help me . Any suggestions MT ? Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 19, 2007 Author Share Posted August 19, 2007 Hi .. I have switched to DOM with the same problem its working but i need small alteration I already mentioned that if there are two or more forms then it will be a trouble , actually i need it in this way. form1 var1a var1b form2 var2b form3 var3a var3b var3c Each variables should converge under their parent form. I have coded this in DOM , actually i am nearing to the conclusion but i need help. Here is the code.. <?php $target_url = "www.dnschart.com"; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs1=$dom->getElementsByTagName("form"); for ($i = 0; $i < $hrefs1->length; $i++) { $href = $hrefs1->item($i); $url = $href->getAttribute('action'); echo $url; echo "<br>"; flush(); $hrefs2=$dom->getElementsByTagName("input"); for ($j = 0;$j < $hrefs2->length; $j++) { $hrefx = $hrefs2->item($j); $urlx = $hrefx->getAttribute('text'); if(strtolower($urlx)=='text') { $urlx = $hrefx->getAttribute('name'); echo $urlx; flush(); } } } ?> Quote Link to comment Share on other sites More sharing options...
Azu Posted August 19, 2007 Share Posted August 19, 2007 I need to extract the textbox names and form names under the form tag of any html source file of a site. Consider the code <form action="new.asp" method="post"> <input type="text" name="txt1"> </form> In this code , i have to extract new.asp and txt1. Please help. Wow... THIS is the greatest problem on Earth!? AMAZING!!! Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 20, 2007 Author Share Posted August 20, 2007 Could someone help me. Quote Link to comment Share on other sites More sharing options...
rea|and Posted August 20, 2007 Share Posted August 20, 2007 Try something like this... I've tried this code only against the last page you've posted. Currently the regexps work only with double/single quotes (so not something like name=namefield). <pre><?php $text='your html code'; $form=array(); preg_replace_callback('/<form[^>]+action=("|\')(.+?)(\\1).+?<\/form>/is','cb_form',$text); function cb_form($mth){ global $form; preg_match_all('/<input(?=[^>]+type="text")[^>]+name=("|\')(.+?)(\\1)/is',$mth[0],$names); // for more than one type attribute // '/<input(?=[^>]+type="(?:text|hidden)")[^>]+name=("|\')(.+?)(\\1)/is' $form[$mth[2]]=$names[2]; } print_r($form); ?></pre> Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 20, 2007 Author Share Posted August 20, 2007 Currently the regexps work only with double/single quotes (so not something like name=namefield). What does that mean ? preg_match_all('/<input(?=[^>]+type="text")[^>]+name=("|\')(.+?)(\\1)/is',$mth[0],$names); this regex works only for type="text" , if i need for both "password" and "text" where i should i add ? Quote Link to comment Share on other sites More sharing options...
rea|and Posted August 20, 2007 Share Posted August 20, 2007 Currently the regexps work only with double/single quotes (so not something like name=namefield). What does that mean ? preg_match_all('/<input(?=[^>]+type="text")[^>]+name=("|\')(.+?)(\\1)/is',$mth[0],$names); this regex works only for type="text" , if i need for both "password" and "text" where i should i add ? It means that if some forms don't use quotes, like google, my code doesn't match the name fields. I could add if you need it, but for now let's check if it works. For the password problem I wrote a comment just below the preg_match_all line to explain how to add more than one attribute. Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 20, 2007 Author Share Posted August 20, 2007 yes realand it works i changed the regex as you told to do in the comment. so how can you make it to work with google.com ... ? also i have this site http://www.vizual.co.in/enquiry_form.html where the input boxes are available but their source is in this way <INPUT name=name22 id=name22 value="" class="formfield"> there is no type="text" attribute . so can you change the regex ? thanks for the advanced help. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted August 20, 2007 Share Posted August 20, 2007 heres one for extraction, i broke it down to make it a little easier <?php $data ="";//the html page //get input preg_match_all('/<input ([^>]*)>/si', $data, $regs, PREG_PATTERN_ORDER); $inputs = $regs[1]; foreach($inputs as $input) { $key = "type"; if (preg_match('/type\s?=\s?(?:\'|")?(\w+)(?:\'|")?/si', $input, $regs)) { $type = $regs[1]; }else{ $type = ""; } echo "$key=$type|"; //----other key $key = "name"; if (preg_match('/type\s?=\s?(?:\'|")?(\w+)(?:\'|")?/si', $input, $regs)) { $type = $regs[1]; }else{ $type = ""; } echo "$key=$type<br />"; } ?> problem is my client was just on the phone so i have to do some paid for work.. will look at this later to night Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 20, 2007 Author Share Posted August 20, 2007 Thank you MT but form names are not being fetched .. there should be a mutual link between the forms and input variable. Anyway thanks i am still waiting Quote Link to comment Share on other sites More sharing options...
rea|and Posted August 20, 2007 Share Posted August 20, 2007 yes realand it works i changed the regex as you told to do in the comment. so how can you make it to work with google.com ... ? also i have this site http://www.vizual.co.in/enquiry_form.html where the input boxes are available but their source is in this way <INPUT name=name22 id=name22 value="" class="formfield"> there is no type="text" attribute . so can you change the regex ? thanks for the advanced help. Back from lunch. That page you posted doesn't have any form tags, so regex can't match anything, instead I've modified the regexp to match no type att. or no quote cases, it seems to work: <pre><?php $form=array(); preg_replace_callback('/<form[^>]+action=("|\')(.+?)(\\1).+?<\/form>/is','cb_form',$text); function cb_form($mth){ global $form; preg_match_all('/<input(?(?=[^>]+type=)(?=[^>]+type=(?(?="|\')("|\')(?:text|password)(?:\\1)|(?:text|password))))[^>]+name=(?(?="|\')("|\')(.+?)(?:\\2)|(\S+))/is',$mth[0],$names); $form[$mth[2]]=($names[2][0]!='')?$names[3]:$names[4]; } print_r($form); ?></pre> Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 20, 2007 Author Share Posted August 20, 2007 Thanks a lot buddy.. your array declarations are really tricky ??? How can i access them one by one ? Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 21, 2007 Author Share Posted August 21, 2007 Regex is really making me mad. Its not matching for all sites. Do you guys easy with DOM ? Here is a small code that returns form names or actions of anywebsites without any restrictions. <?php $target_url = "www.dnschart.com"; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); $hrefs = $dom->getElementsByTagName('form'); foreach($hrefs as $href) { echo $href->getAttribute('action'); echo "<br>"; } ?> This code works fine but i am unable to access the child nodes ( i.e. type=text variables ) Is it possible to proceed with this code and putting this to work ? Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 21, 2007 Author Share Posted August 21, 2007 any ideas and suggestions ??? Quote Link to comment Share on other sites More sharing options...
effigy Posted August 21, 2007 Share Posted August 21, 2007 Try using childNodes like this example. Quote Link to comment Share on other sites More sharing options...
d.shankar Posted August 22, 2007 Author Share Posted August 22, 2007 I tried my best but i still cant figure out. I am unable to fetch the input variables of the form respectively. <?php $target_url = "www.dnschart.com"; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); $params = $dom->getElementsByTagName('form'); foreach ($params as $param) { echo $param -> getAttribute('name').'<br>'; if($param->hasChildNodes()) { echo "true"; echo "<br>"; $children = $param->childNodes; echo $children->getElementsByTagName('input').'<br>'; foreach($children as $child) { echo $child->getAttribute('name'); } } else { echo "false"; echo "<br>"; } ?> ??? Quote Link to comment Share on other sites More sharing options...
shoaiblatif Posted September 5, 2007 Share Posted September 5, 2007 Dear shankar; I have solved your problem. I could show ur desire results. Contact with me at this id shoaiblatif786@hotmail.com. I will be available to u frm 4:00p.m to 10:00 p.m (+5 GMT) Regards, Shoaib Latif Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.