Jump to content

Greatest Problem On Earth !


d.shankar

Recommended Posts

:-[

 

 

I need to extract the textbox names and form names under the form tag of any html source file of  a site.

 

Consider the code

 

<form action="new.asp" method="post">
<input type="text" name="txt1">
</form>

 

In this code , i have to extract new.asp and txt1.

 

Please help.

Link to comment
Share on other sites

If this is the Greatest problem on Earth, there's a lot your leaving out. (hehe)

I am new to regular expressions, but I believe I'm on the right track here.

/action="(\w+.*\w+)"/ should match new.asp and similar.

/name="(\w+)"/ should match any name value.

You may want to add to the second one if you only want to find names of input elements.

Link to comment
Share on other sites

Mark ! I once again say this is the most toughest part in the world  :P;D.

 

say if you have two or three forms in a single page ?

 

for eg; i have this source

 

<form action="form1.asp" method="get">

<input type="text" name="val1">

</form>

 

<form action="form2.asp" method="post">

<input type="text" name="val2">

</form>

 

i have to capture these resullts to an array ..

i.e.

array(0)=form1.asp & val1

array(1)=form2.asp & val2

 

Hope you understand my problem.

 

Link to comment
Share on other sites

try this

 

<?php
$data = '<form action="form1.asp" method="get">
<input type="text" name="val1">
</form>

<form action="form2.asp" method="post">
<input type="text" name="val2">
</form>
';


preg_match_all('/(?<=\<form )action="(\w+\.\w+)".*?name="(\w+)"/si', $data, $result, PREG_PATTERN_ORDER);
$forms = $result[1];
$values = $result[2];

echo "<pre>";
print_r($forms);
print_r($values);

echo "or<br />";
$newarray = array();
foreach($forms as $K => $V)
{
$newarray[] = "$V & {$values[$K]}";
}
print_r($newarray);
?>

Link to comment
Share on other sites

Hi MT thanks for reply.

you did a great job.

 

but  :(

 

actually i have captured the source of website in a variable $data in which it was previously holding the <form> thing in the previous post.

 

I am not getting the results into the variable for this code

 

<?php
$url="www.google.com";

$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_FAILONERROR,true);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
$data = curl_exec($ch); //here $data variable contains the source of google.com

preg_match_all('/(?<=\<form )action="(\w+\.\w+)".*?name="(\w+)"/si', $data, $result, PREG_PATTERN_ORDER);
$forms = $result[1];
$values = $result[2];

echo "<pre>";
print_r($forms);
print_r($values);

echo "or<br />";
$newarray = array();
foreach($forms as $K => $V)
{



$newarray[] = "$V & {$values[$K]}";
}
print_r($newarray);
?>

 

 

any idea buddy ?

 

Link to comment
Share on other sites

yep..

 

not sure what you want this for but..

 

<?php
$url="www.google.com";

$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_FAILONERROR,true);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
$data = curl_exec($ch); //here $data variable contains the source of google.com

preg_match_all('/(?<=\<form )action="([\S]+)".*?name=(\w+)/s', $data, $result, PREG_PATTERN_ORDER);
$forms = $result[1];
$values = $result[2];

echo "<pre>";
print_r($forms);
print_r($values);

echo "or<br />";
$newarray = array();
foreach($forms as $K => $V)
{



$newarray[] = "$V & {$values[$K]}";
}
print_r($newarray);
?>

Link to comment
Share on other sites

MT.. Actually i am working on a spidering/crawling project and i am 0 in regex.

thats why i need these accurate details..

 

back to the code..

 

the code works only for google.com buddy.. why is it so ?

 

does the regex needs to be changed or it is perfect?

 

Link to comment
Share on other sites

the only way i can think if you doing this something like this

 

<?php
$url="www.google.com";

$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_FAILONERROR,true);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
$data = curl_exec($ch); //here $data variable contains the source of google.com

preg_match_all('/.*action\s?=(?:\'|"|\s)?([^\'"\s\>]*)(?:\'|"|\s)?.*?name=(?:\'|"|\s)?([^\'"\s\>]*)(?:\'|"|\s)?/s', $data, $result, PREG_PATTERN_ORDER);

$forms = $result[1];
$values = $result[2];

echo "<pre>";
print_r($forms);
print_r($values);

echo "or<br />";
$newarray = array();
foreach($forms as $K => $V)
{
$newarray[] = "$V & {$values[$K]}";
}
print_r($newarray);
?>

 

 

have a go see if it works

Link to comment
Share on other sites

Thanks MT for spending your precious time.

But each and every time i post for asking help i really feel guilty.

 

i am really sorry buddy.

 

 

your code works perfectly well but i come to back to the main requirement.

the code is able to fetch the values for only a single form.

 

FYI: this page www.dnschart.com/name.php contains two forms. your superb code fetched one but ignored the other.

 

your code fetched

the form "whois/results.php" and variables "query"

 

it left "recordres2.php" and variables "domain"

 

I changed the for loop but it didnt help me . Any suggestions MT ?

Link to comment
Share on other sites

Hi .. I have switched to DOM with the same problem its working but i need small alteration

 

I already mentioned that if there are two or more forms then it will be a trouble , actually i need it in this way.

 

form1

var1a

var1b

form2

var2b

form3

var3a

var3b

var3c

 

Each variables should converge under their parent form.

I have coded this in DOM , actually i am nearing to the conclusion but i need help.

 

Here is the code..

 

<?php

$target_url = "www.dnschart.com";

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$hrefs1=$dom->getElementsByTagName("form");


for ($i = 0; $i < $hrefs1->length; $i++)
{
$href = $hrefs1->item($i);
$url = $href->getAttribute('action');
echo $url;
echo "<br>";
flush();
$hrefs2=$dom->getElementsByTagName("input");
for ($j = 0;$j < $hrefs2->length; $j++)
{
$hrefx = $hrefs2->item($j);
$urlx = $hrefx->getAttribute('text');
	if(strtolower($urlx)=='text')
	{
	$urlx = $hrefx->getAttribute('name');
	echo $urlx;
	flush();
	}
}	

}

?>

Link to comment
Share on other sites

:-[

 

 

I need to extract the textbox names and form names under the form tag of any html source file of  a site.

 

Consider the code

 

<form action="new.asp" method="post">
<input type="text" name="txt1">
</form>

 

In this code , i have to extract new.asp and txt1.

 

Please help.

Wow... THIS is the greatest problem on Earth!? AMAZING!!!
Link to comment
Share on other sites

Try something like this... I've tried this code only against the last page you've posted. Currently the regexps work only with double/single quotes (so not something like name=namefield).

 

<pre><?php 
$text='your html code';
$form=array();
preg_replace_callback('/<form[^>]+action=("|\')(.+?)(\\1).+?<\/form>/is','cb_form',$text);

function cb_form($mth){
global $form;
preg_match_all('/<input(?=[^>]+type="text")[^>]+name=("|\')(.+?)(\\1)/is',$mth[0],$names);
// for more than one type attribute 
// '/<input(?=[^>]+type="(?:text|hidden)")[^>]+name=("|\')(.+?)(\\1)/is'
$form[$mth[2]]=$names[2];
}

print_r($form);
?></pre>

Link to comment
Share on other sites

Currently the regexps work only with double/single quotes (so not something like name=namefield).

 

 

What does that mean ?

 

preg_match_all('/<input(?=[^>]+type="text")[^>]+name=("|\')(.+?)(\\1)/is',$mth[0],$names);

 

this regex works only for type="text" , if i need for both "password" and "text" where i should i add ?

Link to comment
Share on other sites

Currently the regexps work only with double/single quotes (so not something like name=namefield).

 

 

What does that mean ?

 

preg_match_all('/<input(?=[^>]+type="text")[^>]+name=("|\')(.+?)(\\1)/is',$mth[0],$names);

 

this regex works only for type="text" , if i need for both "password" and "text" where i should i add ?

 

It means that if some forms don't use quotes, like google, my code doesn't match the name fields. I could add if you need it, but for now let's check if it works.

 

For the password problem I wrote a comment just below the preg_match_all line to explain how to add more than one attribute. :)

 

Link to comment
Share on other sites

yes realand it works i changed the regex as you told to do in the comment.

 

so how can you make it to work with google.com ... ? also i have this site http://www.vizual.co.in/enquiry_form.html

 

where the input boxes are available but their source is in this way

<INPUT name=name22 id=name22 value="" class="formfield">

 

there is no type="text" attribute . so can you change the regex ?

 

thanks for the advanced help.

Link to comment
Share on other sites

heres one for extraction, i broke it down to make it a little easier

 

<?php
$data ="";//the html page

//get input
preg_match_all('/<input ([^>]*)>/si', $data, $regs, PREG_PATTERN_ORDER);
$inputs = $regs[1];
foreach($inputs as $input)
{
$key = "type";
if (preg_match('/type\s?=\s?(?:\'|")?(\w+)(?:\'|")?/si', $input, $regs))
{
	$type = $regs[1];
}else{
	$type = "";
}
echo "$key=$type|";

//----other key
$key = "name";
if (preg_match('/type\s?=\s?(?:\'|")?(\w+)(?:\'|")?/si', $input, $regs))
{
	$type = $regs[1];
}else{
	$type = "";
}
echo "$key=$type<br />";

}
?>

 

problem is my client was just on the phone so i have to do some paid for work.. will look at this later to night

Link to comment
Share on other sites

yes realand it works i changed the regex as you told to do in the comment.

 

so how can you make it to work with google.com ... ? also i have this site http://www.vizual.co.in/enquiry_form.html

 

where the input boxes are available but their source is in this way

<INPUT name=name22 id=name22 value="" class="formfield">

 

there is no type="text" attribute . so can you change the regex ?

 

thanks for the advanced help.

 

Back from lunch.

That page you posted doesn't have any form tags, so regex can't match anything, instead  I've modified the regexp to match no type att. or no quote cases, it seems to work:

<pre><?php
$form=array();
preg_replace_callback('/<form[^>]+action=("|\')(.+?)(\\1).+?<\/form>/is','cb_form',$text);

function cb_form($mth){
global $form;
preg_match_all('/<input(?(?=[^>]+type=)(?=[^>]+type=(?(?="|\')("|\')(?:text|password)(?:\\1)|(?:text|password))))[^>]+name=(?(?="|\')("|\')(.+?)(?:\\2)|(\S+))/is',$mth[0],$names);
$form[$mth[2]]=($names[2][0]!='')?$names[3]:$names[4];
}
print_r($form);
?></pre>

Link to comment
Share on other sites

Regex is really making me mad. Its not matching for all sites.

Do you guys easy with DOM ?

Here is a small code that returns form names or actions of anywebsites without any restrictions.

 

<?php
$target_url = "www.dnschart.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$hrefs = $dom->getElementsByTagName('form');
foreach($hrefs as $href)
{
echo $href->getAttribute('action');
echo "<br>";
} 
?>

 

This code works fine but i am unable to access the child nodes ( i.e. type=text variables )

 

Is it possible to proceed with this code and putting this to work ?

Link to comment
Share on other sites

I tried my best but i still cant figure out.

I am unable to fetch the input variables of the form respectively.

 

 

<?php

$target_url = "www.dnschart.com";

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$params = $dom->getElementsByTagName('form');

foreach ($params as $param) {
       echo $param -> getAttribute('name').'<br>';
       if($param->hasChildNodes())
       {
       echo "true";
       echo "<br>";
       $children = $param->childNodes;
       echo $children->getElementsByTagName('input').'<br>';
       foreach($children as $child)
       {
	       echo $child->getAttribute('name');
       }
      }
else
{
echo "false";
echo "<br>";
}
?>

 

???

Link to comment
Share on other sites

  • 2 weeks later...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.