Jump to content

How do you remotely access a website and create an autonomous script?


greenace92

Recommended Posts

What I am trying to do is to look up an ip-address automatically via the website ip-lookup.net for example.

 

To do this manually, I look at my ip-grabbing log of vistiors, I copy an ip-address, then I paste it into ip-lookup.net, hit search and they spit out some information.

I'm not denying that this could be a false result. I'm also not sure if I can discern between a "company" versus a private computer... usually I see a Network provider or something like that.

 

At any rate... I want this done automatically.

 

I began to work with web scraping a while back and I could target the input and submit button... or perhaps just do a submission but how can I access their site remotely through my website or from php?

 

I would have to:

 

Go to site

Target input field, paste ip

Search

Target result field, get info

Insert into database with ip-address

 

Any thoughts would be appreciated.

Link to comment
Share on other sites

What's your actual goal? What kind of information do you need for what purpose? There are plenty of IP databases and services with proper APIs, but which one is appropriate depends on your specific requirements.

 

In any case, don't webscrape when there's no need to. It's fugly, fragile and possibly against the TOS.

Edited by Jacques1
Link to comment
Share on other sites

Am all for api's, concerning ip lookups they all have limits.

 

Seems the biggest issues are the spammers, at least the bots and indexers could bring you traffic.

$remote_ip = $_SERVER['REMOTE_ADDR'];
if (strstr($remote_ip, ', ')) {
    $ips = explode(', ', $remote_ip);
    $remote_ip = $ips[0];
}
$spam_ip = "http://api.stopforumspam.org/api?ip=".$remote_ip;
$spamdata = @simplexml_load_file($spam_ip);
if ($spamdata) {
 
    $spamarray = array();
 
    $spamarray = json_decode(json_encode($spamdata), TRUE);

   if($spamarray['appears'] == "yes" ){
   die('spammer');
   }
}

There are some huge apache rules and lists around which block spammy servers and bots.

Just know when start using ip blocks or cidr ranges are blocking a real lot.

 

A robot.txt file can take care of legit bots, the bad ones will ignore it

 

If you want to discover domain names from an ip then match them in an array can use gethostbyaddr()

Some other related functions there as well.

Link to comment
Share on other sites

It's still not really clear what you're trying to achieve.

 

Web crawlers announce themselves via the user agent and can also be easily identified with reverse DNS lookups, so there's no need for an IP blacklist. In fact, Google specifically recommends against that, because their IP addresses may change at any time.

 

Or are you talking about malicious bots which post spam? That's an entirely different story and may be prevented with

  • CAPTCHAs
  • powerful content filters (e. g. Bayesian or Markov filters known from e-mail)
  • as a last resort: blacklists

And then of course there are simple bots written by amateurs which don't cause any harm and should really be left alone.

 

So before you start to randomly implement all kinds of features, I strongly recommend you get clear about your goal. Trying to recognize bots is hardly a sensible objective, because there's such a big range of entirely different bots for entirely different purposes.

Link to comment
Share on other sites

  • 3 weeks later...

I am simply trying to figure out if the ip that visited my website is a real person or if it is just a crawler or some other type of bot. I have been logging access times when someone or "something" visits my website and I see what URL they were asking for, I've seen some pretty scary stuff (to me as I don't know what they are) like triple forward slashes or deliberate url queries like admin= something or where it seems like they are deliberately trying to get access to my server / do something they are not supposed to.

So what I would do manually is take the recorded ip-address and copy, then paste it into ip-lookup.net who would then tell me who it is.

For example, let me find something like what I have described above:

 

Okay here are a couple of examples that worry me as these url queries seem to deliberately be trying to find some sort of access:

 

http://mywebsite.com/?c=4e5e5d7364f443e28fbf0d3ae744a59a

http://w3.hideme.ru:80

http://159.ip-192-99-169.net/?x=()

http://www.southwest.comwww.southwest.com:443  not related to my website

http://192.99.169.159 Connection: Keep-Alive/

http://www.mywebsite.com/?C=D;O=D

 

here's a triple slash which is that wget or not?

 

http:///cgi-sys/entropysearch.cgi

 

 

See there are a bunch of those, and it scares me because I don't know what they mean. Am I safe?

My website is SSL protected with an A rating from Qualys, and I use password login, session required for almost every page,without session,redirect to main page

I have some test pages / other websites that don't use sessions(is that bad?) but the sql database is password protected.

I was told to implement bcrypt and blowfish as well as some other things.

I don't know if I am safe.

 

So, I would take an ip-address and unfortunately I don't have any linked to those queries above, not sure why

Actually it looks like I do, I have a bunch of tables jeez... frequency counting and the actual url looked up.

 

For this one here:

http://mywebsite.com/?c=4e5e5d7364f443e28fbf0d3ae744a59a

 

This is the ip that requested it

183.60.244.46

 

So I go to ip-lookup.net and this is what they tell me:

 

IP : 183.60.244.46     Neighborhoodpopup.gif Host : ?    Country :

China  cn.pngpopup.gif

 

What did I have to do in order to get that?

1) Go to ip-lookup.net

2) clear my ip which is the default searched by their website

3) enter ip of interest

4) hit search

5) see result

 

I want to automate that with a web-scraper or something, I started working on it with something to do with php but lost interest

It's on my list.

So that is what I want to do, figure out how to stop or just deny any weird requests like that, if it doesn't have anything to do with existing directories.

I realize I could probably accomplish that with htaccess.

Edited by greenace92
Link to comment
Share on other sites

Sorry, but what you're trying to do makes no sense.

 

It's only natural for a public website to receive all kinds of requests from all kinds of agents. There's nothing scary about that, and the only way to prevent “unwanted” requests is to not have a public website.

 

Assuming that all bots are evil while all human users are good doesn't make sense either. In fact, a human who actively tries to break into your site should scare you much more than some stupid bot scanning URLs. Needless to say that most bots are legitimate, useful tools which don't cause any harm whatsoever.

 

If you're worried about the security of your website, then do something about that. Learn the basics of the security and make sure your code, your webserver and your operating system are safe. You need to do this anyway, so you might as well start now instead of trying to fight off bots.

 

I'm sure somebody will recommend fail2ban, but I'd be careful about that. At best, this tool is a second line of defense which you apply at the very end. And in the worst case, it will give you a false sense of security and distract you from more important security measures.

Link to comment
Share on other sites

Okay. I just feel really clueless as far as how to know if I am safe.

 

Currently I use:

 

Password login, session-based access

SSL

Parameterized binding

 

Not sure much else, it seems that more and more there are companies getting hacked so I agree with starting now to learn about security measures and good practices on secure coding.

Edited by greenace92
Link to comment
Share on other sites

One note and not a dig but... You want to block people doing what you want to do to another site ;)

 

Also, when I do any form of scraping tests I use what looks like legitimate credentials.

 

To identify malicious crackers, go use a few penetration apps like nikto or ZAP and study what they test for. You can generally spot them by the 404's they generate.

Link to comment
Share on other sites

The most common security vulnerabilities are summarized in the OWASP Top 10 list. In my experience, PHP applications typically struggle with injection vulnerabilities. So if you use parameterized queries and rigorous HTML-escaping, that's definitely a good start.

 

Besides the concrete defense mechanisms, it's important to develop a security-oriented way of thinking:

  • Keep privileges at a minimum.
  • Set up multiple layers of protection instead of relying on a single feature.
  • Don't trust anything, unless it's absolutely necessary.
  • Whitelisting is generally superior to blacklisting.
  • Established, well-tested libraries are generally superior to homegrown implementations.

Pádraic Brady does an excellent job at explaining this.

  • Like 1
Link to comment
Share on other sites

How can I even tell if I have been "infected" for the lack of a better word.

 

I know the site has a lot of problems, thankfully in a way no one really cares about it.

 

I don't understand about html escaping if I am using parameterized binding. Don't you apply the html escape after the query?

 

I don't check for a real email address. Then once an account has been created you are past the point of session problems.

 

Oh man, so glad to have asked these questions, thanks for the help guys.

Link to comment
Share on other sites

Well I have backed up my four vps's and reinstalled them, I figure I'll start from scratch. At this point I don't manage anyone or have any users yet so this is not a problem. It's good to catch the potential problems before they happen. I'm really thankful that you guys have caught me up, I have more to research/learn but this is really great. I want to do it right. Much of my work in other interests has been half-ass so, I really want to be competent in what I am doing.

 

 

Those are two unrelated security mechanisms. Parameterized queries prevent (most) SQL injection attacks, HTML-escaping is required to prevent cross-site scripting attacks. You need both.

 

 

Right, alright, I'm going to start from the ground up.

 

I think I might start a thread about the newest version of apache and openssl if I run into problems again.

 

I stuck with the older version of apache as it came preinstalled on debian 7 but there is debian 8 now and I figure I should choose the latest version.

 

Random thought, the age is off on this site, unless it is just me?

Edited by greenace92
Link to comment
Share on other sites

FYI: Your not "Stuck" with old Apache and you don't have to upgrade the OS to install it. All topics for a different thread. I would suggest you do some research on what to do before just starting a thread on it. For development you could always set up virtual servers on your computer and run anything anyway you want it. There is VMware and many other such software's to do it. Some free, some paid. I have just about every OS there is on my Windows 7 machine and even several OS versions of some of them.

Link to comment
Share on other sites

I jsut recently got into virtual boxing on my windows computer.

 

Yeah I don't know if it is a good idea to switch to the latest OS version if it is not stable. I tried to compile the latest apache, ran into some errors on openssl. Yeah I have a lot to work on, I started to get into nginx server set too. So much to learn.

Link to comment
Share on other sites

Right I asked about it on webhostingtalk regarding the newest apache version. With OVH and Debian 7 it seemed that the 2.2.24?(outdated version) came preinstalled. So to install the newest version I endeded following someone's compiling instructions verbatim but I ran into problems regarding openssl. I meant the stable version on the actual vps. But I really have to figure out what I am doing.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.