Jump to content

Recommended Posts

Hi all,

Let me get right down to it without a lot of fuss :)

OBJECTIVE: Scrape data from a web page, organize it, store in database.

PROGRESS: Stuck at "organizing" --> Multidimensional arrays that need to be sliced, keys reset, etc

TARGET: Not to kill self trying

So here we go. I have a scrape script that pulls the table I want from a URL and can spit it back at me. Fantastic. I then turn it into an array to store in a MySQL database. Here is the Array:

[code]Array
(
    [0] => Array
        (
            [0] => Overview
            [1] => Games
        )

    [1] => Array
        (
        )

    [2] => Array
        (
            [0] => Playlist
            [1] => Level
            [2] => Games Played
            [3] => Wins
        )

    [4] => Array
        (
            [0] => Rumble Pit
            [1] => 15
        )

    [5] => Array
        (
            [1] => 76
            [2] => 7
        )

    [6] => Array
        (
        )

    [8] => Array
        (
            [0] => Double Team
            [1] => 20
        )

    [9] => Array
        (
            [1] => 188
            [2] => 80
        )

    [10] => Array
        (
        )

    [12] => Array
        (
            [0] => Team Slayer
            [1] => 25
        )

    [13] => Array
        (
            [1] => 407
            [2] => 177
        )

    [14] => Array
        (
        )

    [16] => Array
        (
            [0] => Team Skirmish
            [1] => 19
        )

    [17] => Array
        (
            [1] => 533
            [2] => 183
        )

    [18] => Array
        (
        )

    [20] => Array
        (
            [0] => Team Snipers
            [1] => 20
        )

    [21] => Array
        (
            [1] => 69
            [2] => 41
        )

    [22] => Array
        (
        )

    [24] => Array
        (
            [0] => Team Hardcore
            [1] => 14
        )

    [25] => Array
        (
            [1] => 71
            [2] => 29
        )

    [26] => Array
        (
        )

    [28] => Array
        (
            [0] => BTB Skirmish
            [1] => 27
        )

    [29] => Array
        (
            [1] => 356
            [2] => 135
        )

    [30] => Array
        (
        )

    [32] => Array
        (
        )

    [33] => Array
        (
        )

    [34] => Array
        (
        )

    [35] => Array
        (
        )

    [36] => Array
        (
            [0] => Questions about Stats?
Stats Help: Halo 2 and Bungie.net
Gamertag Linking: Get additional features!
Halo 2 Matchmaking: Matchmaking Unveiled
Halo 2 Stats: Ranking Overview
Halo 2 Medals: Medal Info
        )

    [37] => Array
        (
        )

    [39] => Array
        (
            [0] => Games | Stats | Community | Inside Bungie | Bungie Store | Home |
            [1] => contact us  |
            [2] => help
        )

    [40] => Array
        (
            [0] => privacy statement |
            [1] => terms of use |
            [2] => code of conduct |
            [3] => jobs
        )

    [41] => Array
        (
            [0] =>  © 2006 Microsoft Corporation
    All rights reserved.
            [1] =>  Halo 3
            [2] => Halo 2 Xbox
            [3] => Halo 2 Vista
            [4] => Last Updated:


Halo 2 Vista   
Home   
        )

    [42] => Array
        (
            [1] => Halo 2 Stats
            [2] => Playlists
            [3] => Find Player
            [4] => Rank System
            [5] => My Stats
        )

    [43] => Array
        (
            [1] => Forums
            [2] => Find Group
            [3] => Events
            [4] => Fanclub
            [5] => Links   
        )

    [44] => Array
        (
            [1] => The Team
            [2] => Webcams
            [3] => Bungie History
            [4] => Last Updated:


Inside Bungie   
Section   
        )

    [45] => Array
        (
            [1] => T-Shirts
            [2] => Multi-Media
            [3] => Accessories
            [4] => Newest Item:


The entire   
store!   
        )

)
[/code]

PROBLEMS:
1. I can't get rid of the empty arrays
2. The data I need is between "rumble pit" and "BTB Skirmish"
3. I need to combine those arrays, e.g.
[code][4] => Array
        (
            [0] => Rumble Pit
            [1] => 15
        )

    [5] => Array
        (
            [1] => 76
            [2] => 7
        )[/code]

needs to be
[code][4] => Array
        (
            [0] => Rumble Pit
            [1] => 15
            [2] => 76
            [3] => 7
        )[/code]

and so on for the 7 game types.

So my question is...do I try to tweak the scrape script (it's breaking up "Rumble Pit" etc in the game types due to a nested table) or should I just manipulate this array to heck?

I've spent two days looking up slices, unset functions,  combining, user-defined functions...but I'm stuck.

Any guidance would be greatly appreciated! (I did learn that you can't unset an array...how disappointing).

Thanks!
Kate
Very good point. I'm looking into improving the array in the first place. Happen to know any good web resources for reading and parsing remote files? I've done google searches this morning and have a couple of starts, but nothing amazing.
There is no universal way to parse remote files, or any file.  It is all depend on the file's flow, consistency, and structure.  And some files are not parse-able.  To make a file parse-able, people like to use RSS style format. 
yeah...An array of random stuff like that is really just as useless as the source page itself. 

Scraping a remote webpage is more of an artform than a rigid 123 procedure.  You just need to get really good at regex and get good at finding a consistent pattern in the target source.  And even then, you can spend a lot of time perfecting the regex on the scrape and boom! they change up their layout the next day.  Very frustrating. 

My first advice for you is to contact the site and see if they can't offer some kind of xml version of their data for you to easily grab.  Doesn't hurt to ask.
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.