coolbeansdude51 Posted April 14, 2009 Share Posted April 14, 2009 Hello folks, I have a problem/question that I am hoping someone could help me problem solve. I have a script which runs every hour to pull data from Twitter. Specifically statuses from Twitter. The pulled data looks like this: <users type="array"> <user> <id>182791273</id> <name>username</name> <screen_name>username</screen_name> <location>USA</location> <description>sorry!</description> <profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/182783123/twitter_normal.jpg</profile_image_url> <url>http://google.com</url> <protected>false</protected> <followers_count>19</followers_count> <profile_background_color>C6E2EE</profile_background_color> <profile_text_color>663B12</profile_text_color> <profile_link_color>1F98C7</profile_link_color> <profile_sidebar_fill_color>DAECF4</profile_sidebar_fill_color> <profile_sidebar_border_color>C6E2EE</profile_sidebar_border_color> <friends_count>32</friends_count> <created_at>Mon Sep 29 18:43:44 +0000 2008</created_at> <favourites_count>1</favourites_count> <utc_offset>-21600</utc_offset> <time_zone>Central Time (US & Canada)</time_zone> <profile_background_image_url>http://static.twitter.com/images/themes/theme2/bg.gif</profile_background_image_url> <profile_background_tile>false</profile_background_tile> <statuses_count>26</statuses_count> <notifications>false</notifications> <following>false</following> <status> <created_at>Tue Apr 14 17:32:33 +0000 2009</created_at> <id>7128312612312</id> <text>In a meeting</text> <source><a href="http://help.twitter.com/index.php?pg=kb.page&id=75">txt</a></source> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> </status> </user> (Note: I have changed that data so it is anonymous) In the status section I want to just pull the data which has changed. The great part is that the ID in the status section is unique. So I can pull that and then re compare each status before the script runs so that I only get data which changes. Meaning each time I run the script the script first checks a database full of these unique IDs which are then compared to the IDs which are pulled for that hour. Once that is done it will only pull the IDs which are not a duplicate. Here's the problem. How can I save all of those ID's? Can I just stick them in a database and then pull the data again and then re compare? Can I stick all of the IDs in an array then place the array in a database? FYI some of the accounts will have upward of 5000+ users so that means 5000+ ID's every hour ... that could get big real quick. Processing that amount of data ... whats the lag time gonna be? Should I delete the data every ... week or so? What do you all think? Quote Link to comment Share on other sites More sharing options...
Humpty Posted April 14, 2009 Share Posted April 14, 2009 Why don't you just store the users ID (I'm guessing that below code is uniqie per user) <user> <id>182791273</id> ...and store then statusid you last got. If the status id is different on any occasion then the status has changed. ...then do whatever it is you are doing ....and then also update the status id. 5000 records shouldn't hurt anything I wouldn't think. Quote Link to comment Share on other sites More sharing options...
coolbeansdude51 Posted April 14, 2009 Author Share Posted April 14, 2009 Humpty, Thanks for the response. The ID's for the users are unique yes. No matter what the script can/will pull the users ID's I need it to. The users statues changes and each status has a unique id. I need to make sure when I call this script every hour that I don't get over lap from the last pull. I store the created_at data and the status itself. If that specific user doesn't change there status I will keep pulling that same data. That is what I want to avoid. Does that make sense? Anyone have any ideas!? Quote Link to comment Share on other sites More sharing options...
Humpty Posted April 15, 2009 Share Posted April 15, 2009 I still believe I am understanding the point. (perhaps not) Previous suggestion still makes sense to me. if you really want to store as an array you probably can. But an array with 5000 entries in it makes less sense to me than using a database. a BLOB field (MySQL) may hold an array as it is. UPDATE tablename SET BlobFieldName = $ArrayName I've never attempted it as such, when I need multi things in one field I like to use "~value1~value2~Value3~" and so on The only issue I see with storing the array in a table is that to parse the array you probably have to pull it from the table. If you think I am not understanding then by all means ask me not to respond again or explain it again with different wording. (I am happy either way, just trying to help, sometimes keeping out is the best help I could provide ) Quote Link to comment Share on other sites More sharing options...
nankoweap Posted April 15, 2009 Share Posted April 15, 2009 if the twitter interface won't allow you to pull only user data that has changed in some timespan, then you're forced to get it all, parse it, determine if it has changed and do something with it. i think humpty's suggestion is a step in the right direction. i'd consider two tables - user and status. borrow and use the IDs from the twitter interface. i'm not sure i'd store the last status ID in the user table. the cost of querying the max status ID won't be much as long as its uniquely indexed. sounds like it should be. with a proper schema and prepared statements, shouldn't be an issue even if you have to brute-force your way through all the data to determine what's new and what's old. have you considered storing the xml as-is in a text field in mysql? never know if you'll want to go back and extract more information. i do something similar in one of my webapps that communicates with yahoo fire eagle. i extract and store the elements i want, but store the original xml just in case i need it in the future. jason Quote Link to comment Share on other sites More sharing options...
coolbeansdude51 Posted April 15, 2009 Author Share Posted April 15, 2009 @Humpty -- Thanks for your knowledge! I didn't read your reply as well as I should have. My bad. You have been a tremendous help! Thanks so much! @Jason thanks for your comments as well. I have thought of keeping the entire xml data ... just not in the database. I was thinking using s3 to store it all. I am trying to keep my database as small and lean as possible. How would you suggest putting all of the status ID's in the database? One field that stores an array of them? Something to that effect. That just seems like a waste ... having 1 field where everything is stored ... I just had an idea!!! OK so follow me here. All I need to know is if the status is duplicated. So all I need to really know is if the ID's are duplicate. So if they are not a duplicate then that status ID can be deleted. Get it? If the ID's are found the second time then they need to be stored for a third time. If they arn't found then that's perfect and the ID's can be thrown away. What do you think? Does that make sense? Quote Link to comment Share on other sites More sharing options...
Humpty Posted April 15, 2009 Share Posted April 15, 2009 Isn't that what i said? Quote Link to comment Share on other sites More sharing options...
coolbeansdude51 Posted April 15, 2009 Author Share Posted April 15, 2009 @humpty -- ahhh Yes? HAHAHA! Sorry! I am real tired. Thanks for your help! Quote Link to comment Share on other sites More sharing options...
gizmola Posted April 15, 2009 Share Posted April 15, 2009 Honestly, people can't advise you about whether or not your database structure is correct, if we don't know what questions the database will need to answer. With what little info you've provided nankoweap has the right structure as far as I can tell, since it seems that what you are interested in is only the status data, which I assume from what I know about twitter is simply the last tweet that a person has made. So a couple things --> first you know that php has various xml parsing routines, and this looks to me like a job for simplexml which should make reading the data you need trivial. Secondly, this seems like it probably is a read intensive application. You can go a long way performance wise, in helping preserve your mysql performance, by using memcache to cache your entries. This way, when you parse, you really don't touch your mysql db unless there is an updated status required. You can read up on php's memcache on the php.net page for the extension. Quote Link to comment Share on other sites More sharing options...
coolbeansdude51 Posted April 15, 2009 Author Share Posted April 15, 2009 @gizmola Thanks for your input. I misspoke when I said XML. I can get either XML or json. The majority of the data I gather is json data. (Its faster then XML with twitter ... go figure) Basically what we do is gather data about the people who are "following" a certain account on twitter. The problem lies in gather the appropriate data at the right time. We run it every hour so that we can get enough data to make our finding statistically relevant. The problem comes into play when the person who doesn't update there status since we last pulled the data. If they don't update there status for 1 full day that is 12 times that we pull data and the data says that for 12 times this 1 person has tweeted at the same time which skews our results. So we have to eliminate the people who have not changed there status and keep only the ones that have changed there status. All statuses have unique ids. We pull the unique id's place them in a DB then compare them the next time. If they changed and there are no duplicates then perfect. If there are duplicates we will throw away that set section of data but analyze the rest. When I mean analyze we only totals. Here is an example of what the data looks like in the database: timestamp id monday tuesday wednesday thursday friday saturday sunday UTCtime bestday num_records 2009-04-14 17:00:22 22 0 22 0 2 1 2 0 18:07:06 Tuesday 27 The data is then also summed up and then averaged once more to display the totals for the week. This will be done via a cron job and emailed out to the user on a day they specify. Does that help more?? Quote Link to comment Share on other sites More sharing options...
gizmola Posted April 15, 2009 Share Posted April 15, 2009 Sure, go with json if that works better for you -- it's certainly better from a bandwidth point of view in my experience. When you json_decode() you'll get a big array, so it's just as efficient in getting to those couple of keys you need, so long as the structure doesn't change. Depending on the code, the xml version might be a bit more resilient in the case that twitter changes the structure significantly. Can you simply do a describe on the table and paste that in here, so we can clearly see with datatypes what your structure is. Also it seems you have this summary table, but what of the detail table(s) that store the data for each person, and their followers? If I understand you correctly, what you want is something like this Account - So for each Account you have 1-Many FollowerUpdate rows. This is your transaction table. Based on your sampling period, you only want to insert a new FollowerUpdate row if the status id value has changed. Obviously the query to determine this, with an index is trivial and will come back in milliseconds, however you need to do this query for every follower. This is why I suggested the use of memcache, because you can add in a simple caching layer where you attempt to read from memcache first. The memcache key simply needs to be something like Follower ID . Status ID. You can store the date in there if you want as well. If the status hasn't changed, you will not need to query the mysql db, because you'll get the memcache cache hit. If it has changed, the key won't be found, so you know that you need to do an insert of a new FollowerUpdate row. The followerUpdate table needs to have a timestamp column in there so you can do your summarization process later. Your summarization is then a simple matter of applying a date range to a GROUP BY query for whatever the time period that you need to summarize. You can then write those totals into your summary table. I'm not a big fan of denormalizing in the way you apparently have done, but I'll leave that issue be, as it's less important than the structures of your transaction data. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.