Pagination limits to only 10,000 tweets per query and how to retrieve only original tweets

edgartilly · 2018-01-12 16:34:24 UTC

Hi all,

I am new to python and multivac, I try to download tweets from last year elections.

I have troubles downloading more than 10 000 tweets for a specific query, and understanding the pagination system.

Here is the request :

total = requests.get(‘https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=2017LeDebat&output=id_str,ca,tx,usr.tmz&since=2017-05-03&until=2017-05-04&count=1&from=1&api_key=’ + api_key).json()[‘results’][‘total’]
from_arg = 1
print(‘number of tweets’, total)
while from_arg < total / 100:
print(‘Doing tweet {}’.format(from_arg))
results = requests.get(‘https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=2017LeDebat&output=id_str,ca,tx,usr.tmz&since=2017-05-03&until=2017-05-04&count=100&from=’ + str(int(from_arg)) + ‘&api_key=’ + api_key).json()[‘results’][‘hits’]
write_tweets(results, “tweetMacronLepen.json”)
from_arg += 1

Is it possible to download more than 10k tweets in one shot ? Usually after I download 100 pages I receive this error.

Doing tweet 99
Doing tweet 100
Doing tweet 101
Traceback (most recent call last):
File “tweet_v5page.py”, line 17, in
results = requests.get(‘https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=@MLP_Officiel&output=id_str,ca,tx,usr.tmz&since=2017-05-03&until=2017-05-04&count=100&from=’ + str(int(from_arg)) + ‘&api_key=’ + api_key).json()[‘results’][‘hits’]
KeyError: ‘results’

What do I need to change in my code to download the 100 next pages ? When I change “from=” to 1, 2, 3, sometimes it’s working and the 100 next pages of tweets are downloaded, sometimes it’s not and the exact same tweets are downloaded.

Thank you for your help !

Edgar

mpanahi · 2018-01-12 17:56:06 UTC

Hi @edgartilly,

This is actually not your fault! I forgot to increase the maximum number for page that is why you can’t go over page 100! I will increase this so you can download 50,000 tweets by going beyond page 500.

I will fix this tomorrow.

Also, let me know what is the total number of tweets for your query between 2017-05-03 and 2017-05-04? If it’s not too many I may be able to at least double the daily limits on your query

edgartilly · 2018-01-13 13:43:42 UTC

hi @mpanahi,

Thank you for your quick reply ! Ok, I understand better now !

According to the query I look for (@EmmanuelMacron, @MLP_officiel, 2017LeDebat, Macron, LePen, Marine, présidentielle2017) there is approximately 1.2m tweets, but I guess there is a lot of duplicates in these searches.

Thank you for your help !

mpanahi · 2018-01-14 15:59:23 UTC

I have increased the from limit to 500.

If you don’t care about the retweet (to build a network of users around a tweet) and only care about the original tweets, you can use this in your queries to filter the retweeted tweets:

q=!_exists_:retweeted_status AND @MLP_officiel

This makes sure the tweet is original: !_exists_:retweeted_status so you can mix it with OR AND with any other queries you like.

Maziyar

edgartilly · 2018-01-14 21:38:11 UTC

Thank you for the limitation increase and for the tip. I think I will keep the RT for the moment.

I have a question concerning the pagination. When I am downloading tweets with the q=2017LeDebat for instance, I download the 100 first pages but when it’s done I don’t manage to download from page 101 to page 200. Do you know what I need to change in my code?

Edgar

edgartilly · 2018-01-14 21:41:29 UTC

I tried to download new tweets, but I received the same error than previously:

Doing tweet 99
Doing tweet 100
Doing tweet 101
Traceback (most recent call last):
File “tweet_v5page.py”, line 17, in
results = requests.get(‘https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=2017LeDebat&output=id_str,ca,tx,usr.tmz&since=2017-05-03&until=2017-05-04&count=100&from=’ + str(int(from_arg)) + ‘&api_key=’ + api_key).json()[‘results’][‘hits’]
KeyError: ‘results’

mpanahi · 2018-01-14 21:45:54 UTC

There is normally an HTTP error coming back if it’s related to the API request. If I can see the error from Multivac server I can help you more. But as the pagination if you simply put this in your browser you can see the page limit has been gone:

https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=2017LeDebat&count=10&from=200&api_key=YOUR-API-KEY

edgartilly · 2018-01-15 21:39:11 UTC

I see the it’s working from your link, the tweets are not the same if I change the parameters.

But when I download them and I change the parameter “from” from 100 to 200 with count = 1 it’s always the same bunch of tweets that appear.

Do you see anything in the code that could be wrong ?

total = requests.get(‘https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=2017LeDebat&output=id_str,ca,tx,usr.tmz&since=2017-05-03&until=2017-05-04&count=1&from=1&api_key=’ + api_key).json()[‘results’][‘total’]
from_arg = 1
print(‘number of tweets’, total)
while from_arg < total / 100:
print(‘Doing tweet {}’.format(from_arg))
results = requests.get(‘https://api.iscpif.fr/v2/pvt/politic/france/twitter/search?q=2017LeDebat&output=id_str,ca,tx,usr.tmz&since=2017-05-03&until=2017-05-04&count=100&from=’ + str(int(from_arg)) + ‘&api_key=’ + api_key).json()[‘results’][‘hits’]
write_tweets(results, “tweetMacrontest.json”)
from_arg += 1

mpanahi · 2018-01-16 12:05:59 UTC

Actually they are not the same tweets. If you look at the id_str, usr.snm and ca you can see they are different even though they all have only one word #2017LeDebat.

This is because there are lots of tweets containing only #2017LeDebat and since your query is just 2017LeDebat based on TF-IDF score these records are next to each others.

I suggest you break down on date:

First 50k:
since: 2017-05-03T00:00:00
until: 2017-05-03T19:20:00
This is 49K

Then make another query with a date range that gives you around 49K-50K. This way you can slide the window over time and collect all tweets/retweets.

edgartilly · 2018-01-17 21:22:04 UTC

Thanks for the tip for using the time separation, I didn’t know it was possible.

Actually, I am sure it’s the same tweet, when I am searching for a id_str, I found it several times for a different criteria “from”. For instance this id 859872543162982401 even when I change the criteria “from” is always appearing for q = 2017ledebat .

I still can’t download more than 100 pages, the same error as before is appearing after 10k tweets downloaded. Do you have an idea why it’s working for you and not for me ?

Thanks for your help… I am struggling

mpanahi · 2018-01-18 14:08:21 UTC

Hi Edgar,

Problem 1 (not more than 10K): This is due to limitation in paging. You have to construct your queries the way that each one doesn’t have more than 10K results since it is not allowed paging over 10K. This is easy to address by changing your time frame (since, until) in a way that the total count is less than 10K.
This is my fault by thinking raising the from to 500 would fix the issue. So the limit is 100 because there can be only 10K in each query.

Problem 2 (duplicates): I made some changes, but sometimes due to the nature of balancing our large-scale data the documents are coming from different servers. They might be scored differently that’s why time to time you see the same document with different score. You can overcome this issue but insert/upsert your data based on id_str which is always unique.

I did increase your limit to 200k per day so you can go as follow:

first query: 9437
since:2017-05-03T00:00:00
until: 2017-05-03T18:00:00

second query: 8375
since:2017-05-03T18:00:00
until: 2017-05-03T19:00:00

Anything after 19h should be queried every 5 minutes:

since: 2017-05-03T20:00:00
until: 2017-05-03T20:05:00

Then you get 10K for each query/timeframe with 100 pages of data.

edgartilly · 2018-01-19 17:15:02 UTC

Hi Maziyar,

It’s working ! I don’t find the same tweets anymore with the division by time slot.
Thank you for the tips and for the extension of the limit to 200k tweets per day !

edgartilly · 2018-01-20 18:40:47 UTC

Hi Maziyar,

Would it be possible to extent the 200k limit for about a week ? It’s limited to 50k today.

mpanahi · 2018-01-20 18:56:59 UTC

Hi Edgar,

Happy to hear the problem is gone.

About the limit, this should allow you to download 200k tweets everyday. Is there any issue with this?

edgartilly · 2018-01-21 14:41:12 UTC

Yes, I think 200k limit was just temporary, I can only download 50k tweets again. Would it be possible to extent it ?

mpanahi · 2018-01-21 17:58:33 UTC

There must be something with your query. I just tested your API_KEY and it can go up to 200K for unlimited time every day.
Usually the allowed number of downloads are permanently not temporary. Unless you are receiving a response that says “you have reached your limit”, you can still send requests.

mpanahi · 2018-01-23 15:11:15 UTC

@edgartilly has the problem been solved? Can I close this topic?

mpanahi · 2018-01-25 12:57:03 UTC