轻松抓取Twitter关注数据,使用Python与Selenium
抓取数据示例
{
"userId": "95092020",
"isBlueVerified": true,
"following": false,
"canDm": false,
"canMediaTag": false,
"createdAt": "Sun Dec 06 23:33:02 +0000 2009",
"defaultProfile": false,
"defaultProfileImage": false,
"description": "Best-Selling Author | Clinical Psychologist | #1 Education Podcast | Enroll to @petersonacademy now:",
"fastFollowersCount": 0,
"favouritesCount": 161,
"followersCount": 5613000,
"friendCount": 1686,
"hasCustomTimelines": true,
"isTranslator": false,
"listedCount": 14572,
"location": "",
"mediaCount": 7318,
"name": "Dr Jordan B Peterson",
"normalFollowersCount": 5613000,
"pinnedTweetIdsStr": [
"1849105729438790067"
],
"possiblySensitive": false,
"profileImageUrlHttps": "https://pbs.twimg.com/profile_images/1407056014776614923/TKBC60e1_normal.jpg",
"profileInterstitialType": "",
"username": "jordanbpeterson",
"statusesCount": 51343,
"translatorType": "none",
"verified": false,
"wantRetweets": false,
"withheldInCountries": []
}
直接运行代码,无需设置
我们的指南提供了完整的即用代码,无缝抓取Twitter关注数据。使用Python和Selenium,自动化数据收集并高效捕获性能日志。无需额外设置,解锁Twitter洞察!
步骤1:设置环境
首先,安装Selenium进行浏览器自动化:
pip install -r requirements.txt
步骤2:下载ChromeDriver
下载ChromeDriver以便Selenium与Chrome浏览器互动。获取链接:ChromeDriver下载
步骤3:设置Chrome选项
self.options = webdriver.ChromeOptions()
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
self.options.add_argument(f'user-agent={user_agent}')
self.options.add_argument('--disable-gpu')
self.options.add_argument('--no-sandbox')
self.options.add_argument('--disable-dev-shm-usage')
self.options.add_argument(f"--remote-debugging-port={remote_debugging_port}")
js_script_name = modify_random_canvas_js()
self.browser = self.get_browser(script_files=[js_script_name], record_network_log=True, headless=True)
步骤4:访问目标页面
self.browser.switch_to.new_window('tab')
url = 'https://x.com/1_usd_promotion/following'
self.browser.get(url=url)
time.sleep(2)
exist_entry_id = []
self.get_network(exist_entry_id, result_list)
print(f'推文结果长度 = {len(result_list)}')
步骤5:获取浏览器性能日志
performance_log = self.browser.get_log("performance")
for packet in performance_log:
msg = packet.get("message")
message = json.loads(packet.get("message")).get("message")
packet_method = message.get("method")
if "Network" in packet_method and 'Following' in msg:
request_id = message.get("params").get("requestId")
resp = self.browser.execute_cdp_cmd('Network.getResponseBody', {'requestId': request_id})
步骤6:从响应中提取数据
body = resp.get("body")
body = json.loads(body)
instructions = body['data']['user']['result']['timeline']['timeline'].get('instructions', None)
if not instructions:
continue
for instruction in instructions:
entries = instruction.get('entries', None)
步骤7:示例响应数据
{
"userId": "95092020",
"isBlueVerified": true,
"following": false,
"canDm": false,
"canMediaTag": false,
"createdAt": "Sun Dec 06 23:33:02 +0000 2009",
"defaultProfile": false,
"defaultProfileImage": false,
"description": "Best-Selling Author | Clinical Psychologist | #1 Education Podcast | Enroll to @petersonacademy now:",
"fastFollowersCount": 0,
"favouritesCount": 161,
"followersCount": 5613000,
"friendCount": 1686,
"hasCustomTimelines": true,
"isTranslator": false,
"listedCount": 14572,
"location": "",
"mediaCount": 7318,
"name": "Dr Jordan B Peterson",
"normalFollowersCount": 5613000,
"pinnedTweetIdsStr": [
"1849105729438790067"
],
"possiblySensitive": false,
"profileImageUrlHttps": "https://pbs.twimg.com/profile_images/1407056014776614923/TKBC60e1_normal.jpg",
"profileInterstitialType": "",
"username": "jordanbpeterson",
"statusesCount": 51343,
"translatorType": "none",
"verified": false,
"wantRetweets": false,
"withheldInCountries": []
}
步骤8:重要注意事项
- 登录Twitter并获取Twitter Cookie。了解如何获取Twitter Cookie
- 使用来自Apify的API
- 从GitHub获取完整代码
- 加入我们的讨论群!点击这里
常见问题:FAQ
- 问:什么是网络抓取?
网络抓取就像使用一个特殊工具自动从网站收集信息。想象一个机器人帮助从页面收集数据,这样你就不必手动去做。在这里,我们专注于使用Python和Selenium抓取Twitter数据。
- 问:如何开始抓取Twitter数据?
要开始抓取Twitter数据,首先需要设置你的计算机。这包括安装Selenium软件,它帮助你控制网络浏览器。然后下载ChromeDriver,这是一个帮助Google Chrome的工具,让Selenium能够与之配合使用。
- 问:什么是ChromeDriver,为什么需要它?
ChromeDriver就像Selenium与Google Chrome之间的翻译。它帮助Selenium理解如何与Chrome浏览器互动。你需要它,以便Selenium可以自动执行点击按钮或输入信息等操作。
- 问:抓取中的性能日志是什么?
性能日志就像记录抓取过程中所有事件的日记。它跟踪Selenium与Twitter页面之间所有的数据交换,帮助你了解程序发出的请求。
- 问:抓取Twitter之前应该考虑什么?
在抓取Twitter之前,你需要登录你的Twitter账户并获取一个称为auth_token的东西,以证明你有权访问Twitter的数据。同时,要注意遵守Twitter的规则,以免被封禁。
- 问:如何避免在抓取时被封禁?
为了避免被封禁,确保在请求之间引入延迟,轮换代理,并避免在短时间内向Twitter的服务器发送过多请求。