輕鬆抓取Twitter關注數據,使用Python與Selenium
抓取數據示例
{
"userId": "95092020",
"isBlueVerified": true,
"following": false,
"canDm": false,
"canMediaTag": false,
"createdAt": "Sun Dec 06 23:33:02 +0000 2009",
"defaultProfile": false,
"defaultProfileImage": false,
"description": "Best-Selling Author | Clinical Psychologist | #1 Education Podcast | Enroll to @petersonacademy now:",
"fastFollowersCount": 0,
"favouritesCount": 161,
"followersCount": 5613000,
"friendCount": 1686,
"hasCustomTimelines": true,
"isTranslator": false,
"listedCount": 14572,
"location": "",
"mediaCount": 7318,
"name": "Dr Jordan B Peterson",
"normalFollowersCount": 5613000,
"pinnedTweetIdsStr": [
"1849105729438790067"
],
"possiblySensitive": false,
"profileImageUrlHttps": "https://pbs.twimg.com/profile_images/1407056014776614923/TKBC60e1_normal.jpg",
"profileInterstitialType": "",
"username": "jordanbpeterson",
"statusesCount": 51343,
"translatorType": "none",
"verified": false,
"wantRetweets": false,
"withheldInCountries": []
}
直接運行代碼,無需設置
我們的指南提供了完整的即用代碼,無縫抓取Twitter關注數據。使用Python和Selenium,自動化數據收集並高效捕獲性能日誌。無需額外設置,解鎖Twitter洞察!
步驟1:設置環境
首先,安裝Selenium進行瀏覽器自動化:
pip install -r requirements.txt
步驟2:下載ChromeDriver
下載ChromeDriver以便Selenium與Chrome瀏覽器互動。獲取鏈接:ChromeDriver下載
步驟3:設置Chrome選項
self.options = webdriver.ChromeOptions()
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
self.options.add_argument(f'user-agent={user_agent}')
self.options.add_argument('--disable-gpu')
self.options.add_argument('--no-sandbox')
self.options.add_argument('--disable-dev-shm-usage')
self.options.add_argument(f"--remote-debugging-port={remote_debugging_port}")
js_script_name = modify_random_canvas_js()
self.browser = self.get_browser(script_files=[js_script_name], record_network_log=True, headless=True)
步驟4:訪問目標頁面
self.browser.switch_to.new_window('tab')
url = 'https://x.com/1_usd_promotion/following'
self.browser.get(url=url)
time.sleep(2)
exist_entry_id = []
self.get_network(exist_entry_id, result_list)
print(f'推文結果長度 = {len(result_list)}')
步驟5:獲取瀏覽器性能日誌
performance_log = self.browser.get_log("performance")
for packet in performance_log:
msg = packet.get("message")
message = json.loads(packet.get("message")).get("message")
packet_method = message.get("method")
if "Network" in packet_method and 'Following' in msg:
request_id = message.get("params").get("requestId")
resp = self.browser.execute_cdp_cmd('Network.getResponseBody', {'requestId': request_id})
步驟6:從響應中提取數據
body = resp.get("body")
body = json.loads(body)
instructions = body['data']['user']['result']['timeline']['timeline'].get('instructions', None)
if not instructions:
continue
for instruction in instructions:
entries = instruction.get('entries', None)
步驟7:示例響應數據
{
"userId": "95092020",
"isBlueVerified": true,
"following": false,
"canDm": false,
"canMediaTag": false,
"createdAt": "Sun Dec 06 23:33:02 +0000 2009",
"defaultProfile": false,
"defaultProfileImage": false,
"description": "Best-Selling Author | Clinical Psychologist | #1 Education Podcast | Enroll to @petersonacademy now:",
"fastFollowersCount": 0,
"favouritesCount": 161,
"followersCount": 5613000,
"friendCount": 1686,
"hasCustomTimelines": true,
"isTranslator": false,
"listedCount": 14572,
"location": "",
"mediaCount": 7318,
"name": "Dr Jordan B Peterson",
"normalFollowersCount": 5613000,
"pinnedTweetIdsStr": [
"1849105729438790067"
],
"possiblySensitive": false,
"profileImageUrlHttps": "https://pbs.twimg.com/profile_images/1407056014776614923/TKBC60e1_normal.jpg",
"profileInterstitialType": "",
"username": "jordanbpeterson",
"statusesCount": 51343,
"translatorType": "none",
"verified": false,
"wantRetweets": false,
"withheldInCountries": []
}
步驟8:重要注意事項
- 登錄Twitter並獲取你的Twitter Cookie。了解如何獲取Twitter Cookie
- 使用Apify的API
- 從GitHub獲取完整代碼
- 加入我們的討論群!點擊這裡
FAQ:常見問題
- 問:什麼是網絡抓取?
網絡抓取就像使用一個特別的工具,自動收集網站的信息。想像一下有一個機器人幫助從頁面上收集數據,這樣你就不需要手動去做。這次我們重點是使用Python和Selenium抓取Twitter數據。
- 問:我該如何開始抓取Twitter數據?
要開始抓取Twitter數據,你首先需要設置你的電腦。這包括安裝Selenium這個軟件,它幫助你控制網頁瀏覽器。然後,你下載ChromeDriver,一個用於Google Chrome的輔助工具,使Selenium能夠正常工作。
- 問:什麼是ChromeDriver,為什麼我需要它?
ChromeDriver就像Selenium與Google Chrome之間的翻譯。它幫助Selenium理解如何與Chrome瀏覽器互動。你需要它,以便Selenium可以自動執行點擊按鈕或輸入信息等操作。
- 問:抓取中的性能日誌是什麼?
性能日誌就像記錄抓取過程中所有事件的日記。它跟蹤Selenium與Twitter頁面之間所有的數據交換,幫助你了解程序發出的請求。
- 問:抓取Twitter之前應該考慮什麼?
在抓取Twitter之前,你需要登錄你的Twitter賬戶並獲取一個稱為auth_token的東西,以證明你有權訪問Twitter的數據。同時,要注意遵守Twitter的規則,以免被封禁。
- 問:如何避免在抓取時被封禁?
為了避免被封禁,確保在請求之間引入延遲,輪換代理,並避免在短時間內向Twitter的伺服器發送過多請求。