Python 및 Selenium을 사용하여 Twitter 팔로워 스크래핑하기

스크래핑된 데이터 예제

{
  "userId": "1710236730010349568",
  "isBlueVerified": false,
  "following": false,
  "canDm": false,
  "canMediaTag": true,
  "createdAt": "Fri Oct 06 10:13:15 +0000 2023",
  "defaultProfile": true,
  "defaultProfileImage": true,
  "description": "",
  "fastFollowersCount": 0,
  "favouritesCount": 456,
  "followersCount": 64,
  "friendCount": 7320,
  "hasCustomTimelines": false,
  "isTranslator": false,
  "listedCount": 0,
  "location": "",
  "mediaCount": 0,
  "name": "Paislie Dimitrov",
  "normalFollowersCount": 64,
  "pinnedTweetIdsStr": [],
  "possiblySensitive": false,
  "profileImageUrlHttps": "https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png",
  "profileInterstitialType": "",
  "username": "PaisliDimit",
  "statusesCount": 0,
  "translatorType": "none",
  "verified": false,
  "wantRetweets": false,
  "withheldInCountries": []
}

설정 없이 코드 직접 실행
저희 가이드는 Twitter 팔로워 데이터를 쉽게 스크래핑할 수 있는 완전한 준비된 코드를 제공합니다. Python 및 Selenium으로 데이터 수집을 자동화하고 성능 로그를 효율적으로 캡처하세요. 추가 설정 없이 Twitter 통찰력을 확보하세요!

1단계: 환경 설정

먼저, 브라우저 작업을 자동화할 수 있는 Selenium을 설치합니다:

pip install -r requirements.txt

2단계: ChromeDriver 다운로드

해당 ChromeDriver는 여기에서 찾을 수 있습니다 ChromeDriver 다운로드

3단계: Chrome 옵션 설정

self.options = webdriver.ChromeOptions()
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
self.options.add_argument(f'user-agent={user_agent}')
self.options.add_argument('--disable-gpu')
self.options.add_argument('--no-sandbox')
self.options.add_argument('--disable-dev-shm-usage')
self.options.add_argument(f"--remote-debugging-port={remote_debugging_port}")

js_script_name = modify_random_canvas_js()
self.browser = self.get_browser(script_files=[js_script_name], record_network_log=True, headless=True)

4단계: 대상 페이지 접근

self.browser.switch_to.new_window('tab')
url= 'https://x.com/1_usd_promotion/verified_followers'
self.browser.get(url=url)

time.sleep(2)

exist_entry_id = []

self.get_network(exist_entry_id, result_list)

print(f'tweet result length = {len(result_list)}')

5단계: 브라우저 성능 로그 가져오기

performance_log = self.browser.get_log("performance")
for packet in performance_log:

    msg = packet.get("message")
    message = json.loads(packet.get("message")).get("message")
    packet_method = message.get("method")

    if "Network" in packet_method and 'Following' in msg:

        request_id = message.get("params").get("requestId")

        resp = self.browser.execute_cdp_cmd('Network.getResponseBody', {'requestId': request_id})

6단계: 응답에서 데이터 추출

body = resp.get("body")
body = json.loads(body)
instructions = body['data']['user']['result']['timeline']['timeline'].get('instructions', None)
if not instructions:
    continue
for instruction in instructions:
    entries = instruction.get('entries', None)

7단계: 중요한 고려사항

Twitter에 로그인한 후 auth_token을 가져옵니다. auth token 가져오는 방법 알아보기
Apify의 API를 사용할 수 있습니다.
전체 코드는 GitHub에서 확인하세요.
토론 그룹에 참여하세요! 여기 클릭

자주 묻는 질문: FAQ

Q: 웹 스크래핑이란 무엇인가요?
웹 스크래핑은 웹사이트에서 정보를 자동으로 수집하는 특별한 도구를 사용하는 것과 같습니다. 수동으로 할 필요 없이 페이지에서 데이터를 수집하는 로봇을 상상해 보세요. 이번 경우에는 Python과 Selenium을 사용하여 Twitter 데이터를 스크래핑하는 데 집중하고 있습니다.
Q: Twitter 데이터를 스크래핑하려면 어떻게 시작하나요?
Twitter 데이터를 스크래핑하려면 먼저 컴퓨터를 설정해야 합니다. 이는 웹 브라우저를 제어하는 데 도움이 되는 Selenium이라는 소프트웨어를 설치하는 것을 포함합니다. 그 후 Selenium이 작동할 수 있도록 Google Chrome에 대한 도우미 도구인 ChromeDriver를 다운로드합니다.
Q: ChromeDriver란 무엇이며 왜 필요한가요?
ChromeDriver는 Selenium과 Google Chrome 간의 통역사와 같습니다. Selenium이 Chrome 브라우저와 상호 작용하는 방법을 이해하는 데 도움을 줍니다. Selenium이 Twitter에서 버튼을 클릭하거나 정보를 입력하는 등의 작업을 자동화하려면 필요합니다.
Q: 스크래핑 중 성능 로그란 무엇인가요?
성능 로그는 웹 스크래핑 중 발생하는 모든 것을 기록하는 로그와 유사합니다. 스크래퍼(Selenium)와 Twitter 페이지 간의 모든 데이터 교환을 추적하며, 프로그램이 수행한 요청을 이해하는 데 도움을 줍니다.
Q: Twitter를 스크래핑하기 전에 주의해야 할 점은 무엇인가요?
Twitter를 스크래핑하기 전에 Twitter 계정에 로그인하고 Twitter 데이터에 대한 액세스 권한이 있음을 입증하는 auth_token을 가져와야 합니다. 또한 Twitter의 규칙을 준수해야 하므로 차단되지 않도록 해야 합니다.
Q: 스크래핑 중 차단되지 않으려면 어떻게 해야 하나요?
차단되지 않으려면 요청 사이에 지연을 추가하고, 프록시를 순환시키고, Twitter 서버를 너무 자주 방해하지 않아야 합니다.