Python爬虫之抖音视频批量提取术
浣滆€咃細 寮犲皬楦?/span> Python鐖卞ソ鑰呯ぞ鍖轰笓鏍忎綔鑰?/span>
鐭ヤ箮ID锛歨ttps://www.zhihu.com/people/mr.ji
涓汉鍏紬鍙凤細楦′粩璇?nbsp;
鍋囨湡姝eソ鏈夌┖闂叉椂闂达紝缁堜簬鍙互鏁寸悊鑷繁鐨勭瑪璁板暒銆傛暣鐞嗗埌鎶栭煶瑙嗛鐨勬椂鍊欙紝灏卞ソ楹荤儲锛屾瘡娆¢兘瑕佸厛鎶婅棰戝鍑哄埌鏈湴锛屽啀缁欏井淇$殑鏂囦欢绠$悊鍔╂墜锛屽啀涓嬭浇浼犲埌鍗拌薄绗旇锛屼竴鏉ヤ簩鍘绘氮璐逛笉灏戞椂闂达紝鎯虫兂杩欎簨涓嶆濂介€傚悎鐖櫕鍘诲共鍚楋紵浜庢槸灏辨湁浜嗕互涓嬭繖绡囧唴瀹?/span>
宸ュ叿鐜
-
璇█锛歅ython3.6
-
缂栬緫鍣細Pycharm
-
鏁版嵁搴擄細MongoDB
-
宸ュ叿锛欳harles
鍓嶈█锛?/span>
鍦ㄤ娇鐢–harles锛屼綘闇€瑕佸仛涓€浜涘熀纭€鐨勯厤缃紝灏嗕綘鐨勬墜鏈虹殑缃戠粶浠g悊鍒版湰鍦扮數鑴戯紝浠ヤ究鍋氳繘涓€姝ョ殑鎶撳寘鍒嗘瀽锛屼互涓嬩袱绡囨枃绔犲彲鑳藉浣犳湁鎵€甯姪
Charles 浠庡叆闂ㄥ埌绮鹃€?/span>
https://www.jianshu.com/p/a3f005628d07
绉诲姩搴旂敤鎶撳寘璋冭瘯鍒╁櫒Charles
https://www.jianshu.com/p/68684780c1b0
鐖彇鎬濊矾
鐖彇绔欑偣锛歨ttps://www.douyin.com/
杩欓噷鐨勭埇鍙栨€濊矾闈炲父绠€鍗曪紝浠ヨ嚦浜庢垜浼氳寰楄繖绡囨枃绔犱細鏈変簺绌烘礊銆傚綋浣犳姄鍖呮纭厤缃ソ鐜鍚庯紝鎵撳紑鎶栭煶杞欢锛屽仛涓€浜涚畝鍗曠殑鎿嶄綔锛孋harles灏变細缁欎綘杩斿洖濡備笅鐨勬暟鎹紝杩欎簺鏁版嵁鍏跺疄灏辨槸鏈嶅姟绔粰浣犺繑鍥炵殑鏁版嵁锛岄噷闈㈠寘鍚墍鏈夋垜浠渶瑕佺殑淇℃伅銆傛瘮濡傛垜浠粖澶╄涓嬭浇鐨勮嚜宸辩偣鍑昏繃鐨勶紝鍠滄鐨勮棰戦摼鎺ョ瓑
浣犳搷浣滆蒋浠舵椂锛岀湅涓€涓婥harles涓瘡鏉℃暟鎹殑鍙樺寲鎯呭喌锛屼綘浼氬彂鐜帮紝浣犱釜浜轰富椤典笅闈㈢殑閾捐窡videos銆乫eed鍜宭ikes鍜岃繖涓夋潯鏁版嵁鏈夊叧锛屾瘡涓€娆′綘鍋氱浉搴旂殑鎿嶄綔锛屼笅闈㈠氨浼氬鍑轰竴浜涜姹傞摼鎺?/span>
閭f垜浠埆鐨勫厛涓嶇锛岀湅涓嬫瘡涓姹備腑鐨勬暟鎹紝鏈夋病鏈夋垜浠兂瑕佺殑鏁版嵁锛岄殢渚跨湅涓€涓嬫煇涓摼鎺ヤ腑鐨勮繑鍥炴暟鎹?/span>
鍙互鐪嬪埌杩欓噷鏈塸lay_addr锛屽啀涓€鐪嬮摼鎺ヤ腑鏈塿ideo瀛楁牱锛屽熀鏈叓涔濅笉绂诲崄浜嗐€傚洜涓烘垜宸茬粡楠岃瘉杩囦簡锛岃繖閲岀殑淇℃伅灏辨槸濡傛垜浠寽娴嬬殑閭f牱锛屽寘鍚棰戠殑鍏ㄩ儴淇℃伅
閭f垜浠叾瀹炲氨闇€瑕佹ā鎷熻繖閲岀殑璇锋眰閾炬帴鍗冲彲锛屽厛鐪嬩笅璇锋眰涓兘鍖呭惈鍝簺蹇呰鐨勪俊鎭紝浣犲鐪嬪嚑涓氨鍙戠幇锛岀湡姝e彉鍖栫殑灏卞嚑涓浐瀹氱殑鍙傛暟锛屽叾涓孩绾夸互涓婄殑閮ㄥ垎閮芥槸鍜岃澶囩浉鍏崇殑淇℃伅鍜宎pp淇℃伅锛岀湡姝f牳蹇冨姞瀵嗙殑鍙傛暟灏卞彧鏈夛紝mas锛宎s鍜宼s銆傝繖閲屾垜鍏堣嚜宸辩綉涓婃壘浜嗕笅鏈夋病鏈夌浉鍏崇殑杞瓙鍙敤锛岀储鎬х嫍灞庤繍姣旇緝濂斤紝姝eソ鎵惧埌浜嗭紝鍦板潃鍦ㄨ繖锛歨ttps://github.com/AppSign/douyin
濂楃敤鍗冲彲锛岃€屼笖杩欎綅澶т浆鐨勬墍鏈夌牬瑙o紝閮芥槸鍜屽瓧鑺傝烦鍔ㄦ湁鍏崇殑锛屾垜鏈夌偣瑙夊緱杩欎釜灏辨槸瀹樻柟璁╁憳宸ヨ嚜宸辨斁鍑烘潵鐨勩€傛寜灏艰儍锛屾垜浠嬁鍒颁簡鍔犲瘑鐨勫弬鏁扮殑瀹炵幇涔嬪悗锛屽悗闈㈠氨澶畝鍗曚簡
鐪嬩笂闈㈤偅浣嶅ぇ浣殑浠g爜鎻愬彇瑙嗛閭i噷锛岃窡瑙嗛鐩稿叧鐨勫叧閿弬鏁板氨鏄繖涓猘weme_id锛屾垜浠嬁鍒板畠涔嬪悗锛屽悗闈㈢洿鎺ユ瀯閫犳彁鍙栧師瑙嗛鐨勮姹傚嵆鍙?/span>
閭d箞搴熻瘽涓嶈锛屼笂鐮佽蛋璧?/span>
show me the code
鏍稿績璇锋眰锛?/span>
def grab_favorite(self, user_id, max_cursor=0): favorite_params = self.FAVORITE_PARAMS favorite_params[user_id] = user_id favorite_params[max_cursor] = max_cursor query_params = {favorite_params, self.common_params} sign = getSign(self.gettoken(), query_params) params = {query_params, sign} resp = requests.get(self.FAVORITE_URL, params=params, verify=False, headers=self.HEADERS) favorite_info = resp.json() hasmore = favorite_info.get(hasmore) max_cursor = favorite_info.get(max_cursor) video_infos = favorite_info.get(aweme_list) for per_video in video_infos: author_nickname = per_video[author].get("nickname") author_uid = per_video[author].get(uid) video_desc = per_video.get(desc) download_item = { "author_nickname": author_nickname, "video_desc": video_desc, "author_uid": author_uid, } awemeid = per_video.get("awemeid") self.download_favorite_video(awemeid, download_item) time.sleep(5) return hasmore, max_cursor
杩欓噷鎴戜滑灏嗚澶囧弬鏁帮紝app淇℃伅锛岀敤鎴蜂竴璧风敤浣滄煡璇㈠弬鏁帮紝鍐嶄笌鑾峰緱鐨則oken涓€璧凤紝鍙戦€佺粰getSign鍑芥暟锛屾瀯閫犲姞瀵嗘暟鎹紝鏈€鍚庢妸杩欎簺鏁版嵁缁勫悎鎴愮殑瀛楀吀鏀惧湪涓€璧凤紝璇锋眰鎴戜滑鐨勫枩娆㈢殑閾炬帴锛坔ttps://aweme.snssdk.com/aweme/v1/aweme/favorite/锛夊嵆鍙嬁鍒板搴旂殑response鏁版嵁銆傚ぇ瀹跺彲鑳戒細鍙戠幇锛屾垜杩欓噷婕忔帀浜嗕竴涓?strong>max_cursor鍙傛暟锛岃繖鏄洜涓猴紝绗竴娆″彂閫佽姹傛椂锛岃繖閲岀殑鍙傛暟鏄?锛屼箣鍚庢垜浠姹備簡鏁版嵁鍚庯紝濡傛灉杩斿洖鐨刪as_more鏄?锛屽氨浠h〃鏈夋暟鎹紝閭d箞涓嬩竴娆℃垜浠姹傜殑鏃跺€欙紝灏遍渶瑕佸甫涓婁笂涓€娆$殑max_cursor銆傚氨鍙互鐞嗚В涓烘垜浠埛鏁版嵁锛屽線涓嬬炕椤靛惂
鎵€浠ヨ繖涔熷氨鏄负浠€涔堟垜鍦ㄨ繖涓湴鏂瑰仛浜嗚繑鍥烇紝灏辨槸涓轰簡鏂逛究涓婁竴灞傝皟鐢紝鐪嬩笅杩欓噷濡傛灉鏈夋暟鎹殑璇濓紝鎴戜滑灏辩户缁炕椤典笅杞?/span>
缈婚〉锛?/span>
def grab_favorite_main(self, user_id): count = 1 self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count)) hasmore, max_cursor = self.grab_favorite(user_id) while hasmore: count += 1 self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count)) hasmore, max_cursor = self.grab_favorite(user_id, max_cursor)
鎴戜滑鍦ㄧ涓€娆¤姹傚悗寰楀埌鏄惁鏈夋暟鎹殑鐘舵€佸拰max_cursor鍙傛暟锛岄偅灏辩畝鍗曚簡锛屽鏋滄垜浠彂鐜版湁鏇村鏁版嵁锛屽氨缁х画璇锋眰鍗冲彲
瑙嗛涓嬭浇
def grab_favorite_main(self, userid): count = 1 self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count)) hasmore, max_cursor = self.grab_favorite(userid) while hasmore: count += 1 self.logger.info("褰撳墠姝e湪鐖彇绗?馃憠 {} 馃憟 椤靛唴瀹?..".format(count)) hasmore, max_cursor = self.grab_favorite(userid, max_cursor) def download_favorite_video(self, awemeid, video_infos): video_content = self.download_video(awemeid) author_nickname = video_infos.get("author_nickname") author_uid = video_infos.get("author_uid") video_desc = video_infos.get("video_desc") video_name = "".join(author_nickname, author_uid, video_desc) self.logger.info("download_favorite_video 姝e湪涓嬭浇瑙嗛 {} ".format(video_name)) if not video_content: self.logger.warn("浣犳鍦ㄤ笅杞界殑瑙嗛锛岀敱浜庢煇绉嶇绉樺姏閲忕殑浣滅敤锛屽凡缁忓噳鍑変簡锛岃璺宠繃...") return with open("../videos/{}.mp4".format(video_name), wb) as f: f.write(video_content)def download_video(self, awemeid, retrytimes=0): query_params = self.common_params query_params[awemeid] = awemeid sign = getSign(self.gettoken(), query_params) params = {query_params, sign} postdata = { "awemeid": awemeid } resp = requests.get(self.VIDEO_DETAILURL, params=params, data=post_data, verify=False, headers=self.HEADERS) resp_result = resp.json() play_addr_raw = resp_result[aweme_detail][video][play_addr][url_list] content = requests.get(play_addr).content return content
绫讳技鐨勶紝鎴戜滑鏋勯€犱簡sign绛惧悕涔嬪悗锛岃姹傝棰戣幏鍙栭摼鎺ワ紝浼犲叆瀵瑰簲鐨刟weme_id鍗冲彲鎷垮埌鎴戜滑鎯宠鐨勮棰戞暟鎹紝鏈€鍚庣洿鎺ヤ互浜岃繘鍒剁殑褰㈠紡鍐欏叆鏂囦欢鍗冲彲銆傛枃浠跺悕鎴戣繖閲屾槸鐢ㄧ殑鐢ㄦ埛鏄电О銆佺敤鎴峰敮涓€id鍜岃棰戞弿杩帮紝濡傛灉瑙夊緱澶暱锛屽ぇ瀹朵篃鍙互鑷繁鏀规垚鑷繁鎯宠鐨勬枃浠跺悕
鏈€鍚庡紑鍚埇铏紝灏卞彲浠ュ緱鍒板涓嬬粨鏋?/span>
浠ヤ笂瀹炵幇鐖彇鑷繁鎶栭煶鍠滄杩囩殑鎵€鏈夎棰戠殑姝ラ锛屽皬浼欎即浠彲浠ヨ嚜宸卞畬鏁磋蛋涓€閬嶈繃绋嬶紝鎴栬€呯洿鎺ユ嫹璐濇垜鍦╣ithub涓婄殑浠g爜鍦板潃锛坔ttps://github.com/hacksman/spider_world锛?/span>
娉ㄦ剰user_id瑕佹敼鎴愪綘鑷繁鐨勫摝锛?鍙﹀鍚庣画鎴戣繖涓粨搴撲細澧炲姞鏇村鏈夎叮瀹炵敤鐨勭埇铏紝娆㈣繋澶у缁欐垜鐐规槦锛屾湁浠€涔堥棶棰樺彲浠ュ悜鎴戝弽棣堬紝涓€璧峰涔犺繘姝?/span>
github椤圭洰鍦板潃锛歨ttps://github.com/hacksman/spider_world
涓汉缃戠珯锛歨ttp://www.zxiaoji.com/
浣滆€呭ソ鏂囨帹鑽愶細褰撳コ绁ㄥ彂鏉ヤ竴濂楅€佸懡棰橈紝绋嬪簭鍛樺簲璇ユ€庝箞鍋氾紵
Python鐨勭埍濂借€呯ぞ鍖哄巻鍙叉枃绔犲ぇ鍚堥泦锛?/span>
Python鐨勭埍濂借€呯ぞ鍖哄巻鍙叉枃绔犲垪琛?/span>
绂忓埄锛氭枃鏈壂鐮佸叧娉ㄥ叕浼楀彿锛?/span>Python鐖卞ソ鑰?/span>绀惧尯锛?/span>寮€濮嬪涔燩ython璇剧▼锛?/span>
鍏虫敞鍚庡湪鍏紬鍙峰唴鍥炲 璇剧▼ 鍗冲彲鑾峰彇锛?/span>
灏忕紪鐨勮浆琛屽叆鑱屾暟鎹瀛︼紙鏁版嵁鍒嗘瀽鎸栨帢/鏈哄櫒瀛︿範鏂瑰悜锛?/span>銆愭渶鏂板厤璐广€?/span>
灏忕紪鐨凱ython鐨勫叆闂ㄥ厤璐硅棰戣绋?/span>锛?/span>
灏忕紪鐨凱ython鐨勫揩閫熶笂鎵媘atplotlib鍙鍖栧簱锛?/span>
宕?/span>鑰佸笀鐖櫕瀹炴垬妗堜緥鍏嶈垂瀛︿範瑙嗛銆?/span>
闄?/span>鑰佸笀鏁版嵁鍒嗘瀽鎶ュ憡鎵╁睍鍒朵綔鍏嶈垂瀛︿範瑙嗛銆?/span>
鐜╄浆澶ф暟鎹垎鏋愶紒Spark2.X + Python绮惧崕瀹炴垬璇剧▼鍏嶈垂瀛︿範瑙嗛銆?/span>