网站seo和sem是什么意思,个人网站备注模板,黄山建设网站公司,页面设计的宗旨是什么案例需求#xff1a;
1.爬取腾讯社招的数据#xff08;搜索 | 腾讯招聘#xff09;包括岗位名称链接时间公司名称
2.爬取所有页#xff08;翻页#xff09;
3.利用jsonpath进行数据解析
4.保存数据#xff1a;txt文本形式和excel文件两种形式
解析#xff1a;
1.分…案例需求
1.爬取腾讯社招的数据搜索 | 腾讯招聘包括岗位名称链接时间公司名称
2.爬取所有页翻页
3.利用jsonpath进行数据解析
4.保存数据txt文本形式和excel文件两种形式
解析
1.分析该网站同步还是异步——异步查看xhr
2.找到正确的数据包——看响应内容 3.复制请求地址 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp1727929418908countryIdcityIdbgIdsproductIdcategoryIdparentCategoryIdattrIdkeywordpageIndex3pageSize10languagezh-cnareacn 4.删除不必要的找到正确的可删可不删 https://careers.tencent.com/tencentcareer/api/post/Query? 5.该网站反爬手段比较强给其进行伪装 headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36
} data {timestamp: 1648355434381,countryId: ,cityId: ,bgIds: ,productId: ,categoryId: ,parentCategoryId: 40001,attrId: ,keyword: ,pageIndex: i,pageSize: 10,language: zh-cn,area: cn
} 6.保存在excel文件中创建对象 wb workbook.Workbook() # 创建Excel对象
ws wb.active # 激活当前表
ws.append([职称, 链接, 时间, 公司名称]) 进行excel保存 def save_excel(z,l,s,g):my_list [z,l,s,g] # 以列表形式写入ws.append(my_list)wb.save(腾讯社招.xlsx) 进行本地文本保存 def save_text(n,u,t,p):with open(腾讯社招.txt,a,encodingutf-8)as f:f.write(n\n)f.write(u\n)f.write(t\n)f.write(p\n) 7.使用jsonpath解析数据 names jsonpath(r, $..RecruitPostName)
urls jsonpath(r, $..PostURL)
times jsonpath(r, $..LastUpdateTime)
pronames jsonpath(r, $..ProductName) 8.处理解析的数据 for name, url, time, protime in zip(names, urls, times, pronames):# print(name,url,time,protime)save_text(name, url, time, protime)save_excel(name, url, time, protime) 9.翻页分析 for i in range(1,6):print(第{}页已经保存完毕.format(i))# url https://careers.tencent.com/search.htmldata {timestamp: 1648355434381,countryId: ,cityId: ,bgIds: ,productId: ,categoryId: ,parentCategoryId: 40001,attrId: ,keyword: ,pageIndex: i,pageSize: 10,language: zh-cn,area: cn} 示例代码
import requests
from jsonpath import jsonpath
from openpyxl import workbook
import time
#http://careers.tencent.com/jobdesc.html?postId1685827130673340416
def get_data():response requests.get(url, headersheaders, paramsdata)r response.json()return rdef parse_data(r):names jsonpath(r, $..RecruitPostName)urls jsonpath(r, $..PostURL)times jsonpath(r, $..LastUpdateTime)pronames jsonpath(r, $..ProductName)for name, url, time, protime in zip(names, urls, times, pronames):# print(name,url,time,protime)save_text(name, url, time, protime)save_excel(name, url, time, protime)
# 保存数据
def save_text(n,u,t,p):with open(腾讯社招.txt,a,encodingutf-8)as f:f.write(n\n)f.write(u\n)f.write(t\n)f.write(p\n)def save_excel(z,l,s,g):my_list [z,l,s,g] # 以列表形式写入ws.append(my_list)wb.save(腾讯社招.xlsx)
if __name__ __main__:wb workbook.Workbook() # 创建Excel对象ws wb.active # 激活当前表ws.append([职称, 链接, 时间, 公司名称])url https://careers.tencent.com/tencentcareer/api/post/Query?headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36}for i in range(1,6):print(第{}页已经保存完毕.format(i))# url https://careers.tencent.com/search.htmldata {timestamp: 1648355434381,countryId: ,cityId: ,bgIds: ,productId: ,categoryId: ,parentCategoryId: 40001,attrId: ,keyword: ,pageIndex: i,pageSize: 10,language: zh-cn,area: cn}time.sleep(2)hget_data()parse_data(h) 运行结果 同样也可以添加代理来进行
添加代理 zhima_api http://http.tiqu.letecs.com/getip3?num1type1procity0yys0port1pack225683ts0ys0cs0lb1sb0pb4mr1regionsgm4
proxie_ip requests.get(zhima_api).json()[data][0]
print(proxie_ip)
# 将提取后的IP处理成字典形式 构造完整HTTP代理
proxies {http: http:// str(proxie_ip[ip]) : str(proxie_ip[port]),#https: https:// str(proxie_ip[ip]) : str(proxie_ip[port])
}