当前位置：首页 > news >正文

哪儿能做网站建设公司网站建设文案

news 2025/12/29 15:35:24

哪儿能做网站建设,公司网站建设文案,wordpress导航图标代码,深圳软件系统开发公司目录 1、xpath 1.1、xpath的安装以及lxml的安装 1.2、xpath的基本使用 1.3、xpath基本语法 2、JsonPath 2.1、jsonpath的安装 2.2、jsonpath的使用 2.3、jsonpath的基础语法 3、BeautifulSoup 3.1、bs4安装及创建 3.2、beautifulsoup的使用 3.3、beautifulsoup基本语…目录 1、xpath 1.1、xpath的安装以及lxml的安装 1.2、xpath的基本使用 1.3、xpath基本语法 2、JsonPath 2.1、jsonpath的安装 2.2、jsonpath的使用 2.3、jsonpath的基础语法 3、BeautifulSoup 3.1、bs4安装及创建 3.2、beautifulsoup的使用 3.3、beautifulsoup基本语法 1、xpath 1.1、xpath的安装以及lxml的安装 xpath是一门在XML文档中查找信息的语言它也可以用于HTML文档因为HTML可以看作是XML的一种特殊应用形式。在网页自动化测试、网络爬虫等场景中用于精确的定位网页中的元素比如通过xpath可以找到特定的按钮、文本框、表格单元格等元素的位置以便进行后续的操作如点击按钮、获取文本内容等。首先我们需要安装xpath插件压缩包地址xpath压缩包提取码ttkx 关于如何安装该扩展程序 1、首先我们需要对xpath插件进行解压 2、打开chrome浏览器中的扩展程序 3、只需要把解压好的后缀为crx的文件手动拖动到扩展管理页面中即可添加成功 4、快捷键为ctrlshiftx 出现上面的黑框框就代表安装成功了安装lxml库安装方式 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple lxml 1.2、xpath的基本使用 xpath解析有两种解析文件本地文件和服务器响应数据(即response.read().decode(utf-8)) 解析本地文件html_tree etree.parse(文件名.html) 解析服务器响应数据html_tree etree.HTML(response.read().decode(utf-8)) !DOCTYPE html html langen headmeta charsetUTF-8/ !-- 这里需要有结束标志--titleTitle/title /head bodyulli idl1 classc1北京/lili idl2上海/lili idc3广州/lili idc4深圳/li/ululli郑州/lili浙江/lili南京/lili重庆/li/ul /body /html !-- 这便是一个本地的文件 -- from lxml import etree # 解析本地文件 tree etree.parse(解析本地文件.html) print(tree) # 如果解析本地文件.html的meta没有结束标志会报错lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: meta line 4 and head, line 6, column 8 1.3、xpath基本语法路径查询//查找所有子孙节点不考虑层级关系路径查询/找直接子节点谓词查询//div[id]谓词查询//div[idmaincontent]属性查询//class模糊查询//div[contains(id,ha)]模糊查询//div[starts-with(id,ha)]内容查询//div/h1/text()逻辑运算//div[idhead and classs_down]逻辑运算//title | //price from lxml import etree tree etree.parse(解析本地文件.html) # tree.xpath(xpath路径) # 查找ul下面的li# li_list tree.xpath(body//li) # 找到body的所有子孙节点 # li_list tree.xpath(body/ul/li) # 根据层级关系先找到body的子节点再找到ul的子节点 # len()函数来判断列表内元素数量 # print(len(li_list)) # 8 # 查找所有有id的属性li标签 # li_list tree.xpath(body//ul/li[id]) # 属性选择器 # print(li_list) # 2 # 获取标签中内容:text() # li_list tree.xpath(body//ul/li[id11]/text()) # 如果11为单引号则最外面为双引号反之亦然 # print(list_list) # [北京] # 查找到id为11的li的标签的class的属性值 # li tree.xpath(//ul/li[id11]/class) # print(li) # [c1] # 模糊查询 //div[contains(id,ha)],查询id中包含l的li标签 # li_list tree.xpath(//ul/li[contains(id,l)]/text()) # print(li_list) # [北京, 上海] # 查询id的值为l开头的li标签 # li_list tree.xpath(//ul/li[starts-with(id,l)]/text()) # print(li_list) # [北京, 上海] # 查询id为l1和class为c1的数据 # li_list tree.xpath(//ul/li[idl1 and classc1]/text()) # print(li_list) # [北京] # 查询id为l1或id为l2的数据 # li_list tree.xpath(//ul/li[idl1]/text() | //ul/li[idl2]/text()) li_list tree.xpath(//ul/li[idl1 or idl2]/text()) print(li_list) # [北京, 上海] 案例一获取百度网页的百度一下四个字 # 案例1获取百度网页的百度一下 import urllib.request url https://www.baidu.com headers {user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 } request urllib.request.Request(url,headersheaders) response urllib.request.urlopen(request) content response.read().decode(utf-8) from lxml import etree tree etree.HTML(content) result tree.xpath(//input[idsu]/value)[0] print(result) 案例二爬取站站素材前10页美女素材照片至本地 # 站长素材美女图片爬取前十页 # 第一页https://sc.chinaz.com/tag_tupian/yazhoumeinu.html # 第二页https://sc.chinaz.com/tag_tupian/yazhoumeinu_2.html # 因此我们可以狗在一个if、else判断语句 import urllib.request from lxml import etree def create_request(page):if page1:url https://sc.chinaz.com/tag_tupian/yazhoumeinu.htmlelse:url https://sc.chinaz.com/tag_tupian/yazhoumeinustr(page).htmlheaders {user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36}request urllib.request.Request(urlurl,headersheaders)return request def get_content(request):response urllib.request.urlopen(request)content response.read().decode(utf-8)return content def down_load(content):tree etree.HTML(content)jpg_path tree.xpath(//img[classlazy]/data-original)jpg_name tree.xpath(//img[classlazy]/alt)for i in range(len(jpg_path)):name jpg_name[i]path jpg_path[i]url https: pathurllib.request.urlretrieve(urlurl,filenamename.jpg) if __name__ __main__:start_page int(input(请输入起始页码:))end_page int(input(请输入终止页码:))for page in range(start_page, end_page1):# 请求对象的定制request create_request(page)# 获取网页源码content get_content(request)# 下载down_load(content) 2、JsonPath 2.1、jsonpath的安装 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple jsonpath 2.2、jsonpath的使用 obj json.load(open(json文件,r,encodingutf-8)) ret jsonpath.jsonpath(obj,json的语法) 与xpath不同的是xpath既可以解析本地文件也可以解析服务器响应的文件而jsonpath只能解析本地文件 2.3、jsonpath的基础语法 jsonpath与xpath基础语法对比 xpathjsonpath描述/$表示根节点.表示当前元素/.or[]子元素..n/a取父元素jsonpath不支持//.. 取所有符合条件的节点 **匹配所有元素节点n/a属性访问字符jsonpath不支持[][]子元素操作符|[,]支持迭代器中做多选[]?()支持过滤操作n/a()分组jsonpath不支持示例 {store: {book: [{category: reference,author: Nigel Rees,title: Sayings of the Century,price: 8.95},{category: fiction,author: Evelyn Waugh,title: Sword of Honour,price: 12.99},{category: fiction,author: Herman Melville,title: Moby Dick,isbn: 0-553-21311-3,price: 8.99},{category: fiction,author: J. R. R. Tolkien,title: The Lord of the Rings,isbn: 0-395-19395-8,price: 22.99}],bicycle: {color: red,price: 19.95}} }import jsonpath import json obj json.load(open(测试.json,r,encodingutf-8)) # 书店所有书的作者 # author_list jsonpath.jsonpath(obj,$.store.book[*].author) # 因为要的是书店的书的作者如果自行车有作者..author会代表所有作者 # print(author_list) #[Nigel Rees, Evelyn Waugh, Herman Melville, J. R. R. Tolkien] # 所有作者 # author2_list jsonpath.jsonpath(obj,$..author) # print(author2_list) # store的所有元素。所有的books和bicyle a_list jsonpath.jsonpath(obj,$.store) print(a_list) # store下所有的price price_list jsonpath.jsonpath(obj,$.store..price) print(price_list) # [8.95, 12.99, 8.99, 22.99, 19.95] # 第三本书 book_3 jsonpath.jsonpath(obj,$..book[2]) print(book_3) # [{category: fiction, author: Herman Melville, title: Moby Dick, isbn: 0-553-21311-3, price: 8.99}] # 最后一本书 book_end jsonpath.jsonpath(obj,$..book[(.length-1)]) print(book_end) # 前两本书 book_list jsonpath.jsonpath(obj,$..book[0,1]) # book_list jsonpath.jsonpath(obj,$..book[:2]) print(book_list) # 过滤出含有isbn版本号的书 book_isbn_list jsonpath.jsonpath(obj,$..book[?(.isbn)]) print(book_isbn_list) # 哪本书价格超过10元 book_price_list jsonpath.jsonpath(obj,$..book[?(.price10)]) print(book_price_list) 案例爬取淘票票网站上所有电影院的城市分布情况 import urllib.request url https://dianying.taobao.com/cityAction.json?activityId_ksTS1733383129541_108jsoncallbackjsonp109actioncityActionn_snewevent_submit_doGetAllRegiontrue headers {user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36,referer:https://dianying.taobao.com/ } request urllib.request.Request(url, headersheaders) response urllib.request.urlopen(request) content response.read().decode(utf-8) import json content content.replace(jsonp109(,)[:-2] import jsonpath f open(淘票票.json,w,encodingutf-8) f.write(content) content json.loads(content) city_list jsonpath.jsonpath(content,$..regionName) print(city_list) f.close() 3、BeautifulSoup 基本介绍beautifulsoup简称bs4和lxml一样是一个html的解析器主要功能也是解析和提取数据。缺点是效率没有lxml高但接口设计人性化使用方便。 3.1、bs4安装及创建 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple bs4 3.2、beautifulsoup的使用 # 导包 from bs4 import BeautifulSoup # 服务器响应文件生成对象 soup BeautifulSoup(response.read().decode(),lxml) # 本地文件生成对象 soup BeautifulSoup(open(文件.html,lxml) # 注意默认打开文件的编码格式gbk所以需要指定文件打开的格式 3.3、beautifulsoup基本语法示例 !DOCTYPE html html langen headmeta charsetUTF-8titleTitle/title /head body divulli idl1张三/lili idl2李四/lili王二/lia hrefhttps://dwqttkx.blog.csdn.net id classa1人间无解/aspan哈哈/span/ul /div a hrefhttps://www.baidu.com titlea2百度/a div idd1span嘻嘻/span /div p idp1 classp1呵呵/p /body /html 基本语法 from bs4 import BeautifulSoup # 默认打开的文件的编码格式为gbk2312 soup BeautifulSoup(open(bs4的基本使用.html,encodingutf-8),lxml) # 根据标签的名字来查找节点 print(soup.a) # 找到的是第一个符合条件的数据、 print(soup.a.attrs) # 返回标签的属性 {href: https://dwqttkx.blog.csdn.net, id: , class: [a1]}# bs4的一些常见的函数 #1find函数 print(soup.find(a)) # 返回符合条件的第一条数据 print(soup.find(a,titlea2)) #a hrefhttps://www.baidu.com titlea2百度/a print(soup.find(a,class_a1)) # 这里需要注意因为class是python内置的关键字类对象需要在最后加上_ #2find_all函数 print(soup.find_all(a)) # 返回的是列表 print(soup.find_all([a,span])) # 如果想要获取多个标签的数据那么需要在find_all的参数中添加的是列表的数据 print(soup.find_all(li,limit2)) #limit可以限制返回数据的数量 #3select函数根据选择器得到节点对象 print(soup.select(a)) # 返回的是列表数据 print(soup.select(.a1)) #根据class属性值找到标签数据 print(soup.select(#l1)) #根据id的属性值找到标签数据 # 属性选择器 # 查找li标签中有id的标签 print(soup.select(li[id])) print(soup.select(li[idl2])) # 层级选择器 # 1、后代选择器 print(soup.select(div li)) # 子代选择器 print(soup.select(divulli)) # 找到a标签和li标签的所有对象 print(soup.select(li,a)) # 获取节点内容是用于标签中嵌套标签的结构 obj soup.select(#d1)[0] # 如果标签对象中只有内容那么string和get_text()都可以使用如果标签对象中除了内容还有标签则string获取不到 print(obj.string) print(obj.get_text()) # 节点的属性 obj soup.select(#p1)[0] print(obj.name) # 标签的名字 print(obj.attrs)# 将属性值作为字典返回 # 获取节点的属性 obj soup.select(#p1)[0] print(obj.attrs.get(class)) # print(obj[class]) 案例爬取德克士经典小吃的菜单 import urllib.request url https://www.dicos.com.cn/product/index.html headers {user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36,cookie:PHPSESSIDfojuqr8gsm3crh2neh5n5815gi; Hm_lvt_2a236f187a73851700a681cacce60cdf1733407994; HMACCOUNT5C1851BB26ADB117; _gidGA1.3.96677442.1733408008; _ga_89J95J2XENGS1.1.1733407994.1.1.1733409442.0.0.0; Hm_lpvt_2a236f187a73851700a681cacce60cdf1733409442; _gat_gtag_UA_230824051_11; _ga_G95L9KVQWWGS1.1.1733407994.1.1.1733409442.0.0.0; _gaGA1.1.1693328412.1733407994 } request urllib.request.Request(url,headersheaders) response urllib.request.urlopen(request) content response.read().decode(utf-8) # from bs4 import BeautifulSoup # soup BeautifulSoup(content,lxml) # name_list soup.select(.proul p) # for name in name_list: # print(name.get_text()) # xpath语句//ul[classproul]//p/text() from lxml import etree tree etree.HTML(content) result tree.xpath(//ul[classproul]//p/text()) print(result) 本次分享就到这里感谢观看

查看全文

http://www.w-s-a.com/news/218617/