打开Python爬虫实战练习页面ajax动态加载数据的爬取_S07_Spiderbuf,看到页面上的数据已经很熟悉了。
在页面上右键 -> 显示网页源代码,发现数据并没有出现在HTML源码里面,但在HTML源码的最下方看到了一些JavaScript代码。
fetch("/playground/iplist").then(function (response) {
return response.json();
}).then(function (data) {
var dataContent = document.getElementById('mytable');
data.forEach((value, index) => {
var row = dataContent.insertRow();
var ip = row.insertCell();
ip.innerText = value.ip;
var mac = row.insertCell();
mac.innerText = value.mac;
var name = row.insertCell();
name.innerText = value.name;
var manufacturer = row.insertCell();
manufacturer.innerText = value.manufacturer;
var system = row.insertCell();
system.innerText = value.system;
var ports = row.insertCell();
ports.innerText = value.ports;
var status = row.insertCell();
status.innerText = value.status;
})
});
如果熟悉javascript的小伙伴应该一眼就能看出这就是加载数据的代码,不熟悉的也不要紧,我们来逐一分析。
fetch是javascript用来发起请求的函数,默认方法为GET,返回的结果是一个Promise的数据结构,通常通过.then层层解包。getElementById就是通过元素ID来获取到对应的元素,insertRow、insertCell分别是往表格里面插入行跟单元格。innerText是用来给元素赋值的。
结合以上的知识,就可以得知,这段javascript代码向/playground/iplist这个链接发送了一个get请求,并把得到的数据赋值给了ID为mytable的表格。我们先尝试一下把以下链接粘贴到浏览器地址栏看看能得到什么。
[
{
"ip": "172.18.117.6",
"mac": "DD-D1-1C-32-89-E3",
"name": "OA服务器",
"type": "服务器",
"manufacturer": "HUAWEI",
"ports": "80,22,443",
"status": "在线"
},
{
"ip": "10.49.119.200",
"mac": "34-59-71-41-63-8E",
"name": "交换机",
"type": "交换机",
"manufacturer": "HUAWEI",
"ports": "80,22,443",
"status": "离线"
},
{
"ip": "172.29.31.61",
"mac": "EE-93-CB-06-0D-D3",
"name": "测试服务器",
"type": "服务器",
"manufacturer": "Linux",
"ports": "80,22,443",
"status": "在线"
},
{
"ip": "172.20.98.238",
"mac": "51-02-BF-FA-AB-8D",
"name": "交换机",
"type": "交换机",
"manufacturer": "HUAWEI",
"ports": "80,22,443",
"status": "在线"
},
{
"ip": "10.52.185.102",
"mac": "70-A5-E9-5E-EC-97",
"name": "存储服务器",
"type": "服务器",
"manufacturer": "HUAWEI",
"ports": "9000,5606",
"status": "在线"
},
{
"ip": "172.30.116.46",
"mac": "69-E6-56-A9-30-30",
"name": "存储服务器",
"type": "服务器",
"manufacturer": "Windows Server 2012",
"ports": "80,22,443",
"status": "离线"
},
{
"ip": "10.2.190.30",
"mac": "E6-E4-54-64-82-E9",
"name": "ERP服务器",
"type": "服务器",
"manufacturer": "Linux",
"ports": "80,22,443",
"status": "离线"
},
{
"ip": "10.46.235.52",
"mac": "BE-5C-49-AD-38-09",
"name": "堡垒机",
"type": "服务器",
"manufacturer": "HUAWEI",
"ports": "80,22,443",
"status": "在线"
},
{
"ip": "10.104.171.123",
"mac": "07-15-1C-0F-66-F3",
"name": "摄像头",
"type": "摄像头",
"manufacturer": "HUAWEI",
"ports": "80,22,443",
"status": "在线"
},
{
"ip": "172.28.109.198",
"mac": "08-C1-14-56-73-A7",
"name": "存储服务器",
"type": "服务器",
"manufacturer": "Linux",
"ports": "80,22,443",
"status": "在线"
}
]
很显示这就是我们想要的数据,而且是一个json格式,那我们就可以使用Python编写爬虫代码爬取这个链接,并对返回的json进行解析。Python解析json格式的数据需要用到json库。
url = 'http://spiderbuf.cn/playground/iplist'
myheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'}
data_json = requests.get(url, headers=myheaders).text
print(data_json)
f = open('./data/7/07.html', 'w', encoding='utf-8')
f.write(data_json)
f.close()
ls = json.loads(data_json)
print(ls)
有些人看了完整的示例代码可能会有疑惑:为什么完整的示例代码中的url跟这里的不一样。
那是因为完整的示例代码中的url是旧版本的练习网站,因为之前用的第三方的组件,默认加上了order=asc这样的参数,而这里是直接从javascript代码中复制出来的。两者返回的结果是一样的。
从练习的顺序来说,这是首次出现ajax的请求数据方式,实际上现在很多网站都大量应用这种技术,也就是通过javascript代码动态加载数据。而且每种javascript库的代码会略有不同。
建议大家要熟悉原生的javascript语法,这样在往后的javascript代码逆向等练习中就会相对轻松一点,否则会越学越迷糊。
完整示例代码:示例代码