新知一下
海量新知
5 9 0 7 3 4 8

小白也能学VBA网抓(4)

米宏office学堂 | 专注提升职场办公效率 2021/10/14 22:32

我的目标: 让中国的大学生走出校门的那一刻就已经具备这些office技能,让职场人士能高效使用office为其服务。支持我,也为自己加油!

从前几节VBA网抓课程介绍中我们了解到,V BA网抓主要是XmlHttp对象向http服务器发送请求并获得DOM(Microsoft XML Document Object Model)对象,然后通过HTMLDocument对象对DOM进行处理,获得想要的信息

今天介绍下利用另外一个对象来进行VBA网抓,它叫 WinHttp ,与XmlHttp的用法大致相同,但是比 XmlHttp对象的方法和属性更多,用起来肯定就更加灵活。

按照我们前两节课写的代码,把XmlHttp对象直接替换为WinHttp,代码如下:

Sub test3()

Dim OWinHttp As Object: Set OWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")

Dim OHtmlfile As Object: Set OHtmlfile = CreateObject("Htmlfile")

Dim sUrl$, sHtml$, oTable As Object, er As Object, erc As Object, x%, y%

Dim arr, ec

sUrl = "https://www.boc.cn/sourcedb/whpj/index.html"

With OWinHttp

.Open "get", sUrl, 0

.send

sHtml = .responseText

End With

With OHtmlfile

.body.innerHTML = sHtml

Set oTable = .getElementsByTagName("table")(1)

With oTable

ReDim arr(1 To .Rows.Length, 1 To .Cells.Length)

x = 1: y = 1

For Each er In .Rows

For Each erc In er.Cells

arr(x, y) = erc.innerText

y = y + 1

Next

y = 1

x = x + 1

Next

End With

End With

Sheets(1).Cells(Rows.Count, 1).End(3)(2, 1).Resize(UBound(arr), UBound(arr, 2)) = arr

End Sub

试下效果:

新知达人, 小白也能学VBA网抓(4)

完全可以。

如果你在用WinHttp对象提取的内容中出现了乱码,那是因为不同的网站采用不同的编码字符集,如果你要提取的网站用的不 是UTF-8或者Unicode编码字符集,用ResponseText返回的字符中就会有乱码。

比如用如下VBA网抓基础代码得到的bResult信息中可能就有乱码,

Sub test4()

Dim oHtml As Object

Set oHtml = VBA.CreateObject("WinHttp.WinHttpRequest.5.1")

Dim sUrl As String

'指定要抓取的网站

sUrl = "https://www.boc.cn/sourcedb/whpj/index.html"

With oHtml

.Open "GET", sUrl, False

.send

'获取返回的字节数组

bResult = .responseBody

End With

End Sub

怎么办呢?

这时就得通过 Adodb.Stream对象对返回的responseBody进行转码,转码后再进行相应的处理,整个流程如

Sub test4()

Dim oHtml As Object

Set oHtml = VBA.CreateObject("WinHttp.WinHttpRequest.5.1")

Dim sUrl As String

'指定要抓取的网站

sUrl = "https://www.boc.cn/sourcedb/whpj/index.html"

Dim sCharset As String

'指定要抓取的网站的字符编码

sCharset = "utf-8"

With oHtml

.Open "GET", sUrl, False

.send

'获取返回的字节数组

bResult = .responseBody

'按照指定的字符编码显示

sResult = Byte2String(bResult, sCharset)

Call HtmlTable(sResult)

End With

Set oHtml = Nothing

End Sub

Function Byte2String(bContent, ByVal sCharset As String)

Const adTypeBinary = 1

Const adTypeText = 2

Const adModeRead = 1

Const adModeWrite = 2

Const adModeReadWrite = 3

Dim oStream As Object

'创建流对象

Set oStream = CreateObject("ADODB.Stream")

With oStream

'打开流

.Open

'设置为字节模式

.Type = adTypeBinary

'写入字节

.Write bContent

'将位置定位在第一个字节

.Position = 0

'设置为文本模式

.Type = adTypeText

'设置编码的字符集

.Charset = sCharset

'读取编码后的文本

Byte2String = .ReadText

'关闭流对象

.Close

End With

End Function

'提取网页表格的代码

Sub HtmlTable(ByVal sHtml As String)

'网页html文档对象

Dim oHtmlDom As Object

'网页表格对象

Dim oTable As Object

'网页表格行对象

Dim oRows As Object

'网页表格单元格对象

Dim oCells As Object

'抓取的数据存放的excel表格对象

Dim oWK As Worksheet

Set oWK = Sheet1

iRow = oWK.Range("a65536").End(xlUp).Row + 1

Set oHtmlDom = CreateObject("htmlfile")

With oHtmlDom

.body.innerHTML = sHtml

Set oTable = .getElementsByTagName("table")(1)

With oTable

Set oRows = .Rows

For i = 0 To oRows.Length - 1

Set oCells = oRows(i).Cells

For j = 0 To oCells.Length - 1

oWK.Cells(iRow, j + 1) = oCells(j).innerText

Next j

iRow = iRow + 1

Next i

End With

End With

Set oHtmlDom = Nothing

Set oTable = Nothing

Set oRows = Nothing

Set oCells = Nothing

End Sub

以上代码参考了http://www.exceloffice.net/archives/175网站里的代码,本节内容也是我在学习了该网站中的文章后进行的一个总结。

本节我们主要是了解了WinHttp对象,以及遇到乱码时怎么用 Adodb.Stream 对象进行转码。

本节的分享就到这里,祝大家每天都有进步。

更多“VBA”相关内容

更多“VBA”相关内容

新知精选

更多新知精选