fbpx

htmlq:類似 jq( json 處理器),但是是針對 HTML 檔案格式

jq:JSON 資料的 sed–你可以用它來切分、過濾、對映和轉換結構化資料,htmlq 功能類似 jq,是用來處理 html 內容,可使用 CSS 選擇器從 HTML 檔案中提取一些內容。所以可以使用這個命令列工具搭配 shell script 來做一個簡單的網頁爬蟲

安裝

可使用 cargo 或是 homebrew 安裝

brew install htmlq

使用範例

與 cURL 一起使用,透過 ID 找到頁面的一部分

$ curl --silent https://www.rust-lang.org/ | htmlq '#get-help'
<div class="four columns mt3 mt0-l" id="get-help">
        <h4>Get help!</h4>
        <ul>
          <li><a href="https://doc.rust-lang.org">Documentation</a></li>
          <li><a href="https://users.rust-lang.org">Ask a Question on the Users Forum</a></li>
          <li><a href="http://ping.rust-lang.org">Check Website Status</a></li>
        </ul>
        <div class="languages">
            <label class="hidden" for="language-footer">Language</label>
            <select id="language-footer">
                <option title="English (US)" value="en-US">English (en-US)</option>
<option title="French" value="fr">Français (fr)</option>
<option title="German" value="de">Deutsch (de)</option>

            </select>
        </div>
      </div>

查詢一個頁面中的所有連結

$ curl --silent https://www.rust-lang.org/ | htmlq --attribute href a
/
/tools/install
/learn
/tools
/governance
/community
https://blog.rust-lang.org/
/learn/get-started
https://blog.rust-lang.org/2019/04/25/Rust-1.34.1.html
https://blog.rust-lang.org/2018/12/06/Rust-1.31-and-rust-2018.html
[...]

獲取一個網頁中的文字內容

$ curl --silent https://nixos.org/nixos/about.html | htmlq  --text .main

          About NixOS

NixOS is a GNU/Linux distribution that aims to
improve the state of the art in system configuration management.  In
existing distributions, actions such as upgrades are dangerous:
upgrading a package can cause other packages to break, upgrading an
entire system is much less reliable than reinstalling from scratch,
you can’t safely test what the results of a configuration change will
be, you cannot easily undo changes to the system, and so on.  We want
to change that.  NixOS has many innovative features:

[...]

bat 高亮顯示語法

$ curl --silent example.com | htmlq 'body' | bat --language html

專案網址

也許你會有興趣

找相關課程?試看看 Soft & Share 網站搜尋引擎

✍ 搜尋結果太多?可參考 Soft & Share 搜尋引擎使用技巧

追蹤 Soft & Share

幫我們個小忙!

Image by RD LH from Pixabay

Comments are closed.

Powered by WordPress.com.

Up ↑

%d 位部落客按了讚: