サイトから一部のHTMLを抽出

DOMDocumentを使ってスクレイピング

DOMDocument

サイドメニュータグクラウド内の<～>を一覧で取得。

PHP

// 全体ソースを取得
$html = file_get_contents('http://example.com/');
// Basic認証が設定されている場合
// $html = file_get_contents('http://user:password@example.com/');

// 文字化け対策
$html = mb_convert_encoding($html, 'HTML-ENTITIES','UTF-8');

$domDocument = new DOMDocument();
// HTMLをパース
$domDocument->loadHTML($html);

// <div id="tag"><ul>～</ul></div>内を取得
$domElement = $domDocument->getElementById('tag')->getElementsByTagName('ul');
// 最初のノードを取得し、子ノードを取得
$nodes = $domElement->item(0)->childNodes;
for($i = 0; $i < $nodes->length; $i++) {
    // DOMツリーを保存
    $line = trim($domDocument->saveXML($nodes->item($i)));
    if (! empty($line)) {
        $src[] = $line;
    }
}

echo implode('', $src);

PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parserについては、PHPでsitmap.xmlを生成を参照。

PHP

 // ライブラリ読み込み
require_once('simple_html_dom.php');

// 全体ソースを取得
$html = file_get_html('http://example.com/');
// Basic認証が設定されている場合
// $html = file_get_html('http://user:password@example.com/');

if (! empty($html)) {
    foreach($html->find('#tag ul') as $line) {
        // 文字化け対策
        $src[] = mb_convert_encoding($line, 'HTML-ENTITIES','UTF-8');
    }
}

echo implode('', $src);

PHPをよく書いている人の備忘録

サイトから一部のHTMLを抽出

DOMDocument

PHP Simple HTML DOM Parser

カテゴリ

タグクラウド

最新の記事

全ての記事

プロフィール

DOMDocument

PHP Simple HTML DOM Parser

カテゴリ

タグクラウド

サイトから一部のHTMLを抽出と関連するページ

最新の記事

全ての記事

プロフィール

サイトから一部のHTMLを抽出
と関連するページ