PHP最简单的爬取数据

分类首页日期4年前访问2928评论0

话不多说直接上代码,这里抓的是飞猪的数据,可以实现分页,和总页数,更多的数据可以自己定义表达式截取数据,只要浏览器能看到的都能获取,其实就是获取html代码通过正则表达式分隔得到最终的数据:

public function xs()
    {
        $content = file_get_contents('https://travelsearch.fliggy.com/index.htm?searchType=product&keyword=%E4%BA%91%E5%8D%97&pagenum=1');

        $pos1 = strpos($content, '<div class="page-products-block-left clear-fix">');
        $pos2 = strpos($content, '<span class="page-total">到第<input type="text" class="page-skip">页<span type="button" class="confirm-btn" data-spm-click="gostr=/tbtrip;locaid=dredirect">确定</span></span>');
        $content = substr($content, $pos1, $pos2 - $pos1);
// href
        preg_match_all('/<img alt="" class=\"lazy-image\".*? data-src="(.*?)".*?/si', $content, $matches);
        // <img alt="" class="lazy-image" data-src="" data-lazyid="44" style="transition: opacity 100ms ease 0s; opacity: 1;" src="">
        $href = array_values(array_unique($matches[1]));

// src
        // preg_match_all('/_src=\"(.*?)\"/i', $content, $matches);
        preg_match_all('/<h3 class=\"main-title\">(.*?)<\/h3>/i', $content, $matches);
        $title = $matches[1];

// title
        // preg_match_all('/title=\"(.*?)\"/i', $content, $matches);
        // $title = $matches[1];

// price
        preg_match_all('/<span class=\"price\".*?><em>¥<\/em>(.*?)<\/span>/i', $content, $matches);
        preg_match_all('/<div class=\"price-box\".*?>(.*?)<\/div>/i', $content, $matches);
        $price = $matches[1];

        preg_match_all('/<span class=\"tag-value\".*?>(.*?)<\/span>/i', $content, $matches);
        $tag = $matches[1];
        // print_r($tag);
        // return 1;
        $data = array();

        for ($i = 0, $len = count($href); $i < $len; $i++) {
            $data[] = array(
                'href' => $href[$i],
                // 'src' => $src[$i],
                'title' => $title[$i],
                'price' => htmlentities($price[$i], ENT_QUOTES, "UTF-8"),
                'tag' => $tag[$i],
            );
        }
        print_r($data);
    }


下面是获取的数据截图: