Notice

Recent Posts

Recent Comments

Link

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

LiJell's 성장기

__02.data_analyzing_crawling (크롤링) 본문

Bigdata/Web Crawling

__02.data_analyzing_crawling (크롤링)

All_is_LiJell 2022. 1. 13. 18:47

crwaling

crwaling을 위한 pandas 기초

https://lime-jelly.tistory.com/37

2. HTML 이해하기

2.1 필요한 라이브러리 불러오기

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np

2.2 예제

예제 다운로드: https://github.com/Play-with-data/datasalon/blob/master/02_%EA%B0%9C%EC%A0%95%ED%8C%90/2_Data_Analysis_Basic/2_2_Crawling.ipynb

html = 
'''
<html>
    <head>
    </head>
    <body>
        <h1> 우리동네시장</h1>
            <div class = 'sale'>
                <p id='fruits1' class='fruits'>
                    <span class = 'name'> 바나나 </span>
                    <span class = 'price'> 3000원 </span>
                    <span class = 'inventory'> 500개 </span>
                    <span class = 'store'> 가나다상회 </span>
                    <a href = 'http://bit.ly/forPlaywithData' > 홈페이지 </a>
                </p>
            </div>
            <div class = 'prepare'>
                <p id='fruits2' class='fruits'>
                    <span class ='name'> 파인애플 </span>
                </p>
            </div>
    </body>
</html>
'''

2.3 BeautifulSoup을 이용한 정보 찾기

BeautifulSoup으로 해석하기

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

태그명으로 태그 찾기

tags_span = soup.select('span')
print(tags_span)
print(tags_span[0])
print(tags_span[0].text)
'''
[<span class="name"> 바나나 </span>, <span class="price"> 3000원 </span>, <span class="inventory"> 500개 </span>, <span class="store"> 가나다상회 </span>, <span class="name"> 파인애플 </span>]
<span class="name"> 바나나 </span>
 바나나 
'''

tags_p = soup.select('p')
tags_p[0].text
'''
'\n 바나나 \n 3000원 \n 500개 \n 가나다상회 \n 홈페이지 \n'
'''

id로 찾기

ids_fruits = soup.select('#fruits1')
ids_fruits

[<p class="fruits" id="fruits1">
 <span class="name"> 바나나 </span>
 <span class="price"> 3000원 </span>
 <span class="inventory"> 500개 </span>
 <span class="store"> 가나다상회 </span>
 <a href="http://bit.ly/forPlaywithData"> 홈페이지 </a>
 </p>]

class로 찾기

class_name = soup.select('.name')
class_name[1].text
'''
' 파인애플 '
'''

상위구조 활용

soup.select('span.name ')
soup.select('span.inventory')
'''
[<span class="inventory"> 500개 </span>]
'''

soup.select('#fruits1 > span.name')
soup.select('#fruits2 > span.name')
'''
[<span class="name"> 파인애플 </span>]
'''

soup.select('div.sale span.store') # 중간 class를 넘어갈 때 
'''
[<span class="store"> 가나다상회 </span>]
'''

추출하기

tags_a = soup.select('a')[0]
tags_a
'''
<a href="http://bit.ly/forPlaywithData"> 홈페이지 </a>
'''

tags_a.text
'''
' 홈페이지 '
'''

tags_a.text.strip()
'''
'홈페이지'
'''

tags_a['href']
'''
'http://bit.ly/forPlaywithData'
'''

멜론, 벅스, 지니 크롤링: https://lime-jelly.tistory.com/40

__03.data_analyzing_crawling_melon_bugs_genie (멜론, 벅스, 지니 크롤링)

crawling 기초 : https://lime-jelly.tistory.com/39 3. 멜론, 벅스, 지니 차트 크롤링 3.1. selenium & Chrome Driver 라이브러리 불러오기 from selenium import webdriver import 에러가 날 경우 라이브러리 설..

lime-jelly.tistory.com

유튜브 크롤링: https://lime-jelly.tistory.com/41

'Bigdata > Web Crawling' 카테고리의 다른 글

__06.data_analyzing_tourist_data (한국 관광객 추이 1편) (0)	2022.01.14
__05.data.analyzing_youtube_data_visualizing (유튜브 2편) (0)	2022.01.14
__04.data_analyzing_crawling_youtube_ranking (유튜브 1편) (0)	2022.01.13
__03.data_analyzing_crawling_melon_bugs_genie (멜론, 벅스, 지니) (3)	2022.01.13
__01.data_analyzing_pandas (0)	2022.01.12

'Bigdata/Web Crawling' Related Articles

Comments