Reading files using Python 3 BeautifulSoup

Reading html file using python 3 BeautifulSoup

Considering html file reading, it is useful for data science considering text mining. it can be done easily and accurately using beautifulSoup library in python. when it wants to read full html document bellow method can be used for anywhere.

from bs4 import BeautifulSoup

def read_article(file_name):
    file = open(file_name,"r",encoding="utf-8")
    filedata = file.read()
    soup = BeautifulSoup(filedata, features="html.parser")
    text = soup.get_text()

    return text
print(read_article("anyhtml.html"))

this file_name can be any html document. remember this html file should be in your sourcode file as html file. this should not want any url for read html file. you can use file name and extention only.

Read pdf file using python 3 BeautifulSoup library

typically most time someone are using PyPDF2 for read the pdf document. this may be good some time may not be good. it has some limitation reading pdf file. therefore some pdf file can not be read as it is. due to this reason, it has been used BeautifulSoup library for reading pdf file as it is. following code is showing this new method for every one can be easily used it.

from tika import parser
from bs4 import BeautifulSoup


def read_article(file_name):
     pdf_name = file_name
     raw = parser.from_file(pdf_name, xmlContent=True)['content']
     data = BeautifulSoup(raw, 'lxml')
     value = data.find_all(class_='page')
     full_text = ""
     for loop in value:
       page_1 = loop.text
       full_text = full_text + page_1

     return full_text

print(read_article("anypdf.pdf"))

you can use any pdf file with pdf extention as anypdf.pdf. remember this pdf file and source code should be in same folder.

Leave a comment