Reading html file using python 3 BeautifulSoup
Considering html file reading, it is useful for data science considering text mining. it can be done easily and accurately using beautifulSoup library in python. when it wants to read full html document bellow method can be used for anywhere.
from bs4 import BeautifulSoup
def read_article(file_name):
file = open(file_name,"r",encoding="utf-8")
filedata = file.read()
soup = BeautifulSoup(filedata, features="html.parser")
text = soup.get_text()
return text
print(read_article("anyhtml.html"))
this file_name can be any html document. remember this html file should be in your sourcode file as html file. this should not want any url for read html file. you can use file name and extention only.
Read pdf file using python 3 BeautifulSoup library
typically most time someone are using PyPDF2 for read the pdf document. this may be good some time may not be good. it has some limitation reading pdf file. therefore some pdf file can not be read as it is. due to this reason, it has been used BeautifulSoup library for reading pdf file as it is. following code is showing this new method for every one can be easily used it.
from tika import parser
from bs4 import BeautifulSoup
def read_article(file_name):
pdf_name = file_name
raw = parser.from_file(pdf_name, xmlContent=True)['content']
data = BeautifulSoup(raw, 'lxml')
value = data.find_all(class_='page')
full_text = ""
for loop in value:
page_1 = loop.text
full_text = full_text + page_1
return full_text
print(read_article("anypdf.pdf"))
you can use any pdf file with pdf extention as anypdf.pdf. remember this pdf file and source code should be in same folder.