Update api/core/rag/extractor/pdf_extractor.py

Since page.extract_text() may return None when no text is found, consider adding a check before performing encoding operations to avoid potential AttributeError.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
pull/19056/head
weiheng 1 year ago committed by GitHub
parent 236c9d64c3
commit 370a785d48
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -69,7 +69,7 @@ class PdfExtractor(BaseExtractor):
with pdfplumber.open(file_obj) as pdf: with pdfplumber.open(file_obj) as pdf:
for page_number, page in enumerate(pdf.pages): for page_number, page in enumerate(pdf.pages):
# Extract text with layout preservation and encoding detection # Extract text with layout preservation and encoding detection
content = page.extract_text(layout=True) content = page.extract_text(layout=True) or ""
# Try to detect and fix encoding issues # Try to detect and fix encoding issues
try: try:
# First try to decode as UTF-8 # First try to decode as UTF-8

Loading…
Cancel
Save