Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Hyperlinks
#1
This time I have been given a truckload of pdfs.
They are scans of historical articles. Quality is good, OCR is OK.
I can search the texts programatically.

Here is where I could use some advice/ideas:
A researcher comes in and e.g. wants to find out which articles (pdfs)
contain the word "Napoleon" at least once. He wants to read the whole article for context.
I can do that, say, i find 10 'hits'.
Now I have to present the researcher with some type of media
with 10 hyperlinks to the 10 pdfs.
My initial thought is to make a html page on-the-fly,
start the default browser, somehow show the page if all that is possible?
When he/she wants to ask another question, get rid of the temporary html page, and do the same again.
We are NOT talking about an internet environment, just a desktop happening.
Any suggestions?
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
(Feb-10-2023, 07:06 AM)DPaul Wrote: We are NOT talking about an internet environment, just a desktop happening.
You could start a Flask application serving html files on the local host for example (or Bottle, or whatever).
from bottle import route, run, template

@route('/hello/<name>')
def index(name):
    return template('<b>Hello {{name}}</b>!', name=name)

run(host='localhost', port=8080)
or
from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello_world():
    return "<p>Hello, World!</p>"
Reply
#3
(Feb-10-2023, 08:26 AM)Gribouillis Wrote: You could start a Flask application serving html files on the local host for example (or Bottle, or whatever).
Wow, need to explore that possibility.
Thanks.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#4
(Feb-10-2023, 08:26 AM)Gribouillis Wrote: You could start a Flask application serving html files on the local host for example (or Bottle, or whatever).
Just to see if I get this right: (Internet Information Services are enabled)
1) I do a search with a python program and I come up with 10 "hits".
2) I create an on the fly webpage (e.g. index.html) with 10 hyperlinks.
3) This page is posted to ..\wwwroot\index.html, and obviously replaces the old one.
4) In python : webbrowser.open('http://localhost', new=2)

This might no be entirely what you suggested, but I'm trying to KISS.
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#5
No I mean a UI in a web browser such as this one for example, created with the help of the bottle tutorial. Start the program in a terminal, then open http://localhost:8080 in the web browser
from bottle import route, request, run

def search_form():
    return '''
        <form action="/search" method="post">
            Search: <input name="search" type="text" />
            <input value="Search" type="submit" />
        </form>
    '''

@route('/')
@route('/search')
def search():
    return search_form()

@route('/search', method='POST')
def do_search():
    search = request.forms.get('search')
    return f'''
        {search_form()}
        <h3>Results</h3>
        <ul>
        <li>spam {search}</li>
        <li>ham {search}</li>
        <li>eggs {search}</li>
        </ul>
    '''

run(host='localhost', port=8080, debug=True)
Reply
#6
(Feb-10-2023, 03:19 PM)Gribouillis Wrote: No I mean a UI in a web browser
Ok, I got half way there iusing bottle, and I downloaded the user guide.
But I have no experience running searches through a html page.
The example makes it a lot clearer. Thanks.
I will look into it, and eventually it needs to happen when released to the public via internet.

Just F.Y.I. I discovered a stumbling block, a lot of the texts seem to be in German,
and they are using "special" fonts. Only tesseract is good enough to figure those out,
but tesseract means images and not pdf texts. So now it will be either hyperlinks to
pdfs or we could show pages as images like prayer cards.
Still trying to KISS.
Paul

Edit: users just told me they prefer hyperlinks.... :-(
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#7
I think I have this covered in a very "raw" format,
but I'm getting links to pdfs.
One more thing though:
I'm sort of mimicing "Google books" with those pdfs.
Only, "google books" shows not only the pages of your "hits",
but also highlights the search-word in yellow. Every occurence !
Probably they are not showing text, but a picture.
Pinpointing a word on a picture containing text, is something else.

Are the algorithms the "Google Books" are using open source ? Smile
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020