Python Forum

Full Version: Web scraping from bar chart in image format with Python
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello.

I want to extract data (x,y values from axis) from bar chart saved in image (tag <IMG> in HTML). I don't know if it is possible.

If possible, I would like to find the best and most appropriate Python librairies to web scraping for this subject with examples.

I send the url web site with bar chart

Url web site: https://www.ers.usda.gov/data-products/c...rtId=95392

Thanks.
Dacodac
If you look at the HTML code for this chart (open page, and click ctrl-U or right click on chart, and click inspect), you will see that it is stored here so you can download directly from that url.
Get the image like Larz60+ said, then just use tesseract on the command line:

Quote:tesseract --dpi 300 --psm 3 -c preserve_interword_spaces=1 Downloads/share_of_us_grocery_shoppers.png /home/pedro/Downloads/grocery_shoppers.txt -l eng

If you add tsv at the end of the command you get a .tsv file which opens in Excel.

Otherwise, cv2 is good for doing things with images, but I think you don't need that here!

I tried just now with cv2 and pytesseract, it worked ok, but the Python code is about 55 lines, as opposed to the one-liner above
You should try to find the data source so don't have to parse for a finish graph image.
Can be a interesting problem to get data from a image,so as a test i use pytesseract, Pillow.
tesseract from command line as Pedroski55 use is also good solution.

Then i give data to Pandas and try to recreate the same graph.
import pandas as pd
import matplotlib.pyplot as plt

adjusted_data = [
    {'Time Interval': '8 - 8:59 am', 'Weekday (%)': 6, 'Weekend (%)': 3},
    {'Time Interval': '9 - 9:59 am', 'Weekday (%)': 10, 'Weekend (%)': 7},
    {'Time Interval': '10 - 10:59 am', 'Weekday (%)': 12, 'Weekend (%)': 11},
    {'Time Interval': '11 - 11:59 am', 'Weekday (%)': 14, 'Weekend (%)': 17},
    {'Time Interval': 'Noon - 12:59 pm', 'Weekday (%)': 14, 'Weekend (%)': 16},
    {'Time Interval': '1 - 1:59 pm', 'Weekday (%)': 12, 'Weekend (%)': 14},
    {'Time Interval': '2 - 2:59 pm', 'Weekday (%)': 10, 'Weekend (%)': 13},
    {'Time Interval': '3 - 3:59 pm', 'Weekday (%)': 8, 'Weekend (%)': 10},
    {'Time Interval': '4 - 4:59 pm', 'Weekday (%)': 8, 'Weekend (%)': 9},
    {'Time Interval': '5 - 5:59 pm', 'Weekday (%)': 11, 'Weekend (%)': 8},
    {'Time Interval': '6 - 6:59 pm', 'Weekday (%)': 11, 'Weekend (%)': 7},
    {'Time Interval': '7 - 7:59 pm', 'Weekday (%)': 7, 'Weekend (%)': 5}
]

df = pd.DataFrame(adjusted_data)
fig, ax = plt.subplots(figsize=(10, 8))
df.plot(
    x='Time Interval',
    kind='barh',
    ax=ax,
    color=['skyblue', 'orange'],
    edgecolor='black',
    width=0.6
)

ax.set_title('Share of U.S. Grocery Shoppers by Time of Day (2014-17)', fontsize=14, weight='bold')
ax.set_xlabel('Percent on an average day', fontsize=12)
ax.set_ylabel('Time of Day', fontsize=12)
ax.legend(title='Day Type', loc='upper right')
ax.grid(True, axis='x', linestyle='--', alpha=0.7)
ax.invert_yaxis()
ax.set_xlim(0, 25)

plt.tight_layout()
plt.show()
[Image: mrAqkU.png]

Original image:
[Image: share_of_us_grocery_shoppers_768px.png?v=6385.7]