Crawl an email from specified website.
I have list of a specific company registration codes in csv format which are updated weekly basis.
I want to crawl all email address from source website which have those specific corresponding company email addresses and put the email address to new csv file.
Source addresses where the email what needs to be crawled looks like this:
http://www.somesite.com/result?country=en&q=1232498 / "q" value equals variable (comapny registration code) with each different page where the email is).
Each address string which needed to crawl is located in csv file (starting from second column with header "regcode")
(source table structure: compname | regcode | othercol1 | othercol2) (columns are separated by semicolon ;)
The email what need to be crawled is located between the html tags of each page:
The extracted.csv table structure should be as following:
regcode | email
Explanation: the same company registration code which is used as crawl string, should be put into the new csv file belongside the crawled email address.
This process should be triggered every week and automation should look out for new entires only which are updated in the csv file.
I have list of a specific company registration codes in csv format which are updated weekly basis.
I want to crawl all email address from source website which have those specific corresponding company email addresses and put the email address to new csv file.
Source addresses where the email what needs to be crawled looks like this:
http://www.somesite.com/result?country=en&q=1232498 / "q" value equals variable (comapny registration code) with each different page where the email is).
Each address string which needed to crawl is located in csv file (starting from second column with header "regcode")
(source table structure: compname | regcode | othercol1 | othercol2) (columns are separated by semicolon ;)
The email what need to be crawled is located between the html tags of each page:
Output:<table class="table-info">
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>..</tr>
<tr>
<td class="col-1"><div class="col-1-text">E-mail:</div></td>
<td class="col-2"><div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div></td>
</tr>
</table>
The crawled email should be put into new csv file, called extracted.csv.The extracted.csv table structure should be as following:
regcode | email
Explanation: the same company registration code which is used as crawl string, should be put into the new csv file belongside the crawled email address.
This process should be triggered every week and automation should look out for new entires only which are updated in the csv file.