How can we effectively crawl the web to find specific data?
$1,000
top 4
12
submissions
DONE
22 months ago
Small pic search

Novarica conducts research and consulting work in the insurance industry. The company has its own proprietary dataset of insurance companies and information and uses a variety of methods to keep it up to date.

We are interested in scalable methods of crawling the web for publicly listed information. Specifically we’re looking for a) annual premiums or b) annual revenue numbers.

Deliverables

Using the attached list of URLs as a starting point, create a script that crawls these webpages and finds annual premium or total annual revenue numbers.

Search specifically in annual reports, press releases, company about pages, or other relevant pages you discover in your own research.

Example (About Section): https://www.hastingsmutual.com/AboutUs/Company-Financials

Example (Annual Report): http://www.arbella.com/arbella-insurance/why-arbella/annual-report

Info on Premiums and Revenue Numbers:

These links go to the financial statements of Hastings and Arbella. Both are fairly standard financial reports (although no two reports are ever exactly alike). The numbers are a breakdown of the companies’ balance sheets. We do not need these breakdowns. We continue to be interested primarily in Revenue and Written Premium.

For Hastings, the 2013 written premium is included in the bar chart at the bottom of the screen (393.644M). Given that there are no mentions made of additional revenue streams, it is likely that the Revenue is not appreciably larger than the premiums.

Arbella’s report is longer but contains a similar table. It states its direct written premium for 2014 as 808.742M. Revenue is not explicitly stated in this table. However, there is mention made of additional investment and other income of 043.381M, and this jives with the statement on page 4 that “Arbella earned more than $800 million in revenue.”

Criteria:

Top ideas will include well-documented code and be highly accurate at identifying premiums or revenue numbers.

UPDATE: If you submit your script at least 3 days before the deadline, we will respond with feedback to help guide you toward a winning submission.

    Submissions will be graded on the following criteria:
  • Meets Deliverables
  • Creativity
  • Clarity
Additional Materials:
Leaderboard
Top 2 share $800 Next 2 share $200
$400.00 Square pic 60 avatar somaiya
$400.00 0 3rnhnjk 8fvzpy2l7zbbnxrscirzy0pl tcbnxi g6xw uias9qxv0bpuskq04jgh1n9u4azn49a Virginia Polytechnic Institute and State University
$100.00 Square pic 60 fullsizerender Illinois Institute of Technology
$100.00 Square pic 60 11070812 1624711264416414 2481361700572121089 n Indian Institute of Technology - Kanpur