web crawler - how to fix HTTP error fetching URL. Status=500 in java while crawling? -


I am trying to crawl the user rating of IMDB's cinema cinema from the review page: (in my database Number approximately 600,000) I used jsoup to parse the pages given below: (Sorry, I have not written the entire code here because it is too long)

  try { // connecting to mysql DB ResultSet res = st ExecuteQuery ("SELECT id, title, production_year" + "WHERE kind_id = 1" + "" LIMIT 0, 100000 "") ; While (res.next ()) {....... ....... string base url = "http://www.imdb.com/search/title?release_date=" + "" + + + + + + "," + Year + "and title =" + movie name + "" + "and title_type = feature, short, documentary, unknown"; Document doctor = jeff.connect (base url) .usergeent ("mozilla"). Timeout (0) .get (); ............ Enter the rating in the database ...  

I tested it for 100 first, first 500 and earlier 2000 films also DB and it worked well. But the problem is that when I tested for 100,000 films I got this error:

  org.jsoup.HttpStatusException: HTTP URL status to error = 500, URL = http: // Www.imdb.com/search/title RELEASE_DATE = 1899,1899 & amp; Title = 'Colombia'% 20Close% 20to% 20the% 20Wind & amp; title_type = feature, short, org.jsoup.helper.HttpConnection $ Response.execute documentary on org.jsoup.helper.HttpConnection $ Response.execute (HttpConnection.java:424) unknown org.jsoup.helper.HttpConnection.execute (HttpConnection.java:449) (HttpConnection. java: 178) at org.jsoup.helper.HttpConnection.get (HttpConnection.java:167) (imdb.java:47) on imdb.main  

I searched a lot for this error and I found that it is a server side error with 5xx error number.

Then I decided to make a condition that when the connection fails, it tries 2 times again, still can not connect and goes to the next URL. Since I'm new to Java, I searched for similar questions and read these answers in the StackVerf Flow.

However, when I try with "connection. Respond", it states that "connections are not in a way. Can be resolved. "

I appreciate it that someone can help me, because I am just a newbie and I know that this can be easy but I do not know how to fix it.


Well, I just can not relate to the "neglect Acteepiarrr (true)" fix http error status 500 I:

  org.jsoup .Connection con = Jsoup.connect (baseurl) .userAgent ( "Mozilla / 5.0 (X11, Linux x86_64) AppleWebKit / 535.21 (KHTML, like Gecko) Chrome / L9k0kl042k0 Safari / 535.21"); Con.timeout (180000) .ignoreHttpErrors (true) .followRedirects (true); Response resp = con.execute (); Document doctor = null; If (resp.statusCode () == 200) {doc = con.get (); ......  

I hope that this may be their only error.

However, after crawling the review pages of 22907 movies (about 12 hours), I get another error:
"Reed timed out".

I appreciate any suggestions for fixing this error.

Upgrading my comment for an answer:

Connection For validation, for example, a valid http code (200) is given to .response is org.jsoup.Connection.Response

document ), Then break your call in 3 calls; Connection , feedback , document

Therefore, the part of the code given above is modified to you:

  while (res.next ()) {....... ....... string base url = "http://www.imdb.com/search/title?release_date = "+" "+ Year +", "+ year +" and title = "+ movie name +" "+" and title_type = feature, short, documentary, unknown "; Connection thief = Jsoup.connect (baseurl) .userAgent. ( "Mozilla / 5.0 (X11, Linux x86_64) AppleWebKit / 535.21 (KHTML, like Gecko) Chrome / L9k0kl042k0 Safari / 535.21") Timeout (10000); Connection Response resp = con.execute (); Document doctor = null; If (resp.statusCode () == 200) {doc = con.get (); ....}  

Comments