python - Web scraping - Get keys from website -


I downloaded all the files from the following website (which reflects Sao Paulo - Brazil bus routes): and thinking If there was another great way to do this.

Actually what I have done is a loop for all the values ​​between 00000 and 99999, ##### by substitution to check the number that the website exists or No

Is there any way to know before all the keys to make this work more efficient?

I have all the files, but this is a very common problem and I was thinking that there is a shortcut to solve it.

I apologize to the creator of, and not to translate English from Portuguese on the site , Which otherwise clarifies these matters (and the answer is less).

As you can see, the data shows transport routes for buses, subways and trains in Sao Paulo, Paula makes a way to find ways to make their connections easier to plan routes along with that connection. (Google Maps and others make it automatically, sometimes missing routes that meet with more interactive search).

Proximity search between cheap "routes"), and crawls your data from the Sao Paulo Transportation Company (SPTrans).

To answer your first question : Those IDs are from the original site. However, they are not really very stable (I have seen that they remove an old ID and change only one position by changing one line), so that Kruslinhas is a complete Crawls and now whole d TABBAS updates (I want to completely change it, but Google App Engine makes it a little harder than usual).

Good news: site is open-source (under MIT license) documentation Still in Portuguese, but you are more interested in command line crawler.

The most efficient way to get data is to do a Sptscraper.py download , then sptscraper.py parse , then sptscraper.py dump and there There are more options, and here's a quick translation of the help screen:

  Download and parse data of public transport routes from the SPTrans website. It parses HTML files and stores the result in the linhas.sqlite file, which can be used in other applications, which can be converted to JSON or can be used for updating cruzilings themselves. Order: Info shows the number of pending updates [ID] HTML files downloaded from SPTrans (starting with all or ID) resume releases the download from the last line saved. Adds / updates / removes (soft) HTML in the pars database. The list outputs JSON with the root id from the database. Dump [ID] outputs a JSON with all the routes in the database (or just one), a JSON prints to upload JOSAS (mapping routes that cross the hash) for each line pending in the database of Cruzilingan Uploads Changes  

Keep in mind that this data is not taken with the consent of SP Transfers , even if it is public information and < Strong> Ai To do it . Was created as an act of protest against the site and scrape, passed before the specific freedom of digital information law (even if there were already the first laws to regulate the availability of public service information, hence any illegal The work was not done, or will be done if you use it responsibly).

For this reason (and due to the fact that the back end is a bit ... "challenge"), Scaraper is very careful in throttling requests, to avoid overloading your servers, Makes a crawling period for several hours, but you do not want to take the service overload (which can also force you to block them or change the site to make them hard).

I finally rewrite that code (possibly my first python / app engine code, which was written a few years ago, and a quick hack which is outside of the public data SPTrans' website borders Can focus on highlighting There will be a cleaner crawling process, the latest data should be available for download at a single click and possibly the list of links available on the API.

For now, if you just want the ultimate crawling (which I did a month or two ago), just contact me and I would be happy to send you the SQLAT / JSN file.


Comments