[prev in list] [next in list] [prev in thread] [next in thread]
List: python-list
Subject: Re: Python text file fetch specific part of line
From: cs () zip ! com ! au
Date: 2016-07-30 2:46:55
Message-ID: 20160730024655.GA1902 () cskk ! homeip ! net
[Download RAW message or body]
On 29Jul2016 18:42, Gordon Levi <gordon@address.invalid> wrote:
>cs@zip.com.au wrote:
>
>>On 28Jul2016 19:28, Gordon Levi <gordon@address.invalid> wrote:
>>>Arshpreet Singh <arsh840@gmail.com> wrote:
>>>>I am writing Imdb scrapper, and getting available list of titles from IMDB
>>>>website which provide txt file in very raw format, Here is the one part of
>>>>file(http://pastebin.com/fpMgBAjc) as the file provides tags like
>>>>Distribution Votes,Rank,Title I want to parse title names, I tried with
>>>>readlines() method but it returns only list which is quite heterogeneous, is
>>>>it possible that I can parse each value comes under title section?
>>>
>>>Beautiful Soup will make your task much easier
>>><https://www.crummy.com/software/BeautifulSoup/>.
>>
>>Did you look at his sample data?
>
>No. I read he was "writing an IMDB scraper, and getting the available
>list of titles from the IMDB web site". It's here
><http://www.imdb.com/>.
>
>> Plain text, not HTML or XML. Beautiful Soup is
>>not what he needs here.
>
>Fortunately the OP told us his application rather than just telling us
>his current problem. His life would be much easier if he ignored the
>plain text he has obtained so far and started again using a Beautiful
>Soup tutorial.
Or bypass IMDB's computer unfriendliness and go straight to http://omdbapi.com/
You can have JSON directly from it, and avoid BS entirely. BS is an amazing
library, but is essentially a workaround for computer-hostile websites: those
not providing clean machine readable data, and only unstable mutable HTML
output.
Cheers,
Cameron Simpson <cs@zip.com.au>
--
https://mail.python.org/mailman/listinfo/python-list
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic