The Daily Click ::. Forums ::. Klik Coding Help ::. Is There an Easy Way to Scrape from Webpages Yet?
 

Post Reply  Post Oekaki 
 

Posted By Message

The_Antisony

At least I'm not Circy

Registered
  01/07/2002
Points
  1341

VIP MemberStarSnow
22nd February, 2020 at 22/02/2020 18:21:20 -

I'm aware that scraping can be accomplished with Python extensions, but I'd rather the end user doesn't have to install python and a bunch of libraries to make that work. I've found a few command-line apps that seemed promising, but they still rely on java or require Windows Environment Variables to be set up.

I'm aiming to make an MMF app that can scan a directory for movie names and years, then scrape IMDB to save a synopsis, rating, actors, and other movie information to a local file the app can later load and display - without being an installation burden to end users who likely won't know what java, python, and Windows environment variables are.

I've had some luck in the past with the Web Control extension, but it's a little limited in what it can do, and I'd rather not have to render an entire webpage in the app before scraping information out of it 'cause that's a little slow. Besides, from my experience, that extension likes to crash a lot. Since it was written years ago and still utilizes installed and modern browsers, I'm sure there's some incompatibilities by now.

Are there any alternatives which don't require a bunch of dependencies, complicated installation steps, or OS configuration to make work?

 
ChrisD> Employer: Say, wanna see a magic trick?
ChrisD> Employee: Uhh… sure, boss.
ChrisD> Employer: Your job! It just disappeared! Pack your things and leave! Pretty good trick, huh?

Joshtek

Administrator
The Archivist

Registered
  02/01/2002
Points
  3841

Game of the Week WinnerHas Donated, Thank You!Mr BallPicture Me This Round 50 Winner!
23rd February, 2020 at 23/02/2020 17:52:55 -

If you're using Clickteam Fusion 2.5 then the Get object can connect via HTTPS to a page on imdb.com to retrieve the data. You would then need code to process the raw HTML and extract the data you want. If you want to use a scripting language to extract the information then you could use the xlua object and then code it in Lua. Just remember that if IMDB changes their page then your application will need updating, so you might want a way for people to check for updates (to the software and/or to the Lua script).

If you do want to use Python, you can package your software so that the user does not need Python installed. I've historically used py2exe for this purpose which is available from https://github.com/albertosottile/py2exe but alternatives are available.

 
:: Joshtek ::


Oreos? GO! OREOS!

Joshtek

Administrator
The Archivist

Registered
  02/01/2002
Points
  3841

Game of the Week WinnerHas Donated, Thank You!Mr BallPicture Me This Round 50 Winner!
23rd February, 2020 at 23/02/2020 20:28:30 -

Oh, and if you have the Developer version of Fusion 2.5 then you can also use the Regular Expressions object to help parse the results.

 
:: Joshtek ::


Oreos? GO! OREOS!

The_Antisony

At least I'm not Circy

Registered
  01/07/2002
Points
  1341

VIP MemberStarSnow
13th May, 2021 at 13/05/2021 20:51:20 -


Originally Posted by Joshtek
If you're using Clickteam Fusion 2.5 then the Get object can connect via HTTPS to a page on imdb.com to retrieve the data. You would then need code to process the raw HTML and extract the data you want. If you want to use a scripting language to extract the information then you could use the xlua object and then code it in Lua. Just remember that if IMDB changes their page then your application will need updating, so you might want a way for people to check for updates (to the software and/or to the Lua script).

If you do want to use Python, you can package your software so that the user does not need Python installed. I've historically used py2exe for this purpose which is available from https://github.com/albertosottile/py2exe but alternatives are available.



Either way, it sounds like there's no direct method of scraping an element from a webpage without rendering the entire page first. I was hoping I could create a simple movie collection list that could grab title and synopsis information on the fly, really quickly. Between having to render the full page and parse through everything, sounds like it'd take more than a few seconds to produce results for the end-user.

I have a pretty large Kodi library and browsing through it using my media center PC is a chore because of the Kodi interface. I can create a simple list of all of my movie titles which is far easier to browse through on a secondary device, but then there's no movie artwork or synopsis. I was hoping I could make something that could quickly scrape basic title information from IMDB just based on the movie file name, and without saving external data.

Oh well.

 
ChrisD> Employer: Say, wanna see a magic trick?
ChrisD> Employee: Uhh… sure, boss.
ChrisD> Employer: Your job! It just disappeared! Pack your things and leave! Pretty good trick, huh?

Sketchy

Cornwall UK

Registered
  06/11/2004
Points
  1971

VIP MemberWeekly Picture Me This Round 43 Winner!Weekly Picture Me This Round 47 WinnerPicture Me This Round 49 Winner!
14th May, 2021 at 14/05/2021 10:18:12 -

You don't want to load the actual webpage - you should just access the raw data through an "API" (google "IMDB API"), using the "GET" extension. If you're not familiar with the concept of an API, have a read up on it first - it's definitely what you need.

eg. http://www.omdbapi.com

+ Start of frame
-> Get URL: "http://www.omdbapi.com/?t=high+school+musical&apikey=8d5bd4a9"
(I had to sign up for that key, but it's free)

+ On Get complete:
-> Editbox: Set text to Received$( "Get Object" )

...will give you:

{"Title":"High School Musical","Year":"2006","Rated":"TV-G","Released":"20 Jan 2006","Runtime":"98 min","Genre":"Comedy, Drama, Family, Music, Musical, Romance","Director":"Kenny Ortega","Writer":"Peter Barsocchini","Actors":"Zac Efron, Vanessa Hudgens, Ashley Tisdale, Lucas Grabeel","Plot":"A popular high-school athlete and an academically-gifted girl get roles in the school musical and develop a friendship that threatens East High's social order.","Language":"English, Spanish","Country":"USA","Awards":"Won 2 Primetime Emmys. Another 8 wins & 18 nominations.","Poster":"https://m.media-amazon.com/images/M/MV5BZmQ3MWEyNTYtOTY1OC00MTljLWI3OGUtMmU1ZDc2OTYxNDQ4L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTczNjQwOTY@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"5.4/10"},{"Source":"Rotten Tomatoes","Value":"63%"}],"Metascore":"N/A","imdbRating":"5.4","imdbVotes":"82,061","imdbID":"tt0475293","Type":"movie","DVD":"05 Dec 2016","BoxOffice":"N/A","Production":"Walt Disney Pictures, Salty Pictures Inc., First Street Films","Website":"N/A","Response":"True"}

That's a JSON file, so it would be easiest to extract the values you want using the JSON extension, but to be honest, it would be simple enough to parse manually using just the standard string functions and fastloops or one of the string tokenizing extensions.



Edit: Here's another slightly more complicated one you can use: https://imdb-api.com
That will give you loads of info:

{"id":"tt0475293","title":"High School Musical","originalTitle":"","fullTitle":"High School Musical (2006)","type":"Movie","year":"2006","image":"https://imdb-api.com/images/original/MV5BZmQ3MWEyNTYtOTY1OC00MTljLWI3OGUtMmU1ZDc2OTYxNDQ4L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTczNjQwOTY@._V1_Ratio0.6791_AL_.jpg","releaseDate":"2006-01-20","runtimeMins":"98","runtimeStr":"1h 38mins","plot":"Troy Bolton and Gabriella Montez are two totally different teenagers who meet at a party while singing karaoke on New Year's Eve. The next week, Troy returns to East High, his high school in New Mexico, to find that Gabriella is a new student there. They quickly become close friends and accidentally audition for the school musical. They both get callbacks, infuriating drama queen Sharpay Evans and her sidekick brother Ryan. Then Chad, Troy's best friend and basketball teammate, and Taylor, Gabriella's new friend on the decathlon team, must make Gabriella hate Troy.","plotLocal":"","plotLocalIsRtl":false,"awards":"Won 2 Primetime Emmys. Another 8 wins & 18 nominations.","directors":"Kenny Ortega","directorList":[{"id":"nm0650905","name":"Kenny Ortega"}],"writers":"Peter Barsocchini","writerList":[{"id":"nm0058302","name":"Peter Barsocchini"}],"stars":"Zac Efron, Vanessa Hudgens, Ashley Tisdale, Lucas Grabeel","starList":[{"id":"nm1374980","name":"Zac Efron"},{"id":"nm1227814","name":"Vanessa Hudgens"},{"id":"nm0864308","name":"Ashley Tisdale"},{"id":"nm1727317","name":"Lucas Grabeel"}],"actorList":[{"id":"nm1374980","image":"https://imdb-api.com/images/original/MV5BMTUxNzY3NzYwOV5BMl5BanBnXkFtZTgwNzQ3Mzc4MTI@._V1_Ratio0.7273_AL_.jpg","name":"Zac Efron","asCharacter":"Troy Bolton"},{"id":"nm1227814","image":"https://imdb-api.com/images/original/MV5BZGY4NGU0NjgtNjc0Mi00OTk3LWFmMzktNjY4M2JlMDkzOTFkXkEyXkFqcGdeQXVyMTExNzQ3MzAw._V1_Ratio0.7273_AL_.jpg","name":"Vanessa Hudgens","asCharacter":"Gabriella Montez (as Vanessa Anne Hudgens)"},{"id":"nm0864308","image":"https://imdb-api.com/images/original/MV5BMjA0NDk1NDQzNF5BMl5BanBnXkFtZTgwNDAzMzEyNzM@._V1_Ratio0.7273_AL_.jpg","name":"Ashley Tisdale","asCharacter":"Sharpay Evans"},{"id":"nm1727317","image":"https://imdb-api.com/images/original/MV5BNzM2YjQzMDEtYzAzMS00NzVmLThmNGUtZWQ2OWUzMjg2YTc1XkEyXkFqcGdeQXVyMjA2OTY2MTg@._V1_Ratio0.7273_AL_.jpg","name":"Lucas Grabeel","asCharacter":"Ryan Evans"},{"id":"nm0088298","image":"https://imdb-api.com/images/original/MV5BMTY4MTA4Nzc2Ml5BMl5BanBnXkFtZTgwNDUwNDA1NTE@._V1_Ratio1.0000_AL_.jpg","name":"Corbin Bleu","asCharacter":"Chad Danforth"},{"id":"nm0170912","image":"https://imdb-api.com/images/original/MV5BNGIzMDlmYmEtOTBlOC00NTJkLWIzYjMtNTZjMzg0MTkwMTljXkEyXkFqcGdeQXVyNTA2MjYyMQ@@._V1_Ratio0.7273_AL_.jpg","name":"Monique Coleman","asCharacter":"Taylor McKessie"},{"id":"nm0424559","image":"https://imdb-api.com/images/original/MV5BNTczNTg3YzItMTk0OC00NzM0LTljNGEtNjY5MmQwODgyYWU1XkEyXkFqcGdeQXVyMjE4MTM1MzM@._V1_Ratio0.7273_AL_.jpg","name":"Bart Johnson","asCharacter":"Coach Jack Bolton"},{"id":"nm0715295","image":"https://imdb-api.com/images/original/MV5BMTc4OTQxNDIyNF5BMl5BanBnXkFtZTcwMjg4NzQzNA@@._V1_Ratio1.5000_AL_.jpg","name":"Alyson Reed","asCharacter":"Ms. Darbus"},{"id":"nm0912703","image":"https://imdb-api.com/images/original/MV5BNTJkZDYzYzctMmFjMS00YTExLThjNzgtZTczZGY0YzE0YjBmXkEyXkFqcGdeQXVyMjQwMDg0Ng@@._V1_Ratio0.7273_AL_.jpg","name":"Chris Warren","asCharacter":"Zeke Baylor (as Chris Warren Jr.)"},{"id":"nm0750037","image":"https://imdb-api.com/images/original/MV5BMjIwMTI4MDE3N15BMl5BanBnXkFtZTgwODQzMjEyNTE@._V1_Ratio0.7273_AL_.jpg","name":"Olesya Rulin","asCharacter":"Kelsi Nielsen"},{"id":"nm0760835","image":"https://imdb-api.com/images/original/MV5BMTU1MDcyNzEzNV5BMl5BanBnXkFtZTcwOTY4MjYzMQ@@._V1_Ratio0.7727_AL_.jpg","name":"Ryne Sanborn","asCharacter":"Jason Cross"},{"id":"nm0380518","image":"https://imdb-api.com/images/original/nopicture.jpg","name":"Socorro Herrera","asCharacter":"Mrs. Montez"},{"id":"nm0594458","image":"https://imdb-api.com/images/original/MV5BZWUwNTAyZmYtZTYzNi00YTg3LThhM2ItODkwMTA4ZjMzMTJiXkEyXkFqcGdeQXVyNjc3NDgwNzU@._V1_Ratio1.7727_AL_.jpg","name":"Joey Miyashima","asCharacter":"Principal Matsui"},{"id":"nm2150045","image":"https://imdb-api.com/images/original/nopicture.jpg","name":"Dutch Whitlock","asCharacter":"Skater Dude #1"},{"id":"nm1644288","image":"https://imdb-api.com/images/original/MV5BMjYyNDM2M2EtNTk3My00YWU1LTllYmEtZWIwMmZmNThlZTc0XkEyXkFqcGdeQXVyMjQwMDg0Ng@@._V1_Ratio0.7273_AL_.jpg","name":"Ryan Templeman","asCharacter":"Skater Dude #2"}],"fullCast":null,"genres":"Comedy, Drama, Family, Music, Musical, Romance","genreList":[{"key":"Comedy","value":"Comedy"},{"key":"Drama","value":"Drama"},{"key":"Family","value":"Family"},{"key":"Music","value":"Music"},{"key":"Musical","value":"Musical"},{"key":"Romance","value":"Romance"}],"companies":"Salty Pictures, First Street Films","companyList":[{"id":"co0150505","name":"Salty Pictures"},{"id":"co0003297","name":"First Street Films"}],"countries":"USA","countryList":[{"key":"USA","value":"USA"}],"languages":"English, Spanish","languageList":[{"key":"English","value":"English"},{"key":"Spanish","value":"Spanish"}],"contentRating":"TV-G","imDbRating":"5.4","imDbRatingVotes":"81714","metacriticRating":"","ratings":null,"wikipedia":null,"posters":null,"images":null,"trailer":null,"boxOffice":{"budget":"$4,200,000 (estimated)","openingWeekendUSA":"","grossUSA":"","cumulativeWorldwideGross":""},"tagline":"This School Rocks Like No Other!","keywords":"basketball movie,disney channel original movie,high school,singing,2000s","keywordList":["basketball movie","disney channel original movie","high school","singing","2000s"]...

Edited by Sketchy

 
n/a

UrbanMonk

BRING BACK MITCH

Registered
  07/07/2008
Points
  49667

Has Donated, Thank You!Little Pirate!ARGH SignKliktober Special Award TagPicture Me This Round 33 Winner!The Outlaw!VIP MemberHasslevania 2!I am an April FoolKitty
Picture Me This Round 32 Winner!Picture Me This Round 42 Winner!Picture Me This Round 44 Winner!Picture Me This Round 53 Winner!
17th May, 2021 at 17/05/2021 20:13:45 -

You could always code it up in good'ol C++

Performing get requests, parsing JSON, etc is cake.

I'm not sure if there is a JSON parsing extension for Fusion, but if there is you could also use that.

 
n/a

The_Antisony

At least I'm not Circy

Registered
  01/07/2002
Points
  1341

VIP MemberStarSnow
29th May, 2021 at 29/05/2021 19:21:22 -

Thanks Sketch and Urban. My initial interest was a product of Kodi's built-in scrapers being absolutely trash and leaving about a third of my collection without media information. Every time I'd have a friend over for movie night, they were skipping over really interesting movies because there was no way to figure out what they were without manually looking them up.

The idea was make a Fusion app that can pull info directly from Kodi for movies with scraped information, and at least pull synopsis, year, genre, and actor information from IMDB on the fly for anything in my collection missing data. I went a while without a sensible answer and decided to manually create NFO files for everything missing. Hours and hours later, I don't really have a need for an app that scrapes movie data anymore.

Kodi has a Android app capable of browsing through an entire library, but the downside is that Kodi needs to be running first. Eventually, I would like to convert the library into an offline webpage so it can be browsed without my Nvidia Shield TV or Kodi running first, but that's not a big necessity.

 
ChrisD> Employer: Say, wanna see a magic trick?
ChrisD> Employee: Uhh… sure, boss.
ChrisD> Employer: Your job! It just disappeared! Pack your things and leave! Pretty good trick, huh?
   

Post Reply



 



Advertisement

Worth A Click