How to deal with redirects to a bookmark within a page in Scrapy (911 error)

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.

The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.

I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.

Spider 1:

myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):

    myURLs.append(baselineURL + str(counter))



class BroforceIGNScraper(scrapy.Spider):

    name = "foundation"

    start_urls = myURLs



    def parse(self,response):

        for post in response.css("div.main"):

            yield {

                'title': post.css("h3.title a::text").extract_first(),

                'author': post.css("div.meta a.username::text").extract_first(),

                'URL': post.css('h3 a').xpath('@href').extract_first(),

            }

Spider 2:

URLlist = 

baseURL = "https://www.ign.com/boards/"



import csv

with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:

    reader = csv.DictReader(csvfile)

    for row in reader:

        URLlist.append(baseURL + row['URL'])



class BroforceIGNScraper(scrapy.Spider):

    name = "posts2"

    start_urls = URLlist



    # handle_httpstatus_list = [301]



    def parse(self,response):

        for post in response.css(".messageList"):

            yield {

                'URL': response.url,

                'content': post.css(".messageContent article").extract_first(),

                'commentauthor': post.css("div.messageMeta a::text").extract_first(),

                'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),

            }

edited Nov 23 '18 at 19:34

stranac

14.9k31725

asked Nov 23 '18 at 17:21

theresearchant

I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39

I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10

add a comment |

I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.

Spider 1:

myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):

    myURLs.append(baselineURL + str(counter))



class BroforceIGNScraper(scrapy.Spider):

    name = "foundation"

    start_urls = myURLs



    def parse(self,response):

        for post in response.css("div.main"):

            yield {

                'title': post.css("h3.title a::text").extract_first(),

                'author': post.css("div.meta a.username::text").extract_first(),

                'URL': post.css('h3 a').xpath('@href').extract_first(),

            }

Spider 2:

URLlist = 

baseURL = "https://www.ign.com/boards/"



import csv

with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:

    reader = csv.DictReader(csvfile)

    for row in reader:

        URLlist.append(baseURL + row['URL'])



class BroforceIGNScraper(scrapy.Spider):

    name = "posts2"

    start_urls = URLlist



    # handle_httpstatus_list = [301]



    def parse(self,response):

        for post in response.css(".messageList"):

            yield {

                'URL': response.url,

                'content': post.css(".messageContent article").extract_first(),

                'commentauthor': post.css("div.messageMeta a::text").extract_first(),

                'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),

            }

edited Nov 23 '18 at 19:34

stranac

14.9k31725

asked Nov 23 '18 at 17:21

theresearchant

I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39

I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10

add a comment |

I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.

Spider 1:

myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):

    myURLs.append(baselineURL + str(counter))



class BroforceIGNScraper(scrapy.Spider):

    name = "foundation"

    start_urls = myURLs



    def parse(self,response):

        for post in response.css("div.main"):

            yield {

                'title': post.css("h3.title a::text").extract_first(),

                'author': post.css("div.meta a.username::text").extract_first(),

                'URL': post.css('h3 a').xpath('@href').extract_first(),

            }

Spider 2:

URLlist = 

baseURL = "https://www.ign.com/boards/"



import csv

with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:

    reader = csv.DictReader(csvfile)

    for row in reader:

        URLlist.append(baseURL + row['URL'])



class BroforceIGNScraper(scrapy.Spider):

    name = "posts2"

    start_urls = URLlist



    # handle_httpstatus_list = [301]



    def parse(self,response):

        for post in response.css(".messageList"):

            yield {

                'URL': response.url,

                'content': post.css(".messageContent article").extract_first(),

                'commentauthor': post.css("div.messageMeta a::text").extract_first(),

                'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),

            }

edited Nov 23 '18 at 19:34

stranac

14.9k31725

asked Nov 23 '18 at 17:21

theresearchant

I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.

Spider 1:

myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):

    myURLs.append(baselineURL + str(counter))



class BroforceIGNScraper(scrapy.Spider):

    name = "foundation"

    start_urls = myURLs



    def parse(self,response):

        for post in response.css("div.main"):

            yield {

                'title': post.css("h3.title a::text").extract_first(),

                'author': post.css("div.meta a.username::text").extract_first(),

                'URL': post.css('h3 a').xpath('@href').extract_first(),

            }

Spider 2:

URLlist = 

baseURL = "https://www.ign.com/boards/"



import csv

with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:

    reader = csv.DictReader(csvfile)

    for row in reader:

        URLlist.append(baseURL + row['URL'])



class BroforceIGNScraper(scrapy.Spider):

    name = "posts2"

    start_urls = URLlist



    # handle_httpstatus_list = [301]



    def parse(self,response):

        for post in response.css(".messageList"):

            yield {

                'URL': response.url,

                'content': post.css(".messageContent article").extract_first(),

                'commentauthor': post.css("div.messageMeta a::text").extract_first(),

                'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),

            }

python scrapy

edited Nov 23 '18 at 19:34

stranac

14.9k31725

asked Nov 23 '18 at 17:21

theresearchant

edited Nov 23 '18 at 19:34

stranac

14.9k31725

asked Nov 23 '18 at 17:21

theresearchant

edited Nov 23 '18 at 19:34

stranac

14.9k31725

edited Nov 23 '18 at 19:34

stranac

14.9k31725

edited Nov 23 '18 at 19:34

stranac

14.9k31725

asked Nov 23 '18 at 17:21

theresearchant

asked Nov 23 '18 at 17:21

theresearchant

asked Nov 23 '18 at 17:21

theresearchant

I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39

I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10

add a comment |

I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39

I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10

I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39

I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450765%2fhow-to-deal-with-redirects-to-a-bookmark-within-a-page-in-scrapy-911-error%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Argthtjtr