Discussion:
Django/Scrapy Model foreign keys?
Paul
2013-06-18 06:28:00 UTC
Permalink
Currently I can't figure out a way to handle Django model foreign keys.

I set up my django models in a fashion similar to this:
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/

*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School

class School(DjangoItem):
django_model = University

#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course

#result
exceptions.TypeError: <University: University object> is not JSON
serializable

*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])

#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)

I have been wrestling with this problem for quite a while, any ideas? There
isn't much that I could find about DjangoItem (or raw django models) with
Scrapy and foreignkeys.

Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul Tremberth
2013-06-18 09:12:29 UTC
Permalink
Hi,
I dont see the definition of University (nor where it's imported from)
Also I would rename School(DjangoItem) to SchoolItem(DhangoItem) to avoid
clashing with your Django model
In my (little) experience with Django and Scrapy, the tricky thing is to
configure access to your Django models inside Scrapy
I havent played with DjangoItem much though
Does your
School.objects.filter(name=response.meta['school'])[0]
give you the objects you want?

Could you share more code and/or logs perhaps? (remove all
sensitive/proprietary code and anonymize at will)

Paul.
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul
2013-06-18 14:02:41 UTC
Permalink
University should be "School", sorry, I was in the process of switching the
name.

*So this should be:*
from mydjangoapp.models import School
class School(DjangoItem):
django_model = School

*And the error is actually:*
#result
exceptions.TypeError: <School: School object> is not JSON serializable

I am actually intentionally using my Django model for "School" in this case
because that is how I retrieve the school object as the foreign key. If I
use a DjangoItem in that field as the key, it won't let me retrieve the
item from the database ("SchoolItem has no property 'objects'").

*Here is the Django Model definition for School:*
class School(models.Model):
#id is implicit
name = models.CharField(max_length=100)
...(more fields)
date_updated = models.DateTimeField(default=datetime.now)

*And for Course:*
class Course(models.Model):
#id is implicit
university = models.ForeignKey(University,max_length=100) *#notice the
foreign key attribute*
name = models.CharField(max_length=100)
...(other)
date_updated = models.DateTimeField(default=datetime.now)

School.objects.filter(name=response.meta['school'])[0] does in fact give me
the object I need. I pass the current school name via response.meta, and
retrieve it using Djangos syntax for object lookup.

As you can see, School is a foreign key field within Course.

I have full access to my Django models and have imported my Django settings
correctly as far as I can tell (meaning that I can say from
mydjangoapp.models import Course within my Scrapy app, and it succeeds).

Does that help? Thank you so much for your time!
Paul
Post by Paul Tremberth
Hi,
I dont see the definition of University (nor where it's imported from)
Also I would rename School(DjangoItem) to SchoolItem(DhangoItem) to avoid
clashing with your Django model
In my (little) experience with Django and Scrapy, the tricky thing is to
configure access to your Django models inside Scrapy
I havent played with DjangoItem much though
Does your
School.objects.filter(name=response.meta['school'])[0]
give you the objects you want?
Could you share more code and/or logs perhaps? (remove all
sensitive/proprietary code and anonymize at will)
Paul.
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul Tremberth
2013-06-18 14:06:19 UTC
Permalink
Would renaming to SchoolItem change anything?

class School*Item*(DjangoItem):
django_model = School
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul
2013-06-18 14:06:25 UTC
Permalink
Note: my problem is pretty much the same as this one:
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul
2013-06-18 14:13:04 UTC
Permalink
I get the same issue when I rename to "SchoolItem",

Note that I have to use 'School' (which is a django model, not a scrapy
DjangoItem) to retrieve the foreign key object from my database in Django.
course = CourseItem(name=course_name, ....,
school=School.objects.filter(name=response.meta['school'])[0])

*Doing that yields the "School is not JSON Serializable" error: full trace
is:*
2013-06-18 09:07:39-0500 [spider] ERROR: Error caught on signal handler: <
bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter
object at
0xa5bbd0c>>
Traceback (most recent call last):
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/core/scraper.py",
line 213, in _itemproc_finished
item=output, response=response, spider=spider)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/signalmanager.py",
line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 117, in maybeDeferred
result = f(*args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/xlib/pydispatch/ro
bustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/feedexport
.py", line 191, in item_scraped
slot.exporter.export_item(item)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/exporter/_
_init__.py", line 110, in export_item
self.file.write(self.encoder.encode(itemdict))
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 89, in encode
return super(ScrapyJSONEncoder, self).encode(o)
File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib/python2.6/json/encoder.py", line 309, in
_iterencode
for chunk in self._iterencode_dict(o, markers):
File "/usr/lib/python2.6/json/encoder.py", line 275, in
_iterencode_di
ct
for chunk in self._iterencode(value, markers):
File "/usr/lib/python2.6/json/encoder.py", line 317, in
_iterencode
for chunk in self._iterencode_default(o, markers):
File "/usr/lib/python2.6/json/encoder.py", line 323, in
_iterencode_de
fault
newobj = self.default(o)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 109, in default
return super(ScrapyJSONEncoder, self).default(o)
File "/usr/lib/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <School: School object> is not JSON serial
izable
Post by Paul
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul Tremberth
2013-06-18 14:21:00 UTC
Permalink
I dont know how DjangoItem handles that but what about passing the
school_id in CourseItem ?
CourseItem(name=course_name, ....,
school_id=School.objects.filter(name=response.meta['school'])[0].id

You probably need to tweak the Pipeline after that
Post by Paul
I get the same issue when I rename to "SchoolItem",
Note that I have to use 'School' (which is a django model, not a scrapy
DjangoItem) to retrieve the foreign key object from my database in Django.
course = CourseItem(name=course_name, ....,
school=School.objects.filter(name=response.meta['school'])[0])
*Doing that yields the "School is not JSON Serializable" error: full
trace is:*
2013-06-18 09:07:39-0500 [spider] ERROR: Error caught on signal handler: <
bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter
object at
0xa5bbd0c>>
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/core/scraper.py",
line 213, in _itemproc_finished
item=output, response=response, spider=spider)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/signalmanager.py",
line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 117, in maybeDeferred
result = f(*args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/xlib/pydispatch/ro
bustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/feedexport
.py", line 191, in item_scraped
slot.exporter.export_item(item)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/exporter/_
_init__.py", line 110, in export_item
self.file.write(self.encoder.encode(itemdict))
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 89, in encode
return super(ScrapyJSONEncoder, self).encode(o)
File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib/python2.6/json/encoder.py", line 309, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 275, in
_iterencode_di
ct
File "/usr/lib/python2.6/json/encoder.py", line 317, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 323, in
_iterencode_de
fault
newobj = self.default(o)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 109, in default
return super(ScrapyJSONEncoder, self).default(o)
File "/usr/lib/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <School: School object> is not JSON serial
izable
Post by Paul
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul
2013-06-18 14:31:10 UTC
Permalink
Should I forego the use of DjangoItem's altogether? I was looking for a way
but couldn't figure out how to do that.

My current pipeline is:

class DjangoPipeline(object):
def process_item(self, item, spider):
item.save()
return item

I'm not sure how I would change that though.

Unfortunately, though the school_id is the primary key of that foreign key,
Django expects an object instead of an ID, so I get:
exceptions.ValueError: Cannot assign "5L": "Course.school" must be a
"School" instance.
Post by Paul Tremberth
I dont know how DjangoItem handles that but what about passing the
school_id in CourseItem ?
CourseItem(name=course_name, ....,
school_id=School.objects.filter(name=response.meta['school'])[0].id
You probably need to tweak the Pipeline after that
Post by Paul
I get the same issue when I rename to "SchoolItem",
Note that I have to use 'School' (which is a django model, not a scrapy
DjangoItem) to retrieve the foreign key object from my database in Django.
course = CourseItem(name=course_name, ....,
school=School.objects.filter(name=response.meta['school'])[0])
*Doing that yields the "School is not JSON Serializable" error: full
trace is:*
2013-06-18 09:07:39-0500 [spider] ERROR: Error caught on signal handler: <
bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter
object at
0xa5bbd0c>>
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/core/scraper.py",
line 213, in _itemproc_finished
item=output, response=response, spider=spider)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/signalmanager.py",
line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 117, in maybeDeferred
result = f(*args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/xlib/pydispatch/ro
bustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/feedexport
.py", line 191, in item_scraped
slot.exporter.export_item(item)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/exporter/_
_init__.py", line 110, in export_item
self.file.write(self.encoder.encode(itemdict))
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 89, in encode
return super(ScrapyJSONEncoder, self).encode(o)
File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib/python2.6/json/encoder.py", line 309, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 275, in
_iterencode_di
ct
File "/usr/lib/python2.6/json/encoder.py", line 317, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 323, in
_iterencode_de
fault
newobj = self.default(o)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 109, in default
return super(ScrapyJSONEncoder, self).default(o)
File "/usr/lib/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <School: School object> is not JSON serial
izable
Post by Paul
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using
a DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul Tremberth
2013-06-18 14:41:38 UTC
Permalink
Normally you can create Django objects with the "_id" suffix for your
referenced objects
Course(name=..., school_id=5)
http://stackoverflow.com/questions/10622751/django-foreignkey-instance-vs-raw-id
http://stackoverflow.com/questions/2846029/django-set-foreign-key-using-integer

If that doesnt work with DjangoItem you should look into the implementation
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/djangoitem.py

Otherwise, you could indeed do the insert in a second stage processing,
having the school_id in your regular Item CourseItem
Post by Paul
Should I forego the use of DjangoItem's altogether? I was looking for a
way but couldn't figure out how to do that.
item.save()
return item
I'm not sure how I would change that though.
Unfortunately, though the school_id is the primary key of that foreign
exceptions.ValueError: Cannot assign "5L": "Course.school" must be a
"School" instance.
Post by Paul Tremberth
I dont know how DjangoItem handles that but what about passing the
school_id in CourseItem ?
CourseItem(name=course_name, ....,
school_id=School.objects.filter(name=response.meta['school'])[0].id
You probably need to tweak the Pipeline after that
Post by Paul
I get the same issue when I rename to "SchoolItem",
Note that I have to use 'School' (which is a django model, not a scrapy
DjangoItem) to retrieve the foreign key object from my database in Django.
course = CourseItem(name=course_name, ....,
school=School.objects.filter(name=response.meta['school'])[0])
*Doing that yields the "School is not JSON Serializable" error: full
trace is:*
2013-06-18 09:07:39-0500 [spider] ERROR: Error caught on signal handler: <
bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter
object at
0xa5bbd0c>>
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/core/scraper.py",
line 213, in _itemproc_finished
item=output, response=response, spider=spider)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/signalmanager.py",
line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 117, in maybeDeferred
result = f(*args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/xlib/pydispatch/ro
bustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/feedexport
.py", line 191, in item_scraped
slot.exporter.export_item(item)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/exporter/_
_init__.py", line 110, in export_item
self.file.write(self.encoder.encode(itemdict))
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 89, in encode
return super(ScrapyJSONEncoder, self).encode(o)
File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib/python2.6/json/encoder.py", line 309, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 275, in
_iterencode_di
ct
File "/usr/lib/python2.6/json/encoder.py", line 317, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 323, in
_iterencode_de
fault
newobj = self.default(o)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 109, in default
return super(ScrapyJSONEncoder, self).default(o)
File "/usr/lib/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <School: School object> is not JSON serial
izable
Post by Paul
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a
Django model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using
a DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return
Request, BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul
2013-06-18 14:52:24 UTC
Permalink
I'm going to need to do the second stage processing, unfortunately the
other two options did not pan out (though I had no idea about the _id
suggestion).

Can you suggest how I would do the second stage of processing? I know that
I could use a pipeline, but given the fact that many more objects than just
"Course" will be passing through it, I am hesitant to use it.

I assume what I would do is create a DjangoItem called "SchoolItem",
override the "school_id" to just a Field, and then when it hits the
pipeline the pipeline would convert it to a django object and save it? My
main question is how to do this efficiently, knowing that it will need to
be done with other foreign key dependent objects in my project.

I feel close!
Post by Paul Tremberth
Normally you can create Django objects with the "_id" suffix for your
referenced objects
Course(name=..., school_id=5)
http://stackoverflow.com/questions/10622751/django-foreignkey-instance-vs-raw-id
http://stackoverflow.com/questions/2846029/django-set-foreign-key-using-integer
If that doesnt work with DjangoItem you should look into the implementation
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/djangoitem.py
Otherwise, you could indeed do the insert in a second stage processing,
having the school_id in your regular Item CourseItem
Post by Paul
Should I forego the use of DjangoItem's altogether? I was looking for a
way but couldn't figure out how to do that.
item.save()
return item
I'm not sure how I would change that though.
Unfortunately, though the school_id is the primary key of that foreign
exceptions.ValueError: Cannot assign "5L": "Course.school" must be a
"School" instance.
Post by Paul Tremberth
I dont know how DjangoItem handles that but what about passing the
school_id in CourseItem ?
CourseItem(name=course_name, ....,
school_id=School.objects.filter(name=response.meta['school'])[0].id
You probably need to tweak the Pipeline after that
Post by Paul
I get the same issue when I rename to "SchoolItem",
Note that I have to use 'School' (which is a django model, not a scrapy
DjangoItem) to retrieve the foreign key object from my database in Django.
course = CourseItem(name=course_name, ....,
school=School.objects.filter(name=response.meta['school'])[0])
*Doing that yields the "School is not JSON Serializable" error: full
trace is:*
2013-06-18 09:07:39-0500 [spider] ERROR: Error caught on signal handler: <
bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter
object at
0xa5bbd0c>>
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/core/scraper.py",
line 213, in _itemproc_finished
item=output, response=response, spider=spider)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/signalmanager.py",
line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File
"/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 117, in maybeDeferred
result = f(*args, **kw)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/xlib/pydispatch/ro
bustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/feedexport
.py", line 191, in item_scraped
slot.exporter.export_item(item)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/contrib/exporter/_
_init__.py", line 110, in export_item
self.file.write(self.encoder.encode(itemdict))
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 89, in encode
return super(ScrapyJSONEncoder, self).encode(o)
File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib/python2.6/json/encoder.py", line 309, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 275, in
_iterencode_di
ct
File "/usr/lib/python2.6/json/encoder.py", line 317, in
_iterencode
File "/usr/lib/python2.6/json/encoder.py", line 323, in
_iterencode_de
fault
newobj = self.default(o)
File
"/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 109, in default
return super(ScrapyJSONEncoder, self).default(o)
File "/usr/lib/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: <School: School object> is not JSON serial
izable
Post by Paul
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a
Django model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of
using a DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return
Request, BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul Tremberth
2013-06-18 14:10:26 UTC
Permalink
Sorry I hadnt fully read " If I use a DjangoItem in that field as the key,
it won't let me retrieve the item from the database ("SchoolItem has no
property 'objects'")."

Could you post your spider code (partial) to some http://pastebin.com of
some kind?
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Paul
2013-06-18 14:19:30 UTC
Permalink
I don't think it's a problem so much with the spider (there's a lot of
code, but it worked until this foreign key isssue) but more with handling
foreign keys in Scrapy models in general.

My setup is just like the one here:
http://python.6.x6.nabble.com/Question-on-django-scrapy-integration-td5005527.html

And I have the same error.

Mainly, I just need to know if I have two Django models, and one is a
foreign key for the other, how can I insert into them in Scrapy? Maybe
there is some way to override the Course after I have created it, and make
the foreign key object JSON serializable before yielding it?

If you want to chat offline/screenshare for speed purposes let me know.
Thank you for your time and quick replies!
Post by Paul Tremberth
Sorry I hadnt fully read " If I use a DjangoItem in that field as the key,
it won't let me retrieve the item from the database ("SchoolItem has no
property 'objects'")."
Could you post your spider code (partial) to some http://pastebin.com of
some kind?
Post by Paul
Currently I can't figure out a way to handle Django model foreign keys.
http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/
*Now when I try to "yield" a course item that is dependent on a Django
model as the foreign key, I get:*
#items.py
from mydjangoapp.models import School
django_model = University
#spider code
course = CourseItem(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course
#result
exceptions.TypeError: <University: University object> is not JSON
serializable
*I have also tried yielding the actual Django object instead of using a
DjangoItem, but as you might imagine I also get an error.*
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name,
dept_name=dept_name,professors=course_professor,url=url,
school=School.objects.filter(name=response.meta['school'])[0])
#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request,
BaseItem or None, got 'Course' in (url)
I have been wrestling with this problem for quite a while, any ideas?
There isn't much that I could find about DjangoItem (or raw django models)
with Scrapy and foreignkeys.
Thank you!
Paul
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send email to scrapy-users-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
Loading...